Embedding Words and Senses Together via Joint Knowledge-Enhanced Training

Word embeddings are widely used in Natural Language Processing, mainly due to their success in capturing semantic information from massive corpora. However, their creation process does not allow the different meanings of a word to be automatically separated, as it conflates them into a single vector. We address this issue by proposing a new model which learns word and sense embeddings jointly. Our model exploits large corpora and knowledge from semantic networks in order to produce a unified vector space of word and sense embeddings. We evaluate the main features of our approach both qualitatively and quantitatively in a variety of tasks, highlighting the advantages of the proposed method in comparison to state-of-the-art word- and sense-based models.


Introduction
Recently, approaches based on neural networks which embed words into low-dimensional vector spaces from text corpora (i.e. word embeddings) have become increasingly popular (Mikolov et al., 2013;Pennington et al., 2014). Word embeddings have proved to be beneficial in many Natural Language Processing tasks, such as Machine Translation (Zou et al., 2013), syntactic parsing (Weiss et al., 2015), and Question Answering (Bordes et al., 2014), to name a few. Despite their success in capturing semantic properties of words, these representations are generally hampered by an important limitation: the inability to discriminate among different meanings of the same word.
Authors marked with an asterisk (*) contributed equally.
Previous works have addressed this limitation by automatically inducing word senses from monolingual corpora (Schütze, 1998;Reisinger and Mooney, 2010;Huang et al., 2012;Di Marco and Navigli, 2013;Neelakantan et al., 2014;Tian et al., 2014;Li and Jurafsky, 2015;Vu and Parker, 2016;Qiu et al., 2016), or bilingual parallel data (Guo et al., 2014;Ettinger et al., 2016;Suster et al., 2016). However, these approaches learn solely on the basis of statistics extracted from text corpora and do not exploit knowledge from semantic networks. Additionally, their induced senses are neither readily interpretable (Panchenko et al., 2017) nor easily mappable to lexical resources, which limits their application. Recent approaches have utilized semantic networks to inject knowledge into existing word representations (Yu and Dredze, 2014;Faruqui et al., 2015;Goikoetxea et al., 2015;Speer and Lowry-Duda, 2017;Mrksic et al., 2017), but without solving the meaning conflation issue. In order to obtain a representation for each sense of a word, a number of approaches have leveraged lexical resources to learn sense embeddings as a result of post-processing conventional word embeddings Johansson and Pina, 2015;Rothe and Schütze, 2015;Pilehvar and Collier, 2016;Camacho-Collados et al., 2016).
Instead, we propose SW2V (Senses and Words to Vectors), a neural model that exploits knowledge from both text corpora and semantic networks in order to simultaneously learn embeddings for both words and senses. Moreover, our model provides three additional key features: (1) both word and sense embeddings are represented in the same vector space, (2) it is flexible, as it can be applied to different predictive models, and (3) it is scalable for very large semantic networks and text corpora.

Related work
Embedding words from large corpora into a lowdimensional vector space has been a popular task since the appearance of the probabilistic feedforward neural network language model (Bengio et al., 2003) and later developments such as word2vec (Mikolov et al., 2013) and GloVe (Pennington et al., 2014). However, little research has focused on exploiting lexical resources to overcome the inherent ambiguity of word embeddings. Iacobacci et al. (2015) overcame this limitation by applying an off-the-shelf disambiguation system (i.e. Babelfy (Moro et al., 2014)) to a corpus and then using word2vec to learn sense embeddings over the pre-disambiguated text. However, in their approach words are replaced by their intended senses, consequently producing as output sense representations only. The representation of words and senses in the same vector space proves essential for applying these knowledgebased sense embeddings in downstream applications, particularly for their integration into neural architectures (Pilehvar et al., 2017). In the literature, various different methods have attempted to overcome this limitation.  proposed a model for obtaining both word and sense representations based on a first training step of conventional word embeddings, a second disambiguation step based on sense definitions, and a final training phase which uses the disambiguated text as input. Likewise, Rothe and Schütze (2015) aimed at building a shared space of word and sense embeddings based on two steps: a first training step of only word embeddings and a second training step to produce sense and synset embeddings. These two approaches require multiple steps of training and make use of a relatively small resource like WordNet, which limits their coverage and applicability. Camacho-Collados et al. (2016) increased the coverage of these WordNetbased approaches by exploiting the complementary knowledge of WordNet and Wikipedia along with pre-trained word embeddings. Finally,  and Fang et al. (2016) proposed a model to align vector spaces of words and entities from knowledge bases. However, these approaches are restricted to nominal instances only (i.e. Wikipedia pages or entities).
In contrast, we propose a model which learns both words and sense embeddings from a single joint training phase, producing a common vector space of words and senses as an emerging feature.

Connecting words and senses in context
In order to jointly produce embeddings for words and senses, SW2V needs as input a corpus where words are connected to senses 1 in each given context. One option for obtaining such connections could be to take a sense-annotated corpus as input. However, manually annotating large amounts of data is extremely expensive and therefore impractical in normal settings. Obtaining sense-annotated data from current off-the-shelf disambiguation and entity linking systems is possible, but generally suffers from two major problems. First, supervised systems are hampered by the very same problem of needing large amounts of sense-annotated data. Second, the relatively slow speed of current disambiguation systems, such as graph-based approaches (Hoffart et al., 2012;Agirre et al., 2014;Moro et al., 2014), or word-expert supervised systems (Zhong and Ng, 2010;Iacobacci et al., 2016;Melamud et al., 2016), could become an obstacle when applied to large corpora. This is the reason why we propose a simple yet effective unsupervised shallow word-sense connectivity algorithm, which can be applied to virtually any given semantic network and is linear on the corpus size. The main idea of the algorithm is to exploit the connections of a semantic network by associating words with the senses that are most connected within the sentence, according to the underlying network.
Shallow word-sense connectivity algorithm. Formally, a corpus and a semantic network are taken as input and a set of connected words and senses is produced as output. We define a semantic network as a graph (S, E) where the set S contains synsets (nodes) and E represents a set of semantically connected synset pairs (edges). Algorithm 1 describes how to connect words and senses in a given text (sentence or paragraph) T . First, we gather in a set S T all candidate synsets of the words (including multiwords up to trigrams) in T (lines 1 to 3). Second, for each candidate synset s we calculate the number of synsets which are connected with s in the semantic network and are included in S T , excluding connections of synsets which only appear as candidates of the Algorithm 1 Shallow word-sense connectivity Input: Semantic network (S, E) and text T represented as a bag of words Output: Set of connected words and senses T * ⊂ T × S 1: Set of synsets ST ← ∅ 2: for each word w ∈ T 3: ST ← ST ∪ Sw (Sw: set of candidate synsets of w) 4: Minimum connections threshold θ ← |S T |+|T | 2 δ 5: Output set of connections T * ← ∅ 6: for each w ∈ T 7: Relative maximum connections max = 0 8: Set of senses associated with w, Cw ← ∅ 9: for each candidate synset s ∈ Sw 10: Number of edges n = |s ∈ ST : (s, if n > max then 13: Cw ← {(w, s)} 14: max ← n 15: else 16: Cw ← Cw ∪ {(w, s)} 17: T * ← T * ∪ Cw 18: return Output set of connected words and senses T * same word (lines 5 to 10). Finally, each word is associated with its top candidate synset(s) according to its/their number of connections in context, provided that its/their number of connections exceeds a threshold θ = |S T |+|T | 2 δ (lines 11 to 17). 2 This parameter aims to retain relevant connectivity across senses, as only senses above the threshold will be connected to words in the output corpus. θ is proportional to the reciprocal of a parameter δ, 3 and directly proportional to the average text length and number of candidate synsets within the text.
The complexity of the proposed algorithm is N + (N × α), where N is the number of words of the training corpus and α is the average polysemy degree of a word in the corpus according to the input semantic network. Considering that noncontent words are not taken into account (i.e. polysemy degree 0) and that the average polysemy degree of words in current lexical resources (e.g. WordNet or BabelNet) does not exceed a small constant (3) in any language, we can safely assume that the algorithm is linear in the size of the training corpus. Hence, the training time is not significantly increased in comparison to training words only, irrespective of the corpus size. This enables a fast training on large amounts of text corpora, in contrast to current unsupervised disambiguation algorithms. Additionally, as we will show in Section 5.2, this algorithm does not only speed up significantly the training phase, but also leads to more accurate results.
Note that with our algorithm a word is allowed to have more than one sense associated. In fact, current lexical resources like WordNet (Miller, 1995) or BabelNet (Navigli and Ponzetto, 2012) are hampered by the high granularity of their sense inventories (Hovy et al., 2013). In Section 6.2 we show how our sense embeddings are particularly suited to deal with this issue.

Joint training of words and senses
The goal of our approach is to obtain a shared vector space of words and senses. To this end, our model extends conventional word embedding models by integrating explicit knowledge into its architecture. While we will focus on the Continuous Bag Of Words (CBOW) architecture of word2vec (Mikolov et al., 2013), our extension can easily be applied similarly to Skip-Gram, or to other predictive approaches based on neural networks. The CBOW architecture is based on the feedforward neural network language model (Bengio et al., 2003) and aims at predicting the current word using its surrounding context. The architecture consists of input, hidden and output layers. The input layer has the size of the word vocabulary and encodes the context as a combination of onehot vector representations of surrounding words of a given target word. The output layer has the same size as the input layer and contains a one-hot vector of the target word during the training phase.
Our model extends the input and output layers of the neural network with word senses 4 by exploiting the intrinsic relationship between words and senses. The leading principle is that, since a word is the surface form of an underlying sense, updating the embedding of the word should produce a consequent update to the embedding representing that particular sense, and vice-versa. As a consequence of the algorithm described in the previous section, each word in the corpus may be connected with zero, one or more senses. We re- Figure 1: The SW2V architecture on a sample training instance using four context words. Dotted lines represent the virtual link between words and associated senses in context. In this example, the input layer consists of a context of two previous words (w t−2 , w t−1 ) and two subsequent words (w t+1 , w t+2 ) with respect to the target word w t . Two words (w t−1 , w t+2 ) do not have senses associated in context, while w t−2 , w t+1 have three senses (s 1 t−1 , s 2 t−1 , s 3 t−1 ) and one sense associated (s 1 t+1 ) in context, respectively. The output layer consists of the target word w t , which has two senses associated (s 1 t , s 2 t ) in context.
fer to the set of senses connected to a given word within the specific context as its associated senses.
Formally, we define a training instance as a sequence of words W = w t−n , ..., w t , ..., w t+n (being w t the target word) and is the sequence of all associated senses in context of w i ∈ W . Note that S i might be empty if the word w i does not have any associated sense. In our model each target word takes as context both its surrounding words and all the senses associated with them. In contrast to the original CBOW architecture, where the training criterion is to correctly classify w t , our approach aims to predict the word w t and its set S t of associated senses. This is equivalent to minimizing the following loss function: where W t = w t−n , ..., w t−1 , w t+1 , ..., w t+n and S t = S t−n , ..., S t−1 , S t+1 , ..., S t+n . Figure 1 shows the organization of the input and the output layers on a sample training instance. In what follows we present a set of variants of the model on the output and the input layers.

Output layer alternatives
Both words and senses. This is the default case explained above. If a word has one or more associated senses, these senses are also used as target on a separate output layer.
Only words. In this case we exclude senses as target. There is a single output layer with the size of the word vocabulary as in the original CBOW model.
Only senses. In contrast, this alternative excludes words, using only senses as target. In this case, if a word does not have any associated sense, it is not used as target instance.

Input layer alternatives
Both words and senses. Words and their associated senses are included in the input layer and contribute to the hidden state. Both words and senses are updated as a consequence of the backpropagation algorithm.
Only words. In this alternative only the surrounding words contribute to the hidden state, i.e. the target word/sense (depending on the alternative of the output layer) is predicted only from word features. The update of an input word is propagated to the embeddings of its associated senses, if any. In other words, despite not being included in the input layer, senses still receive the same gradient of the associated input word, through a virtual connection. This configuration, coupled with the only-words output layer configuration, corresponds exactly to the default CBOW architecture of word2vec with the only addition of the update step for senses.
Only senses. Words are excluded from the input layer and the target is predicted only from the senses associated with the surrounding words. The weights of the words are updated through the updates of the associated senses, in contrast to the only-words alternative.

Analysis of Model Components
In this section we analyze the different components of SW2V, including the nine model configurations (Section 5.1) and the algorithm which generates the connections between words and senses in context (Section 5.2). In what follows we describe the common analysis setting: • Training model and hyperparameters. For evaluation purposes, we use the CBOW model of word2vec with standard hyperparameters: the dimensionality of the vectors is set to 300 and the window size to 8, and hierarchical softmax is used for normalization.
These hyperparameter values are set across all experiments.
• Corpus and semantic network. We use a 300M-words corpus from the UMBC project (Han et al., 2013), which contains English paragraphs extracted from the web. 5 As semantic network we use BabelNet 3.0 6 , a large multilingual semantic network with over 350 million semantic connections, integrating resources such as Wikipedia and WordNet. We chose BabelNet owing to its wide coverage of named entities and lexicographic knowledge.
• Benchmark. Word similarity has been one of the most popular benchmarks for in-vitro evaluation of vector space models (Pennington et al., 2014;Levy et al., 2015). For the analysis we use two word similarity datasets: the similarity portion (Agirre et al., 2009, WS-Sim) of the WordSim-353 dataset (Finkelstein et al., 2002) and RG-65 (Rubenstein and Goodenough, 1965). In order to compute the similarity of two words using our sense embeddings, we apply the standard closest senses strategy (Resnik, 1995;Budanitsky and Hirst, 2006;Camacho-Collados et al., 2015), using cosine similarity (cos) as comparison measure between senses: sim(w 1 , w 2 ) = max s∈Sw 1 ,s ∈Sw 2 cos( s 1 , s 2 ) (1) where S w i represents the set of all candidate senses of w i and s i refers to the sense vector representation of the sense s i .

Model configurations
In this section we analyze the different configurations of our model in respect of the input and the output layer on a word similarity experiment.
Recall from Section 4 that our model could have words, senses or both in either the input and output layers. Table 1 shows the results of all nine configurations on the WS-Sim and RG-65 datasets. As shown in Table 1, the best configuration according to both Spearman and Pearson correlation measures is the configuration which has only senses in the input layer and both words and senses in the output layer. 7 In fact, taking only senses as input seems to be consistently the best alternative for the input layer. Our hunch is that the knowledge learned from both the co-occurrence information and the semantic network is more balanced with this input setting. For instance, in the case of including both words and senses in the input layer, the co-occurrence information learned by the network would be duplicated for both words and senses.

Disambiguation / Shallow word-sense connectivity algorithm
In this section we evaluate the impact of our shallow word-sense connectivity algorithm (Section 3) by testing our model directly taking a predisambiguated text as input. In this case the network exploits the connections between each word and its disambiguated sense in context. For this comparison we used Babelfy 8 (Moro et al., 2014), a state-of-the-art graph-based disambiguation and entity linking system based on BabelNet. We compare to both the default Babelfy system which 7 In this analysis we used the word similarity task for optimizing the sense embeddings, without caring about the performance of word embeddings or their interconnectivity. Therefore, this configuration may not be optimal for word embeddings and may be further tuned on specific applications. More information about different configurations in the documentation of the source code.  Table 2: Pearson (r) and Spearman (ρ) correlation performance of SW2V integrating our shallow word-sense connectivity algorithm (default), Babelfy, or Babelfy*.
uses the Most Common Sense (MCS) heuristic as a back-off strategy and, following (Iacobacci et al., 2015), we also include a version in which only instances above the Babelfy default confidence threshold are disambiguated (i.e. the MCS backoff strategy is disabled). We will refer to this latter version as Babelfy* and report the best configuration of each strategy according to our analysis. Table 2 shows the results of our model using the three different strategies on RG-65 and WS-Sim. Our shallow word-sense connectivity algorithm achieves the best overall results. We believe that these results are due to the semantic connectivity ensured by our algorithm and to the possibility of associating words with more than one sense, which seems beneficial for training, making it more robust to possible disambiguation errors and to the sense granularity issue (Erk et al., 2013). The results are especially significant considering that our algorithm took a tenth of the time needed by Babelfy to process the corpus.

Evaluation
We perform a qualitative and quantitative evaluation of important features of SW2V in three different tasks. First, in order to compare our model against standard word-based approaches, we evaluate our system in the word similarity task (Section 6.1). Second, we measure the quality of our sense embeddings in a sense-specific application: sense clustering (Section 6.2). Finally, we evaluate the coherence of our unified vector space by measuring the interconnectivity of word and sense embeddings (Section 6.3).
Experimental setting. Throughout all the experiments we use the same standard hyperparameters mentioned in Section 5 for both the original word2vec implementation and our proposed model SW2V. For SW2V we use the same optimal configuration according to the analysis of the previous section (only senses as input, and both words and senses as output) for all tasks. As training corpus we take the full 3B-words UMBC webbase corpus and the Wikipedia (Wikipedia dump of November 2014), used by three of the comparison systems. We use BabelNet 3.0 (SW2V BN ) and WordNet 3.0 (SW2V WN ) as semantic networks.

Word Similarity
In this section we evaluate our sense representations on the standard SimLex-999 (Hill et al., 2015) and MEN (Bruni et al., 2014) word similarity datasets 13 . SimLex and MEN contain 999 and 3000 word pairs, respectively, which constitute, to our knowledge, the two largest similar-  ity datasets comprising a balanced set of noun, verb and adjective instances. As explained in Section 5, we use the closest sense strategy for the word similarity measurement of our model and all sense-based comparison systems. As regards the word embedding models, words are directly compared by using cosine similarity. We also include a retrofitted version of the original word2vec word vectors (Faruqui et al., 2015, Retrofitting 14 ) using WordNet (Retrofitting WN ) and BabelNet (Retrofitting BN ) as lexical resources. Table 3 shows the results of SW2V and all comparison models in SimLex and MEN. SW2V consistently outperforms all sense-based comparison systems using the same corpus, and clearly performs better than the original word2vec trained on the same corpus. Retrofitting decreases the performance of the original word2vec on the Wikipedia corpus using BabelNet as lexical resource, but significantly improves the original word vectors on the UMBC corpus, obtaining comparable results to our approach. However, while our approach provides a shared space of words and senses, Retrofitting still conflates different meanings of a word into the same vector.
Additionally, we noticed that most of the score divergences between our system and the gold standard scores in SimLex-999 were produced on 14 https://github.com/mfaruqui/ retrofitting antonym pairs, which are over-represented in this dataset: 38 word pairs hold a clear antonymy relation (e.g. encourage-discourage or long-short), while 41 additional pairs hold some degree of antonymy (e.g. new-ancient or man-woman). 15 In contrast to the consistently low gold similarity scores given to antonym pairs, our system varies its similarity scores depending on the specific nature of the pair 16 . Recent works have managed to obtain significant improvements by tweaking usual word embedding approaches into providing low similarity scores for antonym pairs (Pham et al., 2015;Schwartz et al., 2015;Nguyen et al., 2016;Mrksic et al., 2017), but this is outside the scope of this paper.

Sense Clustering
Current lexical resources tend to suffer from the high granularity of their sense inventories . In fact, a meaningful clustering of their senses may lead to improvements on downstream tasks (Hovy et al., 2013;Flekova and Gurevych, 2016;Pilehvar et al., 2017). In this section we evaluate our synset representations on the Wikipedia sense clustering task. For a fair comparison with respect to the BabelNet-based com-  Dandala et al. (2013). In these datasets sense clustering is viewed as a binary classification task in which, given a pair of Wikipedia pages, the system has to decide whether to cluster them into a single instance or not. To this end, we use our synset embeddings and cluster Wikipedia pages 17 together if their similarity exceeds a threshold γ. In order to set the optimal value of γ, we follow Dandala et al. (2013) and use the first 500-pairs sense clustering dataset for tuning. We set the threshold γ to 0.35, which is the value leading to the highest F-Measure among all values from 0 to 1 with a 0.05 step size on the 500-pair dataset. Likewise, we set a threshold for NASARI (0.7) and SensEmbed (0.3) comparison systems. Finally, we evaluate our approach on the Se-mEval sense clustering test set. This test set consists of 925 pairs which were obtained from a set of highly ambiguous words gathered from past SemEval tasks. For comparison, we also include the supervised approach of Dandala et al. (2013) based on a multi-feature Support Vector Machine classifier trained on an automaticallylabeled dataset of the English Wikipedia (Mono-SVM) and Wikipedia in four different languages (Multi-SVM). As naive baseline we include the system which would cluster all given pairs. Table 4 shows the F-Measure and accuracy results on the SemEval sense clustering dataset. SW2V outperforms all comparison systems according to both measures, including the sense rep-resentations of NASARI and SensEmbed using the same setup and the same underlying lexical resource. This confirms the capability of our system to accurately capture the semantics of word senses on this sense-specific task.

Word and sense interconnectivity
In the previous experiments we evaluated the effectiveness of the sense embeddings. In contrast, this experiment aims at testing the interconnectivity between word and sense embeddings in the vector space. As explained in Section 2, there have been previous approaches building a shared space of word and sense embeddings, but to date little research has focused on testing the semantic coherence of the vector space. To this end, we evaluate our model on a Word Sense Disambiguation (WSD) task, using our shared vector space of words and senses to obtain a Most Common Sense (MCS) baseline. The insight behind this experiment is that a semantically coherent shared space of words and senses should be able to build a relatively strong baseline for the task, as the MCS of a given word should be closer to the word vector than any other sense. The MCS baseline is generally integrated into the pipeline of stateof-the-art WSD and Entity Linking systems as a back-off strategy (Navigli, 2009;Jin et al., 2009;Zhong and Ng, 2010;Moro et al., 2014;Raganato et al., 2017) and is used in various NLP applications (Bennett et al., 2016). Therefore, a system which automatically identifies the MCS of words from non-annotated text may be quite valuable, especially for resource-poor languages or large knowledge resources for which obtaining senseannotated corpora is extremely expensive. Moreover, even in a resource like WordNet for which sense-annotated data is available (Miller et al., 1993, SemCor), 61% of its polysemous lemmas have no sense annotations (Bennett et al., 2016).
Given an input word w, we compute the cosine similarity between w and all its candidate senses, picking the sense leading to the highest similarity: where cos( w, s) refers to the cosine similarity between the embeddings of w and s. In order to assess the reliability of SW2V against previous models using WordNet as sense inventory, we test our model on the all-words SemEval-2007 (task 17) (Pradhan et al., 2007) and SemEval-2013 (task  12)  WSD datasets. Note that our model using BabelNet as semantic network has a far larger coverage than just WordNet and may additionally be used for Wikification (Mihalcea and Csomai, 2007) and Entity Linking tasks.
Since the versions of WordNet vary across datasets and comparison systems, we decided to evaluate the systems on the portion of the datasets covered by all comparison systems 18 (less than 10% of instances were removed from each dataset). Table 5 shows the results of our system and AutoExtend on the SemEval-2007 and SemEval-2013 WSD datasets. SW2V provides the best MCS results in both datasets. In general, AutoExtend does not accurately capture the predominant sense of a word and performs worse than a baseline that selects the intended sense randomly from the set of all possible senses of the target word.
In fact, AutoExtend tends to create clusters which include a word and all its possible senses. As an example, Table 6 shows the closest word and sense 19 embeddings of our SW2V model and Au-toExtend to the military and fish senses of, respectively, company and school. AutoExtend creates clusters with all the senses of company and school and their related instances, even if they belong to different domains (e.g., firm 2 n or business 1 n clearly concern the business sense of company). Instead, SW2V creates a semantic cluster of word and sense embeddings which are semantically close to the corresponding company 2 n and school 7 n senses.

Conclusion and Future Work
In this paper we proposed SW2V (Senses and Words to Vectors), a neural model which learns vector representations for words and senses in a joint training phase by exploiting both text corpora and knowledge from semantic networks. Data (in- 18 We were unable to obtain the word embeddings of  for comparison even after contacting the authors. 19 Following Navigli (2009), word p n is the n th sense of word with part of speech p (using WordNet 3.0).  Table 6: Ten closest word and sense embeddings to the senses company 2 n (military unit) and school 7 n (group of fish).
cluding the preprocessed corpora and pre-trained embeddings used in the evaluation) and source code to apply our extension of the word2vec architecture to learn word and sense embeddings from any preprocessed corpus are freely available at http://lcl.uniroma1.it/sw2v. Unlike previous sense-based models which require post-processing steps and use WordNet as sense inventory, our model achieves a semantically coherent vector space of both words and senses as an emerging feature of a single training phase and is easily scalable to larger semantic networks like BabelNet. Finally, we showed, both quantitatively and qualitatively, some of the advantages of using our approach as against previous state-ofthe-art word-and sense-based models in various tasks, and highlighted interesting semantic properties of the resulting unified vector space of word and sense embeddings. As future work we plan to integrate a WSD and Entity Linking system for applying our model on downstream NLP applications, along the lines of Pilehvar et al. (2017). We are also planning to apply our model to languages other than English and to study its potential on multilingual and crosslingual applications.