Beyond Bilingual: Multi-sense Word Embeddings using Multilingual Context

Word embeddings, which represent a word as a point in a vector space, have become ubiquitous to several NLP tasks. A recent line of work uses bilingual (two languages) corpora to learn a different vector for each sense of a word, by exploiting crosslingual signals to aid sense identification. We present a multi-view Bayesian non-parametric algorithm which improves multi-sense wor d embeddings by (a) using multilingual (i.e., more than two languages) corpora to significantly improve sense embeddings beyond what one achieves with bilingual information, and (b) uses a principled approach to learn a variable number of senses per word, in a data-driven manner. Ours is the first approach with the ability to leverage multilingual corpora efficiently for multi-sense representation learning. Experiments show that multilingual training significantly improves performance over monolingual and bilingual training, by allowing us to combine different parallel corpora to leverage multilingual context. Multilingual training yields comparable performance to a state of the art monolingual model trained on five times more training data.


Introduction
Word embeddings (Turian et al., 2010;Mikolov et al., 2013, inter alia) represent a word as a point in a vector space. This space is able to capture semantic relationships: vectors of words with similar meanings have high cosine similarity (Turney, 2006;Turian et al., 2010). Use of embeddings as features has been shown to benefit sev-eral NLP tasks and serve as good initializations for deep architectures ranging from dependency parsing (Bansal et al., 2014) to named entity recognition (Guo et al., 2014b).
Although these representations are now ubiquitous in NLP, most algorithms for learning wordembeddings do not allow a word to have different meanings in different contexts, a phenomenon known as polysemy. For example, the word bank assumes different meanings in financial (eg. "bank pays interest") and geographical contexts (eg. "river bank") and which cannot be represented adequately with a single embedding vector. Unfortunately, there are no large sense-tagged corpora available and such polysemy must be inferred from the data during the embedding process. Several attempts (Reisinger and Mooney, 2010;Neelakantan et al., 2014;Li and Jurafsky, 2015) have been made to infer multi-sense word representations by modeling the sense as a latent variable in a Bayesian non-parametric framework. These approaches rely on the "one-sense per collocation" heuristic (Yarowsky, 1995), which assumes that presence of nearby words correlate with the sense of the word of interest. This heuristic provides only a weak signal for sense identification, and such algorithms require large amount of training data to achieve competitive perfor-mance.
Recently, several approaches (Guo et al., 2014a;Suster et al., 2016) propose to learn multi-sense embeddings by exploiting the fact that different senses of the same word may be translated into different words in a foreign language (Dagan and Itai, 1994;Resnik and Yarowsky, 1999;Diab and Resnik, 2002;Ng et al., 2003). For example, bank in English may be translated to banc or banque in French, depending on whether the sense is financial or geographical. Such bilingual distributional information allows the model to identify which sense of a word is being used during training.
However, bilingual distributional signals often do not suffice. It is common that polysemy for a word survives translation. Fig. 1 shows an illustrative example -both senses of interest get translated to intérêt in French. However, this becomes much less likely as the number of languages under consideration grows. By looking at Chinese translation in Fig. 1, we can observe that the senses translate to different surface forms. Note that the opposite can also happen (i.e. same surface forms in Chinese, but different in French). Existing crosslingual approaches are inherently bilingual and cannot naturally extend to include additional languages due to several limitations (details in Section 4). Furthermore, works like (Šuster et al., 2016) sets a fixed number of senses for each word, leading to inefficient use of parameters, and unnecessary model complexity. 1 This paper addresses these limitations by proposing a multi-view Bayesian non-parametric word representation learning algorithm which leverages multilingual distributional information. Our representation learning framework is the first multilingual (not bilingual) approach, allowing us to utilize arbitrarily many languages to disambiguate words in English. To move to multilingual system, it is necessary to ensure that the embeddings of each foreign language are relatable to each other (i.e., they live in the same space). We solve this by proposing an algorithm in which word representations are learned jointly across languages, using English as a bridge. While large parallel corpora between two languages are scarce, using our approach we can concatenate multiple parallel corpora to obtain a large multilingual corpus. The parameters are estimated in a Bayesian nonparametric framework that allows our algorithm to only associate a word with a new sense vector when evidence (from either same or foreign language context) requires it. As a result, the model infers different number of senses for each word in a data-driven manner, avoiding wasting parameters.
Together, these two ideas -multilingual distributional information and nonparametric sense modeling -allow us to disambiguate multiple senses using far less data than is necessary for previous methods. We experimentally demonstrate that our algorithm can achieve competitive performance after training on a small multilingual corpus, comparable to a model trained monolingually on a much larger corpus. We present an analysis discussing the effect of various parameterschoice of language family for deriving the multilingual signal, crosslingual window size etc. and also show qualitative improvement in the embedding space.

Related Work
Work on inducing multi-sense embeddings can be divided in two broad categories -two-staged approaches and joint learning approaches. Twostaged approaches (Reisinger and Mooney, 2010;Huang et al., 2012) induce multi-sense embeddings by first clustering the contexts and then using the clustering to obtain the sense vectors. The contexts can be topics induced using latent topic models (Liu et al., 2015a,b), or Wikipedia (Wu and Giles, 2015) or coarse partof-speech tags (Qiu et al., 2014). A more recent line of work in the two-staged category is that of retrofitting (Faruqui et al., 2015;, which aims to infuse semantic ontologies from resources like WordNet (Miller, 1995) and Framenet (Baker et al., 1998) into embeddings during a post-processing step. Such resources list (albeit not exhaustively) the senses of a word, and by retro-fitting it is possible to tease apart the different senses of a word. While some resources like WordNet (Miller, 1995) are available for many languages, they are not exhaustive in listing all possible senses. Indeed, the number senses of a word is highly dependent on the task and cannot be pre-determined using a lexicon (Kilgarriff, 1997). Ideally, the senses should be inferred in a data-driven manner, so that new senses not listed in such lexicons can be discovered. While re-cent work has attempted to remedy this by using parallel text for retrofitting sense-specific embeddings (Ettinger et al., 2016), their procedure requires creation of sense graphs, which introduces additional tuning parameters. On the other hand, our approach only requires two tuning parameters (prior α and maximum number of senses T ).
In contrast, joint learning approaches (Neelakantan et al., 2014;Li and Jurafsky, 2015) jointly learn the sense clusters and embeddings by using non-parametrics. Our approach belongs to this category. The closest non-parametric approach to ours is that of (Bartunov et al., 2016), who proposed a multi-sense variant of the skipgram model which learns the different number of sense vectors for all words from a large monolingual corpus (eg. English Wikipedia). Our work can be viewed as the multi-view extension of their model which leverages both monolingual and crosslingual distributional signals for learning the embeddings. In our experiments, we compare our model to monolingually trained version of their model.
Incorporating crosslingual distributional information is a popular technique for learning word embeddings, and improves performance on several downstream tasks (Faruqui and Dyer, 2014;Guo et al., 2016;Upadhyay et al., 2016). However, there has been little work on learning multi-sense embeddings using crosslingual signals (Bansal et al., 2012;Guo et al., 2014a;Šuster et al., 2016) with only (Šuster et al., 2016) being a joint approach. (Kawakami and Dyer, 2015) also used bilingual distributional signals in a deep neural architecture to learn context dependent representations for words, though they do not learn separate sense vectors.

Model Description
., x e Ne } denote the words of the English side and ., x f N f } denote the words of the foreign side of the parallel corpus. We assume that we have access to word alignments A e→f and A f →e mapping words in English sentence to their translation in foreign sentence (and vice-versa), so that x e and x f are We define Nbr(x, L, d) as the neighborhood in language L of size d (on either side) around word x in its sentence. The English and foreign neighboring words are denoted by y e and y f , respec-tively. Note that y e and y f need not be translations of each other. Each word x f in the foreign vocabulary is associated with a dense vector x f in R m , and each word x e in English vocabulary admits at most T sense vectors, with the k th sense vector denoted as x e k . 2 As our main goal is to model multiple senses for words in English, we do not model polysemy in the foreign language and use a single vector to represent each word in the foreign vocabulary.
We model the joint conditional distribution of the context words y e , y f given an English word x e and its corresponding translation x f on the parallel corpus: where θ are model parameters (i.e. all embeddings) and α governs the hyper-prior on latent senses. Assume x e has multiple senses, which are indexed by the random variable z, Eq. (1) can be rewritten, where β are the parameters determining the model probability on each sense for x e (i.e., the weight on each possible value for z). We place a Dirichlet process (Ferguson, 1973) prior on sense assignment for each word. Thus, adding the word-x subscript to emphasize that these are word-specific senses, That is, the potentially infinite number of senses for each word x have probability determined by the sequence of independent stick-breaking weights, β xk , in the constructive definition of the DP (Sethuraman, 1994). The hyper-prior concentration α provides information on the number of senses we expect to observe in our corpus. After conditioning upon word sense, we decompose the context probability, Both the first and the second terms are sensedependent, and each factors as, where x e k is the embedding corresponding to the k th sense of the word x e , and y is either y e or y f . The factor Ψ(x e , z = k, y) use the corresponding sense vector in a skip-gram-like formulation. This results in total of 4 factors, See Figure 2 for illustration of each factor. This modeling approach is reminiscent of (Luong et al., 2015), who jointly learned embeddings for two languages l 1 and l 2 by optimizing a joint objective containing 4 skip-gram terms using the aligned pair (x e ,x f )-two predicting monolingual contexts l 1 → l 1 , l 2 → l 2 , and two predicting crosslingual contexts l 1 → l 2 , l 2 → l 1 .
Learning. Learning involves maximizing the log-likelihood, T k=1 β wk (5) are the fully factorized variational approximation of the true posterior P (z, β | y e , y f , x e , x f , α), where V is the size of english vocabulary, and T is the maximum number of senses for any word. The optimization problem solves for θ,q(z) and q(β) using the stochastic variational inference technique (Hoffman et al., 2013) similar to (Bartunov et al., 2016) (refer for details).
The resulting learning algorithm is shown as Algorithm 1. The first for-loop (line 1) updates the English sense vectors using the crosslingual and monolingual contexts. First, the expected sense distribution for the current English word w is computed using the current estimate of q(β) (line 4). The sense distribution is updated (line 7) using the combined monolingual and crosslingual contexts  (4)). We pick each sense (here 2nd) vector for interest, to perform weighted update. We only model polysemy in English.
(line 5) and re-normalized (line 8). Using the updated sense distribution q(β)'s sufficient statistics is re-computed (line 9) and the global parameter θ is updated (line 10) as follows, (6) Note that in the above sum, a sense participates in a update only if its probability exceeds a threshold (= 0.001). The final model retains sense vectors whose sense probability exceeds the same threshold. The last for-loop (line 11) jointly optimizes the foreign embeddings using English context with the standard skip-gram updates.
Disambiguation. Similar to (Bartunov et al., 2016), we can disambiguate the sense for the word x e given a monolingual context y e as follows, Although the model trains embeddings using both monolingual and crosslingual context, we only use monolingual context at test time. We found that so long as the model has been trained with multilingual context, it performs well in sense disambiguation on new data even if it contains only monolingual context. A similar observation was made by (Šuster et al., 2016).

Multilingual Extension
Bilingual distributional signal alone may not be sufficient as polysemy may survive translation in the second language. Unlike existing approaches, we can easily incorporate multilingual distributional signals in our model. For using languages l 1 and l 2 to learn multi-sense embeddings for English, we train on a concatenation of En-l 1 parallel corpus with an En-l 2 parallel corpus. This technique can easily be generalized to more than w ← x e i 3: for k = 1 to T do 4: for y in y c do 7: SENSE-UPDATE(x e i , y, z i ) 8: Renormalize z i using softmax 9: Update suff. stats. for q(β) like (Bartunov et al., 2016) 10: Update θ using eq. (6) 11: for i = 1 to N f do jointly update foreign vectors 12: for y in y c do 14: two foreign languages to obtain a large multilingual corpus.
Value of Ψ(y e , x f ). The factor modeling the dependence of the English context word y e on foreign word x f is crucial to performance when using multiple languages. Consider the case of using French and Spanish contexts to disambiguate the financial sense of the English word bank. In this case, the (financial) sense vector of bank will be used to predict vector of banco (Spanish context) and banque (French context). If vectors for banco and banque do not reside in the same space or are not close, the model will incorrectly assume they are different contexts to introduce a new sense for bank. This is precisely why the bilingual models, like that of (Šuster et al., 2016), cannot be extended to multilingual setting, as they pre-train the embeddings of second language before running the multi-sense embedding process. As a result of naive pre-training, the French and Spanish vectors of semantically similar pairs like (banco,banque) will lie in different spaces and need not be close. A similar reason holds for (Guo et al., 2014a), as they use a two step approach instead of joint learning.
To avoid this, the vector for pairs like banco and banque should lie in the same space and close to each other and the sense vector for bank. The Ψ(y e , x f ) term attempts to ensure this by using the vector for banco and banque to predict the vector of bank. This way, the model brings the embedding space for Spanish and French closer by using English as a bridge language during joint training. A similar idea of using English as a bridging language was used in the models proposed in (Hermann and Blunsom, 2014) and (Coulmance et al., 2015). Beside the benefit in the multilingual case, the Ψ(y e , x f ) term improves performance in the bilingual case as well, as it forces the English and second language embeddings to remain close in space.
To show the value of Ψ(y e , x f ) factor in our experiments, we ran a variant of Algorithm 1 without the Ψ(y e , x f ) factor, by only using monolingual neighborhood N br(x f i , F ) in line 12 of Algorithm 1. We call this variant ONE-SIDED model and the model in Algorithm 1 the FULL model.

Experimental Setup
We first describe the datasets and the preprocessing methods used to prepare them. We also describe the Word Sense Induction task that we used to compare and evaluate our method.
Parallel Corpora. We use parallel corpora in English (En), French (Fr), Spanish (Es), Russian (Ru) and Chinese (Zh) in our experiments. Corpus statistics for all datasets used in our experiments are shown in Table 1. For En-Zh, we use the FBIS parallel corpus (LDC2003E14). For En-Fr, we use the first 10M lines from the Giga-EnFr corpus released as part of the WMT shared task (Callison-Burch et al., 2011). Note that the domain from which parallel corpus has been derived can affect the final result. To understand what choice of languages provide suitable disambiguation signal, it is necessary to control for domain in all parallel corpora. To this end, we also used the En-Fr, En-Es, En-Zh and En-Ru sections of the MultiUN parallel corpus (Eisele and Chen, 2010). Word alignments were generated using fast_align tool (Dyer et al., 2013) in the symmetric intersection mode. Tokenization and other preprocessing were performed using cdec 3 toolkit. Stanford Segmenter (Tseng et al., 2005) was used to preprocess the Chinese corpora.
Word Sense Induction (WSI). We evaluate our approach on word sense induction task. In this task, we are given several sentences showing usages of the same word, and are required to cluster all sentences which use the same sense (Nasiruddin, 2013). The predicted clustering is then compared against a provided gold clustering. Note that WSI is a harder task than Word Sense Disambiguation (WSD) (Navigli, 2009), as unlike WSD, this task does not involve any supervision or explicit human knowledge about senses of words. We use the disambiguation approach in eq. (7) to predict the sense given the target word and four context words.
To allow for fair comparison with earlier work, we use the same benchmark datasets as (Bartunov et al., 2016) -Semeval-2007 and Wikipedia Word Sense Induction (WWSI). We report Adjusted Rand Index (ARI) (Hubert and Arabie, 1985) in the experiments, as ARI is a more strict and precise metric than F-score and V-measure.
Parameter Tuning. For fairness, we used five context words on either side to update each English word-vectors in all the experiments. In the monolingual setting, all five words are English; in the multilingual settings, we used four neighboring English words plus the one foreign word aligned to the word being updated (d = 4, d = 0 in Algorithm 1). We also analyze effect of varying d , the context window size in the foreign sentence on the model performance.
We tune the parameters α and T by maximizing the log-likelihood of a held out English text. 4 The parameters were chosen from the following values α = {0.05, 0.1, .., 0.25}, T = {5, 10, .., 30}. All models were trained for 10 iteration with a decay-ing learning rate of 0.025, decayed to 0. Unless otherwise stated, all embeddings are 100 dimensional.
Under various choice of α and T , we identify only about 10-20% polysemous words in the vocabulary using monolingual training and 20-25% polysemous using multilingual training. It is evident using the non-parametric prior has led to substantially more efficient representation compared to previous methods with fixed number of senses per word.  We performed extensive experiments to evaluate the benefit of leveraging bilingual and multilingual information during training. We also analyze how the different choices of language family (i.e. using more distant vs more similar languages) affect performance of the embeddings.

Word Sense Induction Results.
The results for WSI are shown in Table 2. Recall that the ONE-SIDED model is the variant of Algorithm 1 without the Ψ(y e , x f ) factor. MONO refers to the AdaGram model of (Bartunov et al., 2016) trained on the English side of the parallel corpus. In all cases, the MONO model is outperformed by ONE-SIDED and FULL models, showing the benefit of using crosslingual signal in training. Best performance is attained by the multilingual model (En-FrZh), showing value of multilingual signal. The value of Ψ(y e , x f ) term is also verified by the fact that the ONE-SIDED model performs worse than the FULL model.  Table 3: Effect (in ARI) of language family distance on WSI task. Best results for each column is shown in bold. The improvement from MONO to FULL is also shown as (3) -(1). Note that this is not comparable to results in Table 2, as we use a different training corpus to control for the domain.
We can also compare (unfairly to our FULL model) to the best results described in (Bartunov et al., 2016), which achieved ARI scores of 0.069, 0.097 and 0.286 on the three datasets respectively after training 300 dimensional embeddings on English Wikipedia (≈ 100M lines). Note that, as WWSI was derived from Wikipedia, training on Wikipedia gives AdaGram model an undue advantage, resulting in high ARI score on WWSI.
In comparison, our model did not train on English Wikipedia, and uses 100 dimensional embeddings. Nevertheless, even in the unfair comparison, it noteworthy that on S-2007 and S-2010, we can achieve comparable performance (0.067 and 0.094) with multilingual training to a model trained on almost 5 times more data using higher (300) dimensional embeddings.

Contextual Word Similarity Results.
For completeness, we report correlation scores on Stanford contextual word similarity dataset (SCWS) (Huang et al., 2012) in Table 2. The task requires computing similarity between two words given their contexts. While the bilingually trained model outperforms the monolingually trained model, surprisingly the multilingually trained model does not perform well on SCWS. We believe this may be due to our parameter tuning strategy. 5

Effect of Language Family Distance.
Intuitively, choice of language can affect the result from crosslingual training as some languages may provide better disambiguation signals than others. We performed a systematic set of experiment to evaluate whether we should choose languages from a closer family (Indo-European languages) or farther family (Non-Indo European Languages) 5 Most works tune directly on the test dataset for Word Similarity tasks  as training data alongside English. 6 To control for domain here we use the MultiUN corpus. We use En paired with Fr and Es as Indo-European languages, and English paired with Ru and Zh for representing Non-Indo-European languages.
From Table 3, we see that using Non-Indo European languages yield a slightly higher improvement on an average than using Indo-European languages. This suggests that using languages from a distance family aids better disambiguation. Our findings echo those of (Resnik and Yarowsky, 1999), who found that the tendency to lexicalize senses of an English word differently in a second language, correlated with language distance.
6.4 Effect of Window Size. Figure 3d shows the effect of increasing the crosslingual window (d ) on the average ARI on the WSI task for the En-Fr and En-Zh models. While increasing the window size improves the average score for En-Zh model, the score for the En-Fr model goes down. This suggests that it might be beneficial to have a separate window parameter per language. This also aligns with the observation earlier that different language families have different suitability (bigger crosslingual context from a distant family helped) and requirements for optimal performance.

Qualitative Illustration
As an illustration for the effects of multilingual training, Figure 3 shows PCA plots for 11 sense vectors for 9 words using monolingual, bilingual and multilingual models. From Fig 3a, we note that with monolingual training the senses are poorly separated. Although the model infers two senses for bank, the two senses of bank are close to financial terms, suggesting their distinction was not recognized. The same observation can be

Conclusion
We presented a multi-view, non-parametric word representation learning algorithm which can leverage multilingual distributional information. Our approach effectively combines the benefits of crosslingual training and Bayesian nonparametrics. Ours is the first multi-sense repre-sentation learning algorithm capable of using multilingual distributional information efficiently, by combining several parallel corpora to obtained a large multilingual corpus. Our experiments show how this multi-view approach learns high-quality embeddings using substantially less data and parameters than prior state-of-the-art. We also analyzed the effect of various parameters such as choice of language family and cross-lingual window size on the performance. While we focused on improving the embedding of English words in this work, the same algorithm could learn better multi-sense embedding for other languages. Exciting avenues for future research include extending our approach to model polysemy in foreign language. The sense vectors can then be aligned across languages, to generate a multilingual Wordnet like resource, in a completely unsupervised manner thanks to our joint training paradigm.