Bilingual Learning of Multi-sense Embeddings with Discrete Autoencoders

We present an approach to learning multi-sense word embeddings relying both on monolingual and bilingual information. Our model consists of an encoder, which uses monolingual and bilingual context (i.e. a parallel sentence) to choose a sense for a given word, and a decoder which predicts context words based on the chosen sense. The two components are estimated jointly. We observe that the word representations induced from bilingual data outperform the monolingual counterparts across a range of evaluation tasks, even though crosslingual information is not available at test time.


Introduction
Approaches to learning word embeddings (i.e.realvalued vectors) relying on word context have received much attention in recent years, and the induced representations have been shown to capture syntactic and semantic properties of words.They have been evaluated intrinsically (Mikolov et al., 2013a;Baroni et al., 2014;Levy and Goldberg, 2014) and have also been used in concrete NLP applications to deal with word sparsity and improve generalization (Turian et al., 2010;Collobert et al., 2011;Bansal et al., 2014;Passos et al., 2014).While most work to date has focused on developing embedding models which represent a word with a single vector, some researchers have attempted to capture polysemy explicitly and have encoded properties of each word with multiple vectors (Huang et al., 2012;Tian et al., 2014;Neelakantan et al., 2014;Chen et al., 2014;Li and Jurafsky, 2015).
In parallel to this work on multi-sense word embeddings, another line of research has investigated integrating multilingual data, with largely two distinct goals in mind.The first goal has been to obtain representations for several languages in the same semantic space, which then enables the transfer of a model (e.g., a syntactic parser) trained on annotated training data in one language to another language lacking this annotation (Klementiev et al., 2012;Hermann and Blunsom, 2014;Gouws et al., 2014;Chandar A P et al., 2014).Secondly, information from another language can also be leveraged to yield better firstlanguage embeddings (Guo et al., 2014).Our paper falls in the latter, much less explored category.We adhere to the view of multilingual learning as a means of language grounding (Faruqui and Dyer, 2014b;Zou et al., 2013;Titov and Klementiev, 2012;Snyder and Barzilay, 2010;Naseem et al., 2009).Intuitively, polysemy in one language can be at least partially resolved by looking at the translation of the word and its context in another language (Kaji, 2003;Ng et al., 2003;Diab and Resnik, 2002;Ide, 2000;Dagan and Itai, 1994;Brown et al., 1991).Better sense assignment can then lead to better sense-specific word embeddings.
We propose a model that uses second-language embeddings as a supervisory signal in learning multisense representations in the first language.This supervision is easy to obtain for many language pairs as numerous parallel corpora exist nowadays.Our model, which can be seen as an autoencoder with a discrete hidden layer encoding word senses, leverages bilingual data in its encoding part, while the decoder predicts the surrounding words relying on the predicted senses.We strive to remain flexible as to the form of parallel data used in training and support both the use of word-and sentence-level alignments.
Our findings are: • The second-language signal effectively improves the quality of multi-sense embeddings as seen on a variety of intrinsic tasks for English, with the results superior to that of the baseline Skip-Gram model, even though the crosslingual information is not available at test time.
• This finding is robust across several settings, such as varying dimensionality, vocabulary size and amount of data.
• In the extrinsic POS-tagging task, the secondlanguage signal also offers improvements over monolingually-trained multi-sense embeddings, however, the standard Skip-Gram embeddings turn out to be the most robust in this task.
We make the implementation of all the models as well as the evaluation scripts available at http: //github.com/rug-compling/bimu.

Word Embeddings with Discrete Autoencoders
Our method borrows its general structure from neural autoencoders (Rumelhart et al., 1986;Bengio et al., 2013).Autoencoders are trained to reproduce their input by first mapping their input to a (lower dimensional) hidden layer and then predicting an approximation of the input relying on this hidden layer.
In our case, the hidden layer is not a real-valued vector, but is a categorical variable encoding the sense of a word.Discrete-state autoencoders have been successful in several natural language processing applications, including POS tagging and word alignment (Ammar et al., 2014), semantic role induction (Titov and Khoddam, 2015) and relation discovery (Marcheggiani and Titov, 2016).
More formally, our model consists of two components: an encoding part which assigns a sense to a pivot word, and a reconstruction (decoding) part recovering context words based on the pivot word and its sense.As predictions are probabilistic ('soft'), the reconstruction step involves summation over all potential word senses.The goal is to find embedding parameters which minimize the error in recovering context words based on the pivot word and the sense assignment.Parameters of both encoding and reconstruction are jointly optimized.Intuitively, a good sense assignment should make the reconstruction step as easy as possible.The encoder uses not only words in the first-language sentence to choose the sense but also, at training time, is conditioning its decisions on the words in the second-language sentence.We hypothesize that the injection of crosslingual information will guide learning towards inducing more informative sense-specific word representations.Consequently, using this information at training time would benefit the model even though crosslingual information is not available to the encoder at test time.
We specify the encoding part as a log-linear model: (1) To choose the sense s ∈ S for a word x i , we use the bag of context words C i from the first language l, as well as the bag of context words C i from the second language l . 1 The context C i is defined as a multiset C i = {x i−n , . . ., x i−1 , x i+1 , . . ., x i+n }, including words around the pivot word in the window of size n to each side.We set n to 5 in all our experiments.The crosslingual context C i is discussed in § 3, where we either rely on word alignments or use the entire second-language sentence as the context.We distinguish between sense-specific embeddings, denoted by ϕ ∈ R d , and generic sense-agnostic ones, denoted {γ, γ } ∈ R d for first and second language, respectively.The number of sense-specific embeddings is the same for all words.We use θ to denote all these embedding parameters.They are learned jointly, with the exception of the pre-trained secondlanguage embeddings.The hyperparameter λ ∈ R, 0 ≤ λ ≤ 1 weights the contribution of each language.Setting λ = 0 would drop the second-language component and use only the first language.Our formulation allows the addition of new languages easily, provided that the second-language embeddings live in the same semantic space.
The reconstruction part predicts a context word x j given the pivot x i and the current estimate of its s: where |V| is the vocabulary size.This is effectively a Skip-Gram model (Mikolov et al., 2013a) extended to rely on senses.

Learning and regularization
As sense assignments are not observed during training, the learning objective includes marginalization over word senses and thus can be written as: in which index i goes over all pivot words in the first language, j over all context words to predict at each i, and s marginalizes over all possible senses of the word x i .In practice, we avoid the costly computation of the normalization factor in the softmax computation of Eq. ( 2) and use negative sampling (Mikolov et al., 2013b) instead of log p(x j |x i , s, θ): where σ is the sigmoid non-linearity function and γ x is a word embedding from the sample of negative (noisy) words N .Optimizing the autoencoding objective is broadly similar to the learning algorithm defined for multi-sense embedding induction in some of the previous work (Neelakantan et al., 2014;Li and Jurafsky, 2015).Note though that this previous work has considered only monolingual context.We use a minibatch training regime and seek to optimize the objective function L(B, θ) for each minibatch B. We found that optimizing this objective directly often resulted in inducing very flat posterior distributions.We therefore use a form of posterior regularization (Ganchev et al., 2010) where we can encode our prior expectations that the posteriors should be sharp.The regularized objective for a minibatch is defined as where H is the entropy function and q i are the posterior distributions from the encoder (p(s|x i , C i , C i , θ)).This modified objective can also be motivated from a variational approximation perspective, see Marcheggiani and Titov (2016) for details.By varying the parameter λ H ∈ R, it is easy to control the amount of entropy regularization.For λ H > 0, the objective is optimized with flatter posteriors, while λ H < 0 infers more peaky posteriors.
When λ H → −∞, the probability mass needs to be concentrated on a single sense, resulting in an algorithm similar to hard EM.In practice, we found that using hard-update training2 , which is closely related to the λ H → −∞ setting, led to best performance.

Obtaining word representations
At test time, we construct the word representations by averaging all sense embeddings for a word x i and weighting them with the sense expectations (Li and Jurafsky, 2015) 3 : Unlike in training, the sense prediction step here does not use the crosslingual context C i since it is not available in the evaluation tasks.In this work, instead of marginalizing out the unobservable crosslingual context, we simply ignore it in computation.Sometimes, even the first-language context is missing, as is the situation in many word similarity tasks.In that case, we just use the uniform average, 1 /|S| s∈S ϕ i,s .

Word affiliation from alignments
In defining the crosslingual signal we draw on a heuristic inspired by Devlin et al. (2014).The secondlanguage context words are taken to be the multiset of words around and including the pivot affiliated to x i : where x a i is the word affiliated to x i and the parameter m regulates the context window size.By choosing m = 0, only the affiliated word is used as l context, and by choosing m = ∞, the l context is the entire sentence (≈uniform alignment).To obtain the index a i , we use the following: 1) If x i aligns to exactly one second-language word, a i is the index of the word it aligns to.2) If x i aligns to multiple words, a i is the index of the aligned word in the middle (and rounding down when necessary). 3) If x i is unaligned, C i is empty, therefore no l context is used.We use the cdec aligner (Dyer et al., 2010) to wordalign the parallel corpora.

Parameters and Set-up 4.1 Learning parameters
We use the AdaGrad optimizer (Duchi et al., 2011) with initial learning rate set to 0.1.We set the minibatch size to 1000, the number of negative samples to 1, the sampling factor to 0.001 and the window size parameter m to 5. All the embeddings are 50dimensional (unless specified otherwise) and initialized by sampling from the uniform distribution between [−0.05, 0.05].We include in the vocabulary all words occurring in the corpus at least 20 times.We set the number of senses per word to 3 (see further discussion in § 6.4 and § 7).All other parameters with their default values can be examined in the source code available online.

Bilingual data
In a large body of work on multilingual word representations, Europarl (Koehn, 2005) is the preferred source of parallel data.However, the domain of Europarl is rather constrained, whereas we would like to obtain word representations of more general language, also to carry out an effective evaluation on semantic similarity datasets where domains are usually broader.We therefore use the following parallel corpora: News Commentary (Bojar et al., 2013) (NC), Yandex-1M4 (RU-EN), CzEng 1.0 (Bojar et al., 2012)

Evaluation Tasks
We evaluate the quality of our word representations on a number of tasks, both intrinsic and extrinsic.

Word similarity
We are interested here in how well the semantic similarity ratings obtained from embedding comparisons correlate to human ratings.For this purpose, we use a variety of similarity benchmarks for English and report the Spearman ρ correlation scores between the human ratings and the cosine ratings obtained from our word representations.The SCWS benchmark (Huang et al., 2012) is probably the most suitable similarity dataset for evaluating multi-sense embeddings, since it allows us to perform the sense prediction step based on the sentential context provided for each word in the pair.
The other benchmarks we use provide the ratings for the word pairs without context.WS-353 contains 353 human-rated word pairs (Finkelstein et al., 2001), while Agirre et al. (2009) separate this benchmark for similarity (WS-SIM) and relatedness (WS-REL).The RG-65 (Rubenstein and Goodenough, 1965) and the MC-30 (Miller and Charles, 1991) benchmarks contain nouns only.The MTurk-287 (Radinsky et al., 2011) and MTurk-771 (Halawi et al., 2012) include word pairs whose similarity was crowdsourced from AMT.Similarly, MEN (Bruni et al., 2012) is an AMT-annotated dataset of 3000 word pairs.The YP-130 (Yang and Powers, 2006) and Verb-143 (Baker et al., 2014) measure verb similarity.Rare-Word (Luong et al., 2013) contains 2034 rare-word pairs.Finally, SimLex-999 (Hill et al., 2014b) is intended to measure pure similarity as opposed to relatedness.For these benchmarks, we prepare the word representations by taking a uniform average of all sense embeddings per word.The evaluation is carried out using the tool described in Faruqui and Dyer (2014a).Due to space constraints, we report the results by averaging over all benchmarks (Similarity), and include the individual results in the online repository.

Supersense similarity
We also evaluate on a task measuring the similarity between the embeddings-in our case uniformly averaged in the case of multi-sense embeddings-and a matrix of supersense features extracted from the English SemCor, using the Qvec tool (Tsvetkov et al., 2015).We choose this method because it has been shown to output scores that correlate well with extrinsic tasks, e.g.text classification and sentiment analysis.We believe that this, in combination with word similarity tasks from the previous section, can give a reliable picture of the generic quality of word embeddings studied in this work.

POS tagging
As our downstream evaluation task, we use the learned word representations to initialize the embedding layer of a neural network tagging model.We use the same convolutional architecture as Li and Juraf-sky (2015): an input layer taking a concatenation of neighboring embeddings as input, three hidden layers with a rectified linear unit activation function and a softmax output layer.We train for 10 epochs using one sentence as a batch.Other hyperparameters can be examined in the source code.The multi-sense word embeddings are inferred from the sentential context (weighted average), as for the evaluation on the SCWS dataset.We use the standard splits of the Wall Street Journal portion of the Penn Treebank: 0-18 for training, 19-21 for development and 22-24 for testing.

Results
We compare three embeddings models, Skip-Gram (SG), Multi-sense (MU) and Bilingual Multi-sense (BIMU), using our own implementation for each of them.The first two can be seen as simpler variants of the BIMU model: in SG we omit the encoder entirely, and in MU we omit the second-language (l ) part of the encoder in Eq. ( 1).We train the SG and the MU models on the English part of the parallel corpora.Those parameters common to all methods are kept fixed during experiments.The values λ and m for controlling the second-language signal in BIMU are set on the POS-tagging development set (cf. § 6.3).
The results on the SCWS benchmark (Table 2) show consistent improvements of the BIMU model over SG and MU across all parallel corpora, except on the small CZ-EN (NC) corpus.We have also measured the 95% confidence intervals of the difference between the correlation coefficients of BIMU and SG, following the method described in Zou (2007).According to these values, BIMU significantly outperforms SG on RU-EN, and on French, Russian and Spanish NC corpora. 5ext, ignoring any language-specific factors, we would expect to observe a trend according to which the larger the corpus, the higher the correlation score.However, this is not what we find.Among the largest corpora, i.e.RU-EN, CZ-EN and FR-EN, the models trained on RU-EN perform surprisingly well, practically on par with the 23-times larger FR-EN corpus.Similarly, the quality of the embeddings trained on CZ-EN is generally lower than when trained on the on the English part of the parallel corpora.In BIMU-SG, we report the difference between BIMU and SG, together with the 95% CI of that difference.The Similarity scores are averaged over 12 benchmarks described in § 5.1.For POS tagging, we report the accuracy.
10 times smaller RU-EN corpus.One explanation for this might be different text composition of the corpora, with RU-EN matching the domain of the evaluation task better than the larger two corpora.Also, FR-EN is known to be noisy, containing webcrawled sentences that are not parallel or not natural language (Denkowski et al., 2012).Furthermore, language-dependent effects might be playing a role: for example, there are signs of Czech being the least helpful language among those studied.But while there is evidence for that in all intrinsic tasks, the situation in POS tagging does not confirm this speculation.
We relate our models to previously reported SCWS scores from the literature using 300-dimensional models in Table 3.Even though we train on a much smaller corpus than the previous works,6 the BIMU The results on similarity benchmarks and qvec largely confirm those on SCWS, despite the lack of sentential context which would allow to weight the contribution of different senses more accurately for the multi-sense models.Why, then, does simply averaging the MU and BIMU embeddings lead to better results than when using the SG embeddings?We hypothesize that the single-sense model tends to over-represent the dominant sense with its generic, one-vector-per-word representation, whereas the uniformly averaged embeddings yielded by the multisense models better encode the range of potential senses.Similar observations have been made in the context of selectional preference modeling of polysemous verbs (Greenberg et al., 2015).
In POS tagging, the relationship between MU and BIMU models is similar as discussed above.Overall, however, neither of the multi-sense models outperforms the SG embeddings.The neural network tagger may be able to implicitly perform disambiguation on top of single-sense SG embeddings, similarly to what has been argued in Li and Jurafsky (2015).The tagging accuracies obtained with MU on CZ-EN and FR-EN are similar to the one obtained by Li and Jurafsky with their multi-sense model (93.8), while the accuracy of SG is more competitive in our case (around 94.0 compared to 92.5), although they use a larger corpus for training the word representations.
In all tasks, the addition of the bilingual component during training increases the accuracy of the encoder for most corpora, even though the bilingual information is not available during evaluation.

The amount of (parallel) data
Fig. 2a displays how the semantic similarity as measured on SCWS evolves as a function of increasingly q q q q q q q q q q 55 56 57 larger sub-samples from FR-EN, our largest parallel corpus.The BIMU embeddings show relatively stable improvements over MU and especially over SG embeddings.The same performance as that of SG at 100% is achieved by MU and BIMU sooner, using only around 40/50% of the corpus.

The dimensionality and frequent words
It is argued in Li and Jurafsky (2015) that often just increasing the dimensionality of the SG model suffices to obtain better results than that of their multi-sense model.We look at the effect of dimensionality on semantic similarity in fig.2b, and see that simply increasing the dimensionality of the SG model (to any of 100, 200 or 300 dimensions) is not sufficient to outperform the MU or BIMU models.When constraining the vocabulary to 6,000 most frequent words, the representations obtain higher quality.We can see that the models, especially SG, benefit slightly more from the increased dimensionality when looking at these most frequent words.This is according to expectations-frequent words need more representational capacity due to their complex semantic and syntactic behavior (Atkins and Rundell, 2008).

The role of bilingual signal
The degree of contribution of the second language l during learning is affected by two parameters, λ for the trade-off between the importance of first and second language in the sense prediction part (encoder) and the value of m for the size of the window around the second-language word affiliated to the pivot.Fig. 3a suggests that the context from the second language is useful in sense prediction, and that it should be weighted relatively heavily (around 0.7 and 0.8, depending on the language).Regarding the role of the context-window size in sense disambiguation, the WSD literature has reported both smaller (more local) and larger (more topical) monolingual contexts to be useful, see e.g.Ide and Véronis (1998) for an overview.In fig.3b we find that considering a very narrow context in the second language-the affiliated word only or a m = 1 window around it-performs the best, and that there is little gain in using a broader window.This is understandable since the l representation participating in the sense selection is simply an average over all generic embeddings in the window, which means that the averaged representation probably becomes noisy for large m, i.e. more irrelevant words are included in the window.However, the negative effect on the accuracy is still relatively small, up to around −0.1 for the models using French and Russian as the second languages, and −0.25 for Czech when setting m = ∞.The infinite window size setting, corresponding to the sentence-only alignment, performs well also on SCWS, improving on the monolingual multi-sense baseline on all corpora (Table 4).q q q q q q q q q 92.9 93 q q q q q q −0. the reported accuracies are measured on the POS-tagging development set.

The number of senses
In our work, the number of senses k is a model parameter, which we keep fixed to 3 throughout the empirical study.We comment here briefly on other choices of k ∈ {2, 4, 5}.We have found k = 2 to be a good choice on the RU-EN and FR-EN corpora (but not on CZ-EN), with an around 0.2-point improvement over k = 3 on SCWS and in POS tagging.
With the larger values of k, the performance tends to degrade.For example, on RU-EN, the k = 5 score on SCWS is about 0.6 point below our default setting.

Additional Related Work
Multi-sense models.One line of research has dealt with sense induction as a separate, clustering problem that is followed by an embedding learning component (Huang et al., 2012;Reisinger and Mooney, 2010).In another, the sense assignment and the embeddings are trained jointly (Neelakantan et al., 2014;Tian et al., 2014;Li and Jurafsky, 2015;Bartunov et al., 2015).Neelakantan et al. (2014) propose an extension of Skip-Gram (Mikolov et al., 2013a) by introducing sense-specific parameters together with the k-means-inspired 'centroid' vectors that keep track of the contexts in which word senses have occurred.
They explore two model variants, one in which the number of senses is the same for all words, and another in which a threshold value determines the number of senses for each word.The results comparing the two variants are inconclusive, with the advantage of the dynamic variant being virtually nonexistent.
In our work, we use the static approach.Whenever there is evidence for less senses than the number of available sense vectors, this is unlikely to be a serious issue as the learning would concentrate on some of the senses, and these would then be the preferred predictions also at test time.Li and Jurafsky (2015) build upon the work of Neelakantan et al. with a more principled method for introducing new senses using the Chinese Restaurant Processes (CRP).Our experiments confirm the findings of Neelakantan et al. that multi-sense embeddings improve Skip-gram embeddings on intrinsic tasks, as well as those of Li and Jurafsky, who find that multi-sense embeddings offer little benefit to the neural network learner on extrinsic tasks.Our discrete-autoencoding method when viewed without the bilingual part in the encoder has a lot in common with their methods.
Multilingual models.The research on using multilingual information in the learning of multi-sense embedding models is scarce.Guo et al. (2014) perform a sense induction step based on clustering translations prior to learning word embeddings.Once the translations are clustered, they are mapped to a source corpus using WSD heuristics, after which a recurrent neural network is trained to obtain sense-specific representations.Unlike in our work, the sense induction and embedding learning components are entirely separated, without a possibility for one to influence another.In a similar vein, Bansal et al. (2012) use bilingual corpora to perform soft word clustering, extending the previous work on the monolingual case of Lin and Wu (2009).Single-sense representations in the multilingual context have been studied more extensively (Lu et al., 2015;Faruqui and Dyer, 2014b;Hill et al., 2014a;Zhang et al., 2014;Faruqui and Dyer, 2013;Zou et al., 2013), with a goal of bringing the representations in the same semantic space.A related line of work concerns the crosslingual setting, where one tries to leverage training data in one language to build models for typically lower-resource languages (Hermann and Blunsom, 2014;Gouws et al., 2014;Chandar A P et al., 2014;Soyer et al., 2014;Klementiev et al., 2012;Täckström et al., 2012).
The recent works of Kawakami and Dyer (2015) and Nalisnick and Ravi (2015) are also of interest.The latter work on the infinite Skip-Gram model in which the embedding dimensionality is stochastic is relevant since it demonstrates that their embeddings exploit different dimensions to encode different word meanings.Just like us, Kawakami and Dyer (2015) use bilingual supervision, but in a more complex LSTM network that is trained to predict word translations.Although they do not represent different word senses separately, their method produces representations that depend on the context.In our work, the second-language signal is introduced only in the sense prediction component and is flexible-it can be defined in various ways and can be obtained from sentence-only alignments as a special case.

Conclusion
We have presented a method for learning multi-sense embeddings that performs sense estimation and context prediction jointly.Both mono-and bilingual information is used in the sense prediction during training.We have explored the model performance on a variety of tasks, showing that the bilingual signal improves the sense predictor, even though the crosslingual information is not available at test time.In this way, we are able to obtain word representations that are of better quality than the monolingually-trained multi-sense representations, and that outperform the Skip-Gram embeddings on intrinsic tasks.We have analyzed the model performance under several conditions, namely varying dimensionality, vocabulary size, amount of data, and size of the second-language context.For the latter parameter, we find that bilingual information is useful even when using the entire sentence as context, suggesting that sentence-only alignment might be sufficient in certain situations.

Figure 1 :
Figure 1: Model schema: the sense encoder with bilingual signal and the context-word predictor are learned jointly.

Figure 2 :
Figure 2: (a) Effect of amount of data used in learning on the SCWS correlation scores.(b) Effect of embedding dimensionality on the models trained on RU-EN and evaluated on SCWS with either full vocabulary or the top-6000 words.

Figure 3 :
Figure 3: Controlling the bilingual signal.(a) Effect of varying the parameter λ for controlling the importance of second-language context (0.1-least important, 0.9-most important).(b) Effect of second-language window size m on the accuracy.In both (a) and (b)

Table 1 :
(Callison-Burch et al., 2009)the EU legislation texts, and GigaFrEn(Callison-Burch et al., 2009)(FR-EN).The sizes of the corpora are reported in Table1.The word representations trained on the NC corpora are evaluated only intrinsically due to the small sizes.Parallel corpora used in this paper.The word sizes reported are based on the English part of the corpus.Each language pair in NC has a different English part, hence the varying number of sentences per target language.

Table 2 :
Results, per-row best in bold.SG and MU are trained

Table 3 :
Comparison to other works (reprinted), for the vocabulary of top-6000 words.Our models are trained on RU-EN, a much smaller corpus than those used in previous work.model achieves a very competitive correlation score.

Table 4 :
Comparison of SCWS correlation scores of BIMU trained with infinite l window to the MU baseline (vocabulary of top-6000 words).