Multilingual Training of Crosslingual Word Embeddings

Crosslingual word embeddings represent lexical items from different languages using the same vector space, enabling crosslingual transfer. Most prior work constructs embeddings for a pair of languages, with English on one side. We investigate methods for building high quality crosslingual word embeddings for many languages in a unified vector space.In this way, we can exploit and combine strength of many languages. We obtained high performance on bilingual lexicon induction, monolingual similarity and crosslingual document classification tasks.


Introduction
Monolingual word embeddings have facilitated advances in many natural language processing tasks, such as natural language understanding (Collobert and Weston, 2008), sentiment analysis , and dependency parsing (Dyer et al., 2015). Crosslingual word embeddings represent words from several languages in the same low dimensional space. They are helpful for multilingual tasks such as machine translation (Brown et al., 1993) and bilingual named entity recognition (Wang et al., 2013). Crosslingual word embeddings can also be used in transfer learning, where the source model is trained on one language and applied directly to another language; this is suitable for the low-resource scenario (Yarowsky and Ngai, 2001;Duong et al., 2015b;Das and Petrov, 2011;Täckström et al., 2012).
Most prior work on building crosslingual word embeddings focuses on a pair of languages. English is usually on one side, thanks to the wealth of available English resources. However, it is highly desirable to have a crosslingual word embeddings for many languages so that different relations can be exploited. 1 For example, since Italian and Spanish are similar, they are excellent candidates for transfer learning. However, few parallel resources exist between Italian and Spanish for directly building bilingual word embeddings. Our multilingual word embeddings, on the other hand, map both Italian and Spanish to the same space without using any direct bilingual signal between them. In addition, multilingual word embeddings allow multiple source language transfer learning, producing a more general model and overcoming data sparseness (McDonald et al., 2011;Guo et al., 2016;Agić et al., 2016). Moreover, multilingual word embeddings are also crucial for multilingual applications such as multi-source machine translation (Zoph and Knight, 2016), and multisource transfer dependency parsing (McDonald et al., 2011;Duong et al., 2015a).
We propose several algorithms to map bilingual word embeddings to the same vector space, either during training or during post-processing. We apply a linear transformation to map the English side of each pretrained crosslingual word embedding to the same space. We also extend Duong et al. (2016), which used a lexicon to learn bilingual word embeddings. We modify the objective function to jointly build multilingual word embeddings during training. Unlike most prior work which focuses on downstream applications, we measure the quality of our multilingual word embeddings in three ways: bilingual lexicon induction, monolingual word similarity, and crosslingual document classification tasks. Relative to a benchmark of training on each language pair separately and to various published multilingual word embeddings, we achieved high performance for all the tasks.
In this paper we make the following contributions: (a) novel algorithms for post hoc combination of multiple bilingual word embeddings, applicable to any pretrained bilingual model; (b) a method for jointly learning multilingual word embeddings, extending Duong et al. (2016), to jointly train over monolingual corpora in several languages; (c) achieving competitive results in bilingual, monolingual and crosslingual transfer settings.

Related work
Crosslingual word embeddings are typically based on co-occurrence statistics from parallel text (Luong et al., 2015;Gouws et al., 2015;Chandar A P et al., 2014;Klementiev et al., 2012;Kočiský et al., 2014;Huang et al., 2015). Other work uses more widely available resources such as comparable data (Vulić and Moens, 2015) and shared Wikipedia entries (Søgaard et al., 2015). However, those approaches rely on data from Wikipedia, and it is non-trivial to extend them to languages that are not covered by Wikipedia. Lexicons are another source of bilingual signal, with the advantage of high coverage. Multilingual lexical resources such as PanLex (Kamholz et al., 2014) and Wiktionary 2 cover thousands of languages, and have been used to construct high performance crosslingual word embeddings (Mikolov et al., 2013a;Xiao and Guo, 2014;Faruqui and Dyer, 2014).
Previous work mainly focuses on building word embeddings for a pair of languages, typically with English on one side, with the exception of Coulmance et al. (2015), Søgaard et al. (2015) and Ammar et al. (2016). Coulmance et al. (2015) extend the bilingual skipgram model from Luong et al. (2015), training jointly over many languages using the Europarl corpora. We also compare our models with an extension of Huang et al. (2015) adapted for multiple languages also using bilingual corpora. However, parallel data is an expensive resource and using parallel data seems to under-perform on the bilingual lexicon induction task (Vulić and Moens, 2015). While Coulmance et al. (2015) use English as the pivot language, Søgaard et al. (2015) learn multilingual word em-beddings for many languages using Wikipedia entries which are the same for many languages. However, their approach is limited to languages covered in Wikipedia and seems to under-perform other methods. Ammar et al. (2016) propose two algorithms, MultiCluster and MultiCCA, for multilingual word embeddings using set of bilingual lexicons. MultiCluster first builds the graph where nodes are lexical items and edges are translations. Each cluster in this graph is an anchor point for building multilingual word embeddings. Multi-CCA is an extension of Faruqui and Dyer (2014), performing canonical correlation analysis (CCA) for multiple languages using English as the pivot. A shortcoming of MultiCCA is that it ignores polysemous translations by retaining only one-to-one dictionary pairs (Gouws et al., 2015), disregarding much information. As a simple solution, we propose a simple post hoc method by mapping the English parts of each bilingual word embedding to each other. In this way, the mapping is always exact and one-to-one. Duong et al. (2016) constructed bilingual word embeddings based on monolingual data and Pan-Lex. In this way, their approach can be applied to more languages as PanLex covers more than a thousand languages. They solve the polysemy problem by integrating an EM algorithm for selecting a lexicon. Relative to many previous crosslingual word embeddings, their joint training algorithm achieved state-of-the-art performance for the bilingual lexicon induction task, performing significantly better on monolingual similarity and achieving a competitive result on cross lingual document classification. Here we also adopt their approach, and extend it to multilingual embeddings.

Base model for bilingual embeddings
We briefly describe the base model (Duong et al., 2016), an extension of the continuous bag-of-word (CBOW) model (Mikolov et al., 2013a) with negative sampling. The original objective function is where D is the training data, h i = 1 2k k j=−k;j =0 v w i+j is a vector encoding the context over a window of size k centred around position i, V and U ∈ R |Ve|×d are learned matrices referred to as the context and centre word embeddings, where V e is the vocabulary and p is the number of negative examples randomly drawn from a noise distribution, w ij ∼ P n (w). Duong et al. (2016) extend the CBOW model for application to two languages, using monolingual text in both languages and a bilingual lexicon. Their approach augments CBOW by generating not only the middle word, but also its translation in the other language. This is done by first selecting a translationw i from the lexicon for the middle word w i , based on the cosine distance between the context h i and the context embeddings V for each candidate foreign translation. In this way source monolingual training contexts must generate both source and target words, and similarly target monolingual training contexts also generate source and target words. Overall this results in compatible word embeddings across the two languages, and highly informative nearest neighbours across the two languages. This leads to the new objective function where D s and D t are source and target monolingual data, V s and V t are source and target vocabulary. Comparing with the CBOW objective function in Equation (1), this represents two additions: the translation cross entropy log σ(u w i h i ), and a regularisation term w∈Vs∪Vt u w − v w 2 2 which penalises divergence between context and center word embedding vectors for each word type, which was shown to improve the embedding quality (Duong et al., 2016).

Post hoc Unification of Embeddings
Our goal is to learn multilingual word embeddings over more than two languages. One simple way to do this is to take several learned bilingual word embeddings which share a common target language (here, English), and map these into a shared space (Mikolov et al., 2013a;Faruqui and Dyer, 2014). In this section we propose post hoc methods, however in §4 we develop an integrated multilingual method using joint inference.
Formally, the input to the post hoc combination methods are a set of n pre-trained bilingual word embedding matrices, i.e., |×d are the English word embeddings and F i ∈ R |V f i |×d are foreign language word embeddings for language i, with V e i and V f i being the English and foreign language vocabularies and d is the embedding dimension. These bilingual embeddings can be produced by any method, e.g., those discussed in §2.
Linear Transformation. The simplest method is to learn a linear transformation which maps the English part of each bilingual word embedding into the same space (inspired by Mikolov et al. (2013a)), as illustrated in Figure 1. One language pair is chosen as the pivot, en-it in this example, and the English side of the other language pairs, en-de, en-es, en-nl, are mapped to closely match the English side of the pivot, en-it. This is achieved through learning linear transformation matrices for each language, W de , W es and W nl , respectively, where each W i ∈ R d×d is learned to minimize the objec- where E pivot is the English embedding of the pivot pair, en-it.
Each foreign language f i is then mapped to the same space using the learned matrix W i , i.e., F i = F i × W i . These projected foreign embeddings are then used in evaluation, along with the English side of the language pair with largest English vocabulary coverage, i.e., biggest |V e i |. Together these embeddings allow for querying of monolingual and cross-lingual word similarity, and multilingual transfer of trained models.
The advantage of this approach is that it is very fast and simple to train, since the objective function is strictly convex and has a closed form solution. Moreover, unlike Mikolov et al. (2013a) who learn the projection from a source to a target language, we learn the projection from English to English, thus do not require a lexicon, sidestepping the polysemy problem. 3

Multilingual Joint Training
Instead of combining bilingual word embeddings in the post-processing step, it might be more beneficial to do it during training, so that languages can interact with each other more freely. We extend the method in §2.1 to jointly learn the multilingual word embeddings during training. The input to the model is the combined monolingual data for each language and the set of lexicons between any language pair. We modify the base model (Duong et al., 2016) to accommodate more languages. For the first step, instead of just predicting the translation for a single target language, we predict the translation for all languages in the lexicon. That is, we compute w f i = argmax w∈dict f e (w e i ) cos(v w , context), which is the best translation in language f of source word w e i in language e, given the bilingual lexicon dict f e and the context. For the second step, we jointly predict word w e i and all translations w f i in all foreign languages f ∈ T that we have dictionary dict f e as illustrated in Figure 2.
3 A possible criticism of this approach is that a linear transformation is not powerful enough for the required mapping. We experimented with non-linear transformations but did not observe any improvements. Faruqui andDyer (2014) extended Mikolov et al. (2013a) as they projected both source and target languages to the same space using canonical correlation analysis (CCA). We also adopted this approach for multilingual environment by applying multi-view CCA to map the English part of each pre-trained bilingual word embedding to the same space. However, we only observe minor improvements.
The English word cat might have several translations in German {Katze, Raupe, Typ} and Italian {gatto, gatta}. In the first step, we select the closest translation given the context for each language, i.e. Katze and gatto for German and Italian respectively. In the second step, we jointly predict the English word cat together with selected translations Katze and gatto using the following modified objective function: where D all and V all are the combined monolingual data and vocabulary for all languages. Each of the p negative samples, w ij , are sampled from a unigram model over the combined vocabulary V all .
Explicit mapping. As we keep adding more languages to the model, the hidden layer in our model -shared between all languages -might not be enough to accommodate all languages. However, we can combine the strength of the linear transformation proposed in §3 to our joint model as described in Equation (3). We explicitly learn the linear transformation jointly during training by adding the following regularization term to the objective function: where D e is the English monolingual data (since we use English as the pivot language), F is the set of foreign languages (not English), W f ∈ R d×d is the linear transformation matrix, and α controls the contribution of the regularization term and will be tuned in §6. 4 Thus, the set of learned parameters for the model are the word and context embeddings U, V and |F| linear transformation matrices, {W f } f ∈F . After training is finished, we linearly transform the foreign language embeddings with the corresponding learned matrix W f , such that all embeddings are in the same space.

Experiment Setup
Our experimental setup is based on that of Duong et al. (2016). We use the first 5 million sen-Model it-en es-en nl-en nl-es Average rec 1 rec 5 rec 1 rec 5 rec 1 rec 5 rec 1 rec 5 rec 1 rec 5  (3) (Joint) and joint prediction with explicit mapping as in Equation (4) (+mapping). We report recall at 1 and 5 with respect to four baseline multilingual word embeddings. The best scores for are shown in bold.
tences from the tokenized monolingual data from the Wikipedia dump from Al-Rfou et al. (2013). 5 The dictionary is from PanLex which covers more than 1,000 language varieties. We build multilingual word embeddings for 5 languages (en, it, es, nl, de) jointly using the same parameters as Duong et al. (2016). 6 During training, for a fairer comparison, we only use lexicons between English and each target language. However, it is straightforward to incorporate a lexicon between any pair of languages into our model. The pretrained bilingual word embeddings for the postprocessing experiment in §3 are also from Duong et al. (2016). In the following sections, we evaluate the performance of our multilingual word embeddings in comparison with bilingual word embeddings and previous published multilingual word embeddings

Bilingual Lexicon Induction
In this section we evaluate our multilingual models on the bilingual lexicon induction (BLI) task, which tests the bilingual quality of the model. Given a word in the source language, the model must predict the translation in the target language. We report recall at 1 and 5 for the various models listed in Table 1. The evaluation data for it-en, es-en, and nl-en pairs was manually constructed (Vulić and Moens, 2015). We extend the evaluation for nl-es pair which do not involve English. 7 The BiWE results for pairs involving English in Table 1 are from Duong et al. (2016), the current state of the art in this task. For the nl-es pair, we cannot build bilingual word embeddings, since we do not have a corresponding bilingual lexicon. Instead, we use English as the pivot language. To get the nl-es translation, we use two bilingual embeddings of nl-en and es-en from Duong et al. (2016). We get the best English translation for the Dutch word, and get the top 5 Spanish translations with respect to the English word. This simple trick performs surprisingly well, probably because bilingual word embeddings involving English such as nl-en and es-en from Duong et al. (2016) are very accurate.
For the linear transformation, we use the first pair it-en as the pivot and learn to project es-en, de-en, nl-en pairs to this space as illustrated in Figure 1. We use English part (E biggest ) from transformed de-en pair as the English output. Despite simplicity, linear transformation performs surprisingly well.
Our joint model to predict all target languages simultaneously, as described in Equation (3), performs consistently better in contrast with linear transformation at all language pairs. The joint model with explicit mapping as described in Equation (4) can be understood as the combination of joint model and linear transformation. For this model, we need to tune α in Equation (4). We tested α with value in range {10 −i } 5 i=0 using es-en pair on BLI task. α = 0.1 gives the best performance. To avoid over-fitting, we use the same value of α for all experiments and all other pairs. With this tuned value α, our joint model with mapping clearly outperforms other proposed methods on all pairs. More importantly, this result is substantially better than all the baselines across four language pairs and two evaluation metrics. Comparing with the state of the art (BiWE), our final model (joint + mapping) are more general and more widely applicable, however achieves relatively better result, especially for recall at 5.

Monolingual similarity
The multilingual word embeddings should preserve the monolingual property of the languages. We evaluate using the monolingual similarity task proposed in Luong et al. (2015). In this task, the model is asked to give the similarity score for a pair of words in the same language. This score is then measured against human judgment. Following Duong et al. (2016), we evaluate on three datasets, WordSim353 (WS-en), RareWord (RWen), and the German version of WordSim353 (WS-de) (Finkelstein et al., 2001;Luong et al., 2013;Luong et al., 2015). Table 2 shows the result of our multilingual word embeddings with respect to several baselines. The trend is similar to the bilingual lexicon induction task. Linear transformation per-  forms surprisingly well. Our joint model achieves a similar result, with linear transformation (better on WS-de but worse on WS-en and RW-en). Our joint model with explicit mapping regains the drop and performs slightly better than linear transformation. More importantly, this model is substantially better than all baselines, except for Multi-Trans on RW-en dataset. This can probably be explained by the low coverage of MultiTrans on this dataset. Our final model (Joint + Mapping) is also close to the best bilingual word embeddings (BiWE) performance reported by Duong et al. (2016).

Crosslingual Document Classification
In the previous sections, we have shown that our methods for building multilingual word embeddings, either in the post-processing step or during training, preserved high quality bilingual and monolingual relations. In this section, we demonstrate the usefulness of multi-language crosslingual word embeddings through the crosslingual document classification (CLDC) task. This task exploits transfer learning, where the document classifier is trained on the source language and tested on the target language. The source language classifier is transferred to the target language using crosslingual word embeddings as the document is represented as the sum of bag-  Table 3: Crosslingual document classification accuracy for various model. Chandar A P et al. (2014) and Luong et al. (2015) achieved a state-of-the-art result for en→de and de→en respectively, served as the reference. The best results for bilingual and multilingual word embeddings are bold.
of-word embeddings weighted by tf.idf. This setting is useful for target low-resource languages where the annotated data is insufficient.
The train and test data are from multilingual RCV1/RCV2 corpus (Lewis et al., 2004) where each document is annotated with labels from 4 categories: CCAT (Corporate/Industrial), ECAT (Economics), GCAT (Government/Social) and MCAT (Markets). We extend the evaluation from Klementiev et al. (2012) to cover more language pairs. We use the same data split for en→de and de→en pairs but additionally construct the train and test data for it→de, it→es and en→es. For each pair, we use 1,000 documents in the source language as the training data and 5,000 documents in the target language as the test data. The training data is randomly sampled, but the test data (for es) is evenly balanced among labels. Table 3 shows the accuracy for the CLDC task for many pairs and models with respect to the baselines. For all bilingual models (Duong et al., 2016;Luong et al., 2015;Chandar A P et al., 2014), the bilingual word embeddings are constructed for each pair separately. In this way, they can only get the pairs involving English since there are many bilingual resources involving English on one side. For all our models, including Linear, Joint and Joint + Mapping, the embedding space is available for multiple languages; this is why we can exploit different relations, such as it→es. This is the motivation for the work reported in this paper. Suppose we want to build a document clas-sifier for es but lack any annotations. It is common to build en-es crosslingual word embeddings for transfer learning, but this only achieves 53.8 % accuracy. Yet when we use it as the source, we get 81.0% accuracy. This is motivated by the fact that it and es are very similar.
The trend observed in Table 3 is consistent with previous observations. Linear transformation performs well. Joint training performs better especially for the it→de pair. The joint model with explicit mapping is generally our best model, even better than the base bilingual model from Duong et al. (2016). The de→en result improves on the existing state of the art reported in Luong et al. (2015). Our final model (Joint + Mapping) achieved competitive results compared with four strong baseline multilingual word embeddings, achieving best results for two out of five pairs. Moreover, the best scores for each language pairs are all from multilingual training, emphasizing the advantages over bilingual training. Mikolov et al. (2013b) showed that monolingual word embeddings capture some analogy relations such as Paris − France + Italy ≈ Rome. It seems that in our multilingual embeddings, these relations still hold. Table 4 shows some examples of such relations where each word in the analogy query is in different languages.

Analysis
All our baselines (MultiCluster, MultiCCA, MultiSkip, MultiTrans) are trained using different datasets. While MultiSkip and MultiTrans chico es -bruder de + sorella it (boy -brother + sister) ehemann de -padre es + madre it (husband -father + mother) principe it -junge de + meisje nl (prince -boy + girl) chica es (girl) echtgenote nl (wife) principessa it (princess) ragazza it (girl) moglie it (wife) princess en meisje nl (girl) her en princesa es (princess) girl en marito it (husband) príncipe es (prince) mädchen de (girl) haar nl (her) prinzessin de (princess)  Table 5: Performance of our model compared with MultiCluster and MultiCCA using extrinsic and intrinsic evaluation tasks on 12 languages proposed in Ammar et al. (2016), all models are trained on the same dataset. The best score for each task is bold.
are trained on parallel corpora, MultiCluster and MultiCCA use monolingual corpora and bilingual lexicons which are similar to our proposed methods. Therefore, for a strict comparison 8 , we train our best model (Joint + Mapping) using the same monolingual data and set of bilingual lexicons on the same 12 languages with MultiCluster and Mul-tiCCA. Table 5 shows the performance on intrinsic and extrinsic tasks proposed in Ammar et al. (2016). Multilingual dependency parsing and document classification are trained on a set of source languages and test on a target language in the transfer learning setting. Monolingual word similarity task is similar with our monolingual similarity task described in §7, multilingual word similarity is an extension of monolingual word similarity task but tested for pair of words in different languages. Monolingual QVEC, multilingual QVEC test the linguistic content of word embeddings in monolingal and multilingual setting. Monolingual QVEC-CCA and multilingual QVEC-CCA are the 8 also with respect to the word coverage since MultiSkip and MultiTrans usually have much lower word coverage, biasing the intrinsic evaluations. extended versions of monolingual QVEC and multilingual QVEC also proposed in Ammar et al. (2016). Table 5 shows that our model achieved competitive results, best at 4 out of 9 evaluation tasks.

Conclusion
In this paper, we introduced several methods for building unified multilingual word embeddings. These represent an improvement because they exploit more relations and combine information from many languages. The input to our model is just a set of monolingual data and a set of bilingual lexicons between any language pairs. We induce the bilingual relationship for all language pairs while keeping high quality monolingual relations. Our multilingual joint training model with explicit mapping consistently achieves better performance compared with linear transformation. We achieve new state-of-the-art performance on bilingual lexicon induction task for recall at 5, similar excellent results with the state-of-the-art bilingual word embeddings on monolingual similarity task (Duong et al., 2016). Moreover, our model is competitive at the crosslingual document classification task, achieving a new state of the art for de→en and it→de pair.