Cross-lingual Models of Word Embeddings: An Empirical Comparison

Despite interest in using cross-lingual knowledge to learn word embeddings for various tasks, a systematic comparison of the possible approaches is lacking in the literature. We perform an extensive evaluation of four popular approaches of inducing cross-lingual embeddings, each requiring a different form of supervision, on four typographically different language pairs. Our evaluation setup spans four different tasks, including intrinsic evaluation on mono-lingual and cross-lingual similarity, and extrinsic evaluation on downstream semantic and syntactic applications. We show that models which require expensive cross-lingual knowledge almost always perform better, but cheaply supervised models often prove competitive on certain tasks.

Several models for inducing cross-lingual embeddings have been proposed, each requiring a different form of cross-lingual supervision -some can use document-level alignments (Vulić and Moens, 2015), others need alignments at the sentence (Hermann and Blunsom, 2014; or word level (Faruqui and Dyer, 2014;Gouws and Søgaard, 2015), while some require both sentence and word alignments (Luong et al., 2015). However, a systematic comparison of these models is missing from the literature, making it difficult to analyze which approach is suitable for a particular NLP task. In this paper, we fill this void by empirically comparing four cross-lingual word embedding models each of which require different form of alignment(s) as supervision, across several dimensions. To this end, we train these models on four different language pairs, and evaluate them on both monolingual and cross-lingual tasks. 1 First, we show that different models can be viewed as instances of a more general framework for inducing cross-lingual word embeddings. Then, we evaluate these models on both extrinsic and intrinsic tasks. Our intrinsic evaluation assesses the quality of the vectors on monolingual ( §4.2) and cross-lingual ( §4.3) word similarity tasks, while our extrinsic evaluation spans semantic (cross-lingual document classification §4.4) and syntactic tasks (cross-lingual dependency parsing §4.5).
Our experiments show that word vectors trained using expensive cross-lingual supervision (word alignments or sentence alignments) perform the best on semantic tasks. On the other hand, for syntactic tasks like cross-lingual dependency parsing, models requiring weaker form of cross-lingual supervision (such as context agnostic translation dictionary) are competitive to models requiring expensive supervision. We also show qualitatively how the nature of cross-lingual supervision used to train word vectors affects the proximity of translation pairs across languages, and of words with similar meaning in the same language in the vector-space. 1 Instructions and code to reproduce the experiments available at http://cogcomp.cs.illinois.edu/ page/publication_view/794

Bilingual Embeddings
A general schema for inducing bilingual embeddings is shown in Figure 1. Our comparison focuses on dense, fixed-length distributed embeddings which are obtained using some form of cross-lingual supervision. We briefly describe the embedding induction procedure for each of the selected bilingual word vector models, with the aim to provide a unified algorithmic perspective for all methods, and to facilitate better understanding and comparison. Our choice of models spans across different forms of supervision required for inducing the embeddings, illustrated in Figure 2.
Notation. Let W = {w 1 , w 2 , . . . , w |W | } be the vocabulary of a language l 1 with |W | words, and W ∈ R |W |×l be the corresponding word embeddings of length l. Let V = {v 1 , v 2 , . . . , v |V | } be the vocabulary of another language l 2 with |V | words, and V ∈ R |V |×m the corresponding word embeddings of length m. We denote the word vector for a word w by w. Luong et al. (2015) proposed Bilingual Skip-Gram, a simple extension of the monolingual skipgram model, which learns bilingual embeddings by using a parallel corpus along with word alignments (both sentence and word level alignments).

Bilingual Skip-Gram Model (BiSkip)
The learning objective is a simple extension of the skip-gram model, where the context of a word is expanded to include bilingual links obtained from word alignments, so that the model is trained to predict words cross-lingually. In par-ticular, given a word alignment link from word v ∈ V in language l 2 to w ∈ W in language l 1 , the model predicts the context words of w using v and vice-versa. Formally, the cross lingual part of the objective is, (1) where NBR 1 (w) is the context of w in language l 1 , Q is the set of word alignments, and P (w c | v) ∝ exp(w T c v). Another similar term D 21 models the objective for v and NBR 2 (v). The objective can be cast into Algorithm 1 as, where A(W) and B(V) are the familiar skipgram formulation of the monolingual part of the objective. α and β are chosen hyper-parameters which set the relative importance of the monolingual terms.

Bilingual Compositional Model (BiCVM)
Hermann and Blunsom (2014) present a method that learns bilingual word vectors from a sentence aligned corpus. Their model leverages the fact that aligned sentences have equivalent meaning, thus their sentence representations should be similar.
We denote two aligned sentences, v = x 1 , . . . , and w = y 1 , . . . , where x i ∈ V, y i ∈ W, are vectors corresponding to the words in the sentences. Let functions f : v → R n and g : w → R n , map sentences to their semantic representations in R n . BiCVM generates word vectors by minimizing the squared 2 norm between the sentence representations of aligned sentences. In order to prevent the degeneracy arising from directly minimizing the 2 norm, they use a noise-contrastive large-margin update, with randomly drawn sentence pairs ( v, w n ) as negative samples. The loss for the sentence pairs ( v, w) and ( v, w n ) can be written as, where, Bonjour! Je t' aime.
(d) BiVCD Figure 2: Forms of supervision required by the four models compared in this paper. From left to right, the cost of the supervision required varies from expensive (BiSkip) to cheap (BiVCD). BiSkip requires a parallel corpus annotated with word alignments (Fig. 2a), BiCVM requires a sentence-aligned corpus (Fig. 2b), BiCCA only requires a bilingual lexicon (Fig. 2c) and BiVCD requires comparable documents (Fig. 2d). and, This can be cast into Algorithm 1 by, with A(W) and B(V) being regularizers, with α = β.

Bilingual Correlation Based Embeddings (BiCCA)
The BiCCA model, proposed by Faruqui and Dyer (2014), showed that when (independently trained) monolingual vector matrices W, V are projected using CCA (Hotelling, 1936) to respect a translation lexicon, their performance improves on word similarity and word analogy tasks. They first construct W ⊆ W, V ⊆ V such that |W |= |V | and the corresponding words (w i , v i ) in the matrices are translations of each other. The projection is then computed as: where, P V ∈ R l×d , P W ∈ R m×d are the projection matrices with d ≤ min(l, m) and the V * ∈ R |V |×d , W * ∈ R |W |×d are the word vectors that have been "enriched" using bilingual knowledge. The BiCCA objective can be viewed 2 as the following instantiation of Algorithm 1: where W = W 0 P W and V = V 0 P V , where we set α = β = γ = ∞ to set hard constraints.

Bilingual Vectors from Comparable Data (BiVCD)
Another approach of inducing bilingual word vectors, which we refer to as BiVCD, was proposed by Vulić and Moens (2015). Their approach is designed to use comparable corpus between the source and target language pair to induce crosslingual vectors.
Let d e and d f denote a pair of comparable documents with length in words p and q respectively (assume p > q). BiVCD first merges these two comparable documents into a single pseudobilingual document using a deterministic strategy based on length ratio of two documents R = p q . Every R th word of the merged pseudo-bilingual document is picked sequentially from d f . Finally, a skip-gram model is trained on the corpus of pseudo-bilingual documents, to generate vectors for all words in W * ∪ V * . The vectors constituting W * and V * can then be easily identified.
Instantiating BiVCD in the general algorithm is obvious: C(W, V) assumes the familiar word2vec skip-gram objective over the pseudobilingual document, Although BiVCD is designed to use comparable corpus, we provide it with parallel data in our experiments (to ensure comparability), and treat two aligned sentences as comparable.

Data
We train cross-lingual embeddings for 4 language pairs: English-German (en-de), English-French (en-fr), English-Swedish (en-sv) and English-Chinese (en-zh). For en-de and en-sv we use the l 1 l 2 #sent #l 1 -words #l 2 -words en de Europarl v7 parallel corpus 3 (Koehn, 2005). For en-fr, we use Europarl combined with the newscommentary and UN-corpus dataset from WMT 2015. 4 For en-zh, we use the FBIS parallel corpus from the news domain (LDC2003E14). We use the Stanford Chinese Segmenter (Tseng et al., 2005) to preprocess the en-zh parallel corpus. Corpus statistics for all languages is shown in Table 1.

Evaluation
We measure the quality of the induced crosslingual word embeddings in terms of their performance, when used as features in the following tasks: • monolingual word similarity for English • Cross-lingual dictionary induction • Cross-lingual document classification • Cross-lingual syntactic dependency parsing The first two tasks intrinsically measure how much can monolingual and cross-lingual similarity benefit from cross-lingual training. The last two tasks measure the ability of cross-lingually trained vectors to extrinsically facilitate model transfer across languages, for semantic and syntactic applications respectively. These tasks have been used in previous works (Klementiev et al., 2012;Luong et al., 2015;Vulić and Moens, 2013a;Guo et al., 2015) for evaluating cross-lingual embeddings, but no comparison exists which uses them in conjunction.
To ensure fair comparison, all models are trained with embeddings of size 200. We provide all models with parallel corpora, irrespective of their requirements. Whenever possible, we also report statistical significance of our results.

Parameter Selection
We follow the BestAvg parameter selection strategy from Lu et al. (2015): we selected the parameters for all models by tuning on a set of values (described below) and picking the parameter setting which did best on an average across all tasks.
The word alignments for training the model (available at github. com/lmthang/bivec) were generated using fast_align (Dyer et al., 2013). The number of training iterations was set to 5 (no tuning) and we set α = 1 and β = 1 (no tuning).
BiCCA. First, monolingual word vectors are trained using the skip-gram model 5 with negative sampling (Mikolov et al., 2013a) with window of size 5 (tuned over {5, 10, 20}). To generate a cross-lingual dictionary, word alignments are generated using cdec from the parallel corpus. Then, word pairs (a, b), a ∈ l 1 , b ∈ l 2 are selected such that a is aligned to b the most number of times and vice versa. This way, we obtained dictionaries of approximately 36k, 35k, 30k and 28k word pairs for en-de, en-fr, en-sv and en-zh respectively.
The monolingual vectors are aligned using the above dictionaries with the tool (available at github.com/mfaruqui/eacl14-cca) released by Faruqui and Dyer (2014) to generate the cross-lingual word embeddings. We use k = 0.5 as the number of canonical components (tuned over {0.2, 0.3, 0.5, 1.0}). Note that this results in a embedding of size 100 after performing CCA.
BiVCD. We use word2vec's skip gram model for training our embeddings, with a window size of 5 (tuned on {5, 10, 20, 30}) and negative sampling parameter set to 5 (tuned on {5, 10, 25}). Every pair of parallel sentences is treated as a 5 code.google.com/p/word2vec pair of comparable documents, and merging is performed using the sentence length ratio strategy described earlier. 6

Monolingual Evaluation
We first evaluate if the inclusion of cross-lingual knowledge improves the quality of English embeddings.
Word Similarity. Word similarity datasets contain word pairs which are assigned similarity ratings by humans. The task evaluates how well the notion of word similarity according to humans is emulated in the vector space. Evaluation is based on the Spearman's rank correlation coefficient (Myers and Well, 1995) between human rankings and rankings produced by computing cosine similarity between the vectors of two words.
We use the SimLex dataset for English (Hill et al., 2014) which contains 999 pairs of English words, with a balanced set of noun, adjective and verb pairs. SimLex is claimed to capture word similarity exclusively instead of WordSim-353 (Finkelstein et al., 2001) which captures both word similarity and relatedness. We declare significant improvement if p < 0.1 according to Steiger's method (Steiger, 1980) for calculating the statistical significant differences between two dependent correlation coefficients. Table 2 shows the performance of English embeddings induced by all the models by training on different language pairs on the SimLex word similarity task. The score obtained by monolingual English embeddings trained on the respective English side of each language is shown in column marked Mono. In all cases (except BiCCA on ensv), the bilingually trained vectors achieve better scores than the mono-lingually trained vectors.
Overall, across all language pairs, BiCVM is the best performing model in terms of Spearman's correlation, but its improvement over BiSkip and BiVCD is often insignificant. It is notable that 2 of the 3 top performing models, BiCVM and BiVCD, need sentence aligned and document-aligned corpus only, which are easier to obtain than parallel data with word alignments required by BiSkip.
QVEC. Tsvetkov et al. (2015) proposed an intrinsic evaluation metric for estimating the quality of English word vectors. The score produced by QVEC measures how well a given set of word vectors is able to quantify linguistic properties 6 We implemented the code for performing the merging as we could not find a tool provided by the authors.   of words, with higher being better. The metric is shown to have strong correlation with performance on downstream semantic applications. As it can be currently only used for English, we use it to evaluate the English vectors obtained using cross-lingual training of different models. Table 3 shows that on average across language pairs, BiSkip achieves the best score, followed by Mono (mono-lingually trained English vectors), BiVCD and BiCCA. A possible explanation for why Mono scores are better than those obtained by some of the cross-lingual models is that QVEC measures monolingual semantic content based on a linguistic oracle made for English. Cross-lingual training might affect these semantic properties arbitrarily.
Interestingly, BiCVM which was the best model according to SimLex, ranks last according to QVEC. The fact that the best models according to QVEC and word similarities are different reinforces observations made in previous work that performance on word similarity tasks alone does not reflect quantification of linguistic properties of words (Tsvetkov et al., 2015;Schnabel et al., 2015).  Table 4: Cross-lingual dictionary induction results (top-10 accuracy). The same trend was also observed across models when computing MRR (mean reciprocal rank).

Cross-lingual Dictionary Induction
The task of cross-lingual dictionary induction (Vulić and Moens, 2013a;Mikolov et al., 2013b) judges how good crosslingual embeddings are at detecting word pairs that are semantically similar across languages. We follow the setup of Vulić and Moens (2013a), but instead of manually creating a gold cross-lingual dictionary, we derived our gold dictionaries using the Open Multilingual WordNet data released by Bond and Foster (2013). The data includes synset alignments across 26 languages with over 90% accuracy. First, we prune out words from each synset whose frequency count is less than 1000 in the vocabulary of the training data from §3. Then, for each pair of aligned synsets s 1 = {k 1 , k 2 , · · ·} s 2 = {g 1 , g 2 , · · ·}, we include all elements from the set {(k, g) | k ∈ s 1 , g ∈ s 2 } into the gold dictionary, where k and g are the lemmas. Using this approach we generated dictionaries of sizes 1.5k, 1.4k, 1.0k and 1.6k pairs for en-fr, en-de, en-sv and en-zh respectively.
We report top-10 accuracy, which is the fraction of the entries (e, f ) in the gold dictionary, for which f belongs to the list of top-10 neighbors of the word vector of e, according to the induced cross-lingual embeddings. From the results (Table 4), it can be seen that for dictionary induction, the performance improves with the quality of supervision. As we move from cheaply supervised methods (eg. BiVCD) to more expensive supervision (eg. BiSkip), the accuracy improves. This suggests that for cross lingual similarity tasks, the more expensive the cross-lingual knowledge available, the better. Models using weak supervision like BiVCD perform poorly in comparison to models like BiSkip and BiCVM, with performance gaps upwards of 10 pts on an average.  Table 5: Cross-lingual document classification accuracy when trained on language l1, and evaluated on language l2. The best score for each language is shown in bold. Scores which are significantly better (per McNemar's Test with p < 0.05) than the next lower score are underlined. For example, for sv→en, BiVCD is significantly better than BiSkip, which in turn is significantly better than BiCVM.

Cross-lingual Document Classification
We follow the cross-lingual document classification (CLDC) setup of Klementiev et al. (2012), but extend it to cover all of our language pairs. We use the RCV2 Reuters multilingual corpus 7 for our experiments. In this task, for a language pair (l 1 , l 2 ), a document classifier is trained using the document representations derived from word embeddings in language l 1 , and then the trained model is tested on documents from language l 2 (and vice-versa). By using supervised training data in one language and evaluating without further supervision in another, CLDC assesses whether the learned cross-lingual representations are semantically coherent across multiple languages. All embeddings are learned on the data described in §3, and we only use the RCV2 data to learn document classification models. Following previous work, we compute document representation by taking the tf-idf weighted average of vectors of the words present in it. 8 A multi-class classifier is trained using an averaged perceptron (Freund and Schapire, 1999) for 10 iterations, using the document vectors of language l 1 as features 9 . Majority baselines for en → l 2 and l 1 → en are 49.7% and 46.7% respectively, for all languages. Table 5 shows the performance of different models across different language pairs. We computed confidence values using the McNemar test (McNe-7 http://trec.nist.gov/data/reuters/ reuters.html 8 tf-idf (Salton and Buckley, 1988) was computed using all documents for that language in RCV2. 9 We use the implementation of Klementiev et al. (2012Klementiev et al. ( ). mar, 1947 and declare significant improvement if p < 0.05. Table 5 shows that in almost all cases, BiSkip performs significantly better than the remaining models. For transferring semantic knowledge across languages via embeddings, sentence and word level alignment proves superior to sentence or word level alignment alone. This observation is consistent with the trend in cross-lingual dictionary induction, where too the most expensive form of supervision performed the best.

Cross-lingual Dependency Parsing
Using cross lingual similarity for direct-transfer of dependency parsers was first shown in Täckström et al. (2012). The idea behind direct-transfer is to train a dependency parsing model using embeddings for language l 1 and then test the trained model on language l 2 , replacing embeddings for language l 1 with those of l 2 . The transfer relies on coherence of the embeddings across languages arising from the cross lingual training. For our experiments, we use the cross lingual transfer setup of Guo et al. (2015). 10 Their framework trains a transition-based dependency parser using nonlinear activation function, with the source-side embeddings as lexical features. These embeddings can be replaced by target-side embeddings at test time.
All models are trained for 5000 iterations with fixed word embeddings during training. Since our goal is to determine the utility of word embeddings in dependency parsing, we turn off other features that can capture distributional information like brown clusters, which were originally used in Guo et al. (2015). We use the universal dependency treebank (McDonald et al., 2013) version-2.0 for our evaluation. For Chinese, we use the treebank released as part of the CoNLL-X shared task (Buchholz and Marsi, 2006).
We first evaluate how useful the word embeddings are in cross-lingual model transfer of dependency parsers (Table 6). On an average, BiCCA does better than other models. BiSkip is a close second, with an average performance gap of less than 1 point. BiSkip outperforms BiCVM on German and French (over 2 point improvement), owing to word alignment information BiSkip's model uses during training. It is not surprising that English-Chinese transfer scores are low, due to the significant difference in syntactic structure of the 10 github.com/jiangfeng1124/ acl15-clnndep l 1 l 2 BiSkip BiCVM BiCCA BiVCD  Table 6: Labeled attachment score (LAS) for cross-lingual dependency parsing when trained on language l1, and evaluated on language l2. The best score for each language is shown in bold.
two languages. Surprisingly, unlike the semantic tasks considered earlier, the models with expensive supervision requirements like BiSkip and BiCVM could not outperform a cheaply supervised BiCCA. We also evaluate whether using cross-lingually trained vectors for learning dependency parsers is better than using mono-lingually trained vectors in Table 7. We compare against parsing models trained using mono-lingually trained word vectors (column marked Mono in Table 7). These vectors are the same used as input to the BiCCA model. All other settings remain the same. On an average across language pairs, improvement over the monolingual embeddings was obtained with the BiSkip and BiCCA models, while BiCVM and BiVCD consistently performed worse. A possible reason for this is that BiCVM and BiVCD operate on sentence level contexts to learn the embeddings, which only captures the semantic meaning of the sentences and ignores the internal syntactic structure. As a result, embedding trained using BiCVM and BiVCD are not informative for syntactic tasks. On the other hand, BiSkip and BiCCA both utilize the word alignment information to train their embeddings and thus do better in capturing some notion of syntax.   English, they will also be present together in its French translation. However, BiCCA uses bilingual dictionary and BiVCD use comparable sentence context, which helps in pulling apart the synonyms and antonyms.

Discussion
The goal of this paper was to formulate the task of learning cross-lingual word vector representations in a unified framework, and conduct experiments to compare the performance of existing models in a unbiased manner. We chose existing cross-lingual word vector models that can be trained on two languages at a given time. In recent work, Ammar et al. (2016) train multilingual word vectors using more than two languages; our comparison does not cover this setting. It is also worth noting that we compare here different crosslingual word embeddings, which are not to be confused with a collection of monolingual word embeddings trained for different languages individually (Al-Rfou et al., 2013). The paper does not cover all approaches that generate cross-lingual word embeddings. Some methods do not have publicly available code (Coulmance et al., 2015;Zou et al., 2013); for others, like BilBOWA , we identified problems in the available code, which caused it to consistently produced results that are inferior even to mono-lingually trained vectors. 11 However, the models that we included for comparison in our survey are representative of other cross-lingual models in terms of the form of crosslingual supervision required by them. For example, BilBOWA  and crosslingual Auto-encoder (Chandar et al., 2014) are similar to BiCVM in this respect. Multi-view CCA (Rastogi et al., 2015) and deep CCA (Lu et al., 2015) can be viewed as extensions of BiCCA. Our choice of models was motivated to compare different forms of supervision, and therefore, adding these models, would not provide additional insight.

Conclusion
We presented the first systematic comparative evaluation of cross-lingual embedding methods on several downstream NLP tasks, both intrinsic and extrinsic. We provided a unified representation for all approaches, showing them as instances of a general algorithm. Our choice of methods spans a diverse range of approaches, in that each requires a different form of supervision.
Our experiments reveal interesting trends. When evaluating on intrinsic tasks such as monolingual word similarity, models relying on cheaper forms of supervision (such as BiVCD) perform almost on par with models requiring expensive supervision. On the other hand, for cross-lingual semantic tasks, like cross-lingual document classification and dictionary induction, the model with the most informative supervision performs best 11 We contacted the authors of the papers and were unable to resolve the issues in the toolkit. overall. In contrast, for the syntactic task of dependency parsing, models that are supervised at a word alignment level perform slightly better. Overall this suggests that semantic tasks can benefit more from richer cross-lingual supervision, as compared to syntactic tasks.