Accurate Cross-lingual Projection between Count-based Word Vectors by Exploiting Translatable Context Pairs

We propose a method that learns a cross-lingual projection of word representations from one language into another. Our method utilizes translatable context pairs as bonus terms of the objective function. In the experiments, our method outper-formed existing methods in three language pairs, (English, Spanish), (Japanese, Chinese) and (English, Japanese), without us-ing any additional supervisions.


Introduction
Vector-based representations of word meanings, hereafter word vectors, have been widely used in a variety of NLP applications including synonym detection , paraphrase detection (Erk and Padó, 2008), and dialogue analysis (Kalchbrenner and Blunsom, 2013). The basic idea behind those representation methods is the distributional hypothesis (Harris, 1954;Firth, 1957) that similar words are likely to co-occur with similar context words.
A problem with the word vectors is that they are not meant for capturing the similarity between words in different languages, i.e., translation pairs such as "gato" and "cat." The meaning representations of such word pairs are usually dissimilar, because the vast majority of the context words are from the same language as the target words (e.g., Spanish for "gato" and English for "cat"). This prevents using word vectors in multi-lingual applications such as cross-lingual information retrieval and machine translation.
Several approaches have been made so far to address this problem (Fung, 1998;Klementiev et al., 2012;Mikolov et al., 2013b). In particular, Mikolov et al. (2013b) recently explored learning a linear transformation between word vectors of different languages from a small amount of training data, i.e., a set of bilingual word pairs. This study explores incorporating prior knowledge about the correspondence between dimensions of word vectors to learn more accurate transformation, when using count-based word vectors . Since the dimensions of count-based word vectors are explicitly associated with context words, we can partially be aware of the cross-lingual correspondence between the dimensions of word vectors by diverting the training data. Also, word surface forms present noisy yet useful clues on the correspondence when targeting the language pairs that have exchanged their vocabulary (e.g., "cocktail" in English and "cóctel" in Spanish). Although apparently useful, how to exploit such knowledge within the learning framework has not been addressed so far.
We evaluated the proposed method in three language pairs. Compared with baselines including a method that uses vectors learned by neural networks, our method gave better results.

Related Work
Neural networks (Mikolov et al., 2013a;Bengio et al., 2003) have recently gained much attention as a way of inducing word vectors. Although the scope of our study is currently limited to the countbased word vectors, our experiment demonstrated that the proposed method performs significantly better than strong baselines including neural networks. This suggests that count-based word vectors have a great advantage when learning a crosslingual projection. As a future work, we are also interested in extending the method presented here to apply word vectors learned by neural networks.
There are also methods that directly inducing meaning representations shared by different languages (Klementiev et al., 2012;Lauly et al., 2014;Xiao and Guo, 2014;Hermann and Blunsom, 2014;Faruqui and Dyer, 2014;Gouws and Søgaard, 2015), rather than learning transformation between different languages (Fung, 1998;Mikolov et al., 2013b;. However, the former approach is unable to handle words not appearing in the training data, unlike the latter approach.

Learning cross-lingual projection
We begin by introducing the previous method of learning a linear transformation from word vectors in one language into another, which are hereafter referred to as source and target language.
Suppose we have a training data of n examples {(x 1 , z 1 ), (x 2 , z 2 ), . . . (x n , z n )}, where x i is the count-based vector representation of a word in the source language (e.g., "gato"), and z i is the word vector of its translation in the target language (e.g., "cat"). Then, we seek for a translation matrix, W , such that W x i approximates z i , by solving the following optimization problem.
The second term is the L 2 regularizer. Although the regularization term does not appear in the original formalization (Mikolov et al., 2013b), we take this as a starting point of our investigation because the regularizer can prevent over-fitting and generally helps learn better models.

Exploiting translatable context pairs
Within the learning framework above, we propose exploiting the fact that dimensions of count-based word vectors are associated with context words, and some dimensions in the source language are translations of those in the target language. For illustration purpose, suppose count-based word vectors of Spanish and English. The Spanish word vectors would have dimensions associated with context words such as "amigo," "comer," "importante," while the dimensions of the English word vectors are associated with "eat," "run," "small" and "importance," and so on. Since, for example, "friend" is a English translation of "amigo," the Spanish dimension associated with "amigo" is likely to be mapped to the English dimension associated with "friend." Such knowledge about the cross-lingual correspondence between dimensions is considered beneficial for learning accurate translation matrix.
We take two approaches to obtaining such correspondence. Firstly, since we have already assumed that a small amount of training data is available for training the translation matrix, it can also be used for finding the correspondence between dimensions (referred to as D train ). Note that it is natural that some words in a language have many translations in another language. Thus, for example, D train may include ("amigo", "friend"), ("amigo", "fan") and ("amigo", "supporter").
Secondly, since languages have evolved over the years while often deriving or borrowing words (or concepts) from those in other languages, those words have similar or even the same spelling. We take advantage of this to find the correspondence between dimensions. We specifically define function DIST(r, s) that measures the surface-level similarity, and regard all context word pairs (r, s) having smaller distance than a threshold 1 as translatable ones (referred to as D sim ).
DIST(r, s)= Levenshtein(r, s) min(len(r), len(s)) where function Levenshtein(r, s) represents the Levenshtein distance between the two words, and len(r) represents the length of the word.

New objective function
We incorporate the knowledge about the correspondence between the dimensions into the learning framework. Since the correspondence obtained by the methods presented above can be noisy, we want to treat it as a soft constraint. This consideration leads us to develop the following new objective function: The third and fourth terms are newly added to guide the learning process to strengthen w jk when k-th dimension in the source language corresponds to j-th dimension in the target language. D train and D sim are sets of dimension pairs found by the two methods. β train and β sim are parameters representing the strength of the new terms, and are tuned on held-out development data.

Optimization
We use Pegasos algorithm (Shalev-Shwartz et al., 2011), an instance of the stochastic gradient descent (Bottou, 2004), to optimize the new objective. Given τ -th learning sample (x τ , z τ ), we update translation matrix W as follows: where η τ represents the learning rate and is set to η τ = 1 λτ , and ∇E τ (W ) is the gradient which is calculated from τ -th sample (x τ , z τ ): A and B are gradients corresponding to the two new terms. A is a matrix in which a jk = 1 if (j, k) ∈ D train otherwise 0. B is defined similarly.

Experiments
We evaluate our method on translation among word vectors in four languages: English (En), Spanish (Es), Japanese (Jp) and Chinese (Cn). We have chosen three language pairs: (En, Es), (Jp, Cn) and (En, Jp), for the translation, so that we can examine the impact of each type of translatable context pairs integrated into the learning objective.

Setup
First, we prepared source text in the four languages from Wikipedia 2 dumps following . We extracted plain text from the XML dumps by using wp2txt. 3 Since words are concatenated in Japanese and Chinese, we used MeCab 4 and Stanford Word Segmenter 5 to tokenize the text. Since inflection occurs in English, Spanish, and Japanese, we used Stanford POS tagger, 6 Pattern, 7 and MeCab to lemmatize the text.
Next, we induced count-based word vectors from the obtained text. We considered context windows of five words to both sides of the target word. The function words are then excluded from the extracted context words. Since the count vectors are very high-dimensional and sparse, we selected top-10k frequent words as contexts words (in other words, the number of dimensions of the word vectors). We converted the counts into positive point-wise mutual information (Church and Hanks, 1990) and normalized the resulting vectors to remove the bias that is introduced by the difference of the word frequency.
Then, we compiled a seed bilingual dictionary (a set of bilingual word pairs) for each language pair that is used to learn and evaluate the translation matrix. We utilized cross-lingual synsets in the Open Multilingual Wordnet 8 to obtain bilingual pairs.
Since our method aims to be used in expanding bilingual dictionaries, we designed datasets assuming such a situation. Considering that more frequent words are likely to be registered in a dictionary, we sorted words in the source language by frequency and used the top-11k words and their translations in the target language as a training/development data, and used the subsequent 1k words and their translations as a test data.
We have compared our method with the following three methods: Baseline learns a translation matrix using Eq. 1 for the same count-based word vectors as the proposed method. Comparison between the proposed method and this method reveals the impact of incorporating the cross-lingual correspondences between dimensions.
CBOW learns a translation matrix using Eq. 1 for word vectors learned by a neural network (specifically, continuous bag-of-words (CBOW)) (Mikolov et al., 2013b). Comparison between this method and the above baseline reveals the impact of the vector representation. Note that the CBOW-based word vectors take rare context words as well as the top-10k frequent words into account. We used word2vec 9 to obtain the vectors for each language. 10 Since Mikolov et al. (2013b)  reported the accurate translation can be obtained when the vectors in the source language is 2-4x larger than that in the target language, we prepared m-dimensional (m = 100, 200, 300) vectors for the target language and n-dimensional (n = 2m, 3m, 4m) vectors for the source language, and optimized their combinations on the development data.
Direct Mapping exploits the training data to map each dimension in a word vector in the source language to the corresponding dimension in a word vector in the target language, referring to the bilingual pairs in the training data (Fung, 1998). To deal with words that have more than one translation, we weighted each translation by a reciprocal rank of its frequency among the translations in the target language, as in (Prochasson et al., 2009).
Note that all methods, including the proposed methods, use the same amount of supervision (training data) and thereby they are completely comparable with each other.
Evaluation procedure For each word vector in the source language, we translate it into the target language and evaluate the quality of the translation as in (Mikolov et al., 2013b): i) measure the cosine similarity between the resulting word vector and all the vectors in the test data (in the target language), ii) next choose the top-n (n = 1, 5) word vectors that have the highest similarity against the resulting vector, and iii) then examine whether the chosen vectors include the correct one. Table 1 shows results of the translation between word vectors in each language pair. Proposed significantly improved the translation quality against Baseline, and performed the best among all of the methods. Although the use of CBOW-based word vectors (CBOW) has improved the translation quality against Baseline, the performance gain is smaller than that obtained by our new objective. Proposed w/o surface uses only the training data to find translatable context pairs by setting β sim = 0. Thus, its advantage over Direct Mapping confirms the importance of learning a translation matrix. In addition, the greater advantage of Proposed over Proposed w/o surface in the translation between (En, Es) or (Jp, Cn) conforms to our expectation that surface-level similarity is more useful for translation between the language pairs which have often exchanged their vocabulary. Figure 1 shows P@1 (Es → En) plotted against the size of training data. Remember that the training data is not only used to learn a translation matrix in the methods other than Direct Mapping but also is used to map dimensions in Direct Mapping and the proposed methods. Proposed performs the best among all methods regardless the size of training data. Comparison between Direct Mapping and Proposed w/o surface reveals that learning a translation matrix is not always effective when the size of the training data is small, since it may be suffered from over-fitting (the size of the translation matrix is too large for the size of training data). We can see that surface-level similarity is beneficial especially when the size of training data is small.

Conclusion
We have proposed the use of prior knowledge in accurately translating word vectors. We have specifically exploited two types of translatable context pairs, which are taken from the training data and guessed by surface-level similarity, to design a new objective function in learning the translation matrix. Experimental results confirmed that our method significantly improved the translation among word vectors in four languages, and the advantage was greater than that obtained by the use of a word vector learned by a neural network.