Characterizing Departures from Linearity in Word Translation

We investigate the behavior of maps learned by machine translation methods. The maps translate words by projecting between word embedding spaces of different languages. We locally approximate these maps using linear maps, and find that they vary across the word embedding space. This demonstrates that the underlying maps are non-linear. Importantly, we show that the locally linear maps vary by an amount that is tightly correlated with the distance between the neighborhoods on which they are trained. Our results can be used to test non-linear methods, and to drive the design of more accurate maps for word translation.


Introduction
Following the success of monolingual word embeddings (Collobert et al., 2011), a number of studies have recently explored multilingual word embeddings. The goal is to learn word vectors such that similar words have similar vector representations regardless of their language (Zou et al., 2013;Upadhyay et al., 2016). Multilingual word embeddings have applications in machine translation, and hold promise for cross-lingual model transfer in NLP tasks such as parsing or part-ofspeech tagging.
A class of methods has emerged whose core technique is to learn linear maps between vector spaces of different languages (Mikolov et al., 2013a;Faruqui and Dyer, 2014;Vulic and Korhonen, 2016;Artetxe et al., 2016;Conneau et al., 2018). These methods work as follows: For a given pair of languages, first, monolingual word vectors are learned independently for each language, and second, under the assumption that word vector spaces exhibit comparable structure across languages, a linear mapping function is learned to connect the two monolingual vector spaces. The map can then be used to translate words between the language pair. Both seminal (Mikolov et al., 2013a), and stateof-the-art methods (Conneau et al., 2018) found linear maps to substantially outperform specific non-linear maps generated by feedforward neural networks. Advantages of linear maps include: 1) In settings with limited training data, accurate linear maps can still be learned (Conneau et al., 2018;Zhang et al., 2017;Artetxe et al., 2017;Smith et al., 2017). For example, in unsupervised learning, (Conneau et al., 2018) found that using non-linear mapping functions made adversarial training unstable 1 . 2) One can easily impose constraints on the linear maps at training time to ensure that the quality of the monolingual em-beddings is preserved after mapping (Xing et al., 2015;Smith et al., 2017).
However, it is not well understood to what extent the assumption of linearity holds and how it affects performance. In this paper, we investigate the behavior of word translation maps, and show that there is clear evidence of departure from linearity.
Non-linear maps beyond those generated by feedforward neural networks have also been explored for this task (Lu et al., 2015;Shi et al., 2015;Wijaya et al., 2017;Shi et al., 2015). However, no attempt was made to characterize the resulting maps.
In this paper, we allow for an underlying mapping function that is non-linear, but assume that it can be approximated by linear maps at least in small enough neighborhoods. If the underlying map is linear, all local approximations should be identical, or, given the finite size of the training data, similar. In contrast, if the underlying map is non-linear, the locally linear approximations will depend on the neighborhood. Figure 1 illustrates the difference between the assumption of a single linear map, and our working hypothesis of locally linear approximations to a non-linear map. The variation of the linear approximations provides a characterization of the nonlinear map. We show that the local linear approximations vary across neighborhoods in the embedding space by an amount that is tightly correlated with the distance between the neighborhoods on which they are trained. The functional form of this variation can be used to test non-linear methods.

Review of Prior Work
To learn linear word translation maps, different loss functions have been proposed. The simplest is the regularized least squares loss, where the linear map M is learned as follows: M = arg min M ||MX − Y|| F + λ||M||, here X and Y are matrices that contain word embedding vectors for the source and target language (Mikolov et al., 2013a;Dinu et al., 2014;Vulic and Korhonen, 2016). The translation t of a source language word s is then given by: t = arg max t cos(Mx s , y t ). (Xing et al., 2015) obtained improved results by imposing an orthogonality constraint on M, by minimizing ||MM T − I|| where I is the identify matrix. Another loss function used in prior work is the max-margin loss, which has been shown to significantly outperform the least squares loss (Lazaridou et al., 2015;Nakashole and Flauger, 2017).
Unsupervised or limited supervision methods for learning word translation maps have recently been proposed (Barone, 2016;Conneau et al., 2018;Zhang et al., 2017;Artetxe et al., 2017;Smith et al., 2017). However, the underlying methods for learning the mapping function are similar to prior work (Xing et al., 2015).
Non-linear cross-lingual mapping methods have been proposed. In (Wijaya et al., 2017) when dealing with rare words, the proposed method backs-off to a feed-forward neural network. (Shi et al., 2015) model relations across languages. (Lu et al., 2015) proposed a deep canonical correlation analysis based mapping method. Work on phrase translation has explored the use of many local maps that are individually trained (Zhao et al., 2015). In contrast to our work, these prior papers do not attempt to characterize the behavior of the resulting maps.
Our hypothesis is similar in spirit to the use of locally linear embeddings for nonlinear dimensionality reduction (Roweis and Saul, 2000).

Neighborhoods in Word Vector Space
In order to study the behavior of word translation maps, we begin by introducing a simple notion of neighborhoods in the embedding space. For a given language (e.g., English, en), we define a neighborhood of a word as follows: First, we pick a word x i , whose corresponding vector is x i ∈ X en , as an anchor. Second, we initialize a neighborhood N (x i ) containing a single vector x i . We then grow the neighborhood by adding all words whose cosine similarity to x i is ≥ s. The resulting neighborhood is defined as: Suppose we pick the word multivitamins as the anchor word. We can generate neighborhoods using N (x multivitamins , s) where for each value of s we get a different neighborhood. Neighborhoods corresponding to larger values of s are subsumed by those corresponding to smaller values of s. Figure 2 illustrates the process of generating neighborhoods around the word multivitamins. nutrition. As s gets smaller (e.g., s = 0.6), the neighborhood gets larger, and includes words that are less related to the anchor, such as antibiotic.
Using this simple method, we can define different-sized neighborhoods around any word in the vocabulary.

Analysis of Map Behavior
Given the above definition of neighborhoods, we now seek to understand how word translation maps change as we move across neighborhoods in word embedding space.
Questions Studied. We study the following questions: [Q.1] Is there a single linear map for word translation that produces the same level of performance regardless of where in the vector space the words being translated fall? [Q.2] If there is no such single linear map, but instead multiple neighborhood-specific ones, is there a relationship between neighborhood-specific maps and the distances between their respective neighborhoods?

Experimental Setup and Data
In our first experiment we translate from English (en) to German (de). We obtained pretrained word embeddings from FastText (Bojanowski et al., 2017). In the first experiment, we follow common practice (Mikolov et al., 2013a;Ammar et al., 2016;Nakashole and Flauger, 2017;Vulic and Korhonen, 2016), and used the Google Translate API to obtain training and test data. We make our data available for reproducibility 2 . For the second and third experiments, we repeat the first experiment, but instead of using Google Translate, we use the recently released Facebook AI Research dictionaries 3 for train and test data. The last two experiments were performed on a different language pairs: English (en) to Portuguese (pt), English (en) to Swedish (sv).
In our all experiments, the cross-lingual maps are learned using the max-margin loss, which has been shown to perform competitively, while having fast run-times. (Lazaridou et al., 2015;Nakashole and Flauger, 2017). The max-margin loss aims to rank correct training data pairs (x i , y i ) higher than incorrect pairs (x i , y j ) with a margin of at least γ. The margin γ is a hyper-parameter and the incorrect labels, y j can be selected randomly such that j = i or in a more application specific manner. In our experiments, we set γ = 0.4 and randomly selected negative examples, one negative example for each training data point.
Given a seed dictionary as training data of the form D tr = {x i , y i } m i=1 , the mapping function iŝ whereŷ i = Wx i is the prediction, k is the number of incorrect examples per training instance, and d(x, y) = (x − y) 2 is the distance measure.
For the first experiment, we picked the following words as anchor words and obtained maps associated with each of their neighborhoods: M (multivitamins) , M (antibiotic) , M (disease) , M (blowflies) , M (dinosaur) , M (orchids) , M (copenhagen) . For each anchor word, we set s = 0.5, thus the neighborhoods are N (x i , 0.5) where x i is the vector of the anchor word. The training data for learning each neighborhood-specific linear map consists of vectors in N (x i , 0.5) and their translations. Table 1 shows details of the training and test data for each neighborhood. The words shown in Table 1 were picked as follows: first we picked the word multivitamins, then we picked the other words to have varying degrees of similarity to it. The cosine similarity of these words to the word 'multivitamins' are shown in column 3 of Table 1 about these words. In fact, the last two experiments were carried out on different set of words, and on different language pairs.

Map Similarity Analysis
If indeed there exists a map that is the same linear map everywhere, we expect the above neighborhood-specific maps to be similar. Our analysis makes use of the following definition of matrix similarity: (2) Here tr(M) denotes the trace of the matrix M. tr(M 1 T M 1 ) computes the Frobenius norm ||M 1 || 2 , and tr(M 1 T M 2 ) is the Frobenius inner product. That is, cos(M 1 , M 2 ) computes the cosine similarity between the vectorized versions of matrices M 1 and M 2 .

Experimental Results
The main results of our analysis are shown in Table 1.
We now analyze the results of Table 1 in detail. The 0th column contains the anchor word, x i , around which the neighborhood is formed. The 1st, and 2nd columns contain the size of the training and test data from N (x i , s = 0.5) where x i is the word vector for the anchor word.
The 3rd column contains the cosine similarity between x 0 , multivitamins, and x i . For example, x 1 (antibiotic) is the most similar to x 0 (0.6), and x 6 , copenhagen, is the least similar to x 0 (0.11).
The 4th  x i neighborhoods. The 5th column is the translation accuracy of the map M x 0 , trained on the training data of x 0 , and tested on the test data in x i . We use precision at top-10 as a measure of translation accuracy. Going down this column we can see that accuracy is highest on the test data from the neighborhood anchored at x 0 itself, followed by neighborhoods anchored x 1 and x 2 : antibiotic and disease. Accuracy is lowest on the test data from the neighborhoods anchored at words further away from x 0 , in particular x 3 to x 6 : blowflies, dinosaur, orchids, copenhagen.
The 6th column is translation accuracy of the map M x i , trained on the training data of the neighborhood anchored at x i , and tested on the test data in x i . We can see that compared to the 5th column, in all cases performance is higher when we apply the map trained on data from the neighborhood, M x i instead of M x 0 . The 7th column shows the difference in translation accuracy of the map M x i and M x 0 . This shows that the more dis-   similar the neighborhood anchor word x i is from x 0 according to the cosine similarity shown in the 4rd column, the larger this difference is. The local maps, 6th column, M x i in all cases outperform the global map 4th column, in Table 1.
The 8th column shows the similarity between maps M x i and M x 0 as computed by Equation 2. This column shows that the similarity between these learned maps is highly correlated with the cosine similarity or distance between the words in 3rd column. We also see a correlation with the translation accuracy in the 5th column. This correlation is visualized in Figure 3. Finally, the 9th column shows the magnitudes of the maps. The magnitudes vary somewhat between the maps trained on the different neighborhoods, and are significantly different from the magnitude expected for an orthogonal matrix. (For an orthogonal 300 × 300 matrix O the norm is ||O|| = √ 300 ≈ 17).
In order to determine the generality of our results, we carried out the same experiment on different language pairs and different sets of neighborhoods, as shown in Table 2 and Table 3. Crucially, we see the same trends as those observed in Table 1. In particular, the key trends that maps vary by an amount that is tightly correlated with the distance between neighborhoods as reflected in5th and 6th columns of Tables 2 and 3. This shows the generality of our findings in Table 1.
Experiments Summary. In summary, our experimental study suggests the following: i) linear maps vary across neighborhoods, implying that the assumption of a linear map does not to hold. ii) the difference between maps is tightly correlated with the distance between neighborhoods.

Conclusions
In this paper, we provide evidence that the assumption of linearity made by a large body of current work on cross-lingual mapping for word translation does not hold. We locally approximate the underlying non-linear map using linear maps, and show that these maps vary across neighborhoods in vector space by an amount that is tightly correlated with the distance between the neighborhoods on which they are trained. These results can be used to test non-linear methods. We leave using the findings of this paper to design more accurate maps as future work.