Non-Linear Instance-Based Cross-Lingual Mapping for Non-Isomorphic Embedding Spaces

We present InstaMap, an instance-based method for learning projection-based cross-lingual word embeddings. Unlike prior work, it deviates from learning a single global linear projection. InstaMap is a non-parametric model that learns a non-linear projection by iteratively: (1) finding a globally optimal rotation of the source embedding space relying on the Kabsch algorithm, and then (2) moving each point along an instance-specific translation vector estimated from the translation vectors of the point’s nearest neighbours in the training dictionary. We report performance gains with InstaMap over four representative state-of-the-art projection-based models on bilingual lexicon induction across a set of 28 diverse language pairs. We note prominent improvements, especially for more distant language pairs (i.e., languages with non-isomorphic monolingual spaces).


Introduction and Motivation
Induction of cross-lingual word embeddings (CLWEs) (Vulić et al., 2011;Mikolov et al., 2013;Xing et al., 2015;Smith et al., 2017;Artetxe et al., 2018) has been one of the key mechanisms for enabling multilingual modeling of meaning and facilitating cross-lingual transfer for downstream NLP tasks. Even though CLWEs are recently being contested in cross-lingual downstream transfer by pretrained multilingual language models (Pires et al., 2019;Wu and Dredze, 2019;Wu et al., 2020), they are still paramount in word-level translation, that is, bilingual lexicon induction (BLI).
Despite some recent evidence that joint CLWE induction may lead to better bilingual spaces (Ormazabal et al., 2019), projection-based methods still dominate the field (Hoshen and Wolf, 2018;Nakashole, 2018;Grave et al., 2019;Zhang et al., 2019, inter alia) due to their conceptual attractiveness: they operate on top of vectors produced with any embedding model and need at most a few thousand word pairs of supervision .
Most projection-based CLWE models induce bilingual spaces by orthogonally projecting one monolingual space to another. Since orthogonal projections do not affect the topology of the source space, the performance of these methods is bound by the degree of isomorphism of the two monolingual spaces. Yet, evidence suggests that monolingual spaces, especially those of etymologically and typologically distant languages, are far from isomorphic (Søgaard et al., 2018;Patra et al., 2019). What is more, unsupervised CLWE models (Conneau et al., 2018;Artetxe et al., 2018;Alvarez-Melis and Jaakkola, 2018;Hoshen and Wolf, 2018, inter alia), which additionally exploit the isomorphism assumption when inducing initial translation dictionaries, have been shown to yield near-zero BLI results for pairs of distant languages (Søgaard et al., 2018;. Following these theoretical limitations of effectiveness of orthogonal mapping between nonisomorphic spaces, Joulin et al. (2018) andPatra et al. (2019) relax the orthogonality constraint and report BLI improvements. These models, however, still learn only a linear transformation, i.e., an oblique projection matrix. While oblique projections may scale or skew the source space, there still exists a strong topological similarity between the original space and its oblique projection.
In this work, we deviate from learning a linear projection matrix (i.e., a parametric model) and propose a non-parametric model which translates vectors by estimating instance-specific geometric translations. Our method, INSTAMAP, iteratively (1) applies the Kabsch algorithm (Horn, 1987) on the full training dictionary to learn a globally optimal rotation of the source space w.r.t. the target space; and then (2) translates each point along the instance-specific translation vector, which we compute from the translation vectors of the point's nearest neighbours from the training dictionary.
We extensively evaluate INSTAMAP on the benchmark BLI dataset  encompassing 28 diverse language pairs. Our results show the non-linear mappings with INSTAMAP to be substantially more robust than linear projections, both orthogonal (Smith et al., 2017;Artetxe et al., 2018) and oblique (Joulin et al., 2018;Patra et al., 2019). We also show that, unlike INSTAMAP, oblique projection models -RCSLS (Joulin et al., 2018) and BLISS (Patra et al., 2019) -cannot surpass the performance of the best-performing orthogonal projection model VecMap (Artetxe et al., 2018) for distant languages (i.e., for low isomorphicity). Finally, we report additional significant gains by applying INSTAMAP on top of VecMap.

Instance-Based Mapping
The core idea of INSTAMAP is illustrated in Figure  1. We iteratively: (1) use the entire training dictionary to learn a single global rotation matrix and then (2) perform an instance-based computation of translation vectors.

Globally Optimal Rotation
Let X and Y be monolingual embedding spaces of the source and target language, respectively, and let D = {(w i L1 , w i L2 )}, i = 1 . . . N , be the training dictionary. We first transform each of the two spaces by (independently) performing a full PCA transformation (i.e., no dimensionality reduction): this way we represent vectors in each of the spaces as combinations of linearly uncorre-lated principal components of that space, which facilitates the learning of the optimal rotation between the spaces. Let ⊂ Y be the dictionaryaligned subsets of the two monolingual spaces. We aim to learn the optimal rotation matrix between X and Y, i.e., the matrix W R that minimizes the sum of square distances between the source vector projections and corresponding target vectors, W R = arg min W X D W − Y D . If we constrain W R to be orthogonal, the optimal solution is obtained by solving the Procrustes problem (Schönemann, 1966) -adopted by most projectionbased CLWE models (Smith et al., 2017;Conneau et al., 2018;Artetxe et al., 2018). However, our aim is to avoid introducing the orthogonal constraint and learn only the optimal rotation between the spaces. To this end, we use the Kabsch algorithm (Horn, 1987), which computes the optimal rotation matrix W R as follows: where I R is a modification of the identity matrix, in which the last element (i.e., last row, last column) is not 1, but rather the determinant of VU T . Upon obtaining W R , we rotate X w.r.t. Y, X R = XW R .

Instance-Specific Translations
We then perform localized, instance-specific translations in a rotationally-aligned bilingual space. For each point from both X R and Y, we compute a "personalized" translation vector, as the weighted average of the translation vectors of its closest dictionary entries. That is, for some vector x ∈ X R let x 1 , . . . , x K be the set of K vectors from X D W R (corresponding to words w 1 L1 , w 2 L1 , . . . , w K L1 in D) which are closest to x in terms of cosine similarity and let y 1 , y 2 , . . . , y K be the vectors of the corresponding dictionary translations w 1 L2 , w 2 L2 , . . . , w K L2 from D from the target language space. We then compute the instance-based translation of x, x , as follows: We perform an instance-specific translation of the vectors from Y analogously. Let y 1 , . . . , y K be the set of vectors from Y D that are closest to some vector y ∈ Y. The translation y is then as follows: Because we compute a different translation vector for each point in both vector spaces, the final mapping function between the two spaces is globally non-linear. Also, being based on K nearest neighbours in the training dictionary D, INSTAMAP is, unlike all other projection-based CLWE models, a non-parametric model (i.e., the number of model parameters is not fixed, it depends on the number of entries in the training dictionary D).

Training Dictionary Expansion
We repeat the two steps -global rotation and instance-based translation -aiming to obtain an iterative refinement of the non-linear mapping between the two spaces. Following the established practice found in other iterative models (Conneau et al., 2018;Artetxe et al., 2018), we augment the training dictionary for the next iteration with the mutual nearest neighbours in the bilingual space induced in the previous iteration. Intuitively, with IN-STAMAP being a non-parametric model, we expect it to benefit more from the dictionary augmentation than the parametric projection models, which have been shown to saturate in performance when training dictionaries exceed 5K-10K translation pairs (Vulić and Korhonen, 2016;.

Evaluation
We evaluate INSTAMAP on bilingual lexicon induction, the standard task for evaluating CLWEs.

Experimental Setup
Data. We evaluate on the BLI benchmark dataset introduced by Glavaš     (e) BEST LPs -5 language pairs for which each model yields best relative performance compared to other models.
respective original monolingual spaces -this holds promise of no undesirable side-effects originating from the composition. INSTAMAP has only two hyperparameters: 3 the number of nearest neighbours K from D, and the number of algorithm iterations T . We identified, via fixed-split cross-validation on the training dictionaries, that configuration K = 70 and T = 4 works best for most language pairs. 4

Results
We show BLI performance (P @1), aggregated over several different sets of language pairs, in Table 1. 5 Overall, INSTAMAP significantly outperforms all competing models 6 Somewhat surprisingly, VecMap, which induces an orthogonal projection (i.e., more strongly relies on the assumption of isomorphism), significantly outperforms RCSLS and BLISS, models that relax the orthogonality constraint and induce oblique linear projections. Only INSTAMAP, by removing the constraint of having a global linear projection altogether and by inducing a non-linear mapping, is able to consistently yield improvements over the orthogonal projection (VecMap). What is more, the IM • VM composition yields even larger performance gains. Analysis of results across different groups of language pairs identifies INSTAMAP as particularly beneficial for pairs of distant languages (setups No-EN and HARD) and languages with least reliable monolingual vectors (TR, HR). For example, while INSTAMAP alone and IM • VM yield gains of 0.9 and 2.6 points, respectively, w.r.t. VecMap across ALL language pairs, these gaps widen to 1.5 and 3.5 points on most challenging language pairs (HARD). In contrast, BLISS, a model specifically tailored to improve the mappings between non-isomorphic spaces, appears to be robust only on pairs of close languages (e.g., HR-RU) and pairs involving EN (setup EN-*). It exhibits barely any improvement over the baseline orthogonal projection (PROC) on distant language pairs (HARD) and a significant degradation w.r.t. VecMap, a stateof-the-art model based on orthogonal projection. RCSLS is more robust than BLISS on difficult language pairs, but still performs worse than VecMap.
Further Analysis. We further analyze the performance of INSTAMAP (applied on top of VecMap) with respect to: (1) size of the training dictionary |D| and (2) number of nearest dictionary neighbours K. We analyze the performance of IM•VM for three language pairs with lowest BLI scores: DE-TR, TR-FI, and TR-HR. We prepare dictionaries with 2.5K to 12.5K entries (with a 2.5K step), following steps described in . 7 Figure 2 shows the performance for different training dictionary sizes. We can see that adding INSTAMAP on top of VecMap yields stable improvements for all dictionary sizes. On the one hand, this shows that INSTAMAP is equally helpful for any number of available word translations. On the other hand, since InstaMap is not constrained to learning a single global projection, we hoped to see bigger gains for larger dictionaries, but this is not the case. With larger dictionaries, we are  more likely to find more semantically similar dictionary neighbours for each word -and this should lead to better performance. We speculate, however, that larger dictionaries also increase the likelihood of selecting spurious neighbours due to hubness (Dinu et al., 2015;Conneau et al., 2018) and that this cancels out the positive effect promised by having more candidates to choose the neighbours from. This could perhaps be remedied by using hubnessaware similarity scores like CSLS (Conneau et al., 2018) instead of simple cosine similarity. Figure 3 illustrates how INSTAMAP performance (on top of VecMap, i.e., IM • VM) varies with different values for the number of dictionary neighbours K. The best performance is typically reached for values of K between 50 and 90 and there are no further improvements for larger values of K (TR-FI, where K = 130 gives the best score, is an exception). For very small K performance drops are substantial and here INSTAMAP even degrades the quality of the input space produced by VecMap. We believe this happens because INSTAMAP in this case has too few dictionary neighbours to ac-curately model the meaning of any given word and, in turn, compute a reliable mapping vector.

Conclusion
We have proposed INSTAMAP, a simple and effective approach for improving the post-hoc cross-lingual alignment between non-isomorphic monolingual embedding spaces. Unlike existing projection-based CLWE induction models, which learn a global linear projection matrix, INSTAMAP couples global rotation with instance-specific translations. This way, we learn a globally non-linear projection. Our experiments show that (1) IN-STAMAP significantly outperforms four state-ofthe-art projection-based CLWE models on a benchmark BLI dataset with 28 language pairs and (2) that it yields largest improvements for pairs of distant languages with a lower degree of isomorphism between their respective monolingual spaces. We plan to extend this work in two directions. First, we will explore mechanisms for instance-specific translation that are more sophisticated than the aggregation of translation vectors of nearest dictionary neighbours. Second, we plan to couple instancebased mapping with other informative features (e.g., character-level features) in classificationbased BLI frameworks (Heyman et al., 2017;Karan et al., 2020