LNMap: Departures from Isomorphic Assumption in Bilingual Lexicon Induction Through Non-Linear Mapping in Latent Space

Most of the successful and predominant methods for bilingual lexicon induction (BLI) are mapping-based, where a linear mapping function is learned with the assumption that the word embedding spaces of different languages exhibit similar geometric structures (i.e., approximately isomorphic). However, several recent studies have criticized this simplified assumption showing that it does not hold in general even for closely related languages. In this work, we propose a novel semi-supervised method to learn cross-lingual word embeddings for BLI. Our model is independent of the isomorphic assumption and uses nonlinear mapping in the latent space of two independently trained auto-encoders. Through extensive experiments on fifteen (15) different language pairs (in both directions) comprising resource-rich and low-resource languages from two different datasets, we demonstrate that our method outperforms existing models by a good margin. Ablation studies show the importance of different model components and the necessity of non-linear mapping.


Introduction
In recent years, plethora of methods have been proposed to learn cross-lingual word embeddings (or CLWE for short) from monolingual word embeddings. Here words with similar meanings in different languages are represented by similar vectors, regardless of their actual language. CLWE enable us to compare the meaning of words across languages, which is key to most multi-lingual applications such as bilingual lexicon induction (Heyman et al., 2017), machine translation Artetxe et al., 2018c), or multilingual information retrieval (Vulić and Moens, 2015). They also play a crucial role in cross-lingual knowledge transfer between languages (e.g., from resource-rich to low-resource languages) by providing a common representation space . Mikolov et al. (2013a), in their pioneering work, learn a linear mapping function to transform the source embedding space to the target language by minimizing the squared Euclidean distance between the translation pairs of a seed dictionary. They assume that the similarity of geometric arrangements in the embedding spaces is the key reason for their method to succeed as they found linear mapping to be superior to non-linear mappings with multi-layer neural networks. Subsequent studies propose to improve the model by normalizing the embeddings, imposing an orthogonality constraint on the linear mapper, modifying the objective function, and reducing the seed dictionary size (Artetxe et al., 2016(Artetxe et al., , 2017(Artetxe et al., , 2018aSmith et al., 2017).
A more recent line of research attempts to eliminate the seed dictionary totally and learn the mapping in a purely unsupervised way (Barone, 2016;Zhang et al., 2017;Artetxe et al., 2018b;Xu et al., 2018;Hoshen and Wolf, 2018;Alvarez-Melis and Jaakkola, 2018;Mohiuddin and Joty, 2019). While not requiring any cross-lingual supervision makes these methods attractive,  recently show that even the most robust unsupervised method (Artetxe et al., 2018b) fails for a large number of language pairs. They suggest to rethink the main motivations behind fully unsupervised methods showing that with a small seed dictionary (500-1K pairs) their semisupervised method always outperforms the unsupervised method and does not fail for any language pair. Other concurrent work (Ormazabal et al., 2019;Doval et al., 2019) also advocates for weak supervision in CLWE methods.
Almost all mapping-based CLWE methods, supervised and unsupervised alike, solve the Procrustes problem in the final step or during selflearning . This restricts the transformation to be orthogonal linear mappings. However, learning an orthogonal linear mapping inherently assumes that the embedding spaces of different languages exhibit similar geometric structures (i.e., approximately isomorphic). Several recent studies have questioned this strong assumption and empirically showed that the isomorphic assumption does not hold in general even for two closely related languages like English and German (Søgaard et al., 2018;Patra et al., 2019).
In this work, we propose LNMAP (Latent space Non-linear Mapping), a novel semi-supervised approach that uses non-linear mapping in the latent space to learn CLWE. It uses minimal supervision from a seed dictionary, while leveraging semantic information from the monolingual word embeddings. As shown in Figure 1, LNMAP comprises two autoencoders, one for each language. The auto-encoders are first trained independently in a self-supervised way to induce the latent code space of the respective languages. Then, we use a small seed dictionary to learn the non-linear mappings between the two code spaces. To guide our mapping in the latent space, we include two additional constraints: back-translation and original embedding reconstruction. Crucially, our method does not enforce any strong prior constraints like the orthogonality (or isomorphic), rather it gives the model the flexibility to induce the required latent structures such that it is easier for the non-linear mappers to align them in the code space.
In order to demonstrate the effectiveness and robustness of LNMAP, we conduct extensive experiments on bilingual lexicon induction (BLI) with fifteen (15) different language pairs (in both directions) comprising high-and low-resource languages from two different datasets for different sizes of the seed dictionary. Our results show significant improvements for LNMAP over the state of the art in most of the tested scenarios. It is particularly very effective for low-resource languages; for example, using 1K seed dictionary, LNMAP yields about 18% absolute improvements on average over a state-of-the-art supervised method (Joulin et al., 2018). It also outperforms the most robust unsupervised system of Artetxe et al. (2018b) in most of the translation tasks. Interestingly, for resource-rich language pairs, linear autoencoder performs better than non-linear ones. Our ablation study of LN-MAP reveals the collaborative nature of its different components and the efficacy of its non-linear map-pings in the code space. We open-source our framework at https://ntunlpsg.github.io/project/lnmap/.

Background
Limitations of Isomorphic Assumption. Almost all CLWE methods inherently assume that embedding spaces of different languages are approximately isomorphic. However, recently researchers have questioned this simplified assumption and attributed the performance degradation of existing CLWE methods to the strong mismatches in embedding spaces caused by the linguistic and domain divergences (Sgaard et al., 2019;Ormazabal et al., 2019). Søgaard et al. (2018) empirically show that even closely related languages are far from being isomorphic. Nakashole and Flauger (2018) argue that mapping between embedding spaces of different languages can be approximately linear only at small local regions, but must be non-linear globally. Patra et al. (2019) also recently show that etymologically distant language pairs cannot be aligned properly using orthogonal transformations.
Towards Semi-supervised Methods. A number of recent studies have questioned the robustness of existing unsupervised CLWE methods .  show that even the most robust unsupervised method (Artetxe et al., 2018b) fails for a large number of language pairs; it gives zero (or near zero) BLI performance for 87 out of 210 language pairs. With a seed dictionary of only 500 -1000 word pairs, their supervised method outperforms unsupervised methods by a wide margin in most language pairs. Other recent work also suggested to use semi-supervised methods (Patra et al., 2019;Ormazabal et al., 2019).
Mapping in Latent Space. Mohiuddin and Joty (2019) propose adversarial autoencoder for unsupervised word translation. They use linear autoencoders in their model and the mappers are also linear. They emphasize the benefit of using latent space over the original embedding space. Although their method is more robust than other existing adversarial models, still it suffers from training instability for distant language pairs. Our Contributions. Our proposed LNMAP is independent of the isomorphic assumption. It uses weak supervision from a small seed dictionary, while leveraging rich structural information from monolingual embeddings. Unlike Mohiuddin and Joty (2019), the autoencoders in LNMAP are not limited to only linearity. More importantly, it uses non-linear mappers. These two factors contribute to its robust performance even for very lowresource languages ( §5). To the best of our knowledge, we are the first to showcase such robust and improved performance with non-linear methods. 1

LNMAP Semi-supervised Framework
Let V x ={v x 1 , ..., v xn x } and V y ={v y 1 , ..., v yn y } be two sets of vocabulary consisting of n x and n y words for a source ( x ) and a target ( y ) language, respectively. Each word v x i (resp. v y j ) has an embedding x i ∈ R d (resp. y j ∈ R d ), trained with any word embedding models, e.g., FastText (Bojanowski et al., 2017). Let E x ∈ R nx×d and E y ∈ R ny×d be the word embedding matrices for the source and target languages, respectively. We are also given with a seed dictionary D={(x 1 , y 1 ), . . . , (x k , y k )} with k word pairs. Our objective is to learn a transformation function M such that for any v x i ∈ V x , M(x i ) corresponds to its translation y j , where v y j ∈ V y . Our approach LNMAP ( Figure 1) follows two sequential steps: (i) Unsupervised latent space induction using monolingual autoencoders ( §3.1), and (ii) Supervised non-linear transformation learning with back-translation and source embedding reconstruction constraints ( §3.2).
1 Our experiments with (unsupervised) adversarial training showed very unstable results with the non-linear mappers.

Unsupervised Latent Space Induction
We use two autoencoders, one for each language. Each autoencoder comprises an encoder E x (resp. E y ) and a decoder D x (resp. D y ). Unless otherwise stated, the autoencoders are non-linear, where each of the encoder and decoder is a threelayer feed-forward neural network with two nonlinear hidden layers. More formally, the encodingdecoding operations of the source autoencoder (autoenc x ) are defined as: are the parameters of the layers in the encoder and decoder respectively, and φ is a non-linear activation function; we use Parametric Rectified Linear Unit (PReLU) in all the hidden layers and tanh in the final layer of the decoder (Eq. 6). We use linear activations in the output layer of the encoder (Eq. 3). We train autoenc x with l 2 reconstruction loss as: where 3 } are the parameters of the encoder and the decoder of autoenc x .
The encoder, decoder and the reconstruction loss for the target autoencoder (autoenc y ) are similarly defined.

Supervised Non-linear Transformation
Let q(z x |x) and q(z y |y) be the distributions of latent codes in autoenc x and autoenc y , respectively. We have two non-linear mappers: M that translates a source code into a target code, and N that translates a target code into a source code ( Figure 1). Both mappers are implemented as a feed-forward neural network with a single hidden layer and tanh activations, and they are trained using the provided seed dictionary D.
Non-linear Mapping Loss. Let Θ M and Θ N denote the parameters of the two mappers M and N , respectively. While mapping from q(z x |x) to q(z y |y), we jointly train the mapper M and the source encoder E x with the following l 2 loss.
The mapping loss for N and E y is similarity defined. To learn a better transformation function, we enforce two additional constraints to our objective -back-translation and reconstruction.
Back-Translation Loss. To ensure that a source code z x i ∈ q(z x |x) translated to the target language latent space q(z y |y), and then translated back to the original latent space remain unchanged, we enforce the back-translation constraint, that is, The BT loss in the other direction Reconstruction Loss. In addition to backtranslation, we include another constraint to guide the mapping further. In particular, we ask the decoder D x of autoenc x to reconstruct the original embedding x i from the back-translated code N (M(z x i )). We compute this original embedding reconstruction loss for autoenc x as: The reconstruction loss for autoenc y is defined similarly. Both back-translation and reconstruction lead to more stable training in our experiments. In our ablation study ( §5.4), we empirically show the efficacy of the addition of these two constraints.
Total Loss. The total loss for mapping a batch of word embeddings from source to target is: where λ 1 and λ 2 control the relative importance of the loss components. Similarly we define the total loss for mapping in the opposite direction L y → x .
Remark. Note that our approach is fundamentally different from existing methods in two ways. First, most of the existing methods directly map the distribution of the source embeddings p(x) to the distribution of the target embeddings p(y). Second, they learn a linear mapping function assuming that the embedding spaces of the two languages are nearly isomorphic, which does not hold in general (Søgaard et al., 2018;Patra et al., 2019). Mapping the representations in the code space using non-linear transformations gives our model the flexibility to induce the required semantic structures in its latent space that could potentially yield more accurate cross-lingual mappings ( §5).

Training Procedure
We present the training method of LNMAP in Algorithm 1. In the first step, we pre-train autoenc x and autoenc y separately on the respective monolingual word embeddings. In this unsupervised step, we use the first 200K embeddings. This pretraining induce word semantics (and relations) in the code space (Mohiuddin and Joty, 2019). The next step is the self-training process, where we train the mappers along with the autoencoders using the seed dictionary in an iterative manner. We keep a copy of the original dictionary D; let us call it D orig . We first update the mapper M and the source encoder E x on the mapping loss (Eq. 8). The mappers (both M and N ) then go through two more updates, one for back-translation (Eq. 9) and the other for reconstruction of the source embedding (Eq. 10). The entire source autoencoder autoenc x (both E x and D x ) in this stage gets updated only on the reconstruction loss.
After each iteration of training (step i. in Alg. 1), we induce a new dictionary D new using the learned encoders and mappers. To find the nearest target word (y j ) of a source word (x i ) in the target latent space, we use the Cross-domain Similarity Local Scaling (CSLS) measure which works better than simple cosine similarity in mitigating the hubness problem . It penalizes the words that are close to many other words in the target latent space. To induce the dictionary, we compute CSLS for K most frequent source and target words and select the translation pairs that are nearest neighbors of each other according to CSLS.
For the next iteration of training, we construct the dictionary D by merging D orig with the l most similar (based on CSLS) word pairs from D new . We set l as l = iter × C, where iter is the current iteration number and C is a hyperparameter. This means we incrementally update the dictionary size. This is because the induced dictionary at the initial iterations is likely to be noisy. As the training progresses, the model becomes more mature and the induced dictionary pairs become better. For convergence, we use the criterion: if the difference between the average similarity scores of two successive iteration steps is less than a threshold (we use 1e −6 ), then stop the training process.

Experimental Settings
We evaluate our approach on bilingual lexicon induction, also known as word translation.

Datasets
To demonstrate the effectiveness of our method, we evaluate our models against baselines on two popularly used datasets: MUSE  and VecMap (Dinu et al., 2015).
The MUSE dataset consists of FastText monolingual embeddings of 300 dimensions (Bojanowski et al., 2017) trained on Wikipedia monolingual corpus and gold dictionaries for 110 language pairs. 2 To show the generality of different methods, we consider 15 different language pairs with 15 × 2 = 30 different translation tasks encompassing resource-rich and low-resource languages from different language families. In particular, we evaluate on English (En) from/to Spanish (Es), German (De), Italian (It), Russian (Ru), Arabic (Ar), Malay (Ms), Finnish (Fi), Estonian (Et), Turkish (Tr), Greek (El), Persian (Fa), Hebrew (He), Tamil (Ta), Bengali (Bn), and Hindi (Hi). We differentiate between high-and low-resource languages by the availability of NLP-resources in general.
The VecMap and its subsequent extension by Artetxe et al. (2018a) is a more challenging dataset and contains monolingual embeddings for English, 2 https://github.com/facebookresearch/MUSE Spanish, German, Italian, and Finnish. 3 According to Artetxe et al. (2018b), existing unsupervised methods often fail to produce meaningful results on this dataset. English, Italian, and German embeddings were trained on WacKy crawling corpora using CBOW (Mikolov et al., 2013b), while Spanish and Finnish embeddings were trained on WMT News Crawl and Common Crawl, respectively.

Baseline Methods
We compare our proposed LNMAP with several existing methods comprising supervised, semisupervised, and unsupervised models. For each baseline model, we conduct experiments with the publicly available code. In the following, we give a brief description of the baseline models.
(a) Artetxe et al. (2017) propose a self-learning framework that performs two steps iteratively until convergence. In the first step, they use the dictionary (starting with the seed dictionary) to learn a linear mapping, which is then used in the second step to induce a new dictionary. (b) Artetxe et al. (2018a) propose a multi-step framework that generalizes previous studies. Their framework consists of several steps: whitening, orthogonal mapping, re-weighting, de-whitening, and dimensionality reduction. (c)  compare their unsupervised model with a supervised baseline that learns an orthogonal mapping between the embedding spaces by iterative Procrustes refinement. They also propose CSLS for nearest neighbour search. (d) Joulin et al. (2018) show that minimizing a convex relaxation of the CSLS loss significantly improves the quality of bilingual word vector alignment. Their method achieves state-of-the-art results for many languages (Patra et al., 2019). (e) Jawanpuria et al. (2019) propose a geometric approach where they decouple CLWE learning into two steps: (i) learning rotations for languagespecific embeddings to align them to a common space, and (ii) learning a similarity metric in the common space to model similarities between the embeddings of the two languages. (a)  are the first to show impressive results for unsupervised word translation by pairing adversarial training with effective refinement methods. Given two monolingual word embeddings, their adversarial training plays a twoplayer game, where a linear mapper (generator) plays against a discriminator. They also impose the orthogonality constraint on the mapper. After adversarial training, they use the iterative Procrustes solution similar to their supervised approach. (b) Artetxe et al. (2018b) learn an initial dictionary by exploiting the structural similarity of the embeddings in an unsupervised way. They propose a robust self-learning to improve it iteratively. This model is by far the most robust and best performing unsupervised model .

Model Variants and Settings
We experiment with two variants of our model: the default LNMAP that uses non-linear autoencoders and LNMAP (LIN. AE) that uses linear autoencoders. In both the variants, the mappers are non-linear. We train our models using stochastic gradient descent (SGD) with a batch size of 128, a learning rate of 1e −4 , and a step learning rate decay schedule. During the dictionary induction process in each iteration, we consider K = 15000 most frequent words from the source and target languages. For dictionary update, we set C = 2000.

Results and Analysis
We present our results on low-resource and resource-rich languages from MUSE dataset in Tables 1 and 2, respectively, and the results on VecMap dataset in Table 3. We present the results in precision@1, which means how many times one of the correct translations of a source word is predicted as the top choice. For each of the cases, we show results on seed dictionary of three different sizes including 1-to-1 and 1-to-many mappings; "1K Unique" and "5K Unique" contain 1-to-1 mappings of 1000 and 5000 source-target pairs respectively, while "5K All" contains 1-tomany mappings of all 5000 source and target words, that is, for each source word there can be multiple target words. Through experiments and analysis, our goal is to assess the following questions.
(i) Does LNMAP improve over the best existing methods in terms of mapping accuracy on lowresource languages ( §5.1)?

Performance on Low-resource Languages
Most of the unsupervised models fail in the majority of the low-resource languages . On the other hand, the performance of supervised models on low-resource languages was not satisfactory, especially with small seed dictionary. Hence, we first compare LNMAP's performance on the these languages. From Table 1, we see that on average LNMAP outperforms every baseline by a good margin (1.1% -5.2% from the best baselines). For "1K Unique" dictionary, LNMAP exhibits impressive performance. In all the 20 translation tasks, it outperforms all the (semi-)supervised baselines by a wide margin. If we compare with Joulin et al. (2018), a state-of-the-art supervised model, LNMAP's average improvement is ∼18%, which is remarkable. Compared to other baselines, the average margin of improvement is also quite high -9.9%, 14.6%, 5.2%, 9.7%, and 8.0% gains over Artetxe et al. (2017) If we increase the dictionary size, we can still see the dominance of LNMAP over the baselines. For "5K Unique" seed dictionary, it performs better than the baselines on 14/20 translation tasks, while for "5K All" seed dictionary, the best performance by LNMAP is on 13/20 translation tasks.
One interesting thing to observe is that, under resource constrained setup LNMAP's performance is impressive, making it suitable for very lowresource languages like En-Ta, En-Bn, and En-Hi. Now if we look at the performance of unsupervised baselines on low-resource languages, we see that 's model fails to converge on the majority of the translation tasks (12/20), while the model of Mohiuddin and Joty (2019) fails to converge on En↔Ta, En↔Bn, and En↔Hi. Although the most robust unsupervised method of Artetxe et al. (2018b) performs better than the other unsupervised approaches, it still fails to converge on En↔Ta tasks. If we compare its perfor-   mance with LNMAP, we see that our model outperforms the best unsupervised model of Artetxe et al. (2018b) on 18/20 low-resource translation tasks. Table 2 shows the results for 5 resource-rich language pairs (10 translation tasks) from the MUSE dataset. We notice that our model achieves the highest accuracy in all the tasks for "1K Unique", 4 tasks for "5K Unique", 3 for "5K All". We show the results on the VecMap dataset in Table 3, where there are 3 resource-rich language pairs, and one low-resource pair (En-Fi) with a total of 8 translation tasks. Overall, we have similar observations as in MUSE -our model outperforms other models on 7 tasks for "1K Unique", 4 tasks for "5K Unique", and 4 for for "5K All".

Effect of Non-linearity in Autoencoders
The comparative results between our model variants in Tables 1 -3 reveal that LNMAP (with nonlinear autoencoders) works better for low-resource languages, whereas LNMAP (LIN. AE) works better for resource-rich languages. This can be explained by the geometric similarity between the embedding spaces of the two languages.  In particular, we measure the geometric similarity of the language pairs using the Gromov-Hausdorff (GH) distance (Patra et al., 2019),  which is recently proposed to quantitatively estimate isometry between two embedding spaces. 4 From the measurements (Tables 1-2), we see that etymologically close language pairs have lower GH distance compared to etymologically distant and low-resource language pairs. 5 Low-resource language pairs' high GH distance measure implies that English and those languages embedding spaces are far from isomorphism. Hence, we need strong non-linearity for those distant languages.

Dissecting LNMAP
We further analyze our model by dissecting it and measuring the contribution of its different components. Specifically, our goal is to assess the contribution of back-translation, reconstruction, nonlinearity in the mapper, and non-linearity in the autoencoder. We present the ablation results in Table 4 on 8 translation tasks from 4 language pairs consisting of 2 resource-rich and 2 low-resource languages. We use MUSE dataset for this purpose.
All the experiments for the ablation study are done using "1K Unique" seed dictionary. Reconstruction loss: For removing the reconstruction loss from the full model, on average highresource language pairs lose accuracy by 0.9% and 4 https://github.com/joelmoniz/BLISS 5 We could not compute GH distances for the VecMap dataset; the metric gives 'inf' in the BLISS framework.  Table 4: Ablation study of LNMAP with "1K Unique" dictionary. indicates the component is removed from the full model, and '⊕' indicates the component is added by replacing the corresponding component. 5.3% for from and to English, respectively. For low-resource language pairs, the loses are even higher, on average 2.5% and 6.4% in accuracy.
Back-translation (BT) loss: Removing the BT loss also has a negative impact, but not as high as the reconstruction. This is because the reconstruction loss (Eq. 10) also covers the BT signal. ⊕ Linear mapper: If we replace the non-linear mapper with a linear one in the full model, we see that the effect is not that severe. The reason can be explained by the fact that the autoencoders are still non-linear and the non-linear signal passes through back-translation and reconstruction. ⊕ Procrustes solution: To assess the proper effect of the non-linear mapper, we need to replace it with a linear mapper through which no nonlinear signal passes by during training. This can be achieved by replacing the non-linear mapper with the Procrustes solution. The results show adverse effect for removing non-linearity in the mapper in all the language pairs. But low-resource pairs' performance drops quite significantly. ⊕ Linear autoencoder: For high-resource language pairs, linear autoencoder works better than the non-linear one. But, it is the opposite for the low-resource pairs, where the performance drops significantly for the linear autoencoder.

Conclusions
We have presented a novel semi-supervised framework LNMAP to learn the cross-lingual mapping between two monolingual word embeddings. Apart from exploiting weak supervision from a small (1K) seed dictionary, our LNMAP leverages the information from monolingual word embeddings. In contrast to the existing methods that directly map word embeddings using the isomorphic assumption, our framework is independent of any such strong prior assumptions. LNMAP first learns to transform the embeddings into a latent space and then uses a non-linear transformation to learn the mapping. To guide the non-linear mapping further, we include constraints for back-translation and original embedding reconstruction.
Extensive experiments with fifteen different language pairs comprising high-and low-resource languages show the efficacy of non-linear transformations especially for low-resource and distant languages. Comparison with existing supervised, semi-supervised, and unsupervised baselines show that LNMAP learns a better mapping. With an indepth ablation study, we show that different components of LNMAP works in a collaborative nature.