Unsupervised Bilingual Lexicon Induction via Latent Variable Models

Bilingual lexicon extraction has been studied for decades and most previous methods have relied on parallel corpora or bilingual dictionaries. Recent studies have shown that it is possible to build a bilingual dictionary by aligning monolingual word embedding spaces in an unsupervised way. With the recent advances in generative models, we propose a novel approach which builds cross-lingual dictionaries via latent variable models and adversarial training with no parallel corpora. To demonstrate the effectiveness of our approach, we evaluate our approach on several language pairs and the experimental results show that our model could achieve competitive and even superior performance compared with several state-of-the-art models.


Introduction
Learning the representations of languages is a fundamental problem in natural language processing and most existing methods exploit the hypothesis that words occurring in similar contexts tend to have similar meanings (Pennington et al., 2014;Bojanowski et al., 2017), which could lead word vectors to capture semantic information. Mikolov et al. (2013) first point out that word embeddings learned on separate monolingual corpora exhibit similar structures. Based on this finding, they suggest it is possible to learn a linear mapping from a source to a target embedding space and then generate bilingual dictionaries. This simple yet effective approach has led researchers to investigate on improving cross-lingual word embeddings with the help of bilingual word lexicons (Faruqui and Dyer, 2014;Xing et al., 2015).
For low-resource languages and domains, crosslingual signal would be hard and expensive to obtain, and thus it is necessary to reduce the need for bilingual supervision. Artetxe et al. (2017) successfully learn bilingual word embeddings with only a parallel vocabulary of aligned digits. Zhang et al. (2017) utilize adversarial training to obtain cross-lingual word embeddings without any parallel data. However, their performance is still significantly worse than supervised methods. By combining the merits of several previous works, Conneau et al. (2018) introduce a model that reaches and even outperforms supervised state-of-the-art methods with no parallel data.
In recent years, generative models have become more and more powerful. Both Generative Adversarial Networks (GANs) (Goodfellow et al., 2014) and Variational Autoencoders (VAEs) (Kingma and Welling, 2014) are prominent ones. In this work, we borrow the ideas from both GANs and VAEs to tackle the problem of bilingual lexicon induction. The basic idea is to learn latent variables that could capture semantic meaning of words, which would be helpful for bilingual lexicon induction. We also utilize adversarial training for our model and require no form of supervision. We evaluate our approach on several language pairs and experimental results demonstrate that our model could achieve promising performance. We further combine our model with several helpful techniques and show our model could perform competitively and even superiorly compared with several state-of-the-art methods.

Bilingual Lexicon Induction
Extracting bilingual lexica has been studied by researchers for a long time. Mikolov et al. (2013) first observe there is isomorphic structure among word embeddings trained separately on monolingual corpora and they learn the linear transformation between languages. Zhang et al. (2016b) improve the method by constraining the transformation matrix to be orthogonal. Xing et al. (2015) incorporate length normalization during training and D source target q ф s p θ s p θ t q ф t Figure 1: Illustration of our model. φ s and φ t map the source and target word embeddings into latent variables. Discriminator D guides the two latent distributions to be the same.
maximize the cosine similarity instead. They point out that adding an orthogonality constraint can improve performance and has a closed-form solution, which was referred to as Procrustes approach in Smith et al. (2017). Canonical correlation analysis has also been used to map both languages to a shared vector space (Faruqui and Dyer, 2014;Lu et al., 2015).
To reduce the need for supervision signals, Artetxe et al. (2017) use identical digits and numbers to form an initial seed dictionary and then iteratively refine their results until convergence. Zhang et al. (2017) apply adversarial training to align monolingual word vector spaces with no supervision. Conneau et al. (2018) improve the model by combining adversarial training and Procrustes approach, and their unsupervised approach could reach and even outperform state-of-the-art supervised approaches. In this work, we make further improvements and enhance the model proposed in (Conneau et al., 2018) with latent variable model and iterative training procedure.

Generative Models
VAEs (Kingma and Welling, 2014) represent one of the most successful deep generative models. Standard VAEs assume observed variables are generated from latent variables and the latent variables are sampled from a simple Gaussian distribution. Typically, VAEs utilize an neural inference model to approximate the intractable posterior, and optimize model parameters jointly with a reparameterized variational lower bound. VAEs have been successfully applied in several natural language processing tasks before (Zhang et al., 2016a;Bowman et al., 2016).
GANs (Goodfellow et al., 2014) are another framework for estimating generative models via an adversarial process and have attracted huge attention. The basic strategy is to train a generative model and a discriminative model simultaneously via an adversarial process. Adversarial training technique for matching distribution has proven to be powerful in a variety of tasks (Bowman et al., 2016). Adversarial Autoencoder (Makhzani et al., 2015) is a probabilistic autoencoder that uses the GANs to perform variational inference. By combining a VAE with a GAN, Larsen et al. (2016) use learned feature representations in the GAN discriminator as the basis for the VAE reconstruction objective. GANs have been applied in machine translation before (Yang et al., 2018;.

Proposed Approach
In this section, we first briefly introduce VAEs, and then we illustrate the details and training techniques of our proposed model.

Variational Autoencoder
Variational Autoencoders (VAEs) are deep generative model which are capable of learning complex density models for data via latent variables. Given a nonlinear generative model p θ (x|z) with input x ∈ R D and associated latent variable z ∈ R L drawn from a prior distribution p 0 (z), the goal of VAEs is to use a recognition model, q φ (z|x) to approximate the posterior distribution of the latent variables by maximizing the following variational lower bound (1) where KL refers to Kullback-Leibler divergence.

Our Model
Basically, our model assumes that the source word embedding {x n } and the target word embedding {y n } could be drawn from a same latent variable space {z n }, where {z n } is capable of capturing semantic meaning of words.
In contrast to the standard VAE prior that assumes each latent embedding z n to be drawn from the same latent Gaussian, our model just requires the distribution of latent variables for source and target word embeddings to be equal. To achieve such a goal, we utilize adversarial training to guide the two latent distributions to match with each other.
As in adversarial training, we have networks φ s and φ t for both source and target space, striving to map words into the same latent space, while the discriminator D is a binary classifier which tries to distinguish between the two languages. We also have reconstruction networks θ s and θ t as in VAEs.
The objective function for the discriminator D could be formulated as For the source side, the objective is to minimize Here we define q φs (z|x) = N (µ s (x), Σ s (x)), where µ s (x) = W µs x and Σ s (x) = exp(W σs x); W µs and W σs are learned parameters. We also define the mean of p θs (x|z) to be W T µs z. The objective function and structure for φ t are similar.
The basic framework of our model is shown in Figure 1. As we could see from the figure, our model tries to map the source and target word embedding into the same latent space which could capture the semantic meaning of words.
Theoretical analysis has revealed that adversarial training tries to minimize the Jensen-Shannon (JS) divergence between the real and fake distribution. Therefore, one can view our model as replace KL divergence in Equation 1 with JS divergence and change the Gaussian prior to the target distribution.

Training Strategy
Our model has two generators φ s and φ t , and we have found that training them jointly would be extremely unstable. In this paper, we propose an iterative method to train our models. Basically, we first initialize W µt to be identity matrix and train φ s and θ s on the source side. After convergence, we freeze W µs , and then train φ t and θ t in the target side. The pseudo-code for this process is shown in Algorithm 1. It should be noted that there is no variance once completing training.

Small-scale Datasets
In this section, our experiments focus on smallscale datasets and our main baseline model is adversarial autoencoder (Zhang et al., 2017). For justice, we use the same model selection strategy with Zhang et al. (2017), i.e. we choose the model whose sum of reconstruction loss and classification accuracy is the least. The source and target word embeddings would be first mapped into the latent space. For each source word embedding x, it would be first transformed into z x . The the its k nearest target embeddings would be retrieved and be compared against the entry in a ground truth bilingual lexicon. Performance is measured by top-1 accuracy.

Experiments on Chinese-English Dataset
For this set of experiments, we use the same data as Zhang et al. (2017). The statistics of the final training data is given in Table 1. We use Chinese-English Translation Lexicon Version 3.0 (LDC2002L27) as our ground truth bilingual lexicon for evaluation. The baseline models are MonoGiza system (Dou et al., 2015), translation matrix (TM) (Mikolov et al., 2013), isometric alignment (IA) (Zhang et al., 2016b) and adversarial training approach (Zhang et al., 2017). Table 2 summarizes the performance of baseline models and our approach. The results of baseline models are cited from Zhang et al. (2017). As we can see from the table, our model could achieve superior performance compared with other baseline models.

Experiments on Other Language Pairs Datasets
We also conduct experiments on Spanish-English and Italian-English language pairs. Again, we use the same dataset with Zhang et al. (2017). and the statistics are shown in  The experimental results are shown in Table 4. Because Spanish, Italian and English are closely related languages, the accuracy would be higher than the Chinese-English dataset. Our model is able to outperform baseline model in this setting.

Model
Accuracy (

Large-scale Datasets
In this section, we integrate our method with Conneau et al. (2018), whose method improves Zhang et al. (2017) by more sophiscated refinement procedure and validation criterion. We replace their first step, namely the adversarial training step, with our model. Basically, we first map the source and target embeddings into the latent space using our algorithm, and then fine-tune the identity mapping in the latent space with the closed-form Procrustes solution. We use their similarity measure, namely cross-domain similarity local scaling (CSLS), to produce reliable matching pairs and validation criterion for unsupervised model selection.
We conduct experiments on English-Spanish, English-Russian and English-Chinese datasets, which are the same as Conneau et al. (2018). The results are shown in Table 5. As seen, our model could consistently achieve better performance compared with adversarial training. After refinement, our model could further achieve competitive and even superior results compared with state-of-the-art unsupervised methods. This further demonstrates the capacity of our model.

Conclusion
Based on the assumption that word vectors in different languages could be drawn from a same latent variable space, we propose a novel approach which builds cross-lingual dictionaries via latent variable models and adversarial training with no parallel corpora. Experimental results on several language pairs have demonstrated the effectiveness and universality of our model. We hope our method could be beneficial to other areas such as unsupervised machine translation .
Future directions include validate our model on more realistic scenarios (Dinu et al., 2015) as well as combine our algorithms with more sophiscated adversarial networks Gulrajani et al., 2017).