MAAM: A Morphology-Aware Alignment Model for Unsupervised Bilingual Lexicon Induction

The task of unsupervised bilingual lexicon induction (UBLI) aims to induce word translations from monolingual corpora in two languages. Previous work has shown that morphological variation is an intractable challenge for the UBLI task, where the induced translation in failure case is usually morphologically related to the correct translation. To tackle this challenge, we propose a morphology-aware alignment model for the UBLI task. The proposed model aims to alleviate the adverse effect of morphological variation by introducing grammatical information learned by the pre-trained denoising language model. Results show that our approach can substantially outperform several state-of-the-art unsupervised systems, and even achieves competitive performance compared to supervised methods.


Introduction
The task of unsupervised bilingual lexicon induction aims at identifying translational equivalents across two languages (Kementchedjhieva et al., 2018). It can be applied in plenty of real-scenarios, such as machine translation (Artetxe et al., 2018b), transfer learning (Zhou et al., 2016), and so on.
Based on the observation that embedding spaces of different languages exhibit similar structures, a prominent approach is to align monolingual embedding spaces of two languages with a simple linear mapping (Zhang et al., 2017a;Lample et al., 2018). However, previous work (Artetxe et al., 2018a; has shown that morphological variation is an intractable challenge for the UBLI task. The induced translations in failure cases are usually morphologically related words. Due to similar semantics, these words can easily confuse the system to make the incorrect alignment.  on the FR-EN language pair, showing that all failures can be attributed to morphological variation. For instance, for the French source word "mangez", MUSE translates it to morphologically related word "eats", instead of the correct English translation "eat". However, we find that additional grammatical information can help alleviate the adverse effect of morphological variation. In detail, since lexicon induction (word alignment) can be regarded as word-to-word translation, the fluency of the translated sentence can reflect the quality of word alignment. If the model can retrieve the correct translation for each word in a source sentence, the translated sentence is more likely to be fluent and grammatically correct. Considering some problems (e.g. word order) of the naive word-toword translation can also lead to poor fluency, we pre-train a denoising auto-encoder (DAE) to clean noise in the original translated sentence. Figure 1 visually shows an example. For the French source word "mangez", if the model translates it to "eats" instead of the correct English translation "eat", the denoised translated sentence "you eats meat" is grammatically unreasonable. Therefore, by considering the fluency of the denoised translated sentence, these morphologically related erroneous translations can be reasonably punished.
Motivated by this, we propose a morphologyaware alignment model to alleviate the adverse effect of morphological variation by introducing ad-  ditional grammatical information. The proposed model consists of a learnable linear transformation W between two languages and a parameter-fixed denoising evaluator E. W is responsible for performing word-to-word translation on sentences in the source monolingual corpus. E first applies a DAE to clean noise in the original translated sentence, and then evaluates the fluency of the denoised translated sentence via a language model pre-trained on the target monolingual corpus to guide the training of W. Due to the discrete operation of word-to-word translation, we employ RE-INFORCE algorithm (Williams, 1992) to estimate the corresponding gradient. With the grammatical information contained in E, the adverse effect of morphological differences can be alleviated.
Our main contributions are listed as follows: • We propose a morphology-aware alignment model for unsupervised bilingual lexicon induction, which aims to alleviate the adverse effect of morphological variation by introducing grammatical information learned from pre-trained language model.
• Extensive experimental results show that our approach achieves better performance than several state-of-the-art unsupervised systems, and even achieves competitive performance compared to supervised methods.

Proposed Model
to denote the source and target monolingual embeddings, respectively. The task aims to find a linear transformation W so that for any source word embedding x, Wx lies close to the embedding y of its translation. Figure 2 presents the sketch of our proposed morphology-aware alignment model, which consists of a learnable linear transformation W and a parameter-fixed denoising evaluator E.

Denoising Evaluator
denoising auto-encoder language model Figure 2: The sketch of the proposed model.

Word-to-Word Translation
The word-to-word translation is accomplished by linear transformation W. Specifically, for each word s i in a source sentence s = (s 1 , · · · , s m ), it is translated by retrieving the nearest target word t i based on cosine 1 similarity.
where x s i and y t represent the pre-trained monolingual embedding of the source word s i and target word t, respectively.

Denoising Evaluator
The denoising evaluator E aims to utilize learned grammar information to guild the training of W. It contains two crucial components: a denoising auto-encoder (DAE) and a language model. Both components are pre-trained on the target monolingual corpus and remain fixed during training.

Denoising Auto-Encoder
Considering some ingrained problems (e.g. word order) of the naive word-to-word translation, the original translation t can be regarded as a noisy version of the ground-truth translation. Therefore, we adopt a DAE (Vincent et al., 2008) to clean noise in t = (t 1 , · · · , t m ) so that E can provide a more accurate supervisory signal. Here we implement the DAE as an encoder-decoder framework (Bahdanau et al., 2015). The input is the noisy version N (c) and the output is the cleaned sentence c, where c is a sentence sampled from the target monolingual corpus. Following Kim et al. (2018), we construct N (c) by designing three noises: insertion, deletion, and reordering. Readers can refer to Kim et al. (2018) for more technical explanations.

Language Model
For a source sentence s, if W is of high quality, the denoised translated sentence should keep fluent and grammatically correct. Otherwise, if W retrieves a morphologically related but erroneous word, the denoised translated sentence tends to be grammatically incorrect, leading to poor fluency. Therefore, a language model is used to evaluate the fluency of translation to guide the training of W. We implement the language model as an LSTM (Hochreiter and Schmidhuber, 1997) structure with weight tying. Since this part is not the focus of our work, readers can refer to Press and Wolf (2017) for the details. With the grammatical information learned by the pre-trained language model, erroneous word alignment due to morphological variation is penalized. Therefore, W is encouraged to retrieve correct word translation with appropriate morphology.

Training and Testing
We encourage W to perform correct word alignment so that the denoised translated sentences are fluent and grammatically correct. Therefore, the training objective is to minimize the negative expected reward, which is formulated as follows: where z t is the output of denoising auto-encoder with t as the input, R(z t ) is the reward evaluating the fluency of z t , and p(t|s) is the probability of W outputs t by performing word-to-word translation on s. We introduce them in detail as follows.
For the i-th word s i in the source sentence s, the probability of W retrieving the target translation t i can be characterized by the cosine similarity of both embedding Wx s i and y t i . Formally, Therefore, p(t|s) can be defined as the product of the probability corresponding to each position: The reward R(z t ) aims at evaluating the fluency of the denoised translated sentence z t to guide the training of W, which is defined as follows: where z i is the i-th word in z t = (z 1 , · · · , z |z| ), z <i refers to the sequence (z 1 , · · · , z i−1 ), and q(z i |z <i ) is the probability that the pre-trained language model outputs the word z i conditioned on z <i . If z t is fluent and grammatically correct, the corresponding reward R(z t ) is relatively large. Therefore, the reward R(z t ) can be used as feedback to guide the training of W. Since operation of word-to-word translation is discrete, we use REINFORCE algorithm (Williams, 1992) to estimate the gradient of Eq.
(2) as follows: where b is the baseline that is responsible for reducing the variance of gradient estimate.

Experiment Settings
We conduct experiments on the 300-dim fastText embeddings trained on Wikipedia. All words are lower-cased and only the frequent 200K words are used. We utilize approach in Artetxe et al. (2018a) to provide the initial linear transformation and lexicon constructed by Lample et al. (2018) is used for evaluation. Here we report accuracy with nearest neighbor retrieval based on cosine similarity. The parameters of the DAE and language model are provided in the Appendix. We set the batch size to 64 and the optimizer is SGD. The learning rate is initialized to 10 −5 and it is halved after every training epoch. The unsupervised criterion proposed in Lample et al. (2018) is adopted as both a stopping criterion and a model selection criterion. showing that our proposed model achieves the best performance on all test language pairs under unsupervised settings. In addition, our approach is able to achieve completely comparable or even better performance than supervised systems. This illustrates that the quality of word alignment can be improved by introducing grammar information from the pre-trained denoising language model. Our denoising evaluator encourages the model to retrieve the correct translation with appropriate morphological by assessing the fluency of sentences obtained by word-to-word translation. This alleviates the adverse effect of morphological variation.

Ablation Study
Here we perform an ablation study to understand the importance of different components. Table 3 presents the performance of different ablated versions, showing that our denoising evaluator can bring stable improvements in performance. This illustrates that introducing grammatical information learned by the pre-trained denoising language model is of great help to perform accurate word alignment. By imposing the penalty to the retrieved morphologically related but erroneous translations, this additional grammatical information can alleviate the adverse effects of morphological variation. In addition, we can find that the DAE plays an active role in improving results. By cleaning the noise in the original translated sentence, the DAE makes the reward provided by evaluator more accurate, leading to the improvements in model performance.

The Validity of Cleaning Noise
By cleaning the noise in the original word-to-word translation, the denoising auto-encoder (DAE) can benefit the evaluator E to feedback more accurate evaluation signals. Here Table 4   missing in the first example and the words in the second example are not organized in a grammatical order. However, our pre-trained DAE is able to correct these errors by inserting or deleting appropriate words or adjusting the word order. This intuitively demonstrates the effectiveness of our DAE in cleaning noise contained in the original translated sentence. Table 5 lists several word translation examples on the FR-EN language pair. The results show that the baselines retrieve morphologically related but erroneous translations, while our approach is able to perform the correct word alignment. Our approach can constrain the retrieved translation to have the correct morphology by introducing grammatical information, leading to improved performance. Figure 3 presents the visualization of joint semantic space of FR-EN language pair using t-SNE (Maaten and Hinton, 2008), showing that word pairs that can be translated mutually are represented by almost the same point. This intuitively reveals that our approach can capture the common linguistic regularities of different languages.

Related Work
This paper is mainly related to the following two lines of work.
Supervised cross-lingual embedding. Inspired by the isometric observation between monolingual word embeddings of two different languages, Mikolov et al. (2013b) propose to learn cross-lingual word mapping by minimizing mean squared error. Latter, Dinu and Baroni (2015) investigate the hubness problem and Faruqui and Dyer (2014) incorporates the semantics of a word in multiple languages into its embedding. Furthermore, Xing et al. (2015) propose to impose the orthogonal constraint to the linear mapping and Artetxe et al. (2016) present a series of techniques, including length normalization and mean centering, to improve bilingual results. There also exist some other representative researches. For instance, Smith et al. (2017) present inversesoftmax which normalizes the softmax probability over source words rather than target words and Artetxe et al. (2017) present a self-learning framework to perform iterative refinement, which is also adopted in some unsupervised settings and plays a crucial role in improving performance.
Unsupervised cross-lingual embedding. The endeavors to explore unsupervised cross-lingual embedding are mainly divided into two categories. One line focuses on designing heuristics or utilizing the structural similarity of monolingual embeddings. For instance, Hoshen and Wolf (2018) present a non-adversarial method based on the principal component analysis. Both Aldarmaki et al. (2018) and Artetxe et al. (2018a) take advantage of geometric properties across languages to perform word retrieval to learn the initial word mapping. Cao and Zhao (2018) formulate this problem as point set registration to adopt a point set registration method. However, these methods usually require plenty of random restarts or additional skills to achieve satisfactory performance. Another line strives to learn unsupervised word mapping by direct distribution-matching. For example, Lample et al. (2018) and Zhang et al. (2017a) completely eliminate the need for any supervision signal by aligning the distribution of transferred embedding and target embedding with GAN. Furthermore, Zhang et al. (2017b) and  adopt the Earth Mover's distance and Sinkhorn distance as the optimized distance metrics, respectively. There are also some attempts on distant language pairs. For instance, Kementchedjhieva et al. (2018) generalize Procrustes analysis by projecting the two languages into a latent space and Nakashole (2018) propose to learn neighborhood sensitive mapping by training non-linear functions. As for the hubness problem,  propose a latent-variable model learned with Viterbi EM algorithm. Recently, Alaux et al. (2018) work on the problem of aligning more than two languages simultaneously by a formulation ensuring composable mappings.

Conclusion
In this work, we present a morphology-aware alignment model for unsupervised bilingual lexicon induction. The proposed model is able to alleviate the adverse effect of morphological variation by introducing grammatical information learned from pre-trained denoising language model. The results show that our approach can achieve better performance than several state-of-the-art unsupervised systems, and even achieves competitive performance compared to supervised methods.