Adversarial Training for Unsupervised Bilingual Lexicon Induction

Word embeddings are well known to capture linguistic regularities of the language on which they are trained. Researchers also observe that these regularities can transfer across languages. However, previous endeavors to connect separate monolingual word embeddings typically require cross-lingual signals as supervision, either in the form of parallel corpus or seed lexicon. In this work, we show that such cross-lingual connection can actually be established without any form of supervision. We achieve this end by formulating the problem as a natural adversarial game, and investigating techniques that are crucial to successful training. We carry out evaluation on the unsupervised bilingual lexicon induction task. Even though this task appears intrinsically cross-lingual, we are able to demonstrate encouraging performance without any cross-lingual clues.


Introduction
As word is the basic unit of a language, the betterment of its representation has significant impact on various natural language processing tasks. Continuous word representations, commonly known as word embeddings, have formed the basis for numerous neural network models since their advent. Their popularity results from the performance boost they bring, which should in turn be attributed to the linguistic regularities they capture (Mikolov et al., 2013b).
Soon following the success on monolingual tasks, the potential of word embeddings for crosslingual natural language processing has attracted much attention. In their pioneering work, Mikolov * Corresponding author.  (Mikolov et al., 2013a). Although trained independently, the two sets of embeddings exhibit approximate isomorphism. et al. (2013a) observe that word embeddings trained separately on monolingual corpora exhibit isomorphic structure across languages, as illustrated in Figure 1. This interesting finding is in line with research on human cognition (Youn et al., 2016). It also means a linear transformation may be established to connect word embedding spaces, allowing word feature transfer. This has far-reaching implication on low-resource scenarios (Daumé III and Jagarlamudi, 2011;Irvine and Callison-Burch, 2013), because word embeddings only require plain text to train, which is the most abundant form of linguistic resource. However, connecting separate word embedding spaces typically requires supervision from crosslingual signals. For example, Mikolov et al. (2013a) use five thousand seed word translation pairs to train the linear transformation. In a recent study, Vulić and Korhonen (2016) show that at least hundreds of seed word translation pairs are needed for the model to generalize. This is unfortunate for low-resource languages and domains, The generator G tries to transform source word embeddings (squares) to make them seem like target ones (dots), while the discriminator D tries to classify whether the input embeddings are generated by G or real samples from the target embedding distribution. (b) The bidirectional transformation model. Two generators with tied weights perform transformation between languages. Two separate discriminators are responsible for each language. (c) The adversarial autoencoder model. The generator aims to make the transformed embeddings not only indistinguishable by the discriminator, but also recoverable as measured by the reconstruction loss L R . because data encoding cross-lingual equivalence is often expensive to obtain. In this work, we aim to entirely eliminate the need for cross-lingual supervision. Our approach draws inspiration from recent advances in generative adversarial networks (Goodfellow et al., 2014). We first formulate our task in a fashion that naturally admits an adversarial game. Then we propose three models that implement the game, and explore techniques to ensure the success of training. Finally, our evaluation on the bilingual lexicon induction task reveals encouraging performance, even though this task appears formidable without any cross-lingual supervision.

Models
In order to induce a bilingual lexicon, we start from two sets of monolingual word embeddings with dimensionality d. They are trained separately on two languages. Our goal is to learn a mapping function f : R d → R d so that for a source word embedding x, f (x) lies close to the embedding of its target language translation y. The learned mapping function can then be used to translate each source word x by finding the nearest target embedding to f (x).
We consider x to be drawn from a distribution p x , and similarly y ∼ p y . The key intuition here is to find the mapping function to make f (x) seem to follow the distribution p y , for all x ∼ p x . From this point of view, we design an adversarial game as illustrated in Figure 2(a): The generator G implements the mapping function f , trying to make f (x) passable as target word embeddings, while the discriminator D is a binary classifier striving to distinguish between fake target word embeddings f (x) ∼ p f (x) and real ones y ∼ p y . This intuition can be formalized as the minimax game Theoretical analysis reveals that adversarial training tries to minimize the Jensen-Shannon divergence JSD p y ||p f (x) (Goodfellow et al., 2014). Importantly, the minimization happens at the distribution level, without requiring word translation pairs to supervise training.

Model 1: Unidirectional Transformation
The first model directly implements the adversarial game, as shown in Figure 2(a). As hinted by the isomorphism shown in Figure 1, previous works typically choose the mapping function f to be a linear map (Mikolov et al., 2013a;. We therefore parametrize the generator as a transformation matrix G ∈ R d×d . We also tried non-linear maps parametrized by neural networks, without success. In fact, if the generator is given sufficient capacity, it can in principle learn a constant mapping function to a target word embedding, which makes the discriminator impossible to distinguish, much like the "mode collapse" problem widely observed in the image domain (Radford et al., 2015;Salimans et al., 2016). We therefore believe it is crucial to grant the generator with suitable capacity.
As a generic binary classifier, a standard feedforward neural network with one hidden layer is used to parametrize the discriminator D, and its loss function is the usual cross-entropy loss, as in the value function (1): For simplicity, here we write the loss with a minibatch size of 1; in our experiments we use 128.
The generator loss is given by In line with previous work (Goodfellow et al., 2014), we find this loss easier to minimize than the original form log (1 − D (Gx)).

Orthogonal Constraint
The above model is very difficult to train. One possible reason is that the parameter search space R d×d for the generator may still be too large. Previous works have attempted to constrain the transformation matrix to be orthogonal (Xing et al., 2015;Zhang et al., 2016b;Artetxe et al., 2016). An orthogonal transformation is also theoretically appealing for its self-consistency (Smith et al., 2017) and numerical stability. However, using constrained optimization for our purpose is cumbersome, so we opt for an orthogonal parametrization (Mhammedi et al., 2016) of the generator instead.

Model 2: Bidirectional Transformation
The orthogonal parametrization is still quite slow. We can relax the orthogonal constraint and only require the transformation to be self-consistent (Smith et al., 2017): If G transforms the source word embedding space into the target language space, its transpose G should transform the target language space back to the source. This can be implemented by two unidirectional models with a tied generator, as illustrated in Figure 2(b). Two separate discriminators are used, with the same cross-entropy loss as Equation (2) used by Model 1. The generator loss is given by

Model 3: Adversarial Autoencoder
As another way to relax the orthogonal constraint, we introduce the adversarial autoencoder (Makhzani et al., 2015), depicted in Figure 2(c). After the generator G transforms a source word embedding x into a target language representation Gx, we should be able to reconstruct the source word embedding x by mapping back with G . We therefore introduce the reconstruction loss measured by cosine similarity: Note that this loss will be minimized if G is orthogonal. With this term included, the loss function for the generator becomes where λ is a hyperparameter that balances the two terms. λ = 0 recovers the unidirectional transformation model, while larger λ should enforce a stricter orthogonal constraint.

Training Techniques
Generative adversarial networks are notoriously difficult to train, and investigation into stabler training remains a research frontier (Radford et al., 2015;Salimans et al., 2016;Arjovsky and Bottou, 2017). We contribute in this aspect by reporting techniques that are crucial to successful training for our task.

Regularizing the Discriminator
Recently, it has been suggested to inject noise into the input to the discriminator (Sønderby et al., 2016;Arjovsky and Bottou, 2017). The noise is typically additive Gaussian. Here we explore more possibilities, with the following types of noise, injected into the input and hidden layer: • Multiplicative Bernoulli noise (dropout) (Srivastava et al., 2014): ∼ Bernoulli (p).
• Multiplicative Gaussian noise: As noise injection is a form of regularization (Bishop, 1995;Van der Maaten et al., 2013;Wager et al., 2013), we also try l 2 regularization, and directly restricting the hidden layer size to combat overfitting. Our findings include: • Without regularization, it is not impossible for the optimizer to find a satisfactory parameter configuration, but the hidden layer size has to be tuned carefully. This indicates that a balance of capacity between the generator and discriminator is needed.
• All forms of regularization help training by allowing us to liberally set the hidden layer size to a relatively large value.
• Among the types of regularization, multiplicative Gaussian injected into the input is the most effective, and additive Gaussian is similar. On top of input noise, hidden layer noise helps slightly.
In the following experiments, we inject multiplicative Gaussian into the input and hidden layer of the discriminator with σ = 0.5.

Model Selection
From a typical training trajectory shown in Figure 3, we observe that training is not convergent. In fact, simply using the model saved at the end of training gives poor performance. Therefore we need a mechanism to select a good model. We observe there are sharp drops of the generator loss L G , and find they correspond to good models, as the discriminator gets confused at these points with its classification accuracy (D accuracy) dropping simultaneously. Interestingly, the reconstruction loss L R and the value of G G − I F exhibit synchronous drops, even if we use the unidirectional transformation model (λ = 0). This means a good transformation matrix is indeed nearly orthogonal, and justifies our encouragement of G towards orthogonality. With this finding, we can train for sufficient steps and save the model with the lowest generator loss.
As we aim to find the cross-lingual transformation without supervision, it would be ideal to determine hyperparameters without a validation set. The sharp drops can also be indicative in this case. If a hyperparameter configuration is poor, those values will oscillate without a clear drop. Although this criterion is somewhat subjective, we find it to be quite feasible in practice.

Other Training Details
Our approach takes monolingual word embeddings as input. We train the CBOW model (Mikolov et al., 2013b) with default hyperparameters in word2vec. 1 The embedding dimension d is 50 unless stated otherwise. Before feeding them into our system, we normalize the word embeddings to unit length. When sampling words for adversarial training, we penalize frequent words in a way similar to (Mikolov et al., 2013b). G is initialized with a random orthogonal matrix. The hidden layer size of D is 500. Adversarial training involves alternate gradient update of the generator and discriminator, which we implement with a simpler variant algorithm described in (Nowozin et al., 2016). Adam (Kingma and Ba, 2014) is used as the optimizer, with default hyperparameters. For the adversarial autoencoder model, λ = 1 generally works well, but λ = 10 appears stabler for the low-resource Turkish-English setting.

Experiments
We evaluate the quality of the cross-lingual embedding transformation on the bilingual lexicon induction task. After a source word embedding is transformed into the target space, its M nearest target embeddings (in terms of cosine similarity) are retrieved, and compared against the entry in a ground truth bilingual lexicon. Performance is measured by top-M accuracy (Vulić and Moens, 2013): If any of the M translations is found in the ground truth bilingual lexicon, the source word is considered to be handled correctly, and the accuracy is calculated as the percentage of correctly translated source words. We generally report the harshest top-1 accuracy, unless when comparing with published figures in Section 4.4.

Baselines
Almost all approaches to bilingual lexicon induction from non-parallel data depend on seed lexica. An exception is decipherment (Dou and Knight, 2012;Dou et al., 2015), and we use it as our baseline. The decipherment approach is not based on distributional semantics, but rather views the source language as a cipher for the target language, and attempts to learn a statistical model to decipher the source language. We run the Mono-Giza system as recommended by the toolkit. 2 It can also utilize monolingual embeddings (Dou et al., 2015); in this case, we use the same embeddings as the input to our approach.
Sharing the underlying spirit with our approach, related methods also build upon monolingual word embeddings and find transformation to link different languages. Although they need seed word translation pairs to train and thus not directly comparable, we report their performance with 50 and 100 seeds for reference. These methods are: • Translation matrix (TM) (Mikolov et al., 2013a): the pioneer of this type of methods mentioned in the introduction, using linear transformation. We use a publicly available implementation. 3 • Isometric alignment (IA) (Zhang et al., 2016b): an extension of TM by augmenting its learning objective with the isometric (orthogonal) constraint. Although Zhang et al. (2016b) had subsequent steps for their POS tagging task, it could be used for bilingual lexicon induction as well.
We ensure the same input embeddings for these methods and ours. The seed word translation pairs are obtained as follows. First, we ask Google Translate 4 to translate the source language vocabulary. Then the target translations are queried again and translated back to the source language, and those that do not match the original source words are discarded. This helps to ensure the translation quality. Finally, the translations are discarded if they fall out of our target language vocabulary.  Table 2: Chinese-English top-1 accuracies of the MonoGiza baseline and our models, along with the translation matrix (TM) and isometric alignment (IA) methods that utilize 50 and 100 seeds.

Experiments on Chinese-English Data
For this set of experiments, the data for training word embeddings comes from Wikipedia comparable corpora. 5 Following (Vulić and Moens, 2013), we retain only nouns with at least 1,000 occurrences. For the Chinese side, we first use OpenCC 6 to normalize characters to be simplified, and then perform Chinese word segmentation and POS tagging with THULAC. 7 The preprocessing of the English side involves tokenization, POS tagging, lemmatization, and lowercasing, which we carry out with the NLTK toolkit. 8 The statistics of the final training data is given in Table 1, along with the other experimental settings. As the ground truth bilingual lexicon for evaluation, we use Chinese-English Translation Lexicon Version 3.0 (LDC2002L27). Table 2 lists the performance of the MonoGiza baseline and our four variants of adversarial training. MonoGiza obtains low performance, likely due to the harsh evaluation protocol (cf. Section 4.4). Providing it with syntactic information can help (Dou and Knight, 2013), but in a lowresource scenario with zero cross-lingual information, parsers are likely to be inaccurate or even unavailable.

城市
小行星 文学 chengshi xiaoxingxing wenxue city asteroid poetry town astronomer literature suburb comet prose area constellation poet proximity orbit writing Table 3: Top-5 English translation candidates proposed by our approach for some Chinese words. The ground truth is marked in bold. The unidirectional transformation model attains reasonable accuracy if trained successfully, but it is rather sensitive to hyperparameters and initialization. This training difficulty motivates our orthogonal constraint. But imposing a strict orthogonal constraint hurts performance. It is also about 20 times slower even though we utilize orthogonal parametrization instead of constrained optimization. The last two models represent different relaxations of the orthogonal constraint, and the adversarial autoencoder model achieves the best performance. We therefore use it in our following experiments. Table 3 lists some word translation examples given by the adversarial autoencoder model.

Comparison With Seed-Based Methods
In this section, we investigate how many seeds TM and IA require to attain the performance level of our approach. There are a total of 1,280 seed translation pairs for Chinese-English, which are removed from the test set during the evaluation for this experiment. We use the most frequent S pairs for TM and IA. Figure 4 shows the accuracies with respect to  Table 4: Top-1 accuracies (%) of the MonoGiza baseline and our approach on Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. The results for translation matrix (TM) and isometric alignment (IA) using 50 and 100 seeds are also listed. S. When the seeds are few, the seed-based methods exhibit clear performance degradation. In this case, we also observe the importance of the orthogonal constraint from the superiority of IA to TM, which supports our introduction of this constraint as we attempt zero supervision. Finally, in line with the finding in (Vulić and Korhonen, 2016), hundreds of seeds are needed for TM to generalize. Only then do seed-based methods catch up with our approach, and the performance difference is marginal even when more seeds are provided.

Effect of Embedding Dimension
As our approach takes monolingual word embeddings as input, it is conceivable that their quality significantly affects how well the two spaces can be connected by a linear map. We look into this aspect by varying the embedding dimension d in Figure 5. As the dimension increases, the accuracy improves and gradually levels off. This indicates that too low a dimension hampers the encoding of linguistic information drawn from the corpus, and it is advisable to use a sufficiently large dimension.

Experiments on Other Language Pairs Data
We also induce bilingual lexica from Wikipedia comparable corpora for the following language pairs: Spanish-English, Italian-English, Japanese-Chinese, and Turkish-English. For Spanish-English and Italian-English, we choose to use TreeTagger 9 for preprocessing, as in (Vulić and Moens, 2013). For the Japanese corpus, we use MeCab 10 for word segmentation and POS tagging. For Turkish, we utilize the preprocessing tools (tokenization and POS tagging) provided in LORELEI Language Packs (Strassel and Tracey, 2016), and its English side is preprocessed by NLTK. Unlike the other language pairs, the frequency cutoff threshold for Turkish-English is 100, as the amount of data is relatively small.
The ground truth bilingual lexica for Spanish-English and Italian-English are obtained from Open Multilingual WordNet 11 through NLTK. For Japanese-Chinese, we use an in-house lexicon. For Turkish-English, we build a set of ground truth translation pairs in the same way as how we obtain seed word translation pairs from Google Translate, described above. Table 4, the MonoGiza baseline still does not work well on these language pairs, while our approach achieves much better performance. The accuracies are particularly high for Spanish-English and Italian-English, likely because they are closely related languages, and their embedding spaces may exhibit stronger isomorphism. The  Table 5: Top-1 accuracies (%) of our approach to inducing bilingual lexica for Chinese-English from Wikipedia and Gigaword. Also listed are results for translation matrix (TM) and isometric alignment (IA) using 50 and 100 seeds.

As shown in
performance on Japanese-Chinese is lower, on a comparable level with Chinese-English (cf. Table  2), and these languages are relatively distantly related. Turkish-English represents a low-resource scenario, and therefore the lexical semantic structure may be insufficiently captured by the embeddings. The agglutinative nature of Turkish can also add to the challenge.

Large-Scale Settings
We experiment with large-scale Chinese-English data from two sources: the whole Wikipedia dump and Gigaword (LDC2011T13 and LDC2011T07). We also simplify preprocessing by removing the noun restriction and the lemmatization step (cf. preprocessing decisions for the above experiments).
Although large-scale data may benefit the training of embeddings, it poses a greater challenge to bilingual lexicon induction. First, the degree of non-parallelism tends to increase. Second, with cruder preprocessing, the noise in the corpora may take its toll. Finally, but probably most importantly, the vocabularies expand dramatically compared to previous settings (see Table 1). This means a word translation has to be retrieved from a much larger pool of candidates.
For these reasons, we consider the performance of our approach presented in Table 5 to be encouraging. The imbalanced sizes of the Chinese and English Wikipedia do not seem to cause a problem for the structural isomorphism needed by our method. MonoGiza does not scale to such large vocabularies, as it already takes days to train in our Italian-English setting. In contrast, our approach is immune from scalability issues by working with embeddings provided by word2vec, which is well known for its fast speed. With the network method 5k 10k MonoGiza w/o embeddings 13.74 7.80 MonoGiza w/ embeddings 17.98 10.56 (Cao et al., 2016) 23.54 17.82 Ours 68.59 51.86 Table 6: Top-5 accuracies (%) of 5k and 10k most frequent words in the French-English setting. The figures for the baselines are taken from (Cao et al., 2016).
configuration used in our experiments, the adversarial autoencoder model takes about two hours to train for 500k minibatches on a single CPU.

Comparison
With (Cao et al., 2016) In order to compare with the recent method by Cao et al. (2016), which also uses zero cross-lingual signal to connect monolingual embeddings, we replicate their French-English experiment to test our approach. 12 This experimental setting has important differences from the above ones, mostly in the evaluation protocol. Apart from using top-5 accuracy as the evaluation metric, the ground truth bilingual lexicon is obtained by performing word alignment on a parallel corpus. We find this automatically constructed bilingual lexicon to be noisier than the ones we use for the other language pairs; it often lists tens of translations for a source word. This lenient evaluation protocol should explain MonoGiza's higher numbers in Table 6 than what we report in the other experiments. In this setting, our approach is able to considerably outperform both MonoGiza and the method by Cao et al. (2016).
As one of our baselines, the method by Cao et al. (2016) also does not require cross-lingual signals to train bilingual word embeddings. It modifies the objective for training embeddings, whereas our approach uses monolingual embeddings trained beforehand and held fixed. More importantly, its learning mechanism is substantially different from ours. It encourages word embeddings from different languages to lie in the shared semantic space by matching the mean and variance of the hidden states, assumed to follow a Gaussian distribution, which is hard to justify. Our approach does not make any assumptions and directly matches the mapped source embedding distribution with the target distribution by adversarial training.
A recent work also attempts adversarial training for cross-lingual embedding transformation (Barone, 2016). The model architectures are similar to ours, but the reported results are not positive. We tried the publicly available code on our data, but the results were not positive, either. Therefore, we attribute the outcome to the difference in the loss and training techniques, but not the model architectures or data.

Adversarial Training
Generative adversarial networks are originally proposed for generating realistic images as an implicit generative model, but the adversarial training technique for matching distributions is generalizable to much more tasks, including natural language processing. For example, Ganin et al. (2016) address domain adaptation by adversarially training features to be domain invariant, and test on sentiment classification.  extend this idea to cross-lingual sentiment classification. Our research deals with unsupervised bilingual lexicon induction based on word embeddings, and therefore works with word embedding distributions, which are more interpretable than the neural feature space of classifiers in the above works.
In the field of neural machine translation, a recent work (He et al., 2016) proposes dual learning, which also involves a two-agent game and therefore bears conceptual resemblance to the adversarial training idea. The framework is carried out with reinforcement learning, and thus differs greatly in implementation from adversarial training.

Conclusion
In this work, we demonstrate the feasibility of connecting word embeddings of different languages without any cross-lingual signal. This is achieved by matching the distributions of the transformed source language embeddings and target ones via adversarial training. The success of our approach signifies the existence of universal lexical semantic structure across languages. Our work also opens up opportunities for the processing of extremely low-resource languages and domains that lack parallel data completely.
Our work is likely to benefit from advances in techniques that further stabilize adversarial training. Future work also includes investigating other divergences that adversarial training can minimize (Nowozin et al., 2016), and broader mathematical tools that match distributions (Mohamed and Lakshminarayanan, 2016).