Weakly-Supervised Concept-based Adversarial Learning for Cross-lingual Word Embeddings

Distributed representations of words which map each word to a continuous vector have proven useful in capturing important linguistic information not only in a single language but also across different languages. Current unsupervised adversarial approaches show that it is possible to build a mapping matrix that aligns two sets of monolingual word embeddings without high quality parallel data, such as a dictionary or a sentence-aligned corpus. However, without an additional step of refinement, the preliminary mapping learnt by these methods is unsatisfactory, leading to poor performance for typologically distant languages. In this paper, we propose a weakly-supervised adversarial training method to overcome this limitation, based on the intuition that mapping across languages is better done at the concept level than at the word level. We propose a concept-based adversarial training method which improves the performance of previous unsupervised adversarial methods for most languages, and especially for typologically distant language pairs.


Introduction
Distributed representations of words which map each word to a continuous vector have proven useful in capturing important linguistic information. Vectors of words that are semantically or syntactically similar have been shown to be close to each other in the same space (Mikolov et al., 2013a,c;Pennington et al., 2014), making them widely useful in many natural language processing tasks such as machine translation and parsing (Bansal et al., 2014;Mi et al., 2016), both in a single language and across different languages. Mikolov et al. (2013b) first observed that the geometric positions of similar words in different languages were related by a linear relation. Zou et al. (2013) showed that a cross-lingually shared word embedding space is more useful than a monolingual space in an end-to-end machine translation task. However, traditional methods for mapping two monolingual word embeddings require high quality aligned sentences or dictionaries (Faruqui and Dyer, 2014;Ammar et al., 2016b).
Reducing the need for parallel data, then, has become the main issue for cross-lingual word embedding mapping. Some recent work aiming at reducing resources has shown competitive crosslingual mappings across similar languages, using a pseudo-dictionary, such as identical character strings between two languages (Smith et al., 2017), or a simple list of numerals (Artetxe et al., 2017).
In a more general method, Zhang et al. (2017) have shown that learning mappings across languages via adversarial training (Goodfellow et al., 2014) can avoid using bilingual evidence. This generality comes at the expense of performance. To overcome this limitation, Lample et al. (2018) refine the preliminary mapping matrix trained by generative adversarial networks (GANs) and obtain a model that is again comparable to supervised models for several language pairs. Despite these big improvements, the performance of these refined GAN models depends largely on the quality of the preliminary mappings. It is probably for this reason that these models still do not show satisfactory performance for typologically distant languages.
In this paper, we explore new augmented models of the adversarial framework that are based on the intuition that the right level of lexical alignment between languages, especially distant languages, is not the word but the concept. We also observe that some cross-lingual resources, such as document-aligned data, are not as expensive as high-quality dictionaries. For example, Wikipedia provides thousands of aligned articles in many lan-guages. Combining these two observations, we develop a concept-based, weakly supervised adversarial method for learning cross-lingual word embedding mappings that rely on these readily available cross-lingual resources.
The main novelty of our method lies in the use of Wikipedia concept-aligned article pairs in the discriminator, which encourages the generator to align words from different languages which are used to describe the same concept. We leverage the power of the adversarial framework and its ability to align distributions by defining concepts as the distribution of words in their respective Wikipedia articles.
The results of our experiments on bilingual lexicon induction show that the preliminary mappings (without post-refinement) trained by our proposed multi-discriminator model are much better than unsupervised adversarial training methods. When we include the post-refinement of Lample et al. (2018), in most cases, our models are comparable or even better than our dictionary-based supervised baselines, and considerably improve the performance of the method of Lample et al. (2018), especially for distant language pairs, such as Chinese-English and Finnish-English. 1 Compared to the previous state-of-the-art, our method still shows competitive performance.

Models
The basic idea of mapping two sets of pretrained monolingual word embeddings together was first proposed by Mikolov et al. (2013b). They use a small dictionary of n pairs of words {(w s 1 , w t 1 ), ..., (w s n , w t n )} obtained from Google Translate to learn a transformation matrix W that projects the embeddings v s i of the source language words w s i onto the embeddings v t i of their translation words w t i in the target language, by optimizing the objective: The trained matrix W can then be used for detecting the translation for any source language word w s by simply finding the word w t whose embedding v t is nearest to W v s . Recent work by Smith et al. (2017) has shown that forcing the ma-trix W to be orthogonal can effectively improve the results of mapping.

GANs for cross-lingual word embeddings
Since the core of this linear mapping is derived from a dictionary, the quality and size of the dictionary can considerably affect the result. Recent attempts by Zhang et al. (2017) and Lample et al. (2018) have shown that, even without a dictionary or any other cross-lingual resource, training the transformation matrix W is still possible using the GAN framework (Goodfellow et al., 2014). The standard GAN framework plays a min-max game between two models: a generative model G and a discriminative model D. The generator G learns from the source data distribution and tries to generate new samples that appear drawn from the distribution of target data. The discriminator D discriminates the generated samples from the target data.
If we adapt the standard GAN framework to the goal of mapping cross-lingual word embeddings, the objective of the generator G is to learn the transformation matrix W that maps the source language embeddings to the target language embedding space, and the discriminator D l , usually a neural network classifier, detects whether the input is from the target language, giving us the objective: where G(v s ) = W v s and D l (v t ) denotes the probability that v t came from the distribution p vt rather than the distribution p vs . The inputs of generator G are the embeddings sampled from the distribution of source language word embeddings, p v s , and the inputs of discriminator D l are sampled from both the distribution p v t of real target language word embeddings and the distribution p G(vs) of generated target language word embeddings. Both G and D l are trained simultaneously by stochastic gradient descent. A good transformation matrix W can make p G(vs) similar to p v t , so that D l can no longer distinguish between them. However, this kind of similarity is at the distribution level. Good monolingual word embeddings are widely spread in the vector space. Consequently, without any post-processing, the crosslingual mappings are usually inaccurate when we look at specific words (Lample et al., 2018).

Concept-based GANs for cross-lingual word embeddings
Under the assumption that words in different languages are similar when they are used to describe the same topic or concept (Søgaard et al., 2015), it seems possible to use concept-aligned documents to improve the mapping performance and make the generated embeddings of the source language words closer to the embeddings of corresponding words in the target language. Unlike dictionaries and sentence-aligned corpora, cross-lingual concept-aligned documents are much more readily available. For example, Wikipedia contains more than 40-million articles in more than 200 languages. 2 Therefore, many articles of different languages can be linked together because they are about the same concept. Given two monolingual word embeddings V j s = {v s 1 , ..., v s j } and V k t = {v t 1 , , ..., v t k }, our work consists in using a set of Wikipedia conceptaligned article pairs, such that s represents the article of concept c i in the source language and c t i ∈ C h t represents the article of the same concept in the target language.
As shown in Figure 1, the generator uses the transformation matrix W to map the input embeddings v s from the source to the target language. Differently from the original GANs, the input of the generator and the input of the discriminator on concept D cl are not sampled from the whole distribution of V j s or V k t , but from their sub-distribution conditioned on the concept c, denoted p v s|c and p v t|c . v c represents the embedding of the shared concept c. In this paper, we use the embedding of the title of c in the target language. For titles consisting of multiple words, we average the embeddings of its words.
The input of the discriminator on concept D cl is the concatenation of the embedding of the shared concept v c and the generated target embeddings G(v s ) or the real target embeddings v t . Instead of determining whether its input follows the whole target language distribution p vt , the objective of D cl now becomes to judge whether its input follows the distribution p v t|c , and its loss function is given in equation (3). Because the sub-distribution conditioned on the concept p v t|c is less spread than p vt and the proportion of similar words in p v t|c is higher than in p vt , the embeddings of the input source words have a greater chance of being trained to align with the embeddings of their similar words in the target language.
Our multi-discriminator concept-based model does not completely remove the usual language discriminator D l that determines whether its input follows the real target language distribution p vt or the generated target language distribution, p G(vs) . The simpler discriminator D l is useful when the concept-based discriminator is not stable. The objective function of the multi-discriminator model, shown in equation (4), combines all these elements.

Shared Components
Both the GAN and the concept-based GAN models use a variety of further methods that have been shown to improve results.

Orthogonalization
Previous studies (Xing et al., 2015;Smith et al., 2017) show that enforcing the mapping matrix W to be orthogonal can improve the performance and make the adversarial training more stable. In this work, we perform the same update step proposed by Cisse et al. (2017) to approximate setting W to an orthogonal matrix: According to Lample et al. (2018) and Chen and Cardie (2018), when setting β to less than 0.01, the orthogonalization usually performs well.

Post-refinement
Previous work has shown that the core crosslingual mapping can be improved by refining it by bootstrapping from a dictionary extracted from the learned mapping itself (Lample et al., 2018). Since this refinement process is not the focus of this work, we perform the same refinement procedure as Lample et al. (2018). After the core mapping matrix is learned, we use it to translate the ten thousand most frequent source words (s 1 , s 2 , ..., s 10000 ) into the target language. We then take these ten thousand translations (t 1 , t 2 , ..., t 10000 ) and translate them back into the source language. We call these translation (s 1 , s 2 , ..., s 10000 ). We then consider all the word pairs (s i , t i ) where s i = s i as our seed dictionary and use this dictionary to update our preliminary mapping matrix using the objective in equation 1.
Moreover, following Xing et al. (2015) and Artetxe et al. (2016), we force the refined matrix W * to be orthogonal by using singular value decomposition (SVD):

Cross-Domain Similarity Local Scaling
Previous work (Radovanović et al., 2010;Dinu et al., 2015) has shown that standard nearest neighbour techniques are not effective to retrieve target similar words in high-dimensional spaces. The work of Lample et al. (2018) showed that using cross-domain similarity local scaling (CSLS) to retrieve target similar words is more accurate than standard nearest neighbour techniques. Instead of just considering the similarity between the source word and its neighbours in the target language, CSLS also takes into account the similarity between the target word and its neighbours in the source language: where r t (W v s ) represents the mean similarity between a source embedding and its neighbours in the target language, and r s (v t ) represent the mean similarity between a target embedding and its neighbours in the source language. In this work, we use CSLS to build a dictionary for our postrefinement.

Training
In our weakly-supervised model, the sampling and model selection procedures are important.
Sampling Procedure As the statistics show in Table 1, even after filtering, the vocabulary of concept-aligned articles is still large. But it has been shown that the embeddings of frequent words are the most informative and useful (Luong et al., 2013;Lample et al., 2018), so we only keep the one-hundred thousand most frequent words for learning W . For each training step, then, the input word s of our generator is randomly sampled from the vocabulary that is common both to the source monolingual word embedding S and the source Wikipedia concept-aligned articles. After the input source word s is sampled, we sample a concept c according to the frequency of s in each source article of the ensemble of concepts. Then we uniformly sample a target word t from the subvocabulary of the target article of concept c. 3 Model Selection It is not as difficult to select the best model for weakly-supervised learning as it is for unsupervised learning. In our task, the first selection criterion is the average of the cosine similarity between the embeddings of the concept title in the source and the target languages. Moreover, we find that another good indicator for selecting the best model is the average cosine similarity between the two sets of embeddings consisting of the average embeddings of the ten most frequent words in the source and in the target languages for each aligned article pair. As shown in Figure 2, both of these two criteria correlate well with the word translation evaluation task that we describe in section 5. In this paper, we use the average similarity of the cross-lingual title as the criterion for selecting the best model.
Other Details For our generator, the mapping matrix W is initialized with a random orthogonal matrix. Our two discriminators D l and D cl are two multi-layer perceptron classifiers with different hidden layer sizes (1024 for D l and 2048 for D cl ), and a ReLU function as the activation function. In this paper, we set the number of hidden layers to 2 for both D l and D cl . In practice, one hidden layer also performs very well.

Experiments
We experimentally evaluate our proposal on the task of Bilingual Lexicon Induction (BLI). This task evaluates directly the bilingual mapping ability of a cross-lingual word embedding model. For each language pair, we retrieve the best translations for source words in the test data, and we report the accuracy. More precisely, for a given source language word, we map its embedding onto the target language and retrieve the closest word. If this closest word is included in the correct translations in the evaluation dictionary, we consider that it is a correct case. In this paper, we report the results that use CSLS to retrieve translations.
Baselines As the objective of this work is to improve the performance of unsupervised GANs by using our weakly-supervised concept-based model, we choose the model of Lample et al. (2018) (called MUSE below), as our unsupervised baseline, since it is a typical standard GAN model. We evaluate the models both in a setting without refinement and a setting with refinement. The procedure of refinement is described in section 3.2 and 3.3. Moreover, it is important to evaluate whether our model is comparable to previous supervised models. We use two different dictionarybased systems, VecMap proposed by Artetxe et al. (2017) and Procrustes, provided by Lample et al. (2018), as our supervised baselines. Previous experiments have already shown that these two systems are strong baselines.

Data
Different previous pieces of work on bilingual lexicon induction use different datasets. We choose two from the publicly available ones, selected to have a comprehensive evaluation of our method. BLI-1 The dataset provided by Lample et al. (2018) contains high quality dictionaries for more than 150 language pairs. For each language pair, it has a training dictionary of 5000 words and an evaluation dictionary of 1500 words. This dataset allows us to have a better understanding of the performance of our method on many different language pairs. We choose nine languages for testing and compare our method to our supervised and unsupervised baselines described in section 5: English (en), German (de), Finnish (fi), French (fr), Spanish (es), Italian (it), Russian (ru), Turkish (tr) and Chinese (zh). We classify similar and distant languages based on a combination of structural properties (directional dependency distance, as proposed and measured in Chen and Gerdes (2017)) and lexical properties, as measured by the clustering of current large-scale multilingual sentence embeddings. 4 We consider en de, en fr, en es, en it as similar language pairs and en fi, en ru, en tr, en zh as distant language pairs. BLI-2 Unlike BLI-1, the dataset of Dinu et al. (2015) and its extensions provided by Artetxe et al. (2017Artetxe et al. ( , 2018b only consists of dictionaries of 4 language pairs trained on a Europarl parallel cor-  pus. Each dictionary has a training set of 5000 entries and a test set of 1500 entries. Compared to BLI-1, this dataset is much noisier and the entries are selected from different frequency ranges. However, BLI-2 has been widely used for testing by previous methods (Faruqui and Dyer, 2014;Dinu et al., 2015;Xing et al., 2015;Artetxe et al., 2016;Zhang et al., 2016;Artetxe et al., 2017;Smith et al., 2017;Lample et al., 2018;Artetxe et al., 2018a,b). 5 Using BLI-2 allows us to have a direct comparison with the state-of-the-art.

Monolingual Word Embeddings
The quality of monolingual word embeddings has a considerable impact on cross-lingual embeddings (Lample et al., 2018). Compared to CBOW and Skip-gram embeddings, FastText embeddings  capture syntactic information better. The ideal situation would be to use FastText embeddings for both the BLI-1 and BLI-2 datasets. However, much previous work uses CBOW embeddings, so we use different monolingual word embeddings for BLI-1 and for BLI-2. For BLI-1, we use FastText 6 to train our monolingual word embedding models with 300 dimensions for each language and default settings. The training corpus comes from a Wikipedia dump. 7 For European languages, words are lower-cased and tokenized by the scripts of Moses (Koehn et al., 2007). 8 For Chinese, we first use OpenCC 9 to convert traditional characters to simplified characters and then use Jieba 10 to perform tokenization. For each language, we only keep the words that appear more than five times.
For BLI-2, following the work of Artetxe et al. (2018a), 11 we use their pretrained CBOW embeddings of 300 dimensions. For English, Italian and German, the models are trained on the WacKy corpus. The Finnish model is trained from Common Crawl and the Spanish model is trained from WMT News Crawl.
Concept Aligned Data For concept-aligned articles, we use the Linguatools Wikipedia comparable corpus. 12 The statistics of our concept-aligned data are reported in Table 1. Table 2 summarizes the results of bilingual lexicon induction on the BLI-1 dataset. We can see that with post-refinement, our method achieves the best performance on eight language pairs. Table  3 illustrates the results of bilingual lexicon induction on the BLI-2 dataset. Although our method (with refinement) does not achieve the best results, the gap between our method and the best state-ofthe-art (Artetxe et al., 2018a) is very small, and in most cases, our method is better than previous supervised methods.

Results and Discussion
Comparison with unsupervised GANs As we have mentioned before, the preliminary mappings trained by the method of Lample et al. (2018) perform well for some similar language pairs, such as Spanish to English and French to English. After refinement, their unsupervised GANs models can reach the same level as supervised models for these similar language pairs. However, the biggest drawback of standard GANs is that they exhibit poor performance for distant language pairs. The results from Table 2 clearly confirm this. For example, without refinement, the mapping trained by the unsupervised GAN method can only correctly predict 12% of the words from Turkish to English. Given that the quality of preliminary mappings can seriously affect the effect of refinement, the low-quality preliminary mappings for distant language pairs severely limits the improvements brought by post-refinement. Notice that the method of Lample et al. (2018) scores a null result for English to Finnish on both BLI-1 and BLI-2, indicating that totally unsupervised adversarial training can yields rather unpredictable results.
Compared to the method of Lample et al. (2018), the improvement brought by our method is apparent. Our concept-based GAN models perform better than their unsupervised models on almost every language pair for both datasets, BLI-1 and BLI-2. For those languages where no better results are achieved, the gaps from the best are very small. Figure 4 reports the error reduction rate brought by our method for bilingual lexicon induction on the BLI-1 dataset. It can be clearly seen that the improvement brought by our method is more pronounced on distant language pairs than on similar language pairs. Comparison with supervised baselines Our two dictionary-based supervised baselines are strong. In most cases, the preliminary mappings trained by our concept-based model are not as good, but the gap is small. After post-refinement, our method becomes comparable with these su-Method en-de en-es en-fi en-it Supervised Mikolov et al. (2013b) 35.0 27.3 25.9 34.9 Faruqui and Dyer (2014)   pervised methods. For some language pairs, such as French-English and for English to Finnish, our method performs better.
Comparison with the state-of-the-art From the results shown in Table 3, we can see that in most cases, our method works better than previous supervised and unsupervised approaches. However, the performance of Artetxe et al. (2018a) is very strong and their method always works better than ours. Two potential reasons may cause this difference: First, their self-learning framework iteratively fine-tunes the transformation until convergence, while our refinement just runs for a certain number of iterations. 13 Second, their framework consists of many optimization steps, such as symmetric re-weighting of vectors, steps that we do not have. Table 4 lists some examples of mapping, randomly selected from our experiment results on the dataset of BLI-1. From this table, we can see that for these selected English words, our model performs better than the unsupervised GANs model of Lample et al. (2018), but when we consider the top 3 translation candidates, both models are able to predict the correct translations well. Interestingly, the word "battery" is most commonly translated as Chinese "电 池(dianchi)", correctly predicted by our model, and the top 3 translation candidates provided by Figure 3: Accuracy of Chinese-English bilingual lexicon induction task for models trained from different concept numbers. Figure 4: Average error reduction of our method compared to unsupervised adversarial method for bilingual lexicon induction on BLI-1 dataset (Lample et al., 2018). Since the Finnish-English pair is an outlier for the unsupervised method, we report both the average with and without this pair.

Mapping examples
our model are all related to "electronic device". However, the translation candidates provided by MUSE are all related to "artillery", a different sense of the the word "battery" both in English and in Chinese, but not as common. Impact of number of concepts To understand whether the number of aligned concepts affects our method, we trained our concept-based models on a range of Chinese-English concept-aligned article pairs, from 550 to 10'000. We test them on the BLI-1 dataset. Following the trend of performance change shown in Figure 3, we see that, when the number of shared concepts reaches 2500, the improvement in accuracy is already very close to the result of the model trained from the total number of concepts (reported in Table 1), thus indicating a direction for future optimization.

Related Work
Sharing a word embedding space across different languages has proven useful for many crosslingual tasks, such as machine translation (Zou et al., 2013) and cross-lingual dependency pars-ing (Jiang et al., 2015(Jiang et al., , 2016Ammar et al., 2016a). Generally, such spaces can be trained directly from bilingual sentence aligned or document aligned text (Hermann and Blunsom, 2014;Chandar A P et al., 2014;Søgaard et al., 2015;Vulić and Moens, 2013). However the performance of directly trained models is limited by their vocabulary size.
Initial work on the topic has shown that two monolingual spaces can be combined by applying a linear mapping matrix (Mikolov et al., 2013b). This simple approach has been improved upon in several ways: using canonical correlation analysis to map source and target embeddings (Faruqui and Dyer, 2014); or by forcing the mapping matrix to be orthogonal (Xing et al., 2015).
Recently, efforts have concentrated on how to limit or avoid reliance on dictionaries. Good results were achieved with some drastically minimal techniques. Zhang et al. (2016) achieved good results at bilingual POS tagging, but not bilingual lexicon induction, using only ten word pairs to build a coarse orthonormal mapping between source and target monolingual embeddings. The work of Smith et al. (2017) has shown that a singular value decomposition (SVD) method can produce a competitive cross-lingual mapping by using identical character strings across languages. Artetxe et al. (2017Artetxe et al. ( , 2018b proposed a self-learning framework, which iteratively trains its cross-lingual mapping by using dictionaries trained in previous rounds. The initial dictionary of the self-learning can be reduced to 25 word pairs or even only a list of numerals and still have competitive performance. Furthermore, Artetxe et al. (2018a) extend their self-learning framework to unsupervised models, and build the state-ofthe-art for bilingual lexicon induction. Instead of using a pre-build dictionary for initialization, they sort the value of the word vectors in both the source and the target distribution, treat two vectors that have similar permutations as possible translations and use them as the initialization dictionary. Additionally, their unsupervised framework also includes many optimization augmentations, such as stochastic dictionary induction and symmetric re-weighting, among others.
Theoretically, employing GANs for training cross-lingual word embedding is also a promising way to avoid the use of bilingual evidence. As far as we know, Miceli Barone (2016)   attempt at this approach, but the performance of their model is not competitive. Zhang et al. (2017) enforce the mapping matrix to be orthogonal during the adversarial training and achieve good performance on bilingual lexicon induction. The main drawback of their approach is that the vocabularies of their training data are small, and the performance of their models drops significantly when they use large training data. The recent model proposed by Lample et al. (2018) is so far the most successful, becoming competitive with previous supervised approaches through a strong CSLS-based refinement to the core mapping matrix trained by GANs. Even in this case, though, without refinement, the core mappings are not as good as hoped for some distant language pairs. More recently, Chen and Cardie (2018) extends the work of Lample et al. (2018) from the bilingual setting to the multi-lingual setting. Instead of training crosslingual word embeddings for only one language pair, their approach allows them to train crosslingual word embeddings for many language pairs at the same time. Another recent piece of work which is similar to Lample et al. (2018) comes from Xu et al. (2018). Their approach can be divided into 2 steps: first, using Wasserstein GAN (Arjovsky et al., 2017) to train a preliminary mapping between two monolingual distributions and then minimizing the Sinkhorn Distance across distributions. Although their method performs better than Lample et al. (2018) in several tasks, the improvement mainly comes from the second step, showing that the problem of how to train a better preliminary mapping has not been resolved.

Conclusions and Future Work
In this paper, we propose a weakly-supervised adversarial training method for cross-lingual word embedding mapping. Our approach is based on the intuition that mapping across distant languages is better done at the concept level than at the word level. We propose using the distributions of words in aligned Wikipedia articles as data for this concept mapping. The method improves the performance over previous unsupervised adversarial methods in almost all cases, especially for distant languages.
The monolingual word embeddings that we have used in this paper represent a word and not its different senses. These models do not handle polysemous words well. This also can have a negative impact on the performance of cross-lingual mapping. Recent work on contextualized word embeddings (Peters et al., 2018;Devlin et al., 2019) provides dynamic representations for words and has been shown to perform better at handling the problem of polysemy. Extending our concept-based cross-lingual mapping to contextualized word representations will be the focus of our future work.