Evaluating bilingual word embeddings on the long tail

Bilingual word embeddings are useful for bilingual lexicon induction, the task of mining translations of given words. Many studies have shown that bilingual word embeddings perform well for bilingual lexicon induction but they focused on frequent words in general domains. For many applications, bilingual lexicon induction of rare and domain-specific words is of critical importance. Therefore, we design a new task to evaluate bilingual word embeddings on rare words in different domains. We show that state-of-the-art approaches fail on this task and present simple new techniques to improve bilingual word embeddings for mining rare words. We release new gold standard datasets and code to stimulate research on this task.


Introduction
Bilingual lexicon induction (BLI) is the task of generating accurate translations for each word in a list of source language words. Being able to perform BLI without parallel data is critical in many low resource scenarios. Bilingual word embeddings (BWEs) represent words from two different languages in the same vector space. BWEs have been shown to be very effective for BLI given a small seed lexicon (around 5000 wordpairs) as the only bilingual signal. Until now, BWEs have been evaluated on frequent words from parliament proceedings or Wikipedia articles and reached good accuracies on these datasets. However, evaluations on rare and domain-specific words have not yet been provided even though such evaluation scenarios are critical for applications like machine translation (e.g., mining of translations for OOV (out-of-vocabulary) items) or bilingual terminology mining. In this paper, we design a novel evaluation scenario for BWEs: given (i) large amounts of monolingual data and (ii) a seed lexicon of frequent word-pairs, the goal is to create BWEs that enable accurate mining of rare words. As gold standard data, we release manually annotated pairs of rare words and their translations from three domains: (i) web crawls (ii) news commentaries (iii) medical texts. We show that state-of-the-art BWEs perform poorly on these data sets. We present simple techniques to build and combine BWEs that yield strong performance improvements. We study using fasttext to build BWEs, using ensembles of BWEs, and dealing with orthographic distance in BWEs, all of which improve results for the new task of rare word translation mining. A secondary contribution is improvements over state-of-the-art approaches on frequent words (which have been already extensively studied in previous work). We make our datasets and code publicly available 1 .

Bilingual Induction of Rare Words
We briefly present how BLI is performed using BWEs and then introduce our new datasets.
Bilingual Lexicon Induction. The goal is to generate translations t in target language V t of provided words s from source language V s . Given a BWE representing V s and V t , an n-best list of translations for each word s ∈ V s can be induced by taking the top n words t i ∈ V t whose representation x t i in the BWE is closest to the representation x s according to cosine distance.
Datasets. To create BWEs we use post-hoc mapping which requires only monolingual texts and a small seed lexicon (see §3). Our training set consists of two large monolingual corpora: • GENERAL: 4,400,309 English and German sentences from parliament proceedings, news commentaries and web crawls taken from the WMT 2016 shared task (Bojar et al., 2016 Seed Lexicons. Throughout the paper, we work with two lexicons. For each lexicon, we take the most common words and translate these by taking the top-ranked translation from a probabilistic dictionary. 3 BWEs trained using this data are evaluated on our gold standards containing pairs of rare words (we will also report results on frequent words, as in previous work, see below). (1000 validation, 1109 test) As English-German BLI of frequent words has not been studied before, following previous work, we annotated 2000 frequent English words taken from each of the General and Medical corpora with their German translations using the same probabilistic dictionary as was used to generate the Lexicon sets. These two silver standard datasets will also be released with the paper:

Bilingual Word Embedding Creation
To create bilingual word embeddings, we use post-hoc mapping (PHM), a method that projects monolingual words embeddings (MWEs) into a shared space using a linear transformation trained with a small seed lexicon (Mikolov et al., 2013b;Faruqui and Dyer, 2014;Xing et al., 2015;Lazaridou et al., 2015;Vulić and Korhonen, 2016). Among methods to generate BWEs, PHM uses a very cheap bilingual signal. 5 Given MWEs in two languages V s and V t , the goal of post-hoc mapping is to find a matrix W ∈ R d 1 ×d 2 that maps each representation x s ∈ R d 1 of a source word s ∈ V s to the representation y t ∈ R d 2 of its translation t ∈ V t . Typically, W is learned using a seed lexi- where X and Y are stacked vectors of x i and y i respectively. Lazaridou et al. (2015) use a maxmargin ranking loss (MAX-MARG) to estimate W. For each ( x i , y i ) ∈ L, a candidate y * i = W · x i is computed. The ranking loss is: where y j is a randomly selected negative example, i.e., it is not a translation of x i , k is the number of negative examples and sim( x, y) computes cosine similarity between x and y. Hyperparameters γ and k are tuned on held-out validation data. 6  Table 1: Bilingual lexicon induction of frequent wordpairs on general and medical domain data. We report top-1 and (top-5 in brackets) percentage accuracy. In this paper, bolding indicates a best result so far for a particular dataset.

Testing Previous Work
We reimplement Mikolov et al. (2013b) as well as Lazaridou et al. (2015). To replicate their setup on English-German texts we first evaluate these on two standard tasks, mining frequent words from GENERAL and MEDICAL. We follow the approach of (Heyman et al., 2017) and use English as the source language. First, we train 300 dimensional MWEs on the monolingual data using W2V 7 with default parameters except that we lowered the minimum word frequency threshold to 3 (Mikolov et al., 2013a). To generate BWEs, we use MEDI-CAL and MEDLEX for MEDRARE and MEDFREQ, while we use GENERAL and GENLEX for the rest of the test sets. We report results with the combination of skip-gram (W2V SKIP) or cbow (W2V CBOW) and RIDGE or MAX-MARG. As in previous work, we use top-1 (translation is the closest neighbor) and top-5 (translation is one of 5 closest neighbors) accuracies. The results in Table 1 show that the best performing setups are W2V SKIP with MAX-MARG for GENFREQ and W2V CBOW with RIDGE for MEDFREQ. Accuracies are comparable to previous work (which was on different language pairs). The poor performance on MEDFREQ is consistent with Heyman et al. (2017), who introduced the task of mining frequent medical terms.

Applying BWEs for Mining Rare Word-Pairs
We use the exact same BWEs training setup as above ( §3) and perform BLI on our new test sets of rare words. The results in Table 2 show that on low frequency word-pairs BWEs perform very poorly. Compared to standard evaluation scenarios (see Table 1) a massive performance decrease is observed. Low accuracy is clearly caused by the inability of context-based models (W2V) to build accurate embedding vectors for words occurring 7 https://github.com/dav/word2vec   in very few contexts only. Through post-hoc mapping, these (poor) embeddings get projected randomly into the bilingual space which results in very poor performance on BLI especially for the medical domain.

Using Subword Models
A first way to create BWEs that are better adapted to rare words is to generate MWEs that provide better vector representations for the words. One simple idea is to try to add subword information. We show empirically this helps BLI of rare words, which has not been shown before, to our knowledge. FASTTEXT (Bojanowski et al., 2017) extends W2V by adding subword information s(w, c) to the context-based objective as follows: where G w ⊂ {1, ..., G} is a set of character ngram indices corresponding to the n-grams that appear in the word w, z g is the vector representation of the n-gram and v c is the vector of the context words. Subword information helps for rare words (by using n-gram information shared between words) and generates more accurate MWEs especially for morphologically rich languages like German. We create 300 dimensional MWEs using FASTTEXT skip-gram and cbow models with default parameters and with the same exception as before, i.e., we lowered the minimum word   Table 5: Examples comparing the predictions of the indicated models using ridge for the mapping where model1 and model2 shows the induced words for the given source. Bolding indicates the correct prediction and we give glosses for the incorrect predictions. frequency value to 3. We perform PHM using RIDGE and MAX-MARGIN. The results in Table 3 show that this procedure yields impressive performance improvements. After evaluation 8 we manually looked at the prediction of our models. We present examples in table 5. Examples 1-3 shows that the model improves non-trivial cases as well where the meanings of the incorrect predictions induced by W2V are not close to that of the input. We also show counterexamples 4 and 5 where subword elements cause errors by inducing hyponymies of the correct words. Generating BWEs with MAX-MARGIN on these improved MWEs is particularly effective. By analyzing word similarities we saw that in BWEs acquired with RIDGE rare English words are often mapped near to noise. 8 We added these examples to the camera-ready paper after the results were finalized.
Because MAX-MARGIN uses negative noisy word pairs as training examples this phenomenon is not as strongly present there.

Model Ensembling
Although BWEs obtained with FASTTEXT and MAX-MARGIN clearly outperform other methods on rare words, a combination of BWEs obtained with different models can further improve performance by integrating several sources of information. We ensemble BWEs obtained using different MWEs as follows: we generate n-best lists (n = 100) of translation candidates using each model. For each pair (s, t) of candidate translations, we compute an ensemble weight given by a weighted sum of similarity scores Sim i (s, t) ob-191 tained on each BWE: Sim i (s, t) is computed using cosine similarity. When a candidate pair (s, t) is not in the n-best list generated by a model i then Sim i (s, t) is set to 0. The weights γ i for each test set are tuned on validation sets separately (presented in §2) using grid search. The results (Table 4) show that ensembling yields significant gains over subword models alone for all data sets. We again looked for examples after evaluation (Table 5) where ensembling helped compared with the previous best setup (examples 6-9) and saw that the method again improves upon hard cases where the incorrect predictions are very close, in terms of meaning, to the gold annotation. Row 10 shows a counterexample. We note that this idea could be used in a supervised neural network for BLI as well, where information from multiple models could be integrated by concatenating embeddings from them for a given word.

Adding Orthographic Distance
While subword information captures orthographic properties of words to a certain extent, it cannot precisely represent the orthographic distance of each word pairs in a predefined number of dimensions, especially not that of source and target word pairs when performining post-hoc mapping (MWEs are trained separately thus there is no such cross-lingual information). Thus, it is beneficial to strengthen BWEs by integrating a similarity measure between word strings directly. The BWEs ensemble in Equation 4 can easily be augmented with a weighted term γ M +1 OSim(s, t) that measures the orthographic similarity (which we define as one minus the normalized Levenshtein distance) between the surface-forms of words s and t. We generate n-best lists of candidate translations using different BWE models as in §4.2. In addition, we generate a list containing the n closest target words according to OSim(s, t) and ensemble all lists together. Results are shown in Table 4. To measure the impact of orthographic information alone, we also report results obtained when using this information only (all other ensemble weights set to 0). For low frequency wordpairs, orthographic information leads to massive performance gains. We analyzed the gold stan-dard word pairs in our datasets from the perspective of orthographic similarity. For CRAWLRARE and MEDRARE the ratios of similar words are high which explains the large improvements obtained by adding this measure. Even though the ratio is not high for NEWSRARE and the two frequent datasets, orthographic information still improves performance which shows the advantage of using the technique in all cases. Table 5 shows nontrivial examples (11-13) where orthographic distance improves performance. Example 11 shows the advantage of combining the vector representation with orthographic distance, i.e., our model could find translations of sleddogs that have similar meaning, while in examples 12 and 13 orthographic distance helped to pick the correct translation which is the closest in terms of edit distance.
On the other hand, in example 14 orthographic distance caused an error because the incorrect prediction is too close to the source word in orthographic distance.

Conclusion
We evaluated BWEs on the novel task of rare term mining in different domains. Our experiments show that previous approaches to bilingual lexicon induction fail when mining rare words. We have studied techniques for decreasing the impact of these problems. By ensembling different BWEs and combining those with orthographic cues, we have reached state-of-the-art results. By making our code and datasets publicly available, we hope to encourage other researchers to further enhance BWEs to perform well on this important task. In the future, we would like to work on BLI of multiword translations and compound words.