Improving Bilingual Lexicon Induction for Low Frequency Words

This paper designs a Monolingual Lexicon Induction task and observes that two factors accompany the degraded accuracy of bilingual lexicon induction for rare words. First, a diminishing margin between similarities in low frequency regime, and secondly, exacerbated hubness at low frequency. Based on the observation, we further propose two methods to address these two factors, respectively. The larger issue is hubness. Addressing that improves induction accuracy signiﬁcantly, especially for low-frequency words.


Introduction
Bilingual Lexicon Induction (BLI) studies how to generate word-level translations from non-parallel corpora in two languages. Recently, Irvine and Callison-Burch (2017) observe that rarer words are harder to translate than frequent ones. But their BLI method is based on various "hand-crafted" features. We show that the same phenomenon occurs as well in BLI methods that are based on word embeddings. This type of methods have become especially popular in recent years Faruqui and Dyer, 2014;Artetxe et al., 2016Artetxe et al., , 2018 and achieved state-of-art accuracies (Conneau et al., 2018).
We briefly review BLI methods that are based on word embeddings. Without loss of generality, in this paper, we focus on "supervised" BLI, which assumes that a seeding dictionary is available. Unsupervised BLI (Artetxe et al., 2018;Conneau et al., 2018) often alternates between inducing a seeding dictionary and using that to refine generated translations. Therefore, to some extent, "supervised" BLI is a key step in its "unsupervised" counterpart, and a more basic prototype to study.
Let the source space be X [x 1 , . . . , x m ], where x i ∈ R d is the embedding vector for the i-th source word. Similarly, let the target space be Y [y 1 , . . . , y n ] where y j is the embedding vector for the j-th target word. Here m and n are the vocabulary sizes for the two spaces. The seeding dictionary is made up of subsets of X and Y, denoted as X s = [x s 1 , . . . , x s S ] and Y s = [y s 1 , . . . , y s S ] respectively, where x s k and y s k are the word embeddings of a pair of translations.
S is the size of seeding dictionary.
The typical supervised BLI works by first learning a transformation W that minimizes the discrepancy between X s and Y s , where · F is Frobenius norm, and W is a constraint set of W. The easiest choice of W may be R d×d , seen in . On the other hand, Xing et al. (2015) has observed substantial gain by letting W = O(d), the set of orthogonal matrices. In this case, (1) is also called a Procrustes problem.
Once the transformation W is learned, translation can be induced for a word x i , by retrieving the Nearest Neighbor (NN) of Wx i in Y. Cosine distance is often adopted in the retrieval. Solving the Procrustes problem (Eq. (1)), followed by NN search is a representative framework. We use it as a baseline of this paper.
Despite the existing success in word embedding based BLI, understanding of its performance against word frequency is still lacking. This paper observes that BLI's accuracy significantly degrades for low-frequency words. Then, two factors are identified that may explain the observation. Motivated by them, we propose two methods that address each of the two factors, both improving BLI's performance in low-frequency regime.

Lexicon Induction at Low Frequency
We study how induction accuracies vary for words of different frequencies. Before we start, it should be emphasized that the frequency ranking of a source word and its translation(s) can differ a lot in their respective language. We term this fact as frequency mismatch, and the extent of mismatch also depends on the language pair. To simplify the problem, we design a "Monolingual Lexicon Induction" (MLI) task.

Monolingual Lexicon Induction (MLI)
MLI works with two sets of word embeddings for a single language. Given a word to be translated, the induction is supposed to retrieve the same word. The embeddings are trained respectively from two pieces of monolingual corpora (in the same language). While frequency mismatch still exists due to the differences in the two corpora, it is however reduced significantly. Compared with BLI, the induction task is also much simplified as the ground-truth is an one-to-one mapping.
We take the fasttext wiki and crawl 1 embeddings, and build a shared vocabulary of 500K words. The words are sorted from the most to least frequent according to the crawl corpora, and the order is more or less preserved in the wiki corpora. We split the 500K words into 50 frequency bins. That is, the first bin includes the 10K most frequent words. The second bin includes the next 10K most frequent words, and so on.
In each frequency bin, we randomly hold out 4K words as test words. The rest 6K are used to build seeding dictionary. To see how the size of seeds may impact induction accuracy, we vary the number of seeds sampled from the 6K words. In particular, we sample 0.02K, 0.2K and all the 6K in each frequency bin, resulting in seeding dictionaries of size 1K, 10K and 300K respectively. We ensure that any smaller seeding set is a subset of a bigger one.
An orthogonal transformation is learned using the seeds. Then, for the transformed source embeddings, nearest neighbors are retrieved in the target space. Figure 1a shows the accuracies of retrievals in each frequency bin. The accuracies drop significantly at low frequency. One may wonder if adding more seeds can help. It helps but is not very effective, as the gain diminishes quickly. Indeed, the improvement is tiny from a seeding size 1 https://fasttext.cc/docs/en/english-vectors.html of 10K to 300K. In the next two subsections, we look into two statistics as diagnostics of the observation.

Cosine Similarities and Margin
Consider a source word x i , when we apply NN retrieval, it is supposed that its true translation trans(x i ) is the closest to Wx i , among all the candidates in Y. In other words, we want for y j = trans(x i ). Further define the difference between cos(Wx i , trans(x i )) and max j cos(Wx i , y j ) as a margin associated with ( 2) When M (x i ) < 0, a translation error occurs. Figure 1b plots the (averaged) M (x) values within each frequency bin. We observe that the margin decreases in low frequency regime, leading to the degraded accuracies. Again, when the seeding size increases, the margin can be enlarged, but it also saturates quickly.

Hubness and Tail of k-occurrence
Hubness is a tendency in high dimensional space that some data points, called hubs, appear to be suspiciously close to many others (Radovanovic et al., 2010). Hubness is detrimental as NN search may retrieve these hubs more often than should be. A variable to measure the degree of hubness is k-occurrence. k-occurrence, N k , is defined for any one item in the target space. It is the number of times the item being retrieved as a k-nearest neighbor against a query set Q. Formally, To see the degree of hubness, one makes a histogram of N k (y; Q) for all the y's in the target space. A long tail of the histogram is an indication of strong hubness. The "tailness" can be measured by the number of times the N k values being bigger than a threshold n, formally defined as A bigger T n (N k ) indicates longer tail of the distribution of N k , hence more hubness.
In our MLI case, the ground-truth translation is one-to-one. Therefore if a target word y has a big value of N 1 (y), it is a "hub" that is incorrectly retrieved for at least N 1 (y j ) − 1 times. Note that N k and T n (N k ) both depend on the query set Q. We vary the Q from the most frequent to the least words, and plot T 2 (N 1 ) values in figure 1c. For all seeding sizes, hubness becomes more prominent for low-frequency words, implying that some "hubby" target words are being retrieved more often than should be.
Summarizing this section, we have identified two statistics that may explain the inferior accuracy for low-frequency words. Moreover, adding more seeds is not very effective for improving the accuracy.

Two Methods
Motivated by the two diagnostics in the last section, we introduce two methods, each individually improving the accuracy in low frequency regime.

Hinge Loss for Learning Transformation
We first design a learning objective that enlarges the margin, as follows, where γ > 0 is a threshold. The objective encourages the margin cos(Wx s i , y s i ) − cos(Wx s i , y j ) to be bigger than γ.
It should be noticed that using hinge loss to learn the transformation is not a new idea. Examples are seen not only for BLI (Lazaridou et al., 2015), but also zero-shot image classification (Frome et al., 2013). Our difference with (Lazaridou et al., 2015) is that W is set to O(d) instead of R d×d , as empirically we observe some gain. This is consistent with the discovery in (Xing et al., 2015), although they experiment with the Procrustes loss (Eq. (1)) instead.
We apply the hinge loss to train the orthogonal transformation, using a seeding dictionary of 10K. Accuracy is reported as the green line in figure 2a.
A notable gain is observed over the Procrustes loss (blue line), especially in low frequency regime. Figure 2b validates that the margins (for lowfrequency words) are indeed enlarged by adopting a hinge loss.

Hubless Nearest Neighbor (HNN) Search
To motivate, let us first consider a case where the translation is an one-to-one mapping. We should be able to take advantage of this strong prior, so that each target word is retrieved exactly only once. To this end, we introduce an assignment matrix P ∈ [0, 1] m×n such that P i,j is the probability of assigning y j as a translation of x i . By this definition, j P i,j = 1.
On the other hand, i P i,j measures how each y j is likely to be retrieved. We want them to be equally preferred, so we constrain i P i,j to be uniform over all j. In other words, Observe that m does not necessarily equal n, so in fact we do not constrain the mapping to be one-toone. The P is such that i,j P i,j cos(Wx i , y j ) is minimized, which can be considered as the "cost" of translation. In summary, we want to solve the following optimization problem, (5) is a linear assignment problem. It can be solved by Hungarian algorithm (Jonker and Volgenant, 1987) with cubic complexity. Recently, a more efficient solver is rediscovered in (Cuturi, 2013) by regularizing the entropy of P, where H(P) = − i,j P i,j log P i,j is entropy of P. It is known that as → 0, the solution of problem (6) converges to that of problem (5). Now a key challenge in solving (6) is its expensive computational cost, when m and n are huge. One way to circumvent that is by solving a dual problem (Genevay et al., 2016) of (6) instead. Since the math is lengthy and less relevant, we refer interested readers to a sister paper, (Huang et al., 2019), for all details. Our implementation can be found at github 2 . Once we obtain the P by solving (6), translation of the x i is induced as the y j where j = arg max j P i,j . We learn the transformation by solving the Procrustes problem (with 10K seeds) but replace NN with HNN search. The accuracies across all frequency bins are reported as the red line in figure 2a. It significantly outperforms Procrustes + NN, especially in the low-frequency regime. The reduced hubness is validated by the smaller 2 code at https://github.com/baidu-research/HNN T 2 (N 1 ) values in figure 2c. While HNN is effective, it is not the only method to reduce hubness. For example, CSLS (Conneau et al., 2018) is a recent state-of-art. Fig. 2a also compares our HNN (solid red line) against CSLS (dashed red line) across all frequency bins, and HNN outperforms CSLS for rare words.

Bilingual Experiments
We experiment with two language pairs, an easier pair (English, French) and a harder pair (En-glish, Finnish). The embeddings and ground-truth dictionaries are downloaded from MUSE repo 3 . We use a vocabulary of size 200K for both source and target languages. Following the same setup as in section 2.1, we create 30 frequency bins and a seeding dictionary of size 10K by uniformly sampling from each bin. The remaining words are used for test. Figure 3 shows accuracy as a function of frequency rank.
In all cases, the proposed two methods both improve upon the baseline (blue curve), and HNN shows more gain over hinge loss. However, compared with MLI (figure 2a), now the improvement seems to be more evenly distributed over all frequencies, especially on the harder language pair.
Moreover, HNN is on-par with or slightly better than CSLS for closer language pair. In contrast, en-fi ( fig. 3c) is a case where CSLS works better than HNN notably. We think it is due to a strong morphology in Finnish.

Conclusion
Accuracy of bilingual lexicon induction decays for low-frequency words, as indicated by two factors: (1) diminishing margin between cosine similarities, and (2) exacerbated hubness. Two methods are proposed to address each factor. Experimental results validate their effectiveness.