Hubless Nearest Neighbor Search for Bilingual Lexicon Induction

Bilingual Lexicon Induction (BLI) is the task of translating words from corpora in two languages. Recent advances in BLI work by aligning the two word embedding spaces. Following that, a key step is to retrieve the nearest neighbor (NN) in the target space given the source word. However, a phenomenon called hubness often degrades the accuracy of NN. Hubness appears as some data points, called hubs, being extra-ordinarily close to many of the other data points. Reducing hubness is necessary for retrieval tasks. One successful example is Inverted SoFtmax (ISF), recently proposed to improve NN. This work proposes a new method, Hubless Nearest Neighbor (HNN), to mitigate hubness. HNN differs from NN by imposing an additional equal preference assumption. Moreover, the HNN formulation explains why ISF works as well as it does. Empirical results demonstrate that HNN outperforms NN, ISF and other state-of-the-art. For reproducibility and follow-ups, we have published all code.


Introduction
This paper presents a new method for Bilingual Lexicon Induction (BLI), which we call Hubless Nearest Neighbor (HNN). BLI is the task of creating a lexicon of translation equivalents such as, bank:banc or bank:banque automatically from non-parallel corpora. The proposed method not only improves upon but also unifies several recent works that retrieve translations by Nearest Neighbor (NN) (Mikolov et al., 2013) and more advanced techniques like Inverted SoFtmax (ISF) (Smith et al., 2017).
Recently, Mikolov et al. (2013) observe that isomorphism exists across word embeddings of different languages. This motivates them to learn a linear mapping to align the spaces, using a seeding dictionary of 5K pairs of translations. After that, more translations can be induced by NN search. Following the seminal work, significant advances have been made. For example, Faruqui and Dyer (2014) use Canonical Component Analysis to align the two embedding spaces. Xing et al. (2015) show a substantial gain by normalizing the embeddings and constraining the mapping to be orthogonal. A series of works by Artetxe et al. (2017Artetxe et al. ( , 2018a show that decent accuracies can be achieved even with a tiny or no seeding dictionary. The authors name their method as "self-learning", which alternates between learning the mapping and inducing more translation pairs. The similar methodology is also seen in (Zhang et al., 2017b), where the induction step reduces a cost called earth mover distance. Conneau et al. (2018) propose to use Generative Adversarial Network (Goodfellow et al., 2014) to learn the mapping when no seeding dictionary is available.
Whether using a seeding dictionary or not, the induction always requires to retrieve the translation under some distance measure. NN may be the most straightforward approach. However, it is often challenged by a phenomenon called hubness (Radovanovic et al., 2010). Hubness is a tendency that a few words (hubs) are too near to too many other words, especially in high dimensional spaces. It degrades the accuracy of NN in various tasks (Aucouturier and Pachet, 2008;Ozaki et al., 2011;Suzuki et al., 2013;Zhang et al., 2017a), including BLI (Dinu et al., 2014). Recently, remarkable improvements have been made in BLI by mitigating hubness. For example, Smith et al. (2017) propose Inverted SoFtmax (ISF) that scales the similarities by a (global) measure of hubness. Conneau et al. (2018) develop a method called Cross domain Similarity Local Scaling (CSLS) that relies on a local measure of hubness instead.
This work studies how to overcome hubness in BLI. The new method, HNN, is proposed by introducing an equal preference assumption. As we shall see, the assumption leads to an optimization problem that manifests the connection between HNN, NN and ISF. Empirical results demonstrate that HNN is very competitive and able to outperform NN, ISF and other state-of-the-art like CSLS. In summary, the paper makes the following contributions: 1. We propose an optimization based framework that connects NN, ISF and the proposed HNN.
2. We derive an efficient solver for HNN, which outperforms NN, ISF and other state-of-theart.
3. We show that ISF is a part of HNN's solver, which explains why ISF works.

Bilingual Lexicon Induction with a Seeding Dictionary
Since hubness is the major concern of this work, we focus on the case with a seeding dictionary for simplicity. Representative methods (Mikolov et al., 2013;Xing et al., 2015;Artetxe et al., 2018a) often consist of two steps: 1) learning a mapping that aligns the source and target embedding spaces; 2) given a source word, retrieve according to some distance metric, in the target embedding space. We briefly review the two steps in this section. Let the source word embeddings space be X ⊂ R d , and the target space be Y ⊂ R d . A typical value of the dimension d is 300. Suppose the vocabulary sizes of source/target languages are m and n respectively. Then X is a set of m embedding vectors, denoted as and Y is a set of n embeddings, Y = {y 1 , . . . , y n } .

Learning Linear Transformation
Suppose we can access a seeding dictionary, which reveals some correspondences from a subset of X to a subset of Y. The correspondence can be one-to-one, many-to-one, or one-to-many. In matrix form, let the columns of matrix X (Y) be the embeddings of source (target) words in the dictionary, where the j-th columns of X and Y are the embeddings of a pair of translations.
Using X and Y, a linear mapping T can be learned, which "aligns" the X and Y space. In particular, Here Ω is a constraint set on T. For example, Mikolov et al. (2013) simply let Ω be R d×d , whereas Xing et al. (2015) show substantial gain by setting Ω as the set of orthogonal matrices.

Retrieval by Nearest Neighbor (NN)
Once the T is obtained, translation can be cast as a retrieval problem. We define a distance matrix D, between the mapped source embeddings and target embeddings, where "dist" is some distance metric. For the ith source word, Nearest Neighbor (NN) criterion determines (the index of) its translation in the Y set, by arg min However, several works (Radovanovic et al., 2010;Dinu et al., 2014) have observed that the accuracy of NN is often significantly degraded by a phenomenon called hubness. Mitigating hubness has thus become necessary, which we will review next.

Inverted Softmax: Improve NN by Mitigating Hubness
Hubness occurs in high dimensional feature space (Radovanovic et al., 2010). It appears as some data points, called hubs, being close to too many of the others. We look into Inverted SoFtmax (ISF) (Smith et al., 2017), a recent retrieval methods that tackle hubness. Given the distance matrix D, ISF seeks a matrix ISF where its i, j-th entry decides the probability of translating the i-th source word to the j-th target. A "temperature" parameter is introduced to construct a kernel, exp(−D i,j / ). The ISF matrix is obtained by normalizing the kernel's columns first, and then the rows.
Smith et al. (2017) show that ISF significantly outperforms NN in BLI tasks. However, it is still not clear why ISF works so well. One contribution of our work is to shed light on the mechanism behind ISF, which will be manifested after we study the proposed Hubless Nearest Neighbor.

Hubless Nearest Neighbor (HNN)
This section formalizes the proposed HNN. As will be clear from this section and the next, there is a unified view over NN, ISF and HNN. We now start by rephrasing the retrieval task into the following general problem. Given the distance matrix D defined in section 2.2, we seek an assignment matrix P where P i,j is the probability of the j-th target word being the translation of the i-th source word. Assume the target vocabulary is large enough so that one or more translations can always be found for any source word. Therefore j P i,j = 1.

The (index of) translation is inferred by
The art is to determine P from D and various other information/priors. The above framework is general in the sense that various designs exist in seeking the P.

NN as a Warm-up
As a warm-up, we show that NN is a special case of the above framework. In specific, let ·, · be matrix inner product, then D, P is a measure of cost to translate from source to target, which we may want to minimize. In addition, if we minimize it along with a negative entropy regularizer (on P), we can reduce the gap between NN and (ISF). As stated by the following proposition, Proposition 1. The (NN) criterion is equivalent to arg max j P i,j , where P is the solution of the following optimization problem, Proof. The solution to (P 0 ) is simply Substituting Eq.
(1) to arg max j P i,j , we arrive at which is exactly the NN criterion.
The objective of (P 0 ) is regularized by i,j P i,j log(P i,j ), which is the negative entropy of P. It smooths the linear objective, and leads to a solution, Eq. (1), as anther view of NN that is closer to (ISF).

Equal Preference Assumption
Starting from (P 0 ), we now try to reduce the hubs. Since hubness results in some y j 's being retrieved more frequently than others, a natural idea is to force all y j 's being equally preferred to be retrieved. The preference of y j can be measured by In other words, on average, how the j-th target word is likely to be picked as the translation of a source word. We therefore force 1 m i P i,j to be uniform over all j's. In addition, with the constraint that j P i,j = 1 for any i, we have 1 m i P i,j = 1 n for any j. We term the constraint as equal preference assumption. Formally, Definition 1. Equal Preference Assumption: for all j's.
If the translation is strictly one-to-one, then m = n and P is a permutation matrix. The assumption exactly holds. In reality, translation is not one-to-one. But empirically, we still observe that it approximately holds, at least for some language pairs. To see that, we build a "groundtruth" P * using an English-French dictionary 2 . The dictionary includes 113K items, with plenty of polysemies. The vocabularies are built from monolingual word embedding files 3 . Vocabulary sizes are set as m = n = 500, 000. The P * is computed as where Z i is a normalizer that ensures j P * i,j = 1. We compute the pf j values using the P * . To measure how much the values deviate from the equal preference assumption, we compute their variance, If the equal preference assumption exactly holds, var j [pf j ] = 0. Otherwise, the larger the variance, the less true the assumption is. It turns out that var j [pf j ] ≈ 1.9×10 −11 , which is tiny and support the assumption.

HNN
We add the equal preference assumption as a constraint to problem (P 0 ), and now solve the following new problem instead, (P 1 ) In analogous to (NN) and Proposition 1, we introduce the following definition of Hubless Nearest Neighbor (HNN). Definition 2. HNN is the criterion that retrieves index arg max where P is the solution of problem (P 1 ). The remaining question is how to solve (P 1 ), which we will discuss in the next section.

Solvers for HNN
We first present a straightforward but less efficient solver, then we derive an efficient alternative. Both solvers, as will be seen, have strong connections with ISF.

A Less Efficient Solver
(P 1 ) can be solved by Sinkhorn iteration (Cuturi, 2013), which iteratively normalizes the columns and rows of a kernel matrix. The steps are summarized in algorithm 1.
Algorithm 1 Sinkhorn Solver for Problem (P 1 ) Input: D Output: P 1: P ← exp(−D/ ) where exp is on elements. Here ./ denotes elementwise division, 1 is an all-one vector of suitable length, and diag{·} constructs a diagonal matrix from a vector.
Remark 1. The Sinkhorn Solver reveals some connection between ISF and HNN. Indeed, ISF is equivalent to running a single iteration of algorithm 1. A deeper connection will be revealed in the next subsection, where we provide a complementary view of problem (P 1 ).

Dual Problem and an Efficient Solver
One drawback of algorithm 1 is its prohibitive memory cost. Indeed, the P matrix has to be in memory for frequent update, which is costly when the vocabulary sizes m and n are big. In this section, we study a dual form of problem (P 1 ), given in Proposition 2, which hints a more efficient method to solve for P.
Proposition 2. The solution of problem (P 1 ) can be expressed as where β j is the solution of Proof. The proof is by method of Lagrangian multipliers. Details are in supplementary.
The dual form (D) is a special case of the dual forms in general contexts (Genevay et al., 2016). (D) is useful because it allows us to learn a much smaller vector β (of size n), instead of keeping updating the huge matrix P (m by n). Moreover, its loss function is a summation over the i's, which can be minimized by parallelizable gradient descent. Finally, (D) is convex, which guarantees the convergence of gradient descent.
It is now a natural idea to derive the gradient of (D) and give a gradient descent solver. Let us define the objective in (D) as i , Then, Algorithm 2 summarizes the gradient descent solver. The equivalence between algorithm 1 and 2 will be empirically validated in section 5.1. In large-scale experiments, we apply algorithm 2 (instead of algorithm 1) for its lower memory cost.
Remark 2 (Unifying NN, ISF and HNN). Comparing the P matrix under NN (Eq. (1)) and HNN criterion (Eq. (4)), we observe that HNN introduces an additional set of values, β j 's. The quantity, exp(−β j / ), normalizes the j-th column of the kernel matrix exp(−D i,j / ). In contrast, (ISF) simply sets the column normalizer as the column sum, i exp(−D i,j / ). Obviously, HNN works harder in figuring out the normalizers, which results in a higher accuracy, as we shall see in the experiment section.

Experiments
In this section, we first experiment with synthetic data to illustrate the connection between NN, ISF and the proposed HNN. Then, we report extensive results on BLI task to show the advantage of HNN over NN, ISF and CSLS, another state-of-the-art.
We will also demonstrate that hubness is indeed reduced by HNN. To measure hubness, we adopt the k-occurrence metric proposed in (Radovanovic et al., 2010), but with a small adaption. In its original definition, k-occurrence, N k , is the number of times a data point appears among the k nearest neighbors of all the others. In our case, we measure for every target example, the number of times it is retrieved against the source set. If a target example is retrieved too many times, it is likely to be a "hub". For both the original definition and our adapted one, hubness can be indicated by a long tail of the distribution of N k .

Connection between NN, ISF and HNN
We simulate a retrieval task, where source and target spaces are already aligned. In specific, data is generated from a Gaussian mixture model, where the dimension d = 300. The class mean µ c is generated by normalizing a R 300 vector where each dimension is drawn from uniform(−1, 1). We generate two samples per class, one in the source set, the other as the target to be retrieved. We use NN, ISF and HNN with algorithm 1 and 2 to retrieve for the 10K source samples. is set to 0.1, which gives the best accuracy of ISF. The same is also used in HNN. Table 1 reports top-1, top-5, top-10 accuracies. The two algorithms for HNN achieve the best results and their accuracies are basically the same, validating their equivalence. To understand the improvement over NN, we measure the hubness in different methods. Figure 1 plots the distribution of N 1 , the 1-occurrences. HNN has the shortest tail, in stark contrast to the long tail of NN, implying significantly reduced number of hubs.
We then illustrate the connection between ISF and HNN. Figure 2   over the iterations in algorithm 1. The accuracy at the first iteration matches that of ISF, validating the comments in remark 1. Next, recall the observation we made in remark 2: Algorithm 2 and ISF both seek normalizers over the target examples to penalize hubness. It is therefore interesting to compare the two normalizers, shown in figure 3. We observe that the normalizer by HNN is smoother than that of ISF.

BLI Data and Setups
We now compare the different methods on real BLI tasks. We follow the setup in (Conneau et al., 2018). The word embeddings and groundtruth dictionaries (for both training and testing) can be downloaded from the MUSE 4 repository. The dataset includes word embeddings for 45 languages. We focus on six languages in our experiments, since dictionaries are available for any two out of the six. These six languages are English (en), Spanish (es), French (fr), Italian (it), Portuguese (pt) and German (de).
Dictionaries for a pair of languages have the following three parts: 1. src-tgt.0-5000.txt is a seeding dictionary for learning the mapping, which has 5K unique source words.
2. src-tgt.5000-6500.txt is a small test dictionary that includes 1.5K unique source words.
3. src-tgt.txt is a full dictionary that includes the above two dictionaries and much more src-totgt translations. In later experiments, we will use it as a large test dictionary by excluding from it the items in src-tgt.0-5000.txt.
Following (Conneau et al., 2018), an orthonormal mapping T is learned using the seeding dictionary first. The retrieval step uses cosine distance, i.e., The hyper-parameters ( for ISF and HNN, k for CSLS) should be set as the ones that achieve the best accuracy on the seeding dictionary. We choose to trust the default values ( = 1/30 and k = 10) used in the MUSE repository, since using them, we can reproduce the results reported in (Conneau et al., 2018). For HNN, is set to the same value as in ISF, and we use the gradient solver (Algorithm 2) throughout. As a sanity check, we first reproduce some results reported in Tab. 1 of (Conneau et al., 2018) (the part with cross-lingual supervision), and compare those to HNN. Source and target vocabularies are both 200K. P@1 values are reported on the 1.5K small test dictionary, shown in Tab. 2. HNN is within the ballpark of state-of-the-art, produced by ISF and CSLS.  Table 3: (a) Visualizing var j [pf j ], for every pair of source and target languages, the corresponding block color-codes the variance of the preferences over the target words. The variances are significantly bigger for pairs that involve German; (b)-(f): Distribution of N 5 , the 5-occurrences for different methods in foreign-English BLI tasks. 5-occurrence is the number of times a target word appears in the top-5 retrievals. Long tail of the distribution indicates hubness.

BLI Results on the Large Test Dictionary
Reporting P@1 on the 1.5K small test dictionary may not be sufficient to make a convincing comparison. We therefore repeat the same experiments but report P@1 on the large test dictionary.
We first keep the vocabulary size of 200K, then try a more challenging 500K. P@1 values are reported on the large test dictionary. In fact, results have the same trend for these different vocabulary sizes. Therefore, considering space limit, we put the results for the 200K case in supplementary. Results of the 500K case are in Tab. 3. HNN outperforms all the other methods in all cases except pairs that involve German. Note that French, Italian, and Portuguese are all Romance languages. German is a Germanic language. English originates from both. The results seem to suggest that the equal preference assumption is not true between Romance and Germanic languages.
To better understand this, we estimate the "groundtruth" preference of target words, and the variance of the preferences, following the process in section 3.2 (Eq. (2) and (3)). The larger the variance, the less likely the equal preference assumption holds. We visualize the variance between any pair of languages in figure 4a, where a hot block indicates large variance. We observe a large variance when either the source or target language is German. In other words, the equal preference assumption is more violated when translating from or to German.

Analysis of Hubs in BLI
To see how the hubs are reduced, we again calculate the k-occurrence metric. Figure 4b to 4f plot  It is interesting to see what types of words are likely to be hubs. Table 4 lists some representative ones in the pt-en experiment, picked from the top 100 hubs with the biggest N 5 values. Some of them are extremely low-frequency words, e.g., "s+bd". This is consistent with the finding in (Dinu et al., 2014). However, it is also interesting to see that highly frequent words like "were" and "you" also appear to be hubs. Finally, all the N 5 values are reduced after applying HNN.

Conclusion
This paper studies how to reduce hubness during retrieval, a crucial step for Bilingual Lexicon Induction (BLI). The Hubless Nearest Neighbor (HNN) is proposed by assuming an "equal preference" constraint. HNN connects to NN, and also sheds light on a recent hubness-preventing method called Inverted SoFtmax (ISF). Empirical results demonstrate that HNN effectively reduces hubs, and can outperform NN, ISF and other state-ofthe-art. Future works include applying the method to more language pairs and more domain-specific lexicon induction, e.g., terminologies.