A Discriminative Latent-Variable Model for Bilingual Lexicon Induction

We introduce a novel discriminative latent-variable model for the task of bilingual lexicon induction. Our model combines the bipartite matching dictionary prior of Haghighi et al. (2008) with a state-of-the-art embedding-based approach. To train the model, we derive an efficient Viterbi EM algorithm. We provide empirical improvements on six language pairs under two metrics and show that the prior theoretically and empirically helps to mitigate the hubness problem. We also demonstrate how previous work may be viewed as a similarly fashioned latent-variable model, albeit with a different prior.


Introduction
Is there a more fundamental bilingual linguistic resource than a dictionary?The task of bilingual lexicon induction seeks to create a dictionary in a datadriven manner directly from monolingual corpora in the respective languages and, perhaps, a small seed set of translations.From a practical point of view, bilingual dictionaries have found uses in a myriad of NLP tasks ranging from machine translation (Klementiev et al., 2012) to cross-lingual named entity recognition (Mayhew et al., 2017).In this work, we offer a probabilistic twist on the task, developing a novel discriminative latent-variable model that outperforms previous work.
Our proposed model is a bridge between current state-of-the-art methods in bilingual lexicon induction that take advantage of word representations, e.g., the representations induced by Mikolov et al.'s (2013b)'s skip-gram objective, and older ideas in the literature that build an explicit probabilistic model for the task.In contrast to previous work, our model is a discriminative probability model, inspired by Irvine and Callison-Burch (2013), but infused with the bipartite matching dictionary prior of Haghighi et al. (2008).However, similar to more recent approaches (Artetxe et al., 2017), our model operates directly over word representations, inducing a joint cross-lingual representation space, and scales to large vocabulary sizes.To train our model, we derive a generalized expectation maximization algorithm (EM; Neal and Hinton, 1998) and employ an efficient combinatorial algorithm to perform the bipartite matching.
Empirically, we experiment on three highresource and three extremely low-resource language pairs.We evaluate intrinsically, comparing the quality of the induced bilingual dictionary, as well as analyzing the resulting bilingual word representations themselves.The latentvariable model yields gains over several previous approaches across language pairs.It also enables us to make implicit modeling assumptions explicit.To this end, we provide a reinterpretation of Artetxe et al. (2017) as a latent-variable model in the style of IBM Model 1 (Brown et al., 1993), which allows a clean side-by-side analytical comparison between our method and previous work.Viewed in this light, the difference between our approach and Artetxe et al.'s (2017), the strongest baseline, is whether one-to-one alignments or one-to-many alignments are admitted between the words of the languages' respective lexicons.We conclude our prior over one-to-one alignments is primarily responsible for the improvements over Artetxe et al.'s (2017) model, which allows more permissive alignments.

Background
Bilingual lexicon induction 2 is the task of finding word-level translations between the lexicons of two languages.For instance, the German word Hund and the English word dog are roughly semantically equivalent, so the pair Hund-dog should be an en-try in a German-English bilingual lexicon.The task itself comes in a variety of flavors.In this paper, we consider a version of the task that only relies on monolingual corpora in the tradition of Rapp (1995) and Fung (1995).In other words, the goal is to produce a bilingual lexicon primarily from unannotated raw text in each of the respective languages.Importantly, we avoid reliance on bitext, i.e., corpora with parallel sentences that are known translations of each other, e.g., EuroParl (Koehn, 2005).The bitext assumption is quite common in the literature; see Ruder et al. (2019, Table 2) for a survey.Additionally, we will assume the existence of a small seed set of word-level translations obtained from a dictionary; we also experiment with seed sets obtained from heuristics that do not rely on the existence of linguistic resources.

Graph-Theoretic Formulation
To ease the later exposition, we will formulate the task of bilingual lexicon induction graphtheoretically.Let ℓ src denote the source language and ℓ trg the target language.Suppose the source language ℓ src has n src word types in its lexicon V src and ℓ trg has n trg word types in its lexicon V trg .We will write v src (i) for the i th word type in ℓ src and v trg (i) for the i th word type in ℓ trg .We will often write i for v src (i) and j for v trg (j) for brevity.Now consider the bipartite set of vertices V trg ∪ • V src .In these terms, a bilingual lexicon is just a bipartite graph G = (V trg ∪ • V src , E) and, thus, the task of bilingual lexicon induction is a combinatorial problem: the search for a good edge set E ⊆ V trg × V src .We depict such a bipartite graph in Fig. 1.In §3, we will operationalize the notion of goodness by assigning a weight w ij to each possible edge between V trg and V src .
When the edge set E takes the form of a bipartite matching, we will denote it as m.In general, we will be interested in partial bipartite matchings, i.e., bipartite matchings where some vertices may have no incident edges.We will write M for the set of all partial matchings on the bipartite graph G.The set of vertices in V trg (respectively, V src ) with no incident edges will be termed u trg (respectively, u src ).Note that for any matching m, we have the identity u trg = V trg \ {i : (i, j) ∈ m}.

Word Representations
Word representations will also play a key role in our model.For the remainder of the paper, we will as- representations for each language's lexicon-for example, those provided by a standard model such as skip-gram (Mikolov et al., 2013b).Notationally, we define the real matrices S ∈ R d×nsrc and T ∈ R d×ntrg .Note that in this formulation s i ∈ R d , the i th column of S, is the word representation corresponding to the word v src (i).Likewise, note t i ∈ R d , the i th column of T , is the word representation corresponding to the word v src (i).

A Latent-Variable Model
The primary contribution of this paper is a novel latent-variable model for bilingual lexicon induction.The latent variable will be the edge set E, as discussed in §2.1.We define the following density over representations of the target language given the representations of the source language.Recall from §2, M is the set of all bipartite matchings on the graph G and m ∈ M is an individual matching.Note that, then, p(m) is a distribution over all partial bipartite matchings on G such as the matching shown in Fig. 1.We will take p(m) to be fixed as the uniform distribution for the remainder of the exposition, though more complicated distributions could be learned.We further define the density where we write (i, j) ∈ m to denote an edge in the bipartite matching.Furthermore, for notational simplicity, we have dropped the dependence of u trg on m-recall u trg = V trg \ {i : (i, j) ∈ m}.Next, we define the two densities present in Eq. (2): where Ω ∈ R d×d is a real orthogonal matrix of parameters to be learned.We define the model's parameters, to be optimized, as θ = (Ω, µ), which justifies our use of the subscript θ in p θ .Now, given a fixed matching m, we may create submatrices S m ∈ R d×|m| and T m ∈ R d×|m| such that the rows correspond to word vectors of matched vertices, i.e., translations under the bipartite matching m.Now, after some algebra, we can rewrite (i,j)∈m p(t i | s j ) in matrix notation: The result of this derivation, Eq. (4d), will become useful during the discussion in §4.
Modeling Assumptions.In the previous section, we have formulated the induction of a bilingual lexicon as the search for an edge set E, which we treat as a latent variable in a discriminative probability model.Specifically, we assume that E is a partial matching.Thus, for every (i, j) ∈ m, we have t i ∼ N (Ω s j , I), that is, the representation for v trg (i) is assumed to have been drawn from a Gaussian centered around the representation for v src (j), after an orthogonal transformation.This gives rise to two modeling assumptions, which we make explicit: (i) There exists a single source for every word in the target lexicon and that source cannot be used more than once.(ii) There exists an orthogonal transformation, after which the representation spaces are more or less equivalent.Assumption (i) may be true for related languages, but is likely false for morphologically rich languages that have a many-to-many relationship between the words in their respective lexicons.Empirically, we ameliorate this problem with a frequency constraint that only considers the top-n most frequent words in both lexicons for matching in §6.In addition, we experiment with priors that relax this constraint in §7.As for assumption (ii), previous work (Xing et al., 2015;Artetxe et al., 2017) has achieved some success using an orthogonal transformation.Recently, however, Søgaard et al. (2018) demonstrated that monolingual representation spaces are not approximately isomorphic and that there is a complex relationship between word form and meaning, which is only inadequately modeled by current approaches, which for example cannot model polysemy.Nevertheless, we show that imbuing our model with these assumptions helps empirically in §6.
Why it Works: The Hubness Problem.Why should we expect the bipartite matching prior to help, given that we know of cases when multiple source words should match a target word?One answer is because the bipartite prior helps us obviate the hubness problem, a common issue in word-representation-based bilingual lexicon induction (Dinu and Baroni, 2015).The hubness problem is an intrinsic problem of high-dimensional vector spaces where certain vectors will be universal nearest neighbors, i.e., they will be the nearest neighbor to a disproportionate number of other vectors (Radovanović et al., 2010).Thus, if we allow one-to-many alignments, we will find the representations of certain elements of V src acting as hubs, i.e., the model will pick them to generate a disproportionate number of target representations, which reduces the quality of the representation space.Another explanation for the positive effect of the one-to-one alignment prior is its connection to the Wasserstein distance and computational optimal transport (Villani, 2008;Grave et al., 2019).

Parameter Estimation
We conduct parameter estimation via Viterbi EM.We describe first the E-step, then the M-step.Viterbi EM estimates the parameters by alternating between the two steps until convergence.We give the complete pseudocode in Algorithm 1.

Viterbi E-Step
The E-step asks us to compute the posterior of latent bipartite matchings p θ (m | S, T ) given the representations S and T .Computation of this distribution, however, is intractable as it would require a sum over all bipartite matchings, which is #P-hard (Valiant, 1979).However, tricks from combinatorial optimization make it possible to maximize over all bipartite matchings in polynomial time.Thus, we fall back on the Viterbi approximation for the E-step (Brown et al., 1993;Samdani et al., 2012).
The Viterbi E-step requires us to solve the following combinatorial optimization problem In following Proposition, we give a polynomialtime solution to Eq. ( 5).
Proposition 4.1.The optimization problem argmax m∈M log p θ (m | S, T ) can be solved in O((n src + n trg )3 ) time with the Hungarian algorithm (Kuhn, 1955).
Exploiting Sparsity.The solution given in Proposition 4.1 requires use of the the Hungarian algorithm.Despite a runtime of O((n src + n trg ) 3 ), preliminary experimentation found it too slow for practical use with large vocabulary sizes. 3Thus, we consider a sparsification heuristic.For each source word, most target words, however, are unlikely candidates for alignment.We thus only consider the top-k most similar target words or alignment with every source word.After the sparsification, we employ a version of the Jonker-Volgenant algorithm (Jonker and Volgenant, 1987;Volgenant, 1996), which has been optimized for bipartite matching on sparse graphs.4 Relationship to Cosine Distance.In Proposition 4.1, we weight the induced bipartite graph with edge weights where v src (i) is the vertex corresponding to the i th word in the source language and v trg (j) is the j th word in the target language.We give a simple Proposition relating Eq. ( 6) to cosine distance.

M-Step
Next, we will describe the M-step.Given an optimal matching m ⋆ computed in §4.1, we search for a matrix Ω ∈ R d×d .We additionally enforce the constraint that Ω is a real orthogonal matrix, i.e., Ω ⊤ Ω = I.Previous work (Xing et al., 2015;Artetxe et al., 2017) found that the orthogonality constraint leads to noticeable improvements.
Our M-step optimizes two objectives independently.First, making use of the result in Eq. (4d), we minimize the following: with respect to Ω subject to Ω ⊤ Ω = I.Note we may ignore the constant C during the optimization.Second, we minimize the objective with respect to the mean parameter µ, which is simply an average.Note, again, we may ignore the constant D during optimization.Optimizing Eq. ( 8) with respect to Ω is known as the orthogonal Procrustes problem (Schönemann, 1966;Gower and Dijksterhuis, 2004) and has a closed-form solution that exploits the singular value decomposition (Horn and Johnson, 2012).Namely, we compute U ΣV ⊤ = T m S ⊤ m .Then, we directly arrive at the optimum: Ω ⋆ = U V ⊤ .Optimizing Eq. ( 9) can also been done in closed form; the point which minimizes distance to the data points (thereby maximizing the log-probability) is the centroid: µ ⋆ = 1 /|utrg| i∈utrg t i .
Algorithm 1 Viterbi EM for our latent-variable model 1: repeat 2: Artetxe et al. (2017) as a Latent-Variable Model The self-training method of Artetxe et al. (2017), our strongest baseline in §6, may also be interpreted as a latent-variable model in the spirit of our exposition in §3.Indeed, we only need to change the edge-set prior p(m) to allow for edge sets other than those that are matchings.Specifically, a matching enforces a one-to-one alignment between types in the respective lexicons.Artetxe et al. (2017), on the other hand, allow for one-to-many alignments.
We show how this corresponds to an alignment distribution that is equivalent to IBM Model 1 (Brown et al., 1993), and that Artetxe et al. ( 2017)'s selftraining method is actually a form of Viterbi EM.
To formalize Artetxe et al. (2017)'s contribution as a latent-variable model, we lay down some more notation.Let A = {1, . . ., n src + 1} ntrg , where we define (n src + 1) to be none, a distinguished symbol indicating unalignment.The set A is to be interpreted as the set of all one-to-many alignments a on the bipartite vertex set V = V trg ∪ • V src such that a i = j means the i th vertex in V trg is aligned to the j th vertex in V src .Note that a i = (n src + 1) = none means that the i th element of V trg is unaligned.Now, by analogy to our formulation in §3, we define The algebra given above is an instance of the dynamic programming trick introduced by Brown et al. (1993), which reduces the number of terms in the expression from exponentially many to polynomially many.We take p(a) to be a (parameterless) uniform distribution over all alignments.

Artetxe et al. (2017)'s Viterbi E-
Step.Now, we can perform the maximization necessary for a Viterbi E-step with dynamic programming.Specifically, the maximization problem over alignments decomposes additively, i.e., thus, we can simply find a ⋆ component-wise: Note that there is no longer a computational need to fall back on the Viterbi approximation to EMwe can also efficiently compute the expectations necessary for EM with dynamic programming.
Artetxe et al. ( 2017)'s M-step.The M-step remains unchanged from the exposition in §3 with the exception that we fit Ω given submatrices S a and T a formed from a one-to-many alignment a, rather than a bipartite matching m.
Why a Reinterpretation?The reinterpretation of Artetxe et al. (2017) as a probabilistic model yields a clear analytical comparison between our method and theirs.We see the only difference between the two models is the constraint on the bilingual lexicon that the model is allowed to induce, i.e., our model enforces a one-to-one alignment where as Artetxe et al.'s (2017) does not.

Experiments
We first conduct experiments on bilingual dictionary induction and cross-lingual word similarity on three standard language pairs, English-Italian, English-German, and English-Finnish.

Experimental Details
Datasets.For bilingual dictionary induction, we use the English-Italian dataset from Dinu and Baroni ( 2015) and the English-German and English-Finnish datasets by Artetxe et al. (2017) English-German and the WordSim-353 crosslingual dataset for English-Italian by Camacho-Collados et al. (2015).
Seed Dictionaries.Following Artetxe et al. (2017), we use dictionaries of 5,000 words, 25 words, and a numeral dictionary consisting of words matching the [0-9]+ regular expression in both vocabularies. 5In line with Søgaard et al. (2018), we additionally use a dictionary of identically spelled strings in both vocabularies.
Implementation.Similar to Artetxe et al. (2017), we stop training when the improvement on the average cosine similarity for the induced dictionary is below 1 × 10 −6 between succeeding iterations.Unless stated otherwise, we induce a dictionary of 200,000 source and 200,000 target words as in previous work (Mikolov et al., 2013c;Artetxe et al., 2016).For optimal one-to-one (1:1) alignment, we have observed the best results by keeping the top k = 3 most similar target words.If using a frequency constraint, we restrict the matching in the E-step to the top 40,000 words in both languages. 6Finding the optimal alignment on the 200, 000 × 200, 000 graph takes about 25 minutes on CPU;7 with the frequency constraint, matching takes around three minutes.
Baselines.We compare our approach with and without the frequency constraint to the original bilingual mapping approach by Mikolov et al. (2013c).In addition, we compare with Zhang et al. (2016) and Xing et al. (2015) who augment the former with an orthogonality constraint and normalization and an orthogonality constraint respectively.Finally, we compare with Artetxe et al. (2016), who add dimension-wise mean centering to Xing et al. (2015), and to Artetxe et al. (2017).Both Mikolov et al. (2013c) and Artetxe et al. (2017) are special cases of our famework and comparisons to these approaches thus act as an ablation study.Specifically, Mikolov et al. (2013c) does not employ orthogonal Procrustes, but rather allows the learned matrix Ω to range freely.Likewise, as discussed in §5, Artetxe et al. (2017) make use of a Viterbi EM style algorithm with a different prior over edge sets.8

Results
We show results for bilingual dictionary induction in Tab. 1 and for cross-lingual word similarity in Tab. 2. Our method with a 1:1 prior outperforms all baselines on English-German and English-Italian.9Interestingly, the 1:1 prior by itself fails on English-Finnish with a 25 word and numerals seed lexicon.We hypothesize that the bipartite matching prior imposes too strong of a constraint to find a good solution for a distant language pair from a poor initialization.With a betterbut still weakly supervised-starting point using identical strings, our approach finds a good solution.Alternatively, we can mitigate this deficiency effectively using a frequency constraint, which allows our model to converge to good solutions even with a 25-word or numeral-based seed lexicon.The frequency constraint generally performs similarly or boosts performance, while being about 8 times faster.All approaches do better with identical strings compared to numerals, indicating that the former may be generally suitable as a default weakly-supervised seed lexicon.On cross-lingual word similarity, our approach yields the best performance on WordSim-353 and RG-65 for English-German and is only outperformed by Artetxe et al. (2017) on English-Italian WordSim-353.

Analysis
Vocabulary Sizes.The beneficial contribution of the frequency constraint demonstrates that in similar languages, many frequent words will have one-to-one matchings, while it may be harder to find direct matches for infrequent words.We would thus expect the latent variable model to perform better if we only learn dictionaries for the top n most frequent words in both languages.We show results for our approach in comparison to the baselines in Fig. 2 for English-Italian using a 5,000 word seed lexicon across vocabularies consisting of different numbers n of the most frequent words. 10he comparison approaches mostly perform simi-lar, while our approach performs particularly well for the most frequent words in a language.
Figure 2: Bilingual dictionary induction results of our method and baselines for English-Italian with a 5,000 word seed lexicon across different vocabulary sizes.
Different Priors.An advantage of having an explicit prior as part of the model is that we can experiment with priors that satisfy different assumptions.Besides the 1:1 prior, we experiment with a 2:2 prior and a 1:2 prior.For the 2:2 prior, we create copies of the source and target words V ′ src and V ′ trg and add these to our existing set of vertices . We run the Viterbi E-step on this new graph G ′ and merge matched pairs of words and their copies in the end.Similarly, for the 1:2 prior, which allows one source word to be matched to two target words, we augment the vertices with a copy of the source words V ′ src and proceed as above.We show results for bilingual dictionary induction with different priors across different vocabulary sizes in Fig. 3.The 2:2 prior performs best for small vocabulary sizes.As solving the linear assignment problem for larger vocabularies becomes progressively more challenging, the differences between the priors become obscured and their performance converges.
Hubness Problem.We analyze empirically whether the prior helps with the hubness problem.Following Lazaridou et al. (2015), we define the hubness N k (y) at k of a target word y as follows: where Q is a set of query source language words and NN k (x, G) denotes the k nearest neighbors of x in the graph G. 11 In accordance with Lazaridou et al. (2015), we set k = 20 and use the words   in the evaluation dictionary as query terms.We show the target language words with the highest hubness using our method and Artetxe et al. (2017) for English-German with a 5,000 seed lexicon and the full vocabulary in Tab.3.12 Hubs are fewer and occur less often with our method, demonstrating that the prior-to some extent-aids with resolving hubness.Interestingly, compared to Lazaridou et al. (2015), hubs seem to occur less often and are more meaningful in current cross-lingual word representation models. 13For instance, the neighbors of gleichgültigkeit all relate to indifference and words appearing close to luis or jorge are Spanish names.This suggests that the prior might also be beneficial in other ways, e.g., by enforcing more reliable translation pairs for subsequent iterations.are typically available, but are not adequately reflected in current benchmarks (besides the English-Finnish language pair).We perform experiments with our method with and without a frequency constraint and Artetxe et al. (2017) for three truly lowresource language pairs, English-{Turkish, Bengali, Hindi}.We additionally conduct an experiment for Estonian-Finnish, similarly to Søgaard et al. (2018).For all languages, we use fastText representations (Bojanowski et al., 2017) trained on Wikipedia, the evaluation dictionaries provided by Lample et al. (2018), and a seed lexicon based on identical strings to reflect a realistic use case.We note that English does not share scripts with Bengali and Hindi, making this even more challenging.We show results in Tab. 4. Surprisingly, the method by Artetxe et al. (2017) is unable to leverage the weak supervision and fails to converge to a good solution for English-Bengali and English-Hindi. 14ur method without a frequency constraint significantly outperforms Artetxe et al. (2017), while particularly for English-Hindi the frequency constraint dramatically boosts performance.
Error Analysis.To illustrate the types of errors the model of Artetxe et al. (2017) and our method with a frequency constraint make, we query both of them with words from the test dictionary of

Related work
Cross-lingual representation Priors.Haghighi et al. (2008) first proposed an EM algorithm for bilingual lexicon induction, representing words with orthographic and context features and using the Hungarian algorithm in the E-step to find an optimal one-to-one matching.Artetxe et al. (2017) proposed a similar self-learning method that uses word representations, with an implicit one-to-many alignment based on nearest-neighbor queries.Vulić and Korhonen (2016) proposed a more strict oneto-many alignment based on symmetric translation pairs, which is also used by Lample et al. (2018).
Our method bridges the gap between early latent variable and word representation-based approaches and explicitly allows us to reason over its prior.

Conclusion
We have presented a novel latent-variable model for bilingual lexicon induction, building on the work of Artetxe et al. (2017).Our model combines the prior over bipartite matchings inspired by Haghighi et al. (2008) and the discriminative, rather than generative, approach inspired by Irvine and Callison-Burch (2013).We show empirical gains on six language pairs and theoretically and empirically demonstrate the application of the bipartite matching prior to solving the hubness problem. ) time with the Hungarian algorithm (Kuhn, 1955).

Figure 1 :
Figure 1: Partial lexicons of German and English shown as a bipartite graph.German is the target language and English is the source language.The n trg = 7 German words are shown in blue and the n src = 6 English words are shown in green.A bipartite matching m between the two sets of vertices is also depicted.The German nodes in u trg are unmatched.

Figure 3 :
Figure3: Bilingual dictionary induction results of our method with different priors using a 5,000 word seed lexicon across different vocabulary sizes.

Table 1 :
Precision at 1 (P@1) scores for bilingual lexicon induction of different models with different seed dictionaries and languages on the full vocabulary.
. For cross-lingual word similarity, we use the RG-65 and WordSim-353 cross-lingual datasets for

Table 2 :
Spearman correlations on English-Italian and English-German cross-lingual word similarity datasets.

Table 3 :
Hubs in English-German cross-lingual representation space with degree of hubness.Non-name tokens are translated.

Table 5 :
Kementchedjhieva et al. (2018)n-English.Artetxe et al. (2017)in German and seek their nearest neighbors in the English representation space.P@1 over the German-English test set is 36.38 and 39.18 forArtetxe et al. (2017)and our method respectively.We show examples where nearest neighbrs of the methods differ in Tab. 5. Similar toKementchedjhieva et al. (2018), we find that morphologically related words are often the source of mistakes.Other common sources of mistakes in this dataset are names that are translated to different names and nearly synonymous words being predicted.Both of these sources indicate that while the learned alignment is generally good, it is often not sufficiently precise.