Semi-Supervised Bilingual Lexicon Induction with Two-way Interaction

Semi-supervision is a promising paradigm for Bilingual Lexicon Induction (BLI) with limited annotations. However, previous semisupervised methods do not fully utilize the knowledge hidden in annotated and nonannotated data, which hinders further improvement of their performance. In this paper, we propose a new semi-supervised BLI framework to encourage the interaction between the supervised signal and unsupervised alignment. We design two message-passing mechanisms to transfer knowledge between annotated and non-annotated data, named prior optimal transport and bi-directional lexicon update respectively. Then, we perform semi-supervised learning based on a cyclic or a parallel parameter feeding routine to update our models. Our framework is a general framework that can incorporate any supervised and unsupervised BLI methods based on optimal transport. Experimental results on MUSE and VecMap datasets show significant improvement of our models. Ablation study also proves that the two-way interaction between the supervised signal and unsupervised alignment accounts for the gain of the overall performance. Results on distant language pairs further illustrate the advantage and robustness of our proposed method.


Introduction
Bilingual Lexicon Induction (BLI) is of huge interest to the research frontier. BLI methods learn cross-lingual word embeddings from separately trained monolingual embeddings. BLI is believed to be a promising way to transfer semantic information between different languages, and spawns lots of NLP applications like machine translation (Lample et al., 2018b;Artetxe et al., 2018b), Part Of Speech (POS) tagging (Gaddy et al., 2016), parsing † Yong Zhang is the corresponding author. (Xiao and Guo, 2014), and document classification (Klementiev et al., 2012).
The key step of BLI is to learn a transformation between monolingual word embedding spaces (Ruder et al., 2019), which could be further used for translation retrieval or cross-lingual analogy tasks. However, it is hard to obtain the high quality transformation with low supervision signals, i.e. with limited annotated lexicon. Thus, some semi-supervised BLI methods (Artetxe et al., 2017;Patra et al., 2019) are proposed to make use of annotated and non-annotated data. Artetxe et al. (2017) bootstrapped the supervised lexicon to enhance the supervision but ignored the knowledge in non-annotated data. Meanwhile, Patra et al. (2019) combined the unsupervised BLI loss that captured the structural similarity in word embeddings (Lample et al., 2018a) with the supervised loss (Joulin et al., 2018). However, this loss combination still performed poorly since the bad supervised optimization under limited annotations, see the Experiment part for details. As a result, existing semisupervised BLI methods suffer from low effectiveness (Artetxe et al., 2017) or low robustness (Patra et al., 2019).
In this work, we focus on designing a new semisupervised BLI method to make full use of both annotated and non-annotated data. We propose a novel framework with two different strategies, which exceeds the previous separate (Artetxe et al., 2017;Patra et al., 2019) semi-supervised methods by emphasizing the two-way interaction between the supervised signal and unsupervised alignment. In this framework, supervised training tries to align the parallel lexicon and unsupervised training can exploit the structure similarity between monolingual embedding spaces. The foundation of twoway interaction is in two carefully designed message passing mechanisms, see Section 3.1 and 3.2. Two-way interaction enables semi-supervised BLI to guide the exploitation of structural similarity by unsupervised procedure  and extend insufficient lexicon for supervised procedure (Joulin et al., 2018) simultaneously, see Figure 1. In this paper, we only consider the unsupervised BLI methods based on Optimal Transport (OT) Alaux et al., 2019;Huang et al., 2019;, which have achieved impressive results on BLI task.
More specifically, the contributions of this paper are listed below.
• We propose the two-way interaction between the supervised signal and unsupervised alignment. It consists of two message passing mechanisms, Prior Optimal Transport (POT) and Bidirectional Lexicon Update (BLU). POT enables the OT-based unsupervised BLI approach to be guided by any prior BLI transformation, i.e. transfers what is learned by supervised BLI method to the unsupervised BLI method. BLU employs the alignment results in bi-directional retrieval to enlarge the annotated data, and thus enhances the supervised training by unsupervised BLI transformation. • We propose two strategies of semi-supervised BLI framework based on POT and BLU, named by Cyclic Semi-Supervision (CSS) and Parallel Semi-Supervision (PSS). They are recognized by the cyclic and parallel parameter feeding routines, respectively, see Figure 1.
Notably, CSS and PSS are universal to admit any supervised BLI methods and OT-based unsupervised BLI methods. • Extensive experiments on two popular datasets show that CSS and PSS exceed all previous supervised, unsupervised, and semisupervised approaches and are suitable to different scenarios. Ablation study of CSS and PSS demonstrates that the two-way interaction (POT and BLU) is the key to improve the performance. Results on distant language pairs show the advantage and robustness of our method.

Background
In this section, we describe the basic formulation of related supervised and unsupervised BLI methods. We define two embedding matrices X, Y ∈ R n×d , where n is the number of words and d is the dimension of the word embedding.
The key to supervised BLI is the parallel lexi-con between two languages, say word x i in X is translated to word y i in Y . Mikolov et al. (2013) suggested regarding supervised BLI as a regression problem aligning word embeddings by a linear transformation Q .
(1) Artetxe et al. (2016) introduced the orthogonal constraint on Q. Therefore, Problem (1) has a closedform solution Q = U V , where U, V are defined by the SVD decomposition Y X = U SV . Joulin et al. (2018) proposed to replace the 2-norm in Problem (1) by Relaxed Cross-domain Similarity Local Scaling (RCSLS) loss to mitigate the hubness problem, which is formulated in Equation (2).
where N X (y) represents the set which consists of the k nearest neighbors of y in the point cloud X, so as N Y (x i Q).
For unsupervised BLI, embeddings in X and Y are totally out of order. As a result, unsupervised BLI methods need to model an unknown permutation matrix P ∈ P n = {0, 1} n×n where O d is the set of orthogonal matrices. Problem (3) could be solved by iteratively minimizing Q and P . More specifically,  considered random samplesX,Ȳ ∈ R m×d from X, Y in a stochastic optimization scheme. Minimizing P directly is hard. The key to unsupervised methods is how to solve P approximately, see Section 6. OT based methods solve P by optimal transport (Zhang et al., 2017;Alaux et al., 2019;Huang et al., 2019).  and Zhang et al. (2017) proposed to solve the Wasserstein problem between the two distributions supported onXQ andȲ , respectively.
where D ij is the cost between x i Q and y j , such as 2-norm, RCSLS loss, or other costs.
To summarize, the foundation of supervised BLI is the annotated parallel lexicon for training, and the critical step of OT-based unsupervised BLI is the solution of transport plan P .

Message Passing in BLI
In this section, we present two message passing mechanisms for semi-supervised framework, including POT and BLU. POT is proposed to enhance the unsupervised BLI by the knowledge passed from the supervised BLI. Meanwhile BLU enhances the supervised BLI by the additional lexicon based on the unsupervised retrieval results. Therefore, POT and BLU form the two-way interaction between the supervised signal and unsupervised alignment.

Prior Optimal Transport
We present POT to strengthen the stochastic optimization in unsupervised BLI with prior information from supervised BLI. More specifically, POT is designed to guide the original OT solution of P , see Problem (4). POT can replace the original OT problems in unsupervised BLI models such as . In this way, we enable the transformation Q sup trained in supervised BLI to enhance the unsupervised BLI.
Given Q sup learned from any supervised BLI and random word embedding samples {x i } and {y j }, we compute the cost matrix C between transformed source embeddings and target embeddings. In this work, we choose RCSLS as the specific Based on this cost function, we propose to use the Boltzmann Distribution, i.e. softmax function with temperature to construct a prior transport plan Γ: Γ ij represents the probability that the i-th word in X is a translation of the j-th word in Y . Temperature T controls the significance of translation in Γ. Γ, induced from Q sup , assigns each pair of words with a smaller cost in C a higher probability of forming a lexicon. Instead of considering Problem (4), we consider the POT regularized by the Kullback-Leibler (KL) divergence between Γ and P .
POT(XQ,Ȳ ) = min P ∈Π D, P + εKL(P Γ), (8) where D, P = i,j P ij D ij is the matrix inner product. We note that KL regularization in POT problem is totally different from the aforementioned entropic regularization (5). For entropy regularized OT, the regularization coefficient is expected to be as small as possible to approximate the original OT solution. However, for POT discussed in (8), the regularization coefficient ε controls the interpolation of OT transport plan that minimizes Problem (4) and prior transport plan Γ. Therefore, ε does not need to be as small as possible. Instead, it is a proper number to coordinate the effect from prior supervised transformation Q sup .
The key to solving Problem (8) is to decompose the KL divergence into entropic term and linear term. Therefore, Problem (8) is reduced to min P ∈Π D − ε log Γ, P + εH(P ).
By treating D − ε log Γ as the Γ-prior cost matrix D(ε, Γ), Problem (8) could also be solved by Sinkhorn algorithm. Again, since POT does not require ε to be closed to zero, the solution of POT Problem (8) will not suffer from numerical instability problems (Peyré et al., 2019).

Bi-directional Lexicon Update
As stated in Section 2, the key to supervised BLI is its parallel lexicon for training. Therefore, to enhance the supervised BLI, we propose BLU to extend the parallel lexicon by the structural similarity of word embeddings exploited in unsupervised BLI. To distinguish from unsupervised notations in Section 3.1, let S, T ∈ R n×d be the parallel word embedding matrices for source and target languages respectively. The i-th row s i of S and t i of T form a translation pair in the annotated lexicon. BLU selects the additional lexicon S , T ∈ R l×d with high credit scores to extend S and T . Let S * = S ⊕ S and T * = T ⊕ T be the extended lexicon, where ⊕ denotes the concatenation operation along columns between two matrices.
Given the forward and backward transformations − → Q unsup and ← − Q unsup between source language and target language from unsupervised BLI. BLU defines the S and T by Q unsup and Q unsup respectively in four steps: (1) Compute the forward and backward distance matrices. Forward distance matrix − → D is defined between transformed source embeddings and target embeddings, while backward distance matrix ← − D is defined between source embeddings and backward transformed target embeddings.
(2) Generate forward and backward translation pairs. Let (3) Compute the credit score CS for each translation pair. Firstly, we define the forward and backward credit scores for a pair Select additional lexicon by credit score. The additional lexicon is selected in descending order of the CS for each translation pair (i, j).
Based on the steps mentioned above, we append the annotated lexicon with the additional lexicon that contains high credit translation pairs. This message passing mechanism is related to the bootstrap routine in (Artetxe et al., 2018a). However, we select the credible translation pairs from the intersection, rather than union, of the forward and backward set of translation pairs. In this way, we guarantee the high quality of the additional lexicon.

Semi-Supervision with Two-way Interaction
In the previous section, we have presented two message passing mechanisms to enhance supervised BLI and OT-based unsupervised BLI by prior transformation Q unsup and Q sup , respectively. Moreover, recent state-of-the-art (SOTA) supervised(Sup) and unsupervised(U nSup) approaches are all based on stochastic optimization rather than the closed-form solution. This means that all SOTA Sup and U nSup approaches can be considered as a module that operates on the feed-in parameter Q. Therefore, we propose two different strategies for semi-supervision that emphasize the two-way interaction between the supervised signal and unsupervised alignment based on the message passing mechanisms, see Figure 1. All SOTA Sup and OT-based U nSup methods can be plugged into the proposed framework seamlessly.

Cyclic Semi-Supervision
The first proposed semi-supervised BLI strategy is CSS, see Figure 1 (a). CSS feeds the parameter Q into Sup and U nSup iteratively in a cyclic parameter feeding routine. Cyclic parameter feeding is a "hard" way to share the parameters and is no more than Patra et al. (2019) itself. Besides parameter feeding, we propose to use the message passing mechanisms BLU and POT to strengthen the Sup and U nSup. However, there is no convergence guarantee for this optimization scheme. As a result, it may suffer from limited performance when the BLI task is hard, as will be detailed in Section 5.

Parallel Semi-Supervision
The second strategy is PSS, see Figure 1 (b), where Sup and U nSup are performed in parallel. The information between Sup and U nSup is only passed by the proposed message passing mechanisms. In this point of view, Artetxe et al. (2017) only had the Sup part with lexicon update and ignored the Un-Sup part. Compared to CSS, PSS indirectly shares the information in a "soft" way and may be suitable for some hard BLI tasks. We use the metric formulated in Equation 4 to evaluate Q sup and Q unsup on the word embedding spaces and choose the better one as the final output of PSS.

Experiment
In this section, we conduct extensive experiments to evaluate the performance of CSS and PSS. We open the source code on Github * .

Setup
Baselines We take several methods proposed in recent five years as baselines, including supervised ( Datasets We evaluate CSS and PSS against baselines on two popularly used datasets: the MUSE † dataset (Lample et al., 2018a) and the VecMap ‡ dataset (Dinu and Baroni, 2015). The MUSE dataset consists of FASTTEXT word embeddings (Bojanowski et al., 2017) trained on Wikipedia corpora and more than 100 bilingual dictionaries of different languages. The FASTTEXT embeddings used in MUSE are trained on very large and highly semantically similar language corpora (Wikipedia), which means the results on MUSE are biased (Artetxe et al., 2018a) and easier to obtain. On the contrary, the VecMap dataset is less biased and harder using CBOW embeddings trained on the WacKy scrawled corpora and bilingual dictionaries obtained from the Europarl word alignments (Dinu and Baroni, 2015 on three annotated lexicons with different sizes, including one-to-one and one-to-many mappings: "100 unique" and "5K unique" contain one-to-one mappings of 100 and 5000 source-target pairs respectively, while "5K all" contains one-to-many mappings of all 5000 source and target words, that is, for each source word there may be multiple target words. Moreover, we present the experiment results of five totally unsupervised baselines and three supervised ones. All the accuracies reported in this section are the average of four repetitions. For detailed experimental data, such as the standard deviation, please refer to the tables in appendix. Hyperparameter Setting We train our models using Stochastic Gradient Descent with a batch size of 400 and a learning rate 1.0 for Sup, a batch size of 8K and a learning rate 500 for UnSup. The temperature T in Equation (7) is 0.1 and the coefficient ε in Equation (8)

Results on MUSE Dataset
In Table 1, we show the word translation results for five language pairs from the MUSE dataset, including 10 BLI tasks considering bidirectional translation.
With "100 unique" annotated lexicon, CSS outperforms all other semi-supervised methods on every task. The accuracy score of Patra et al. (2019) is less than 3% on all tasks because the limited annotated lexicon is insufficient for effective learning, while Artetxe et al. (2017) avoided this problem by lexicon bootstrap. Both CSS and PSS keep strong performance with insufficient annotated lexicon by the proposed message passing mechanisms, and achieve 2.8% and 0.9% improvement over Artetxe et al. (2017), respectively. Compared to the iterative CSS that feeds parameters by UnSup directly into Sup, the parallel PSS has fewer connections between Sup and UnSup. Thus, CSS performance is better than PSS under low supervision.
We notice that semi-supervised approaches with "100 unique" annotated lexicon are even worse than the unsupervised methods. This indicates that 100 annotation lexicon is too weak for supervised approach to learn meaningful transformation. It does not mean our approach has marginal contribution. On the contrary, these empirical results reveal that bad supervised BLI won't hurt the overall performance of our semi-supervised framework and this is what previous work cannot achieve.
As the annotated lexicon size increases, the dominance of CSS and PSS is still observed. Moreover, the gap between CSS and PSS disappears as the size of annotated lexicon gets larger. With "5K unique" annotated lexicon, CSS and PSS outperform other semi-supervised methods on all tasks. With "5K all" annotated lexicon, CSS and PSS outperform other semi-supervised baselines on 9 of 10 tasks. On average, CSS exceeds Artetxe et al. Taking all methods into consideration, includ-ing supervised, semi-supervised and unsupervised, CSS and PSS achieve the highest accuracy on 8 of 10 tasks and the best results on average.

Results on VecMap Dataset
In Table 2, we show the word translation accuracy for three language pairs, including 6 translation tasks on the harder VecMap dataset (Dinu and Baroni, 2015).
Notably, a couple of unsupervised approaches (Lample et al., 2018a;Mohiuddin and Joty, 2019;Alaux et al., 2019) are evaluated to have a zero accuracy on some of the language pairs. On the one hand, their valotile results demonstrate the toughness of the VecMap dataset where the structural similarity for unsupervised BLI is very low. On the other hand, unstable performance may be explained by the high dependence of those methods on the initialization. Though the performance of those methods are highest in some cases,  We also mark the second-highest score by bold font and † if necessary.
At all supervision levels, CSS and PSS outperform all other semi-supervised approaches. Taking all unsupervised, semi-supervised and supervised methods into account, CSS and PSS achieve SOTA accuracy on average. Notably, PSS gets the highest or the second-highest (except the unstable unsupervised baseline (Alaux et al., 2019)) scores for 5 of 6 language pairs. The results for "100 unique" annotated lexicon support our finding on the MUSE dataset that CSS learns better at low supervision level. Interestingly, with "5K unique" and "5K all" annotated lexicons, PSS outperforms CSS on almost every task, which is different from the MUSE dataset. Given that the structural similarity of embeddings between different languages in VecMap is very low, UnSup procedure is very unstable. In this case, CSS has lower performance due to the unstable Q unsup is directly fed into Sup, while the parallel strategy of PSS does not suffer from this problem.

Ablation Study
In the ablation study, we disassemble CSS and PSS into the basic components to analyze the contribution of each component. Specifically, we consider the proposed two message passing mechanisms POT and BLU. For CSS, we also include the effect of Sup or UnSup module in the cyclic parameter feeding. However, if Sup or UnSup in PSS is removed, the framework falls back to the unsupervised or supervised BLI, whose results are already in Table 1 and 2.
We conduct ablation experiments with two annotated lexicons with different sizes, "5K all" and "1K unique" to compare the behavior of CSS and PSS under different annotation level. The experimental setting is the same as the main experiments. The ablation results are presented in Table 3 on four language pairs (2 from MUSE dataset and 2 from VecMap dataset).

Effectiveness of POT and BLU:
Regardless of the annotated lexicon size, removing POT, BLU and both of them from CSS brings 2.4%, 0.9% and 13.0% decline of accuracy respectively on average. Notably, the cyclic parameter feeding does not bring further benefits. Only when combined with at least one message passing mechanism, POT or BLU, the accuracy is improved significantly. For PSS, removal of POT or BLU brings 1.6% and 1.0% decline on the average score respectively.
Moreover, we consider different annotated lexicon sizes. On average, removal of POT, BLU or both from CSS brings 1.2%, 0.7% and 4.2% decline respectively with "5K all" annotated lexicon size, 3.4%, 0.9% and 21.6% decline with "1K unique" annotated lexicon size. The message passing mechanisms contribute drastically with a smaller annotated lexicon size since Sup receives significantly larger additional lexicons from Un-Sup to strengthen its performance. As for PSS, removal of POT and BLU brings 0.9% and 1.5% decline respectively with "5K all" annotated lexicon size, 2.3% and 0.7% decline with "1K unique" annotated lexicon size. No significant effect of annotation level for PSS is observed in ablation study. For both CSS and PSS, the contribution of POT is slightly larger than that of BLU and the combination of them could bring impressive improvement in general.

Analysis of Sup and UnSup in CSS:
In this step, we remove Sup or UnSup from CSS  and monitor the performance change. Note that if we remove UnSup from CSS, POT also needs to be removed as we do not need any prior transport plan for UnSup anymore. Removing Sup also means the removal of BLU for a similar reason. After removing UnSup and POT, CSS feeds Q sup exactly to BLU for additional lexicon and then to Sup again, just like Artetxe et al. (2017Artetxe et al. ( , 2018a. After removing Sup and BLU, UnSup takes the transformation learned by itself in previous steps to generate the prior transport plan. The average accuracy drops by 1.5% and 4.5% with "5K all" and "1K unique" annotated lexicon respectively after removing UnSup, by 1.7% and 0.8% after removing Sup. Given the comparison above, Sup contributes less than UnSup with "1K unique" annotated lexicon. Whereas Sup and UnSup contribute comparably with "5K all" annotated lexicon. In other words, at low annotation level, i.e. "1K unique", where Sup BLI does not work well, the participation of UnSup extends the valuable additional lexicon.

Results on distant language pairs
In this section, We report the tranlation accuracy of our method on five distant language pairs with 5000 lexicon. We choose three methods as baselines: Patra et al. (2019) proposed semi-supervised SOTA method. Jawanpuria et al. (2019) is the supervised SOTA method. Zhou et al. (2019) designed an unsupervised matching procedure with density matching technologies, which achieved significant improvement on distant language pairs. As we need to compare supervised, unsupervised and semi-supervised method simultaneously, we conduct evaluation only on the "5K unique" supervision level.
As shown in Table 4, our method also retains a distinct advantage on these distant language pairs. In the cases between "EN" and "JA", Patra et al.

Related Work
This paper is mainly related to the following three lines of work. Supervised methods. Mikolov et al. (2013) pointed out that it was a feasible way to BLI by learning a linear transformation based on the Euclidean distance. Artetxe et al. (2016) applied normalization to word embeddings and imposed an orthogonal constraint on the linear transformation which led to a closed-form solution. Joulin et al. (2018) replaced Euclidean distance with the RC-SLS distance to relieve the hubness phenomenon and achieved SOTA results for many languages. (2019) by using the RCSLS as the distance metric, which addresses hubness phenomenon better than Euclidean distance. Zhao et al. (2020) proposed an relaxed matching procedure derived from unbalanced OT algorithms and solved the polysemy problem to a certain extent. Xu et al. (2018) used a neural network implementation to calculate the Sinkhorn distance, a well-defined OT-based distributional similarity measure, and optimized the objective through back-propagation.
Semi-supervised methods. Artetxe et al. (2017) proposed a simple self-learning approach that can be combined with any dictionary-based mapping technique and started with almost no lexicon. Patra et al. (2019) proposed a semi-supervised approach that relaxes the isometric assumption and optimizes a supervised loss and an unsupervised loss together.
Notably, comparing with the self-learning method like (Artetxe et al., 2018a) or , our framework with two message passing mechanisms is quite different from theirs. Although the lexicon updating procedures in their papers are similar with the BLU that we proposed, there are two main differences: (1) Their approaches use the lexicon from current step to extract the lexicon for next step. Meanwhile, BLU uses unsupervised output to extract lexicon for the supervised part. Our models will degenerate to their situation after removing the unsupervised part and POT. This situation has been discussed in the ablation study in Section 5.4. (2) BLU extracts lexicon according to bidirectional matching information while they only consider one direction. This trick improves the lexicon quality.
Moreover, alignment of word embeddings in latent spaces by Auto-Encoders or other projections is another trend of BLI research. Latent space alignment includes unsupervised variants (Dou et al., 2018;Bai et al., 2019;Mohiuddin and Joty, 2019) and semi-supervised variants (Mohiuddin et al., 2020). We emphasize that the latent space alignment is orthogonal to our proposed framework. Our entire framework can be transferred to any given latent space.

Conclusions
In this paper, we introduce the two-way interaction between the supervised signal and unsupervised alignment by proposed POT and BLU message passing mechanisms. POT guides the OT-based unsupervised BLI by prior BLI transformation. BLU employs a bidirectional retrieval to enlarge the annotated data and stabilize the training of supervised BLI approaches. Ablation study shows that the two-way interaction by POT and BLU is the key to significant improvement.
Based on the message passing mechanisms, we design two strategies of semi-supervised BLI to integrate supervised and unsupervised approaches, CSS and PSS, which are constructed on cyclic and parallel strategies respectively. The results show that CSS and PSS achieve SOTA results over two popular datasets. As CSS and PSS are compatible with any supervised BLI and OT-based unsupervised BLI approaches, they can also be applied to the latent space optimization.  Table 2: Detailed Experimental Results Ablation Study. We repeat the experiment on each language pair four times and report best, avg, st of the four results(best: the highest @1 accuracy. avg: the average accuracy which is reported in main body of this paper. st: the standard deviation.)