Classification-Based Self-Learning for Weakly Supervised Bilingual Lexicon Induction

Effective projection-based cross-lingual word embedding (CLWE) induction critically relies on the iterative self-learning procedure. It gradually expands the initial small seed dictionary to learn improved cross-lingual mappings. In this work, we present ClassyMap, a classification-based approach to self-learning, yielding a more robust and a more effective induction of projection-based CLWEs. Unlike prior self-learning methods, our approach allows for integration of diverse features into the iterative process. We show the benefits of ClassyMap for bilingual lexicon induction: we report consistent improvements in a weakly supervised setup (500 seed translation pairs) on a benchmark with 28 language pairs.


Introduction and Motivation
Cross-lingual word embeddings (CLWEs), that is, representations of words in a shared crosslingual vector space, enable multilingual modeling of meaning and facilitate cross-lingual transfer for downstream NLP tasks . One of their primary use cases is bilingual lexicon induction (BLI), that is, learning translation correspondences across languages which benefit the development of core language technology also for resource-poor languages and domains (Adams et al., 2017;Smith et al., 2017;Heyman et al., 2018;Hangya et al., 2018;. Earlier work focused on joint CLWE induction from bilingual corpora, relying on word- (Klementiev et al., 2012;Gouws and Søgaard, 2015), sentence- (Zou et al., 2013;Hermann and Blunsom, 2014;Coulmance et al., 2015;Levy et al., 2017), or document-level supervision Vulić and Moens, 2016). However, recent focus is predominantly on post-hoc alignment of independently trained monolingual word embeddings: the * Equal contribution. so-called projection-based or mapping approaches (Mikolov et al., 2013;Conneau et al., 2018;Joulin et al., 2018;Artetxe et al., 2018b;Patra et al., 2019). Such methods are particularly suitable for weakly supervised learning setups: they support CLWE induction with only as much as few thousand word translation pairs as the bilingual supervision. 1 One critical component of weakly supervised projection-based CLWEs is a self-learning procedure that iteratively refines the initial seed dictionary to learn projections of increasingly higher quality. This process leads to substantial improvements of the initially mapped space, especially with smaller seed dictionaries (Artetxe et al., 2017;. However, current self-learning procedures are still rather basic, typically relying only on direct extraction of (mutual) nearest neighbors from the current shared space (Conneau et al., 2018;Artetxe et al., 2018b;. In this work, we propose a more sophisticated self-learning procedure for weakly supervised projection-based CLWE methods, and show its benefits for a wide range of language pairs. We frame self-learning as iterative classificationbased process, which yields several benefits over the previously used self-learning mechanisms. 1) It enables integration of a variety of heterogeneous features at different levels of granularity (e.g., word-level vs. orthographic features); some trans-1 In the extreme, fully unsupervised projection-based CLWEs extract such seed bilingual lexicons from scratch on the basis of monolingual data only (Conneau et al., 2018;Artetxe et al., 2018b;Hoshen and Wolf, 2018;Alvarez-Melis and Jaakkola, 2018;Chen and Cardie, 2018;Mohiuddin and Joty, 2019, inter alia). However, as shown in recent comparative empirical analyses , using seed sets of only 500-1,000 translation pairs, with all other components equal, always outperforms fully unsupervised methods. Therefore, we focus on a more natural weakly supervised setup (Artetxe et al., 2020) instead, i.e., we assume the existence of at least 500 seed translations for each language pair in consideration. lation cues (e.g., subword-level overlap) have been ignored by previous self-learning approaches. 2) It allows us to control for the reliability of translation pairs considered as candidates for the dictionary updates in the current iteration. Effectively, this helps reduce noise in the process as the training dictionary grows. 3) As suggested by prior work on classification-based BLI (Irvine and Callison-Burch, 2017;Heyman et al., 2017), framing the actual BLI task as a classification problem results in further gains in the final BLI performance.
We extensively evaluate our classification-based self-learning procedure, termed CLASSYMAP, on the standard BLI data set  spanning 28 pairs of diverse languages. The integration of the proposed self-learning method into VECMAP (Artetxe et al., 2018b), a state-of-the-art projection-based CLWE framework, yields substantial gains over previous self-learning procedures. 2 We demonstrate that the improvements are indeed achieved through the synergy of diverse features used by the classifier. We also demonstrate further BLI improvements when we treat BLI as a supervised classification-based task.
2 Classification-Based Self-Learning Projection-Based CLWE Methods (linearly) align independently trained monolingual word embeddings X 1 of the source language L 1 and X 2 (target language L 2 ), using a seed word translation dictionary D (Mikolov et al., 2013;Artetxe et al., 2018a). Working in weakly supervised setups, we assume the existence of some translation pairs (≈ 500 pairs) in D. Let X 1,D ⊂ X 1 and X 2,D ⊂ X 2 refer to the row-aligned subsets of monolingual embedding spaces containing vectors of translation pairs from D. Those are used to learn orthogonal transformations T 1 and T 2 that define the final shared cross-lingual space W cl = W 1 ∪W 2 , where W 1 = X 1 T 1 and W 2 = X 2 T 2 .
Our departure point is a standard self-learning setup from related work (Artetxe et al., 2018b;Conneau et al., 2018), outlined in the following. At each iteration k, the dictionary D (k) is first used to learn the joint space W 2 We use VECMAP due to its very competitive and robust BLI performance according to the recent comparative studies Doval et al., 2019). We note that our methodology is equally applicable to other projection-based methods that employ self-learning e.g., (Conneau et al., 2018;Mohiuddin and Joty, 2019), and our preliminary results with other methods suggest the similar benefits stemming from the classification-based approach.
Algorithm 1: Classification-based self-learning cl are then used to extract the new dictionary D (k+1) . Previous work typically relies on a variant of mutual nearest neighbours in the aligned embedding space of the current iteration to select likely translation candidates for the next. However, as hinted by Lubin et al. (2019), that procedure still results in many noisy candidates inserted in the extended seed sets, and the error may get amplified over subsequent iterations.
New Self-Learning Procedure. Therefore, we propose a more versatile self-learning process. We train a supervised classifier in each iteration: given a word pair, it produces a probability score denoting to which extent the pair is a correct translation pair. The classifier can be fed a wide range of features on the character, subword, and word level.
We apply the classifier in two ways. First, at iteration k the classification scores are used to select likely translation candidates which are added to the dictionary D (k+1) for iteration k + 1. Second, similar to Heyman et al. (2017), at test time we use the classifier scores to rerank translation candidates produced by 1) finding nearest neighbours in the final aligned embedding space and 2) considering orthographically similar candidates. 3 A high-level overview of the proposed classification-based selflearning procedure is outlined in Algorithm 1.
Self-Learning: Components. For implementing the AlignEmbeddings operation (see Algorithm 1) we rely on the VECMAP 4 system (Artetxe et al., 2018b) in its supervised variant. The nn function returns word pairs that are nearest neighbours in a given aligned embedding space. The TrainClassifier functionality can be instantiated using any standard classification framework. In this work, we opt for a simple a multi-layer perceptron with a single hidden layer.
A very important design choice concerns generating negative training examples for the classifier. All word pairs in the dictionary at current iteration D (k) are used as positive examples. For each positive pair (s, t), we generate two negative examples: 1) (s, x), where x is sampled uniformly from N o target words which are orthographically (measured by edit distance) most similar to s; 2) (s, y), where y is sampled uniformly from N c target words closest (by cosine) to s in the current space W (k) cl . This strategy performed considerably better than randomly generating negative examples. The intuition is as follows: at test time the classifier must operate on word pairs that are generated using nearest neighbour search. Such word pairs are not random, but are rather very close in the aligned embedding space and are often orthographically similar. Thus, this strategy for generating negative samples makes the train conditions for the classifier better reflect the test conditions.
Features. The classification-based approach allows for the integration of a wide spectrum of diverse features that capture different word translation evidence. We outline the sets of features used in this work, computed for each word pair (s, t).
F1. Edit distance -Levenshtein and Jaro-Winkler distance between s and t (Cohen et al., 2003). Following Heyman et al. (2017) we also include normalized edit distance, log of the rank of t in a list sorted by edit distance with respect to s, as well as a product of these two values.
F2. Cosine similarity of s and t in W F5. Character n-grams -we extract all character ngrams and use χ 2 feature selection to select the 10 most indicative ones. The intuition is to allow the model to recognize indicative prefixes or suffixes.
F6. Subword-level similarity -we use multilingual subword embeddings (SWEs) based on BPEs (Heinzerling and Strube, 2018). We add the following features: i) we average the BPEs of s and t and calculate cosine similarity of the resulting vectors, ii) the pairwise maximum cosine similarity of all pairs of SWEs (one from s and the other from t), and iii) the Earth Mover's distance between the two sets of SWEs (Kusner et al., 2015).
F7. Frequencies -we provide the rank of the word in a list of all words sorted by frequency. The ranks are normalized by the number of words.
At test time, if we use the classifier to perform the final reranking, we take for each source word s a set of candidate target word translations as the union of 1) the top N ro target word neighbours of s by edit distance, and 2) the top N rc target word neighbours of s by cosine in the final aligned W cl . We then score the N ro + N rc candidates using the classifier from the last self-learning iteration.

Experimental Setup
Monolingual Vectors and BLI Data. , and Finnish (FI). As our focus is on weakly supervised setups, we use only 500 translation pairs as our initial seed dictionary. We report BLI performance using the standard Precision@1 (P@1) measure.
Classifier Details. We use the Adam optimizer (Kingma and Ba, 2015) and regularize the model via 2 -penalty on the weights and early stopping on 10% of held-out data. Early stopping is performed for each language pair separately, while other hyperparameter values are found by grid search 5 maximizing a three-fold cross-validation score on the training data for a randomly selected language pair (EN-HR), and reused in all other experiments.
Hyperparameters. We find values for other hyperparameters on held-out data for a randomly chosen language pair: EN-HR. Unless otherwise stated, we fix them to the following values for all other experiments and language pairs. In Algo-rithm 1, P = 1000, K = 500, n = 30. Further, we sample 2 negative examples per each positive example from the sets of size N o = N c = 5. N ro = N rc = 3 when doing the final reranking. We note that more careful tuning of these values could lead to further improvements in results.
Baselines. We compare to the VECMAP system (Artetxe et al., 2018b) in its semi-supervised variant as a robust and highly competitive self-learning framework .

Results and Discussion
The main results over a representative selection of language pairs and setups are provided in Table 1. Full results over all 28 pairs are provided in Appendix A. The results indicate several important findings. First, classification-based self-learning is more powerful than the standard VECMAP self-learning: we observe gains on 22/28 pairs using CLASSYMAP without the final reranking step, even without language pair-dependent fine-tuning. Second, framing BLI as a classification task leads to further gains: we report improvements on 25/28 pairs using CLASSYMAP with the final reranking step over both supervised and semisupervised VECMAP variants. Using reranking with CLASSYMAP seems useful across the board. 6 As a side finding, our results also revalidate the evident usefulness of the self-learning procedure for weakly supervised setups in general : the average P@1 score across All languages of a supervised VECMAP method based on the same initial dictionary, but without any selflearning, is only 0.111, while we report the average of 0.365 (with final reranking) in Table 1.
Importantly, the gains seem more pronounced for more "difficult", typologically dissimilar, and morphologically rich language pairs such as TR-RU or DE-TR, than for similar languages such as IT-FR, with more isomorphic monolingual spaces (Søgaard et al., 2018). To analyze this further, we have run additional experiments on the BLI evaluation sets of  comprising more typologically distant language pairs 7 , with similar conclusions. For instance, 6 We have also probed a variant where we learn a classifier for the final reranking step on top of VECMAP's output after its self-learning procedure. However, as suggested by the results in Table 1, this leads to drops in performance compared to standard semi-supervised VECMAP. We speculate that this is due to higher levels of noise in the final VECMAP dictionary. 7 github.com/cambridgeltl/panlex-bli  Table 1: P@1 BLI scores for a selection of language pairs. We also perform the average scores over pairs that include English (EN-X) and those that do not (No EN), as well as the averages for all pairs (All). The a/b score format denotes a score without (a), and with the final reranking step (b). All improvements of CLASSYMAP with reranking over the strongest baseline (i.e., VECMAP with self-learning) are significant (p<0.05) according to the non-parametric shuffling test (Yeh, 2000) with the Bonferroni correction.
with 500 seed pairs CLASSYMAP with reranking scores 24.6 P@1 for Estonian-Esperanto and 16.6 for Hungarian-Basque. The strongest baselines achieve P@1 of 20.0 and 13.8, respectively. In sum, our classification-based approach holds promise to guide future work especially on distant pairs.
Step Size and the Number of Iterations. We now analyze how two vital components of self-learning impact the final BLI scores: 1) the number of added dictionary entries per iteration (i.e., step size, see Table 3), and 2) the number of iterations (Figure 1). For brevity, we run the analyses on several "difficult" language pairs: DE-RU, TR-FI, HR-FR, and EN-FI. The results suggest that the step size has only moderate impact on the final scores, and is language pair-dependent. However, all three options improve over the baseline self-learning method, and final reranking is again useful across the board. According to Figure 1, the optimal number of iterations is also pair-dependent: TR-FI performance steadily increases over time, while DE-RU hits the peak after only 5 iterations and steadily declines afterwards. This finding calls for a more careful tuning of this parameter in future work.
Feature Ablation Analysis. We also perform an ablation analysis, reported in Table 4. Overall, the results suggest that different features contribute to the final performance. This corroborates our hypothesis that one of the main advantages of the classification-based approach is its ability to fuse different translation evidence. However, there are cases (e.g., using BPE for DE-RU or TR-FI) where 500 1k 3k 5k    235 .111 / .197 / .234 / .249 .213 / .197 / .246 / .266 .242 / .198 / .258 / .274    a feature set can negatively affect performance. In sum, this small ablation study warrants finergrained and language pair-dependent feature selection in future work.
Seed Dictionary Size. We also provide additional results when varying the size of the initial seed dictionary in Table 2. The main finding is that, while the absolute BLI scores are naturally higher with larger seed dictionaries, CLASSYMAP remains useful even with much larger dictionary sizes (check the results with 3k and 5k seed pairs). CLASSYMAP with reranking remains the strongest BLI method, corroborating our previous findings.

Conclusion and Future Work
We introduced CLASSYMAP, a novel classificationbased approach to self-learning, which is a crucial component of projection-based cross-lingual word embedding induction models in low-data regimes. We reported its usefulness and robustness across a wide spectrum of diverse language pairs in the BLI task, confirming the usefulness of learning classifiers both as part of the self-learning procedure as well as for the final word retrieval in the BLI task. This proof-of-concept work opens up a wide spectrum of interesting avenues for future research, including the use of more powerful classifiers, more sophisticated features (e.g., character-level transformers), and fine-grained linguistic analyses on the importance of disparate features over different language pairs. One particularly exciting direction is the application of our classification-based self-learning framework on top of the most recent methods that induce bilingual spaces via non-linear alignments (Glavaš and Vulić, 2020;Mohiuddin and Joty, 2020 Table 5: P@1 BLI scores for all 28 language pairs. We report scores of 1) VECMAP in the supervised setting without self learning, 2) VECMAP and CLASSYMAP with only self learning but without reranking (SL), and 3) VECMAP and CLASSYMAP with both self learning and reranking (SL+R). All models start with the same seed set of 500 word translation pairs.