Improving Bilingual Lexicon Induction with Unsupervised Post-Processing of Monolingual Word Vector Spaces

Work on projection-based induction of cross-lingual word embedding spaces (CLWEs) predominantly focuses on the improvement of the projection (i.e., mapping) mechanisms. In this work, in contrast, we show that a simple method for post-processing monolingual embedding spaces facilitates learning of the cross-lingual alignment and, in turn, substantially improves bilingual lexicon induction (BLI). The post-processing method we examine is grounded in the generalisation of first- and second-order monolingual similarities to the nth-order similarity. By post-processing monolingual spaces before the cross-lingual alignment, the method can be coupled with any projection-based method for inducing CLWE spaces. We demonstrate the effectiveness of this simple monolingual post-processing across a set of 15 typologically diverse languages (i.e., 15*14 BLI setups), and in combination with two different projection methods.

Importantly, CLWEs are one of the central mechanisms for facilitating transfer of language technologies for low-resource languages, which often lack sufficient bilingual signal for obvious transfer via machine translation. Lack of language re-sources is the main reason for popularity of the socalled projection-based CLWE methods (Mikolov et al., 2013a;Artetxe et al., 2016Artetxe et al., , 2018a. These models align two independently trained monolingual word vector spaces post-hoc, using limited bilingual supervision in the form of several hundred to several thousand word translation pairs (Mikolov et al., 2013a;Vulić and Korhonen, 2016;Joulin et al., 2018;Ruder et al., 2018). Some models even align the monolingual spaces using only identical strings (Smith et al., 2017;Søgaard et al., 2018) or numerals (Artetxe et al., 2017). The most recent work focused on fully unsupervised CLWE induction: they extract seed translation lexicons relying on topological similarities between monolingual spaces (Conneau et al., 2018;Artetxe et al., 2018a;Hoshen and Wolf, 2018;Alaux et al., 2019).
In this work, we do not focus on projection itself: rather, we investigate a transformation of input monolingual word vector spaces that facilitates the projection and leads to higher quality CLWEs. Regardless of the actual projection method, the quality of the input monolingual spaces has a profound impact on the induced shared cross-lingual space, and, in turn, on the quality of induced bilingual lexicons. We demonstrate that simple unsupervised post-processing of monolingual embedding spaces leads to substantial BLI performance gains across a large number of language pairs. Our work is inspired by observations that monolingual "embeddings capture more information than what is immediately obvious" (Artetxe et al., 2018c). In other words, the information surfaced in the pretrained monolingual vector spaces may not be optimal for an application such as word-level translation (BLI).
We rely on a monolingual post-processing method of Artetxe et al. (2018c): a linear transformation controlled by a single parameter that adjusts the similarity order of the input embedding spaces. We demonstrate that applying this trans-formation on both monolingual spaces before any standard projection-based CLWE framework yields consistent BLI gains for a wide array of languages. We run a large-scale BLI evaluation with 15 typologically diverse languages (i.e., 15×14 = 210 BLI setups) and show that this simple monolingual postprocessing yields gains in 183/210 setups over the current state-of-the-art BLI models which combine self-learning (Artetxe et al., 2018a) with (weak) word-level supervision (Vulić et al., 2019). We further show that this monolingual post-processing yields improvements on other BLI datasets (Glavaš et al., 2019), for different projection-based CLWE models, and also for BLI with 210 similar (major European) languages (Dubossarsky et al., 2020), indicating the importance and robustness of monolingual post-processing for BLI.

Methodology
Projection-Based CLWEs: Preliminaries. Projection-based CLWE models learn a linear projection between two independently trained monolingual spaces -X (source language L s ) and Z (target language L t ) -using a word translation dictionary D to guide the alignment. X D ⊂ X and Z D ⊂ Z denote the row-aligned subsets of X and Z containing vectors of aligned words from D. X D and Z D are used to learn orthogonal projections W x and W z defining the bilingual space: Y = XW x ∪ ZW z . While (weakly) supervised methods start from a readily available dictionary D, fully unsupervised models automatically induce the seed dictionary D (i.e., from monolingual data). 1 Furthermore, it has been empirically validated (Artetxe et al., 2017;Vulić et al., 2019) that applying an iterative self-learning procedure leads to consistent BLI improvements, especially for distant languages and in low-data regimes. In a nutshell, at each self-learning iteration k, a dictionary D (k) is first used to learn the joint space z . The mutual crosslingual nearest neighbours in Y (k) are then used to extract the new dictionary D (k+1) . Relying on mutual nearest neighbours partially removes the noise, leading to better performance. For more technical details on self-learning, we refer the reader to prior work (Ruder et al., 2019a;Vulić et al., 2019).
Motivation. Most existing CLWE models ignore the properties of the initial monolingual spaces X and Z (i.e., they are taken "as-is") and focus on improving the projection. However, monolingual postprocessing of X and Z prior to learning the projections may facilitate the projection and be beneficial for iterative setups such as self-learning. This intuition is already confirmed by a number of monolingual transformations, e.g., 2 -normalisation, mean centering, or whitening/dewhitening, that are "by default" performed by toolkits such as MUSE (Conneau et al., 2018) and VecMap (Artetxe et al., 2018b;Zhang et al., 2019). In this work, however, we investigate a transformation to the monolingual spaces which is applied before they undergo the series of standard normalisation and centering steps.
Further, we investigate a line of research that leverages unsupervised post-processing of monolingual word vectors (Mu et al., 2018;Wang et al., 2018;Raunak et al., 2019;Tang et al., 2019) to emphasise semantic properties over syntactic aspects, typically with small gains reported on intrinsic word similarity (e.g., SimLex-999 (Hill et al., 2015)). In this work, we empirically validate that these unsupervised post-processing techniques can also be effective in cross-lingual scenarios for lowresource BLI, even when coupled with the current state-of-the-art CLWE frameworks that rely on "all the bells and whistles", such as self-learning and additional vector space preprocessing.
Unsupervised Monolingual Post-processing. We now outline the simple post-processing method of Artetxe et al. (2018c) used in this work, and then extend it to the bilingual setup. The core idea is to generalise the notion of first-and second-order similarity (Schütze, 1998) 2 to nth-order similarity. Let us define the (standard, first-order) similarity matrix of the source language space X as M 1 (X) = XX T (similar for Z). The second-order similarity can then be defined as M 2 (X) = XX T XX T , where it holds M 2 (X) = M 1 (M 1 (X)); the nth-order similarity is then M n (X) = (XX T ) n . The embeddings of words w i and w j are given by the rows i and j of each M n matrix.
We are then looking for a general linear transformation that adjusts the similarity order of input  matrices X and Z. As proven by Artetxe et al. (2018c), the n th -order similarity transformation can be obtained as M n (X) = M 1 (XR (n−1)/2 ), with R α = Q∆ α , where Q and ∆ are the matrices obtained via eigendecomposition of X T X (X T X = Q∆Q T ): ∆ is a diagonal matrix containing eigenvalues of X T X; Q is an orthogonal matrix with eigenvectors of X T X as columns. 3 Finally, we apply the above post-processing on both monolingual vector spaces X and Z. This results in adjusted vector spaces X αs = XR αs and Z αt = ZR αt . Transformed spaces X αs and Z αt then replace the original spaces X and Z as input to any standard projection-based CLWE method.

Experimental Setup
We evaluate the impact of unsupervised monolingual post-processing described in §2 on BLI, focusing on pairs of typologically diverse languages. 4 Mean reciprocal rank (MRR) is used as the main evaluation metric, reported as MRR×100%. 5 Training and Test Data. We exploit the training and test dictionaries compiled from PanLex  Table 1 and a total of 210 distinct L s → L t BLI 3 Although the post-processing motivation stems from the desire to adjust discrete similarity orders, note that α is in fact a continuous parameter which can be carefully fine-tuned (negative values are also allowed). The code is available at: https://github.com/artetxem/uncovec. 4 The focus of this work is on the standard BLI task; however, it has recently shown (Glavaš et al., 2019) that some downstream tasks strongly correlate with BLI. 5 Our findings also hold for Precision@M, for M ∈ {1, 5} . We analyse the impact of unsupervised monolingual postprocessing from §2 by (1) feeding the original vectors X and Y to VecMap (BASELINE), and then by (2) feeding their post-processed variants X αs and Y αt (POSTPROC). We experiment with projection model variants without and with self-learning, and with different initial dictionary sizes (5K and 1K). Note that the POSTPROC variant requires tuning of two hyper-parameters: α s and α t . Due to a lack of development sets for BLI experiments, we tune the two α-parameters on a single language pair (BG-CA) via crossvalidation; we grid-search over the following values: [−0.5, −0.25, −0.15, 0, 0.15, 0.25, 0.5]. We then keep them fixed to the following values: α s = −0.25, α t = 0.15 in all subsequent experiments.

Results and Discussion
Main BLI results averaged over each source language (L s ) are provided in Table 2   the supplemental material. We also observe performance gains with a "pure" supervised model variant (i.e., without self-learning), but for clarity, we focus our analysis on the more powerful baseline, with self-learning. We note improvements in 183/210 (seed dictionary size 5K) and 181/210 BLI setups (size: 1K) over the projection-based baselines that held previous peak scores using the same data (Vulić et al., 2019). This validates our intuition that monolingual vectors store more information which needs to be "uncovered" via monolingual post-processing. The effect of monolingual postprocessing pertains after applying other perturbations such as 2 -norm or mean centering. For some languages -e.g., FI, TR, NO -we achieve gains in all BLI setups with those languages as sources.
What is more, we have not carefully fine-tuned α s and α t : we note that even higher scores can be achieved by finer-grained fine-tuning in the future. For instance, setting (α s , α t ) = (−0.5, 0.25) instead of (−0.25, 0.15) for TR-BG increases BLI score from 37.8 to 39.5; the previous peak score with BASELINE was 35.1. The baseline mapping is simply obtained by setting (α s , α t ) = (0, 0), and we note that the tuned post-processing validated in our work should be considered as a tunable option for any projection-based CLWE method.
We further probe the robustness of unsupervised post-processing by running experiments on additional BLI evaluation set of Glavaš et al. (2019) and with another mapping model: RCSLS (Joulin et al., 2018). While we again observe gains across a range of different model variants and with different seed dictionary sizes, we summarise a selection of results in Table 3. Finally, small but consistent improvements extend also to a set of 15 European languages from Dubossarsky et al. (2020) (see Fotnote 6): POSTPROC yields gains on average for all 15/15 source languages, and across 173/210 setups (5K seed dictionary); the global average improves from 43.9 (the strongest BASELINE) to 44.7. In summary, these results further underline the usefulness of the monolingual post-processing method.
We have demonstrated a simple and effective method for improving bilingual lexicon induction (BLI) with projection-based cross-lingual word embeddings. The method is based on standalone unsupervised post-processing of initial monolingual word embeddings before mapping, and as such applicable to any projection-based CLWE method. We have verified the importance and robustness of this monolingual post-processing with a wide range of (dis)similar language pairs as well as in different BLI setups and with different CLWE methods.
In future work, we will test other unsupervised post-processors, and also probe similar methods that inject external lexical knowledge into monolingual word vectors towards improved BLI. We also plan to probe if similar gains still hold with recently proposed more sophisticated self-learning methods (Karan et al., 2020), non-linear mappingbased CLWE methods (Glavaš and Vulić, 2020; Mohiuddin and Joty, 2020). Another idea is to also apply a similar principle to contextualised word representations in cross-lingual settings (Schuster et al., 2019;Liu et al., 2019).

A Supplemental Material
We report main BLI results for all 15 × 14 = 210 language pairs based on PanLex training and test data in the supplemental material, grouped by the source language, and for two dictionary sizes: |D| = 1, 000 and |D| = 5, 000 (while similar relative performance is also observed with other dictionary sizes, e.g., |D| = 500). The results are provided in Table 4-Table 18, and they are the basis of the results reported in the main paper. The language codes are available in Table 1 (in the main paper). As mentioned in the main paper, all results are obtained with the two α-hyperparameters fixed to the following values: α S = −0.25, α T = 0.15, without any further fine-tuning. A more careful language pair-specific fine-tuning results in even higher performance for many language pairs. In all tables, BASELINE refers to the bestperforming weakly supervised projection-based approach without and with self-learning, as reported in a recent comparative study of Vulić et al.
(2019); 5k and 1k denote the seed dictionary D size. The scores in bold indicate improvements over the BASELINE methods. All results are reported as MRR scores: the MRR score of .xyz should be read as xy.z% (e.g., the score of .432 can be read as 43.2%).