Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining

Existing models of multilingual sentence embeddings require large parallel data resources which are not available for low-resource languages. We propose a novel unsupervised method to derive multilingual sentence embeddings relying only on monolingual data. We first produce a synthetic parallel corpus using unsupervised machine translation, and use it to fine-tune a pretrained cross-lingual masked language model (XLM) to derive the multilingual sentence representations. The quality of the representations is evaluated on two parallel corpus mining tasks with improvements of up to 22 F1 points over vanilla XLM. In addition, we observe that a single synthetic bilingual corpus is able to improve results for other language pairs.


Introduction
Parallel corpora constitute an essential training data resource for machine translation as well as other cross-lingual NLP tasks. However, large parallel corpora are only available for a handful of language pairs while the rest relies on semi-supervised or unsupervised methods for training. Since monolingual data are generally more abundant, parallel sentence mining from non-parallel corpora provides another opportunity for low-resource language pairs.
An effective approach to parallel data mining is based on multilingual sentence embeddings (Schwenk, 2018;Artetxe and Schwenk, 2019b). However, existing methods to generate crosslingual representations are either heavily supervised or only apply to static word embeddings. An alternative approach to unsupervised multilingual training is that of Devlin et al. (2018) or Lample and Conneau (2019), who train a masked language model (M-BERT, XLM) on a concatenation of monolingual corpora in different languages to learn a joint structure of these languages together. While several authors (Pires et al., 2019;Wu and Dredze, 2019;Karthikeyan et al., 2019;Libovický et al., 2019) bring evidence of cross-lingual transfer within the model, its internal representations are not entirely language agnostic.
We propose a method to further align representations from such models into the cross-lingual space and use them to derive sentence embeddings. Our approach is completely unsupervised and is applicable even for very distant language pairs. The proposed method outperforms previous unsupervised approaches on the BUCC 2018 1 shared task, and is even competitive with several supervised baselines.
The paper is organized as follows. Section 2 gives an overview of related work; Section 3 introduces the proposed method; Section 4 describes the experiments and reports the results. Section 5 concludes.

Related Work
Related research comprises supervised methods to model multilingual sentence embeddings and unsupervised methods to model multilingual word embeddings which can be aggregated into sentences. Furthermore, our approach is closely related to the recent research in cross-lingual language model (LM) pretraining.
Supervised multilingual sentence embeddings. The state-of-the-art performance in parallel data mining is achieved by LASER (Artetxe and Schwenk, 2019b) -a multilingual BiLSTM model sharing a single encoder for 93 languages trained on parallel corpora to produce language agnostic sentence representations. Similarly, Schwenk and Douze (2017); Schwenk (2018); Espana-Bonet et al. (2017) derive sentence embeddings from internal representations of a neural machine translation system with a shared encoder. The universal sentence encoder (USE) Yang et al., 2019) family covers sentence embedding models with a multi-task dual-encoder training framework including the tasks of question-answer prediction or natural language inference. Guo et al. (2018) directly optimize the cosine similarity between the source and target sentences using a bidirectional dual-encoder. These approaches rely on heavy supervision by parallel corpora which is not available for low-resource languages.
Unsupervised multilingual word embeddings. Cross-lingual embeddings of words can be obtained by post-hoc alignment of monolingual word embeddings (Mikolov et al., 2013) and mean-pooled with IDF weights to represent sentences (Litschko et al., 2019). Unsupervised techniques to find a linear mapping between embedding spaces were proposed by Artetxe et al. (2018) and Conneau et al. (2018), using iterative self-learning or adversarial training. Several recent studies (Patra et al., 2019;Ormazabal et al., 2019) criticize this simplified approach, showing that even the embedding spaces of closely related languages are not isometric. Vulić et al. (2019) question the robustness of unsupervised mapping methods in challenging circumstances.
Cross-lingual LM pretraining. Ma et al. (2019); Reimers and Gurevych (2019) derive monolingual sentence embeddings by mean-pooling contextualized word embeddings from BERT. Schuster et al. (2019); Wang et al. (2019b) propose mapping such contextualized embeddings into the multilingual space and report favorable results on the task of dependency parsing. Pires et al. (2019) extract contextualized embeddings directly from unsupervised multilingual LMs and use them for parallel sentence retrieval. Other authors improve the alignment of representations in a multilingual LM using a parallel corpus as an anchor (Cao et al., 2020) or using iterative self-learning (Wang et al., 2019a). None of these works apply multilingual embeddings to mine parallel sentences. Our work is the first in improving unsupervised cross-lingual models using additional unsupervised information.

Proposed Method
We propose a method to enhance the cross-lingual ability of a pretrained multilingual model by fine-tuning it on a small synthetic parallel corpus. The parallel corpus is obtained via unsupervised machine translation (MT) so the method remains unsupervised. In this section, we describe the pretrained model (Section 3.1), the fine-tuning objective (Section 3.2) and the extraction of sentence embeddings (Section 3.3). We provide details on the unsupervised MT system in Section 3.4.

XLM Pretraining
The starting point for our experiments is a crosslingual language model (XLM) (Lample and Conneau, 2019) of the BERT family pretrained on concatenated monolingual texts in 100 languages using the masked language model (MLM) training objective (Devlin et al., 2018). The model processes the input in BPE subword units (Sennrich et al., 2016) with a shared vocabulary for all languages. In this work, we use the publicly available pretrained model XLM-100 2 (Lample and Conneau, 2019) with 16 transformer layers, 16 attention heads and a hidden unit size of 1280. The model was trained on monolingual corpora in 100 languages with the BPE vocabulary of 240k subwords.

XLM Fine-tuning with a Translation Objective
When parallel data is available, it can be leveraged in training of the multilingual language model using a translation language model loss (TLM) (Lample and Conneau, 2019). Pairs of sentences are concatenated, random tokens are masked from both sentences and the model is trained to fill in the blanks by attending to any of the words of the two sentences. The Transformer self-attention layers thus have the capacity to enrich word representations with the information about their monolingual context as well as their translation counterparts. This explicit cross-lingual training objective further enhances the alignment of the embeddings in the cross-lingual space. We use this objective to fine-tune the pretrained model on a small synthetic parallel data set obtained via unsupervised MT for one language pair, aiming to improve the overall cross-lingual alignment of the internal representations of the model. In our experiments, we also compare the performance to fine-tuning on small authentic parallel corpora.

Sentence Embeddings
Pretrained language models produce contextual representations capturing the semantic and syntactic properties of word (subword) tokens in their variable context (Devlin et al., 2018). Contextualized embeddings can be derived from any of the internal layer outputs of the model. We tune the choice of the layer on the task of parallel sentence matching and conclude that the best cross-lingual performance is achieved at the 12th (5th-to-last) layer. Therefore, we use the representations from this layer in the rest of this paper. The evaluation across layers is summarized in Figure 1 in Section 4.6.
Aggregating subword embeddings to fixedlength sentence representations necessarily leads to an information loss. We compose sentence embeddings from subword representations by simple element-wise averaging. Even though meanpooling is a naive approach to subword aggregation, it is often used for its simplicity (Reimers and Gurevych, 2019;Ruiter et al., 2019;Ma et al., 2019) and in our scenario it yields better results than max-pooling.

Unsupervised Machine Translation
Our unsupervised MT model follows the approach of Lample and Conneau (2019). It is a Transformer model with an encoder-decoder architecture. Both the encoder and the decoder are shared across languages and they are initialized with a pretrained bilingual LM to bootstrap the training. Both the encoder and the decoder have 6 layers, 8 attention heads and a hidden unit size of 768. The system is trained using the unsupervised MT training pipeline of denoising and back-translation .

Experiments & Results
In this section, we empirically evaluate the quality of our cross-lingual sentence embeddings and compare it with state-of-the-art supervised methods and unsupervised baselines. We evaluate the proposed method on the task of parallel corpus mining and parallel sentence matching. We fine-tune two different models using English-German and Czech-German synthetic parallel data.

Data
The XLM model was pretrained on the Wikipedia corpus of 100 languages (Lample and Conneau, 2019). The monolingual data for fine-tuning was sampled from NewsCrawl 2018 (10k Czech sentences, 10k German sentences, 10k English sentences).
Monolingual training data for the unsupervised MT models was obtained from NewsCrawl 2007-2008 (5M sentences per language). The text was cleaned and tokenized using standard Moses (Koehn et al., 2007) tools and segmented into BPE units based on 60k BPE splits.

Experiment Details
To generate synthetic data for fine-tuning, we train two unsupervised MT models (Czech-German, English-German) using the same method and parameters as in Lample and Conneau (2019) on 8 GPUs for 24 hours. We use these models to translate 10k sentences in each language. The translations are coupled with the originals into two parallel corpora of 20k synthetic sentence pairs.
The small synthetic parallel corpora obtained in the first step are used to fine-tune the pretrained XLM-100 model using the TLM objective. We measure the quality of induced cross-lingual embeddings from different layers on the task of parallel sentence matching described in Section 4.5 and observe the best results at the 12th layer after fine-tuning for one epoch with a batch size of 8 sentences and all other pretraining parameters intact. The development accuracy decreases with fine-tuning on a larger data set.

Baselines
We assess our method against two unsupervised baselines to separately measure the fine-tuning effect on the XLM model and to compare our results to another possible unsupervised approach based on post-hoc alignment of word embeddings.
Vanilla XLM. Contextualized token representations are extracted from the 12th layer of the original XLM-100 3 model and mean-pooled into sentence embeddings.
Word Mapping. We use Word2Vec embeddings with 300 dimensions pretrained on NewsCrawl and map them into the cross-lingual space using the unsupervised version of VecMap (Artetxe et al., 2018). As above, word embeddings are aggregated by mean-pooling to represent sentences. 4

Evaluation I: Parallel Corpus Mining
We measure the performance of our method on the BUCC shared task of parallel corpus mining where the system is expected to search two comparable non-aligned corpora and identify pairs of parallel sentences. We evaluate on two data sets -the original BUCC 2018 corpus created by inserting parallel sentences into monolingual texts extracted from Wikipedia (Zweigenbaum et al., 2017) and a new BUCC-like data set (News train and test) which we created by shuffling 10k parallel sentence from News Commentary into 400k monolingual sentences from News Crawl. The BUCC and News data sets are comparable in size and contain parallel sentences from the same source, but differ in overall domain.
In order to score all candidate sentence pairs, we use the margin-based approach of Artetxe and Schwenk (2019a) which was proved to eliminate the hubness problem of embedding spaces and yield superior results (Artetxe and Schwenk, 2019b). The score relies on cosine similarity to measure the distance between sentences but it is defined in relative terms to the average cosine similarity between the two sentences and their nearest neighbors. The optimal threshold for filtering the translation pairs is learned by tuning on the train set F1 scores. Tables 1 and 2 show the results of our proposed model on the BUCC and News test sets, resp., comparing them to related work and unsupervised baselines.
When comparing our method to related work, it must be noted that the XLM model was pretrained on Wikipedia and therefore has seen the monolingual BUCC sentences during training. This could result in an advantage over other systems, as the model could exploit the fact that it has seen the non-parallel part of the comparable corpus during training. However, since both the proposed method an the vanilla XLM baseline suffer from this, their results remain comparable. We also report results on the News test set which is free from such potential bias ( Table 2).
The results reveal that TLM fine-tuning brings a substantial improvement over the initial pretrained model trained only using the MLM objective (vanilla XLM). In terms of the F1 score, the gain across four BUCC language pairs is 14.0-22.3 points. Even though the fine-tuning focused on a single language pair (English-German), the improvement is notable for all evaluated language pairs. The largest margin of 21.6 points is observed for the English-Chinese mining task. We observe that using a small parallel data set of authentic translation pairs instead of synthetic ones does not have a significant effect.
The weak results of the word mapping baseline can be partially attributed to the superiority of contextualized embeddings for representation of sentences over static ones. Furthermore, word mapping relies on the questionable assumption of isomorphic embedding spaces which weakens its performance especially for distant languages. In de-en cs-en cs-de cs-fr cs-ru fr-es fr-ru es-ru Artetxe and Schwenk (2019b) 98  Table 3: Accuracy on a parallel sentence matching task (newstest2012) averaged over both matching directions.
our proposed model, it is possible that joint training of contextualized representations induces an embedding space with more convenient geometric properties which makes it more robust to language diversity. Although the performance of our model generally lags far behind the supervised LASER benchmark, it is valuable because of its fully unsupervised nature and it works even for distant languages such as Chinese-Czech or English-Kazakh.

Evaluation II: Parallel Sentence Matching
To assess the effect of proposed fine-tuning on other language pairs not covered by BUCC, we evaluate our embeddings on the task of parallel sentence matching (PSM). The task entails searching a pool of shuffled parallel sentences to recover correct translation pairs. Cosine similarity is used for the nearest neighbor search. We first evaluate the pairwise matching accuracy on a newstest multi-way parallel data set of 3k sentences in 6 languages. 5 We use newstest2012 for development and newstest2013 for testing. The results in Table 3 show that the fine-tuned model is able to match correct translations in 90-95% of cases, depending on the language pair, which is ∼7% more than vanilla XLM. It is notable that the model which was only fine-tuned on English-German synthetic parallel data has a positive effect on completely unrelated language pairs as well (e.g. Russian-Spanish, Czech-French).
Since the greatest appeal of parallel corpus mining is to enhance the resources for low-resource languages, we also measure the PSM accuracy on the Tatoeba (Artetxe and Schwenk, 2019b) data set of 0.5-1k sentences in over 100 languages aligned with English. Aside from the two completely unsupervised models, we fine-tune two more models on small authentic parallel data in English-Nepali (5k sentence pairs from the Flores development sets) and English-Kazakh (10k sentence pairs from 5 Czech, English, French, German, Russian, Spanish News Commentary). Table 4 confirms that the improvement over vanilla XLM is present for every language we evaluated, regardless on the language pair used for fine-tuning. Although we initially hypothesized that the performance of the English-German model on English-aligned language pairs would exceed the German-Czech model, their results are equal on average. Fine-tuning on small authentic corpora in low-resource languages exceeds both by a slight margin.
The results are clearly sensitive to the amount of monolingual sentences in the Wikipedia corpus used for XLM pretraining and the matching accuracy of very low-resource languages is significantly lower than we observed for high-resource languages. However, the benefits of fine-tuning are substantial (around 20 percentage points) and for some languages the results even reach the supervised baseline (e.g. Kazakh, Georgian, Nepali).
It seems that explicitly aligning one language pair during fine-tuning propagates through the shared parameters and improves the overall representation alignment, making the contextualized embeddings more language agnostic. The propagation effect could also positively influence the ability of cross-lingual transfer within the model in downstream tasks. A verification of this is left to future work. Figure 1: Average PSM accuracy on newstest2012 before and after fine-tuning from the input embedding layer (0th) to the deepest layer (16th).  Table 4: Accuracy on a parallel sentence matching task (Tatoeba) averaged over both matching directions (to and from English). The supervised baseline was obtained using the public implementation of the LASER model (Artetxe and Schwenk, 2019b). Our proposed models were fine-tuned on synthetic parallel data (en↔de, cs↔de) and authentic parallel data (en↔kk, en↔ne).

Analysis: Representations Across Layers
We derive sentence embeddings from all layers of the model and show PSM results on the development set averaged over all language pairs in Figure 1, both before and after fine-tuning. The accuracy differs substantially across the model depth, the best cross-lingual performance is consistently achieved around the 12th (5th-to-last) layer of the model. The TLM fine-tuning affects especially the deepest layers.

Conclusion
We proposed a completely unsupervised method to train multilingual sentence embeddings which can be used for building a parallel corpus with no previous translation knowledge. We show that fine-tuning an unsupervised multilingual model with a translation objective using as little as 20k synthetic translation pairs can significantly enhance the cross-lingual alignment of its representations. Since the synthetic translations were obtained from an unsupervised MT system, the entire procedure requires no authentic parallel sentences for training.
Our sentence embeddings yield significantly better results on the tasks of parallel data mining and parallel sentence matching than our unsupervised baselines. Interestingly, targeting only one language pair during the fine-tuning phase suffices to propagate the alignment improvement to unrelated languages. It is therefore not necessary to build a working MT system for every language pair we wish to mine.
The average F1 margin across four language pairs on the BUCC task is ∼17 points over the original XLM model and ∼7 on the News dataset where only one of the evaluated language pairs was seen during fine-tuning. The gain in accuracy in parallel sentence matching across 8 language pairs is 7.2% absolute, lagging only 7.1% absolute behind supervised methods.
For the future we would like to apply our model on other cross-lingual NLP tasks such as XNLI or cross-lingual semantic textual similarity.