Parallel sentences mining with transfer learning in an unsupervised setting

The quality and quantity of parallel sentences are known as very important training data for constructing neural machine translation (NMT) systems. However, these resources are not available for many low-resource language pairs. Many existing methods need strong supervision are not suitable. Although several attempts at developing unsupervised models, they ignore the language-invariant between languages. In this paper, we propose an approach based on transfer learning to mine parallel sentences in the unsupervised setting.With the help of bilingual corpora of rich-resource language pairs, we can mine parallel sentences without bilingual supervision of low-resource language pairs. Experiments show that our approach improves the performance of mined parallel sentences compared with previous methods. In particular, we achieve excellent results at two real-world low-resource language pairs.


Introduction
Parallel sentences are known as very important training data for constructing machine translation (MT) systems (Belinkov and Bisk, 2018). The volumes of quality parallel sentences heavily affect the performance of trained machine translation systems. However, these resources are only available for a handful of language pairs and domains while the others suffer from the scarcity problem (Bouamor and Sajjad, 2018). In this situation, parallel sentences are very crucial for training machine translation systems.
Transfer learning is an effective approach to mine parallel data in low-resource scenarios. (Artetxe and Schwenk, 2019) brought the evidence of cross-lingual transfer to mine parallel data for low-resource language pairs. However, their method is not unsupervised and relies on bilingual supervision (e.g, bilingual lexicon or sentences), which is not available for low-resource language pairs. Although (Kvapilíková et al., 2020) solved the supervised limitation by employing an unsupervised MT, the performance heavily depended on MT's quality.
In this paper, we propose a parallel sentences mining model based on transfer learning in an unsupervised setting 1 . As illustrated in Figure 1, we obtain sentence embeddings by mean-pooling the outputs of multilingual BERT (Lample and Conneau, 2019), which is trained on monolingual corpora. In particular, we use a language discriminator to learn shared and refined language-invariant representations for transfer learning. (Chen et al., 2018;Ziser and Reichart, 2018) pointed out the language-invariant is helpful for transfer learning. Then, we treat detecting parallel sentences as a classification task and generate multi-view semantic representations for the classifier. Generally, data from different views contain complementary information and multi-view learning exploits the consistency from multiple views (Li et al., 2018;Fei and Li, 2020). In our model, we use two views for the classifier: (i) word representations; (ii) sentence representations. In addition to achieving good results on BUCC 2018 2 shared task, we demonstrate the effectiveness of our model using an example of two low-resource language pairs where parallel corpora are almost not available.
In summary, our contributions in this paper are as follows: (1) We propose an unsupervised method based on transfer learning to mine parallel sentences without any bilingual data for low-resource language 1 The unsupervised setting means we only have monolingual corpora for a pair of language that bilingual resources are not available, while there are some language pairs have bilingual resources which we use for unsupervised transfer learning in low-resource language pairs. 2 11th Workshop on Building and Using Comparable Corpora Figure 1: Our proposed method that based on multi-view transfer training for parallel phrase detection on a nonparallel sentence pair.
pairs. By designing a multi-view model, we encode the representations on word-level and sentencelevel to obtain high-quality parallel data.
(2) We extensively consider the languageinvariant by constructing a language discriminator to well capture the semantic similarity among languages. This makes the robustness of our model for transfer learning.

Related Work
Many works mine parallel corpora from monolingual data which contain potential mutual translations. Previous methods depended on engineering features. (Shi et al., 2006;Esplà-Gomis et al., 2016) used metadata information from web crawls to mine parallel data. Recent methods used crosslingual word embeddings to obtain parallel corpora (Guo et al., 2018;Schwenk, 2018;Bouamor and Sajjad, 2018;Schwenk et al., 2019b,a). (Artetxe and Schwenk, 2019) encoded the universal language embeddings that are agnostic to languages. They used transfer learning to mine parallel sentences of low-resource language pairs. This transfer learning method inspired our work and the main difference is that they required bilingual supervision (e.g, bilingual lexicon, parallel sentences), which is not available for many low-resource language pairs.
Recently, several works developed unsupervised method to mine parallel data (Hangya et al., 2018;Hangya and Fraser, 2019;Kvapilíková et al., 2020;Keung et al., 2020). These approaches mainly rely on unsupervised cross-lingual embeddings (Artetxe et al., 2018;Lample and Conneau, 2019) that be trained on monolingual corpora. However, several researchers question that these methods may not well capture the semantic similarity among languages (Karthikeyan et al., 2019;Pires et al., 2019). Some researchers proposed to use transfer learning to solve cross-lingual applications for low-resource language pairs (Lakew et al., 2018;Kocmi, 2020). (Eriguchi et al., 2018) used a multilingual neural machine translation system to learn the word representations of rich-resource language pairs. Then, they used transfer learning to identify parallel sentences for low-resource language pairs. However, it has an implicit dependency on multilingual NMT that requires pre-training on large parallel sentences. Our transfer learning is inspired by (Fei and Li, 2020). The difference is that they mainly solve cross-lingual unsupervised sentiment classification.

Proposed Method
The overview of the model architecture is as shown in Figure 1. Our proposed approach based on transfer learning to mine parallel data is composed of three components: an unsupervised multilingual BERT, a language discriminator, and a multiview classifier. Motivated by the success of unsupervised cross-lingual word embeddings (Artetxe et al., 2018;Lample and Conneau, 2019) and its application in mining parallel data (Hangya and Fraser, 2019;Keung et al., 2020), we use multilingual BERT to initialize word and sentence embeddings. Although previous methods are effective, they may ignore sentential context on using multilingual word embeddings, which could harm the performance of mining parallel corpora. In our work, we use multi-view representations to mine parallel data. We can get good performance on rich-resource language pairs. However, our aim is to obtain parallel data for low-resource language pairs. For this purpose, we use transfer learning to mine parallel data of the low-resource scenarios using rich-resource language pairs. Note that our method doesn't rely on any bilingual data of lowresource language pairs. Therefore, we can call that our method is unsupervised for low-resource language pairs.

Language Discriminator
Previous works (Chen et al., 2018;Fei and Li, 2020) indicate that cross-lingual transfer learning work well when their representations are languageinvariant. We use the unsupervised multilingual BERT to map the word representations into a shared space. Although we can generate shared word representations for different languages by using the unsupervised multilingual BERT, there is still a semantic gap between languages. Following (Chen et al., 2018;Lample et al., 2018), we employ a language discriminator for getting finetuned word representations, which is necessary to preserve language-invariant on language transfer. In detail, the language discriminator is trained to distinguish between the mapped source and target embeddings. Then, we refine-turn the two language embeddings with a cross-lingual Procrustes method according to (Lample et al., 2018). The language discriminator contains a feed-forward neural network with two hidden layers as an encoder and one softmax layer. The objective of the discriminator is to maximize its ability to identify the source and target embeddings. The discriminator loss can be written as follows: Where Θ D denotes parameters of the discriminator, (x, y) corresponds to source and target language. P θ D (source = 1|z) is a probability that a vector z is the mapping W of a source embedding, P θ D (target = 1|z) is similar. In parallel, we use the Procrustes analysis to fine-tune the mapping W as follows (Lample et al., 2018). We can obtain universal language-agnostic embeddings when the discriminator is not able to identify the origin of an embedding.

Transfer Learning for Mining Parallel Data
In this paper, we propose to use transfer learning to mine parallel data of the low-resource scenarios by rich-resource language pairs. In this paper, we first consider two views of input for classifier in rich-resource language pairs:(i) the word-level representations from languages; (ii) the sentencelevel representations from languages. The multiview classifier has been demonstrated useful as data from different views contains complementary information (Chen and Qian, 2019;Fei and Li, 2020).
In this paper, we use a feed-forward neural network based on LSTM with two hidden layers as an encoder to balance two view representations. Then, we train a classifier to match predicted labels with ground truth from the parallel sentences in rich-resource language pairs as follows: P (s|t) = e enc(θ) 1 + e enc(θ) (0, 1) Where enc(θ) denotes parameters of the encoder. Then, we use transfer learning to mine parallel data for low-resource language pairs. The detail process is as follows: We firstly train a classifier on rich-resource language pairs (such as English-Chinese or English-French). In parallel, we use the language discriminator to fine-tune the different language representations into a shared space to keep language-invariant between languages. After that, we transfer the pre-trained classifier to detect parallel sentences for low-resource pairs. Finally, we use detected parallel data to train the classifier again in low-resource language pairs for better performance.

Experimental Setting
In this section, we mainly present our experimental settings and describe the datasets used.
Dataset: We test our proposed method on four language pairs of BUCC sample data (English-French, English-German, English-Russian, English-Chinese). The shared task of  Table 1: Results of our proposed systems on the BUCC shared task's training set for the 4 language-pairs. We also report the results of baselines as described in their paper. "-" represents the result are not reported in ther paper, respectively.
the workshop on Building and Using Comparable Corpora (BUCC) is a well-established evaluation framework for mining parallel corpora (Zweigenbaum et al., 2018). The shared task provides a gold standard to assess retrieval systems for precision, recall, and F 1 -score. We applied our approach to all language pairs of the BUCC18 shared task. Moreover, we carry out an experiment on realworld low-resource scenarios (English-Esperanto, Chinese-Kazakh). For the monolingual data, we extract corpora from Wikipedia using WikiExtractor 3 . As there is no gold standard to evaluate mining parallel sentences, we use mined parallel sentences to train a machine translation system that can reflect the quality of mined parallel sentences. Baselines: In our experiments, we consider supervised baselines (Bouamor and Sajjad, 2018;Schwenk, 2018;Artetxe and Schwenk, 2019). We also compare several unsupervised baselines which contains (Hangya and Fraser, 2019;Keung et al., 2020;Hangya et al., 2018;Kvapilíková et al., 2020).

Results and Discussions
In this section, we present the results of mining parallel sentences and our comparison to previous work. We also present results on real-world low-resource language pairs and demonstrate our obtained parallel corpora can improve the performance of machine translation. 3 https://github.com/attardi/wikiextractor

Results on BUCC
As BUCC provides a gold standard to assess mined parallel data, we test our method on the BUCC dataset. Although the language pairs used for evaluation are all high-resources, we only simulate the low-resource scenario to justify our method here and we will present results on real-world lowresource language pairs in the section 5.3. We show precision (P), recall(R) and F 1 scores in Table 1 for the four language pairs. Noted that, we use English-German as the rich-resource language pair to initialize our model. Then, we transfer this model into other low-resource language pairs. We also test different rich-resource language pairs for transfer learning as Table 2.
Noted that, our method doesn't rely on any bilingual data of low-resource language pairs. Therefore, we can call that our method is unsupervised for low-resource language pairs. This is a fair comparison to other unsupervised methods. From Table 1, we achieve an increase of F 1 compared with unsupervised baselines for all language pairs. It also can be seen that the precision and recall of the proposed method is significantly increased for all language pair than unsupervised methods. (Artetxe and Schwenk, 2019) also used transfer learning to mine parallel sentences. However, their method needs strong supervision which is not available in low-resource language pairs. The proposed method overcomes the limitation and obtains relatively good results against (Artetxe and Schwenk, 2019).

En-Fr
En-De En-Ru En-Zh  Table 2: Ablation study on the BUCC shared task. Note that, the first column indicates that we use different rich-resource language pairs for transfer learning.

Ablation Study
To understand the effect of different components in our model on the overall performance, we conduct an ablation study in Table 2 to test the language discriminator whether affects transfer learning or not. "-language discriminator" is not adding the language discriminator and "+language discriminator" is adding the language discriminator. In Table 2, the first column is that we use different rich-source language pairs to implement transfer learning for mining parallel sentences. We firstly can find that different sources have similar results for transfer learning of our model. Then, we can find that when we don't add the language discriminator, the performances of the model are not good for transfer learning. When we add the language discriminator for transfer learning, we can find that our model gets an obvious and stable improvement in all language pairs. So from Table 2, we can conclude that language-invariant is very important for transfer learning.

Results on Low-resource Language Pair
In the above section, we simulate the low-resource scenario to justify our method on the BUCC dataset. In this section, we evaluate our mined parallel sentences on real-world low-resource language pairs. We apply our method to the English-Esperanto(En-Es) and Chinese-Kazakh(Zh-Kz) language pairs. As there is no gold standard to evaluate mining parallel sentences, we use mined parallel sentences to train a machine translation system that can reflect the quality of mined parallel sentences.

Methods
En-Es Zh-Kz (Hangya and Fraser, 2019) 18.5 21.6 ( Keung et al., 2020) 20.2 22.8 (Hangya et al., 2018) 16.3 19.3 (Kvapilíková et al., 2020) 23.6 22.7 Proposed method 24.3 25.8 We use openNMT 4 to train the machine translation system. The results are as in Table 3. Based on the scores in Table 3 it can be seen that we achieve a significant performance increase compared to the unsupervised baseline. It is well-known that the quality and quantity heavily affect the performance of machine translation. The results of Table  3 demonstrate that the proposed method is effective, especially for low-resource language pairs.

Conclusion
In this paper, we propose an unsupervised method that uses multi-view transfer learning to mine parallel sentences. Our method can effectively use the bilingual data of rich-resource language pairs. We transfer the model of rich-resource language pairs into a low-resource situation without any supervision of low-resource language pairs. In particular, we employ a language discriminator to capture language-invariant for benefiting transfer learning. In the experiments, the results show that our method significantly and consistently outperforms the baselines.
For the future, we would like to apply our model on other low-resource language pairs to test universal applicability in different language pairs.