Unicoder: A Universal Language Encoder by Pre-training with Multiple Cross-lingual Tasks

We present Unicoder, a universal language encoder that is insensitive to different languages. Given an arbitrary NLP task, a model can be trained with Unicoder using training data in one language and directly applied to inputs of the same task in other languages. Comparing to similar efforts such as Multilingual BERT and XLM , three new cross-lingual pre-training tasks are proposed, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model. These tasks help Unicoder learn the mappings among different languages from more perspectives. We also find that doing fine-tuning on multiple languages together can bring further improvement. Experiments are performed on two tasks: cross-lingual natural language inference (XNLI) and cross-lingual question answering (XQA), where XLM is our baseline. On XNLI, 1.8% averaged accuracy improvement (on 15 languages) is obtained. On XQA, which is a new cross-lingual dataset built by us, 5.5% averaged accuracy improvement (on French and German) is obtained.


Introduction
Data annotation is expensive and time-consuming for most of NLP tasks.Recently, pre-trained models, such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2018) and GPT (Radford et al., 2018), have shown strong capabilities of transferring knowledge learned from large-scale text corpus to specific NLP tasks with limited or no training data.But they still cannot handle tasks when training and test instances are in different languages.
Motivated by this issue, some efforts have been made, such as Multilingual BERT (Devlin et al., 2018) and XLM (Lample and Conneau, 2019), for cross-lingual tasks.Multilingual BERT trains a BERT model based on multilingual Wikipedia, which covers 104 languages.As its vocabulary contains tokens from all languages, Multilingual BERT can be used to cross-lingual tasks directly.XLM further improves Multilingual BERT by introducing a translation language model (TLM).TLM takes a concatenation of a bilingual sentence pair as input and performs masked language model based on it.By doing this, it learns the mappings among different languages and performs good on the XNLI dataset.
However, XLM only uses a single cross-lingual task during pre-training.At the same time, Liu et al. (2019) has shown that multi-task learning can further improve a BERT-style pre-trained model.So we think more cross-lingual tasks could further improve the resulting pre-trained model for cross-lingual tasks.To verify this, we propose Unicoder, a universal language encoder that is insensitive to different languages and pre-trained based on 5 pre-training tasks.Besides masked language model and translation language model, 3 new cross-lingual pre-training tasks are used in the pre-training procedure, including crosslingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model.Cross-lingual word recovery leverage attention matrix between bilingual sentence pair to learn the cross-lingual word alignment relation.Cross-lingual paraphrase classification takes two sentences from different languages and classify whether they have same meaning.This task could learn the cross-lingual sentence alignment relation.Inspired by the successful of monolingual pre-training on long text (Radford et al., 2018;Devlin et al., 2018), we propose cross-lingual masked language model whose input is document written by multiple languages.We also find that doing fine-tuning on multiple languages together can bring further improvement.For the languages without training data, we use machine translated data from rich-resource languages.
Experiments are performed on cross-lingual natural language inference (XNLI) and crosslingual question answering (XQA), where both Multilingual BERT and XLM are considered as our baselines.On XNLI, 1.8% averaged accuracy improvement (on 15 languages) is obtained.On XQA, which is a new cross-lingual dataset built by us, 5.5% averaged accuracy improvement (on French and German) is obtained.
In short, our contributions are 4-fold.First, 3 new cross-lingual pre-training tasks are proposed, which can help to learn a better languageindependent encoder.Second, a cross-lingual question answering (XQA) dataset is built, which can be used as a new cross-lingual benchmark dataset.Third, we verify that by fine-tuning multiple languages together, significant improvements can be obtained.Fourth, on the XNLI dataset, new state-of-the-art results are achieved.

Related work
Monolingual Pre-training Recently, pretraining an encoder by language model (Radford et al., 2018;Peters et al., 2018;Devlin et al., 2018) and machine translation (McCann et al., 2017) have shown significant improvement on various natural language understanding (NLU) tasks, like tasks in GLUE (Wang et al., 2018).The application scheme is to fine-tune the pre-trained encoder on single sentence classification task or sequential labeling task.If the tasks have multiple inputs, just concatenate them into one sentence.This approach enables one model to be generalized to different language understanding tasks.Our approach also is contextual pre-training so it could been applied to various NLU tasks.
Cross-lingual Pre-training Cross-lingual Pretraining is a kind of transfer learning with different source and target domain (Pan and Yang, 2010).A high-quality cross-lingual representation space is assumed to effectively perform the crosslingual transfer.Mikolov et al. (2013) has been applied small dictionaries to align word representations from different languages and it is sufficient to align different languages with orthogonal transformation (Xing et al., 2015), even without parallel data (Lample et al., 2018).Following the line of previous work, the word alignment also could be applied to multiple languages (Ammar et al., 2016).Artetxe and Schwenk (2018) use multilingual machine translation to train a multilingual sentence encoder and use this fixed sentence embedding to classify XNLI.We take these ideas one step further by producing a pre-trained encoder instead of word embedding or sentence embedding.
Our work is based on two recent pre-trained cross-lingual encoders: multilingual BERT1 (Devlin et al., 2018) and XLM (Lample and Conneau, 2019).Multilingual BERT trains masked language model (MLM) with sharing vocabulary and weight for all 104 languages.But each training sample is monolingual document.Keeping the same setting, XLM proposed a new task TLM, which uses a concatenation of the parallel sentences into one sample for masked language modeling.Besides these two tasks, we proposed three new cross-lingual pre-training tasks for building a better language-independent encoder.

Approach
This section will describe details of Unicoder, including tasks used in the pre-training procedure and its fine-tuning strategy.

Model Structure
Unicoder follows the network structure of XLM (Lample and Conneau, 2019).A shared vocabulary is constructed by running the Byte Pair Encoding (BPE) algorithm (Sennrich et al., 2016) on corpus of all languages.We also down sample the rich-resource languages corpus, to prevent words of target languages from being split too much at the character level.

Pre-training Tasks in Unicoder
Both masked language model and translation language model are used in Unicoder by default, as they have shown strong performance in XLM.
Motivated by Liu et al. (2019), which shows that pre-trained models can be further improved by involving more tasks in pre-training, we introduce three new cross-lingual tasks in Unicoder.
All training data for these three tasks are acquired from the existing large-scale high-quality machine translation corpus.
Cross-lingual Word Recovery Similar to translation language model, this task also aims to let the Formally, given a bilingual sentence pair (X, Y ), where X = (x 1 , x 2 , ..., x m ) is a sentence with m words from language s, Y = (y 1 , y 2 , ..., y n ) is a sentence with n words from language t, this task first represents each x i as x t i ∈ R h by all word embeddings of Y : where x s i ∈ R h and y t j ∈ R h denote the word embeddings of x i and y j respevtively, h denotes the word embedding dimension, A ∈ R m×n is an attention matrix calculated by: W ∈ R 3 * h is a trainable weight and is elementwise multiplication.Then, Unicoder takes X t = (x t 1 , x t 2 , ..., x t n ) as input, and tries to predict the original word sequence X.
Similar to translation language model in XLM, this task is based on the bilingual sentence pairs as well.However, as it doesn't use the original words as input, we can train this task by recovering all words at the same time.The model structure of this task is illustrated in Figure 1.a.

Cross-lingual Paraphrase Classification
This task takes two sentences from different languages as input and classifies whether they are with the same meaning.Like the next sentence prediction task in BERT, we concatenate two sentences as a sequence and input it to Unicoder.The representation of the first token in the final layer will be used for the paraphrase classification task.This procedure is illustrated in Figure 1.b.
We created the cross-lingual paraphrase classification dataset from machine translation dataset.Each bilingual sentence pair (X, Y ) servers as a positive sample.For negative samples, the most straight forward method is to replace Y to a ran-dom sampled sentence from target language.But this will make the classification task too easy.So we introduce the hard negative samples followed Guo et al. (2018).First, we train a light-weight paraphrase model with random negative samples.Then we use this model to select sentence with high similarity score to X but doesn't equal to Y as hard negative samples.We choose DAN (Iyyer et al., 2015) as the light model.We create positive and negative samples in 1:1.
Cross-lingual Masked Language Model Previous successful pre-training language model (Devlin et al., 2018;Radford et al., 2018) is conducted on document-level corpus rather than sentence-level corpus.The language model perplexity on document also is much lower than sentence (Peters et al., 2018).So we propose crosslingual masked language model, whose input is come from cross-lingual document.
Cross-lingual document is a sequence of sentences, and the sentences are written with different languages.In most case, people won't write crosslingual document.But we found that a large proportion of aligned sentence pairs in machine translation are extracted from parallel documents, such as MultiUN corpus and OpenSubtitles corpus.In other words, these MT corpus are document-level corpus in which each sentence and its translation is well aligned.We construct cross-lingual document by replacing the sentences with even index to its translation as illustrated in Figure 1.c.We truncate the cross-lingual document by 256 sequence length and feed it to Unicoder for masked language modeling.

Multi-language Fine-tuning
A typical setting of cross-lingual language understanding is only one language has training data, but the test is conducted on other languages.We denote the language has training data as source language, and other languages as target languages.A scalable way (Conneau et al., 2018) to address this problem is through Cross-lingual TEST, in which a pre-trained encoder is trained on data in source language and directly evaluated on data in target languages.
There are two other machine translation methods that make training and test belong to the same language.TRANSLATE-TRAIN translates the source language training data to a target language and fine-tunes on this pseudo training data.TRANSLATE-TEST fine-tunes on source language training data, but translates the target language test data to source language and test on it.
Inspired by multi-task learning (Liu et al., 2018(Liu et al., , 2019) ) for improving pre-trained model, we propose a new fine-tuning strategy Multi-language Fine-tuning.We propose to fine-tune on both the source language training data and pseudo target language training data.If there are multiple target languages, we will fine-tune on all of them at same time.
Different languages may have totally different vocabulary and syntax.But our experiments show that in most cases, joint fine-tuning multiple languages could bring huge improvement.Only in just a few cases, this may harm the performance.

Experiment
In this section, we describe the data processing and training details.Then we compare the Unicoder with the current state of the art approaches on two tasks: XNLI and XQA.

Data Processing
Our model is pre-trained on 15 languages, including English(en), French(fr), Spanish(es), German(de), Greek(el), Bulgarian(bg), Russian(ru), Turkish(tr), Arabic(ar), Vietnamese(vi), Thai(th), Chinese(zh), Hindi(hi), Swahili(sw) and Urdu(ur).For MLM, we use the Wikipedia from these languages.The other four tasks need MT dataset.We use same MT dataset as Lample and Conneau (2019) which are collected from MultiUN (Eisele and Chen, 2010), IIT Bombay corpus (Kunchukuttan et al., 2017), OpenSubtitles 2018, EUbookshop corpus and GlobalVoices.In the MT corpus, 13 of 14 languages (except IIT Bombay corpus) are from parallel document and could be used to train cross-lingual document language model.The number of data we used is reported at table 1.
For tokenization, we follows the line of Koehn et al. (2007); Chang et al. (2008) for each language.We use byte-pair encoding (BPE) to process the corpus and build vocabulary.

Pre-training details
To reduce pre-training time, we initialize our model from XLM (Lample and Conneau, 2019).We pretrain Unicoder with five tasks including MLM\TLM and our three cross-lingual tasks.In each step, we iteratively train these five tasks.A batch for these tasks is available in 15 languages, and we sample several languages with equal probability.And we use batch size 512 by gradient accumulation.We train our model with the Adam optimizer (Kingma and Ba, 2015), and learning rate starts from 1e−5 with invert square root decay (Vaswani et al., 2017).We run our pretraining experiments on a single server with 8 V100 GPUs and use FP16 to save the memory.
The max sequence length of MLM and crosslingual language model is 256.For the other three tasks with two sentences as input, we set the max sequence length to 128 so the sum of them is 256.
Fine-tuning details For fine-tuning stage, we use same optimizer and learning rate as pretraining.We set the batch size to 32.
Experimental evaluation XNLI: Cross-lingual Natural Language Inference Natural Language Inference(NLI) takes two sentences as input and determines whether one entails the other, contradicts it or neither (neutral).XNLI is NLI defined on 15 languages.Each language contains 5000 human annotated development and test set.Only English has training data, which is a crowd-sourced collection of 433k sentence pairs from MultiNLI (Williams et al., 2018).The performance is evaluated by classification accuracy.
We report the results of XNLI in Table 2, by comparing our Unicoder model with four baselines: Conneau et al. (2018) uses LSTM as sentence encoder and constraints bilingual sentence pairs have similar embedding.The other baselines are pre-training based approaches.Multilingual BERT (Devlin et al., 2018) is to train masked language model on multilingual Wikipedia.And Artetxe and Schwenk ( 2018) is pre-trained with machine translation model and takes the MT encoder to produce sentence embedding.XLM (Lample and Conneau, 2019) explores masked language model on multilingual Wikipedia, using translation language model on MT bilingual sentence pair for pre-training in addition.
Based on the result, we could find our pretraining model Unicoder obtains the best result in every fine-tuning setting.In TRANSLATE-TRAIN, TRANSLATE-TEST and Cross-lingual TEST, Unicoder obtains 76.9%, 74.9% and 75.4% accuracy on average, respectively.In Multi-language Fine-tuning, Unicoder outper- TRANSLATE-TRAIN is to machine translate English training data to target language and fine-tune with this translated data; TRANSLATE-TEST is machine translate target language test data to English, the fine-tuning is conducted on English; Cross-lingual TEST is to fine-tune on English and directly test on target language; Multilanguage Fine-tune is to fine-tune on machine translated training data on all languages.
forms XLM by 0.7%.By translating the English training data to target language, both TRANSLATE-TRAIN and Multi-language Finetuning can outperform other fine-tuning approaches on average no matter what encoder is used.Our Multi-language Fine-tuning approach is even better than TRANSLATE-TRAIN.With this approach, XLM is been improved by 1.1% and Unicoder is been improved by 1.6%.By Combining Unicoder and Multi-language Fine-tuning, Unicoder achieves an new state of the art with 78.5%.It obtains 1.8% accuracy gain compared to previous state of the art, XLM finetuned with TRANSLATE-TRAIN.
XQA: Cross-lingual Question Answering We proposed a new dataset XQA.Question Answering takes a question and an answer as input, then classify whether the answer is relevant to question.Each answer is a short passage.XQA contains three languages including English, French and German.Table 3: Results on the XQA.The average column is the average of fr and de result.
we also evaluate Unicoder on XQA and set XLM fine-tuned on our dataset as baseline with their published code.We split 5K data from training dataset as development data to do model selection.
The results are shown in Table 3.We could find that 1) Our model outperforms XLM at every fine-tuning setting.In Multi-language Fine-tuning, we achieved 2.0% gain.2) With our Unicoder, Multi-language Finetuning approach achieved 3.3% gain compared to TRANSLATE-TRAIN.XLM also could been improved by 3.5%.3) By combining Unicoder and TRANSLATE-TRAIN, we achieve best per-language XNLI-en XNLI-ar XNLI-es XNLI-fr XNLI-ru XNLI-zh average number Acc

Analysis
In this section, we provide ablation analysis for different variants of our approaches and elucidate some interesting aspects of Unicoder.Sec.5.1 is the ablation study of each cross-lingual pretraining task.It also shows the impact of Multilanguage Fine-tuning.Sec.5.2 explores the impact of language numbers.Additionally, Sec.5.3 analyzes the relation between English and other language by joint fine-tune on two languages.Then we further explore the relation between any language pair (Sec.5.4).

Ablation Study
To examine the utility of our new cross-lingual pre-training tasks, we conducted ablation study on XNLI dataset.For these three cross-lingual pretraining tasks, we remove them and only pre-train on other tasks.To this end, we fine-tune the Unicoder with Multi-language Fine-tuning.The results are present at Table 2. Ablation experiments for each factor showed that removing any tasks will lead to performance drop.Comparing Unicoder with XLM, We can draw several conclusions from the results in Table 2. First, the cross-lingual paraphrase classification has least drop compared to others and removing the word recovery task hurts performance significantly.For example, in the case of XNLI, using just the Cross-lingual Language Model and Paraphrase Classification improves test accuracy on average by 0.4%.And integrating with Word Recovery model allows Unicoder to learn a better representation improves the average accuracy another 0.3%.Second, Multi-language fine-tuning is helpful to find the relation between languages, we will analyze it below.Table 2 and Table 3 both show it can bring a significant boost in cross-lingual language understanding performance.With the help of Multi-language finetuning, Unicoder is been improved by 1.6% of accuracy on XNLI and 3.3% on XQA.

The relation between language number and fine-tuning performance
In Table 2, we proved that Multi-language Fine-tuning with 15 languages is better than TRANSLATE-TRAIN who only fine-tune on 1 language.In this sub-section, we try more setting to analysis the relation between language number and fine-tuning performance.
In this experiment, only English has human labeled training data, the other languages use machine translated training data from English.The experiment is conducted on 6 languages which are the languages of MT corpus Multilingual United Nations (MultiUN).
We have four settings: 1 language is equals to TRANSLATE-TRAIN, the pre-trained model is fine-tuned on target language.2 languages is to fine-tune on English and target language.For English, we report the average result when fine-tune with other 5 languages, respectively.6 languages is to fine-tune on 6 selected languages of this experiment.15 languages is to fine-tune on all 15 languages our model support, and report the results on 6 languages.This setting is equals to last row of Table 2.
The results are shown at Table 4.In most languages, we could find that the more languages we used in fine-tuning, the better the performance.Chinese and Russian have two numbers don't follow this trend.But 15 languages always outperform 1 language for each languages.
The most surprising result is English could be improved by Multi-language Fine-tuning even it is source language and has human-labeled training data.In next experiment, we will show that in 2 languages setting, the improvement on English is not stable and depends on the another language.But from 1 language to 6 languages and to 15 languages, English has stable improvement.

The relation between English and other languages
In this experiment, we joint fine-tune English and one language.With this experiment, we could test the relation between English and other languages since all languages have equal position in the pretraining and fine-tuning.We report the performance on English and average of 15 languages.First, we could find most of the average results are improved by joint finetuning two languages.Only Vietnamese(vi) and Urdu(ur) lead to performance drop.Secondly, the improvement on English is not stable.French(fr) and Spanish(es) could improve English performance.But Vietnamese(vi) and Thai(th) lead to a big drop.

The relation between different languages
So as to better understand the relation between different languages, we fine-tune Unicoder on one language and test on all 15 languages.The results are shown in Table 5.The numbers in the diagonal correspond to the TRANSLATE-TRAIN result reported in Table 2.
We observe that the Unicoder can transfer knowledge from one language to another language.We could find that fine-tune on one language often lead to best performance on this language, except Greek(el) and Urdu(ur).In fact, TRANSLATE-TRAIN of Urdu even harm the performance.Urdu also have worst generation ability to other languages.Russian(ru) has best generalization ability, even better than source language English.
We also could find that transfer between English(en), Spanish(es) and French(fr) is easier that other languages.The MT system between these languages also outperform other languages (Conneau et al., 2018).

Conclusion
We have introduced the Unicoder which is insensitive to different languages.We pre-train Unicoder with three new cross-lingual tasks, including cross-lingual word recovery, cross-lingual paraphrase classification and cross-lingual masked language model.We also proposed a new Multilanguage Fine-tuning approach.The experiments on XNLI and XQA proved Unicoder could bring large improvements.and our approach become new state of the art on XNLI and XQA.We also did experiments to show that the more languages we used in fine-tuning, the better the results.Even rich-resource language also could been improved.

Figure 1 :
Figure 1: Unicoder consists of three cross-lingual pre-training tasks: (a) The cross-lingual word recovery model is to learn word relation from different languages (b) The cross-lingual paraphrase classification is to classify whether two sentences from different languages are paraphrase.(c) The cross-lingual masked language model is to train masked language model with cross-lingual document.

Figure 2 :
Figure 2: Currently cross-lingual fine-tuning has three baseline approaches, they could be defined based on their training data and test data.Suppose we target to test on Chinese data, Translate-train is to train on Chinese training data which is translated from English and test on Chinese test data; Translate-Test is to train on English training data and test on English test data which is translated from Chinese; Cross-lingual test is to train in English training data and test on Chinese test data.Multi-language fine-tuning is to train on English training data and multiple other languages training data which are translated from English, then test on Chinese Test data.

Table 1 :
Sentence number we used in pre-training.

Table 2 :
Test accuracy on the 15 XNLI languages.This table is organized by fine-tuning and test approaches.
Only English have training data.The training data cover various domains, such as health, tech, sports, etc.The English training data contains millions of samples and each languages has 500 test data.Keeping the same experimental setup as XNLI,

Table 4 :
Experiments of fine-tuning on different number of languages.The model is evaluated on 6 languages, and the average result is at last column.The results in the last row correspond to the results in last row of Table2.

Table 5 :
Accuracy on the XNLI test set of when fine-tuning Unicoder with one language and testing on other languages.The results in the diagonal correspond to the TRANSLATE-TRAIN accuracy reported in Table2.

Table 6 :
Result of joint fine-tuning two languages.This table reports the result on English and average accuracy of 15 languages on XNLI.The last row means Unicoder only fine-tunes on English.