XGLUE: A New Benchmark Dataset for Cross-lingual Pre-training, Understanding and Generation

In this paper, we introduce XGLUE, a new benchmark dataset to train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, and evaluate their performance across a diverse set of cross-lingual tasks. Comparing to GLUE (Wang et al.,2019), which is labeled in English and includes natural language understanding tasks only, XGLUE has three main advantages: (1) it provides two corpora with different sizes for cross-lingual pre-training; (2) it provides 11 diversified tasks that cover both natural language understanding and generation scenarios; (3) for each task, it provides labeled data in multiple languages. We extend a recent cross-lingual pre-trained model Unicoder (Huang et al., 2019) to cover both understanding and generation tasks, which is evaluated on XGLUE as a strong baseline. We also evaluate the base versions (12-layer) of Multilingual BERT, XLM and XLM-R for comparison.


Introduction
Pre-training + Fine-tuning has become a new NLP paradigm, where the general knowledge are firstly learnt from large-scale corpus by selfsupervised learning and then transferred to downstream tasks by task-specific fine-tuning. Three different types of pre-trained models are explored recently, including monolingual pre-trained models (Radford et al., 2018;Devlin et al., 2019;Yang et al., 2019b;Lewis et al., 2019a), multilingual and cross-lingual pre-trained models (Devlin et al., 2019;Conneau and Lample, 2019; and multimodal pretrained models (Lu et al., 2019;Li et al., 2020;. In this paper, we focus on the cross-lingual pre-trained models, due to their importance to alleviating the lowresource issue among languages, where an NLP task often has rich training data in one language (such as English) but has few or no training data in other languages (such as French and German). In order to further advance the development of cross-lingual pre-trained models for various downstream tasks in different languages, this paper introduces XGLUE, a new benchmark dataset that can be used to: (i) train large-scale cross-lingual pre-trained models using multilingual and bilingual corpora, (ii) evaluate generalization capabilities of the cross-lingual pre-trained models across a diverse set of cross-lingual tasks.
The contribution of XGLUE is two-fold. First, it provides 11 diversified cross-lingual tasks covering both understanding and generation scenarios, which, to the best of our knowledge, is the first attempt in the cross-lingual dataset construction efforts. XTREME ) is a concurrent work of XGLUE. But it includes cross-lingual understanding tasks only. Second, an extended version of Unicoder  is described and evaluated as a strong cross-lingual pretrained model baseline on XGLUE for both understanding and generation tasks. We also evaluate the base versions (12-layer) of Multilingual BERT (Devlin et al., 2019), XLM (Conneau and Lample, 2019) and XLM-R  for comparison. We conduct comprehensive experiments on XGLUE, which not only show interesting findings, but also point out several ways to further improve the cross-lingual pre-trained models.

Pre-training Corpus
We collect two corpora, Small Corpus and Large Corpus, with different sizes for cross-lingual pretraining: the former can be used to evaluate new  ideas effectively and the latter can be used to train large-scale models. Table 1 lists the data statistics.

Small Corpus (SC)
Multilingual Corpus We extract raw sentences from the Wikipedia dump using WikiExtractor 2 , which leads to a 101G multilingual corpus covering 100 languages.

Bilingual Corpus
We use an in-house pipeline to extract bilingual sentence pairs from the Web, which leads to a 99G bilingual corpus covering 27 languages, including Arabic, Bulgarian, Danish, German, Greek, English, Spanish, Finnish, French, Hebrew, Hindi, Hungarian, Indonesian, Italian, Japanese, Korean, Dutch, Polish, Portuguese, Russian, Swedish, Swahili, Thai,

Bilingual Corpus
We reuse the bilingual corpus described in Section 2.1.1. We will add CCMatrix (Schwenk et al., 2019) in the future.

Downstream Tasks
We select 11 cross-lingual tasks in XGLUE, which are categorized into 3 groups: single-input understanding tasks, pair-input understanding tasks, and generation tasks. For each task, training set is only available in English. In order to obtain a good performance on XGLUE, a model should be able to learn how to do a task well using its English training set, and then transfer this ability to test sets in other languages. Table 2 gives the dataset statistics  and Table 3 lists languages covered by all tasks.

Single-input Understanding Tasks
NER We select a subset of the following two NER tasks, CoNLL-2002NER (Sang, 2002 and CoNLL-2003 NER (Sang andDe Meulder, 2003), to form this cross-lingual NER dataset. It covers 4 languages, including English, German, Spanish and Dutch, and 4 types of named entities, including Person, Location, Organization and Miscellaneous entities that do not belong to the previous three types. F1 score is used as the metric.
News Classification (NC) This task aims to predict the right category of a given news article. We crawl <news article, news category> pairs from a commercial news website. It covers 10 different news categories and 5 languages, including English, Spanish, French, German and Russian. Accuracy (ACC) of the multi-class classification is used as the metric.

Pair-input Understanding Tasks
MLQA The MLQA (Lewis et al., 2019b) is a multilingual machine reading comprehension task, which contains QA annotations labeled in 7 languages, including English, Arabic, German, Spanish, Hindi, Vietnamese and Chinese. F1 score of the predicted answers is used as the metric.
PAWS-X The PAWS-X (Yang et al., 2019a) is a paraphrase identification dataset, which extends the Wikipedia portion of the PAWS (Zhang et al., 2019) evaluation to more languages. We select 4 languages, including English, Spanish, French and German, from the original dataset and use them in XGLUE. Accuracy (ACC) of the binary classification is used as the metric.   Query-Ads Matching (QADSM) This task aims to find the most relevant ads of a given user query. We collect this dataset from a commercial search engine. It covers 3 languages, including English, French and German. Each labeled instance is a triple: <query, ad title+ad description, label>. Accuracy (ACC) of the binary classification is used as the metric.
Web Page Ranking (WPR) This task aims to find the most relevant web page of a given user query. We construct this dataset based on the user clicks obtained from a commercial search engine. It covers 6 languages, including English, German, French, Italian, Portugal and Chinese. Each labeled instance is a 4-tuple: <query, web page title, web page snippet, label>. The label contains 4 ratings: highly relevance, middle relevance, low relevance and not related. Normalize Discounted Cumulative Gain (nDCG) is used as the metric.
QA Matching (QAM) This task aims to determine whether a <question, passage> pair is a QA pair. We construct this dataset based on a commercial search engine. It covers 3 languages, including English, French and German. Each labeled instance is a 3-tuple: <question, passage, label>. The label indicates whether the passage is the answer of the question, or not. Accuracy (ACC) of the binary classification is used as the metric.

Generation Tasks
Question Generation (QG) This task aims to generate a natural language question for a given passage. We construct this dataset from the same data source of the QAM task, but the inputs are passages and the outputs are questions now. BLEU-4 score is used as the metric.
News Title Generation (NTG) This task aims to generate a proper title for a given news body. We crawl <news title, news body> pairs from a commercial news website. It covers 5 languages, including German, English, French, Spanish and Russian. BLEU-4 score is used as the metric.

Pre-train Unicoder for Cross-lingual Understanding Tasks
We select Unicoder ) as the backbone model. Section 3 introduces a simplified version of Unicoder using two pre-training tasks (MLN and TLM) for cross-lingual understanding tasks. Section 4 describes how to extend Unicoder to cover cross-lingual generation tasks. The original Unicoder  includes more pre-training tasks besides MLM and TLM. But to keep the baseline pre-trained model simple and to reduce the experimental cost, we just use these two most commonly used tasks in Unicoder. It means for understanding tasks, Unicoder is almost equal to XLM, except several hyper-parameter differences. We will add the results of Unicoder pre-trained by more tasks beyond MLM and TLM in the updated version.

Masked Language Model (MLM)
Following Devlin et al. (2019), this task extends the masked language model task to multiple languages. At each iteration, a batch is composed of sentences sampled from different languages. The sampling probability of a language l i is λ l i : where p l i is the percentage of the language l i in the entire corpus, the smoothing factor α is set to 0.3. For each batch, we randomly sample 15% of the words and replace them with (i) a special symbol [MASK], (ii) a random token or (iii) keep them unchanged with probability 80%, 10% and 10%, respectively. For each token, we only use its word embedding and position embedding, and discard segment embedding and language embedding.

Translation Language Model (TLM)
Following Conneau and Lample (2019), this task extends the MLN task to bilingual corpus. Given a bilingual sentence pair, TLM first concatenates them into a single sentence, and then masks words using the same strategy of MLM. The pre-trained model learns to recover each masked word based on the bilingual context. We follow MLM to sample language pairs in each batch with α = 0.3.

Multilingual Denoising Auto-Encoding (xDAE)
Motivated by BART (Lewis et al., 2019a), xDAE aims to predict the original text is a noising function that corrupts an input text X as its output. Four different noising strategies for c(·) are explored in this paper. (1) Shuffle the input text X by adding a noise α ∼ U(0, 3) to the input indices and then re-ordering X based on the rank of the noised indices.
(2) Drop words with a probability of 0.1. (3) Replace 10% of the input words in X with the [MASK] symbol. (4) Sample a number of token spans from X with span lengths drawn from a Poisson distribution (λ = 3), and then replace each token span with a single [MASK] token. Here, 0-length spans correspond to the insertion of [MASK] tokens. Based on the performance of different noising strategies (Table 11), we select (4) and use it in pre-training. We leave finding better noising strategies for future work.
We train Unicoder using this task by maximizing the following loss function L xDAE : where L = l 1 , ..., l N denotes N languages, X is an instance in the i th language l i , p(x t |x <t , c(X)) denotes the probability of generating a token x t at time step t given c(X) and x <t .

Multilingual Future N-gram Prediction (xFNP)
Motivated by ProphetNet (Yan et al., 2020), xFNP introduces a future n-gram prediction mechanism to natural language generation. It encourages the model to plan for the future tokens explicitly and prevents over-fitting on strong local correlations.
Given an input text X = (x 1 , x 2 , ..., x |X| ) ∈ l i from a language l i , we randomly mask k token spans of X to generate the masked text X ′ as the input, and concatenate all masked token spans into Y as the output. Details of this mask strategy are described in Section 6.1. After this, xFNP first encodes X ′ to H enc with the encoder: Then, instead of predicting the next token only at each time step, xFNP generates n future tokens simultaneously at time step t with the decoder: Following Yan et al. (2020), we set n = 2.
We train Unicoder using this task by maximizing the following loss function L xF N P : where X ′ and Y are generated from X based on the method mentioned above. Following Yan et al.

Related Work
Dataset GLUE  includes 9 natural language understanding tasks that are labeled in English only. Comparing to GLUE, XGLUE not only expands task annotations to multiple languages, but also includes natural language generation tasks. XNLI (Conneau et al., 2018), NER (Sang, 2002;Sang and De Meulder, 2003), POS Tagging (Kim et al., 2017), MLQA (Lewis et al., 2019b) and PAWS-X (Yang et al., 2019a) are 5 multilingual datasets built for specific tasks. XGLUE not only includes these 5 existing tasks, but also introduces 6 new tasks selected from real-world scenarios (i.e., Search, Ads and News). This makes XGLUE have more practical values. XTREME ) is a concurrent work of XGLUE. Comparing to it, XGLUE includes both understanding and generation tasks, which, to the best of our knowledge, is the first attempt in the cross-lingual dataset construction efforts.  (Lewis et al., 2019a) and ProphetNet (Yan et al., 2020) are two latest generative pre-trained models. We borrow ideas from these two work and extend Unicoder to cross-lingual generation tasks, which goes a step further to verify and explore different text generation approaches in the cross-lingual scenarios.

Experimental Settings
Understanding Tasks The hyper-parameters are set as follows: 768 hidden units, 12 heads, GELU activation, a dropout rate of 0.1, 512 max input length, 12 layers in encoder.
In the pre-training stage, we first initialize Unicoder LC with XLM-R base , and then run continue pre-training with the accumulated 8,192 batch size with gradients accumulation. We use Adam Optimizer with a linear warm-up (Vaswani et al., 2017) and set the learning rate to 3e-5. We select different understanding tasks randomly in different batches.
In the fine-tuning stage, the batch size is set to 32. We use Adam Optimizer (Kingma and Ba, 2014) with warm-up and set the learning rate to 5e-6. For all understanding tasks, we fine-tune Unicoder models for 10 epochs. There are two exceptions, for POS Tagging we set the learning rate to 2e-5. For MLQA, we set the learning rate to 3e-5, batch size to 12 and train 2 epochs following    and Unicoder xF N P SC are pretrained by xDAE (for 15 languages) and xFNP (for 100 languages), respectively. For the results of M-BERT/XLM on generation tasks, we initialize the encoder-decoder model with M-BERT/XLM and fine-tune it on each downstream task without pre-training. All models are (12-layer) base ones. Given a task, each pre-trained model is fine-tuned using its English training set only, and then applied to all test sets in different languages. AVG 2 U and AVG 2 G denote the average score of the average scores on 9 understanding tasks and 2 generation tasks, respectively. Due to GPU limitation and time cost consideration, Unicoder LC is pre-trained using 10% of the large corpus only.   , the hyper-parameters are set as follows: 1,024 hidden units, 8 heads, GELU activation, a dropout rate of 0.1, 512 max input length, 12 layers in encoder, 6 layers in decoder.
In the pre-training stage, we first initialize encoder and decoder with XLM (Conneau and Lample, 2019), and then run continue pre-training with the accumulated 1,024 batch size with gradients accumulation. We use Adam optimizer with a linear warm-up and the set the learning rate to 1e-4.
In the fine-tuning stage, the batch size is set to 32. We use Adam Optimizer (Kingma and Ba, 2014) with learning rate to 5e-6.
For Unicoder xF N P SC , the hyper-parameters are set as follows: 1,024 hidden size, 12 layers in encoder, 12 layers in decoder, 512 max input length, 4,096 feed-forward filter size.
In the pre-training stage, we pre-train the model from scratch, and follow ProphetNet (Yan et al., 2020) to randomly mask a continuous span (with a fixed length 9) in every 64 tokens. About 15% of the tokens in original sequence are masked in this step. We use a special symbol [MASK] to replace 80% of the masked tokens, keep 10% unchanged, and random replace 10% of the masked tokens. We set the batch size to 1,024, training steps to 120,000. The learning rate is set to 1e-4. We set the number of future tokens n to 2.
In the fine-tuning stage, we use Adam Optimizer (Kingma and Ba, 2014) and set the learning rate to 1e-4. We set the batch size to 64 and the warm-up steps to 1,000.

Main Result
7 cross-lingual pre-trained models are evaluated and compared in  (2019).
As XLM-R base (or Unicoder LC ) performs consistently better than XLM (or Unicoder SC ), we discard the results of XLM and Unicoder SC on all tasks except XNLI. Given a downstream task, each pre-trained model is fine-tuned using its English training set, and then applied to all test sets in different languages. Table 4 shows that: (1) Unicoder LC performs better than M-BERT and XLM-R base on almost all tasks, as it leverages bilingual corpus in pre-training. (2) Unicoder LC performs better than Unicoder SC , which shows larger corpus can lead to better models. (3) Unicoder xDAE SC and Unicoder xF N P SC perform better than M-BERT and XLM on generation tasks, as they include generation tasks in pre-training while M-BERT and XLM don't. XLM is unable to generate correct languages on the fr and de QG test sets at all. (4) Unicoder xF N P SC performs worse than Unicoder xDAE SC , as the former is pre-trained for 100 languages while the latter is pre-trained for 15 languages only. We will add the results of Unicoder xDAE SC pre-trained for 100 languages, and the results of combining xDAE and xFNP into a unified pre-trained model in the updated version.

Pivot-language Fine-tuning
We define pivot-language fine-tuning as follows: (1) fine-tune the pre-trained model for a downstream task using its labeled data in a pivot language (e.g. English); (2) apply the resulting finetuned model to all languages.  Table 7: Impact of multi-language fine-tuning on XNLI. pl and ml denote pivot-language fine-tuning (using English as the pivot) and multi-language fine-tuning, respectively. XLM-R baseml denotes the multi-language finetuning results based on XLM-R base .  Table 8: Impact of multi-language fine-tuning on NTG. pl and ml denote pivot-language fine-tuning (using English as the pivot) and multi-language fine-tuning, respectively. BLUE-4 is the metric.  Table 9: Impacts of multi-task fine-tuning on XNLI, PAWS-X, NC, QAM and QADSM. pl and mt denote pivot-language fine-tuning (using English as the pivot) on each task and multi-task fine-tuning, respectively.
glish as the pivot language, as all tasks in XGLUE have labeled data in English. But is English the optimal choice? Will the results become better, if we do fine-tuning using other pivot languages?
In order to answer this question, we investigate the impacts of using different pivot languages in fine-tuning on XNLI and NTG, and list results of using different pivot languages on these 2 tasks in Table 5 and Table 6, respectively. Table 5 and Table 6 show that: (1) For each test set, its best result is often achieved when the pretrained model is fine-tuned on the training set in the same language.
(2) For XNLI, the best pivot languages are Spanish (es), Greek (el) and Turkish (tr), rather than English (en), and for NTG, the best pivot language is English (en). This phenomenon shows a possibility to further improve the average performance of a cross-lingual pre-trained model on different downstream tasks, by selecting different pivot languages in fine-tuning. We leave explorations on pivot languages for future work.

Multi-language Fine-tuning
We investigate the impact of multi-language finetuning, which fine-tunes the pre-trained model for a downstream task using the available labeled data from different languages. We also report results on XNLI and NTG tasks, due to the availability of the labeled data on multiple languages. Table 7 and Table 8 show that multi-language fine-tuning can achieve better results than pivotlanguage fine-tuning on both XNLI and NTG. It means we can quickly improve the average performance of a cross-lingual pre-trained model on a specific task over multiple languages, based on the merged label data in these languages.

Multi-task Fine-tuning
We investigate the impact of multi-task fine-tuning on XGLUE. To reduce the experimental cost, we perform this experiment on 5 understanding tasks only, including XNLI, PAWS-X, NC, QAM and QADSM. We first do fine-tuning using the merged English training set of these 5 tasks, and then evaluate the fine-tuned model on the test sets of these tasks. Evaluation results are listed in Table 9. Table 9 shows that PAWS-X, NC and QAM can benefit from the joint fine-tuning but XNLI and QADSM decrease. We leave discovering the relationships between different tasks for better pretraining and fine-tuning for future work.

Impacts of Noising Strategies
We investigate the impacts of different noising strategies (Section 4.1) in Unicoder xDAE SC , and list comparison results in Table 11, where (1)+(2)+(3) denotes the result of using the first three strategies in pre-training, (4) denotes the result of using the last strategy in pre-training, (1)+(2)+(3)+(4) denotes the result of using all strategies in pre-training. We can see that (4) achieves the best average result on NTG. So all results of Unicoder xDAE SC reported in this paper is pre-trained using (4) only.  Table   en Input News if you 're planning a trip to europe , you probably want to check some famous landmarks off your list . but there are certain tourist traps you 're better off missing . susana victoria perez has more . Golden Title do yourself a favor and avoid these tourist traps in europe Unicoder xDAE SC tourist traps you should avoid in europe fr Input News alain juppe , candidat a la primaire de la droite , " ne se sent pas engage " par les investitures decidees par le parti les republicains preside par nicolas sarkozy , a affirme jeudi a l' afp son directeur de campagne , gilles boyer . " c' est un processus mene a la hussarde . il n' y a pas de volonte d' equilibre et de rassemblement " , a-t-il denonce , en affirmant que " l' accord politique " entre les differents candidats a la primaire " n' a pas ete respecte " . Golden Title legislatives : juppe " ne se sent pas engage " par les investitures Unicoder xDAE SC alain juppe : " ne se sent pas engage " par les investitures de Input News vermutlich zur verteidigung seines reviers hat ein aggressiver bussard in baden-wurttemberg einen radfahrer zu fall gebracht , der sich dabei schwer verletzte . wie die polizei in ludwigsburg am freitag mitteilte , attackierte der greifvogel den 51-jahrigen am vortag auf einem radweg entlang einer landesstraße . der bussard flog demnach so tief auf den radler zu , dass dieser ausweichen musste und sturzte . den angaben zufolge erlitt der mann schwere verletzungen und wurde von rettungskraften in ein krankenhaus gebracht . " aus luftiger hohe , von einem laternenmast aus , beobachtete der raubvogel anschließend die unfallaufnahme " , hieß es im polizeibericht . Golden Title aggressiver bussard bringt radfahrer zu fall Unicoder xDAE SC aggressiver bussard in ludwigsburg sturzes radler es Input News despues de la marcha de bruce willis por problemas de agenda , steve carrell le sustituira asi en la nueva pelicula que prepara woody allen . segun informa variety , el actor se une al reparto ya formado por blake lively , parker posey , kristen stewart , jesse eisenberg , jeannie berlin ,corey stoll , anna camp , y ken stott , entre otros . como siempre , los detalles de la trama son aun un secreto aunque el rodaje se encuentre actualmente en marcha . por otro lado , aun no hay fecha de estreno ni distribuidora para la pelicula sin titulo de woody allen . sin embargo , el director tiene aun pendiente de estreno su ultimo filme con emma stone y joaquin phoenix titula da irrational man que se estrenara el proximo 25 de septiembre . Golden Title steve carrell sustituye a bruce willis en la nueva pelicula de woody allen Unicoder xDAE SC steve carrell sustituira a steve carrell en woody allen (1)+(2)+(3) 16.7 10.6 10.4 9.2 7.4 10.9 (4) 16.7 11.1 10.6 9.1 7.5 11.0 (1)+(2)+(3)+ (4) 17.0 10.4 10.0 9.5 7.7 10.9  and XNLG are fine-tuned using English labeled data. ROUGE-L is the metric. 12. We can see that by using xDAE only in pretraining, Unicoder xDAE SC can outperform XNLG significantly, which is pre-trained using 4 tasks including MLM, DAE, XMLM and XAE. This verifies the effectiveness of the 4 th noising strategy described in Section 4.1 for generative tasks.

Updates in the Next Version
We will add 3 updates in the next version: (1) the results of a 24-layer Unicoder on understanding tasks; (2) the results of a 12-layer Unicoder on understanding tasks, which is pre-trained by new tasks beyond MLM and TLM; (3) the comparison results of Unicoder xDAE SC and Unicoder xF N P SC on generation tasks, which are pre-trained based on the small corpus for 100 languages.