Cross-Lingual Machine Reading Comprehension

Though the community has made great progress on Machine Reading Comprehension (MRC) task, most of the previous works are solving English-based MRC problems, and there are few efforts on other languages mainly due to the lack of large-scale training data.In this paper, we propose Cross-Lingual Machine Reading Comprehension (CLMRC) task for the languages other than English. Firstly, we present several back-translation approaches for CLMRC task which is straightforward to adopt. However, to exactly align the answer into source language is difficult and could introduce additional noise. In this context, we propose a novel model called Dual BERT, which takes advantage of the large-scale training data provided by rich-resource language (such as English) and learn the semantic relations between the passage and question in bilingual context, and then utilize the learned knowledge to improve reading comprehension performance of low-resource language. We conduct experiments on two Chinese machine reading comprehension datasets CMRC 2018 and DRCD. The results show consistent and significant improvements over various state-of-the-art systems by a large margin, which demonstrate the potentials in CLMRC task. Resources available: https://github.com/ymcui/Cross-Lingual-MRC


Introduction
Machine Reading Comprehension (MRC) has been a popular task to test the reading ability of the machine, which requires to read text material and answer the questions based on it.Starting from cloze-style reading comprehension, various neural network approaches have been proposed and massive progresses have been made in creating largescale datasets and neural models (Hermann et al., 2015;Hill et al., 2015;Kadlec et al., 2016;Cui et al., 2017;Rajpurkar et al., 2016;Dhingra et al., 2017).Though various types of contributions had been made, most works are dealing with English reading comprehension.Reading comprehension in other than English has not been well-addressed mainly due to the lack of large-scale training data.
To enrich the training data, there are two traditional approaches.Firstly, we can annotate data by human experts, which is ideal and high-quality, while it is time-consuming and rather expensive.One can also obtain large-scale automatically generated data (Hermann et al., 2015;Hill et al., 2015;Liu et al., 2017), but the quality is far beyond the usable threshold.Another way is to exploit cross-lingual approaches to utilize the data in richresource language to implicitly learn the relations between <passage, question, answer>.
In this paper, we propose the Cross-Lingual Machine Reading Comprehension (CLMRC) task that aims to help reading comprehension in lowresource languages.First, we present several back-translation approaches when there is no or partially available resources in the target language.Then we propose a novel model called Dual BERT to further improve the system performance when there is training data available in the target language.We first translate target language training data into English to form pseudo bilingual parallel data.Then we use multilingual BERT (Devlin et al., 2019) to simultaneously model the <passage, question, answer> in both languages, and fuse the representations of both to generate final predictions.Experimental results on two Chinese reading comprehension dataset CMRC 2018 (Cui et al., 2019) and DRCD (Shao et al., 2018) show that by utilizing English resources could substantially improve system performance and the proposed Dual BERT achieves state-of-the-art performances on both datasets, and even surpass human performance on some metrics.Also, we conduct experiments on the Japanese and French SQuAD (Asai et al., 2018) and achieves substantial improvements.Moreover, detailed ablations and analysis are carried out to demonstrate the effectiveness of exploiting knowledge from richresource language.To best of our knowledge, this is the first time that the cross-lingual approaches applied and evaluated on realistic reading comprehension data.The main contributions of our paper can be concluded as follows.
• We present several back-translation based reading comprehension approaches and yield stateof-the-art performances on several reading comprehension datasets, including Chinese, Japanese, and French.
• We propose a model called Dual BERT to simultaneously model the <passage, question> in both source and target language to enrich the text representations.
• Experimental results on two public Chinese reading comprehension datasets show that the proposed cross-lingual approaches yield significant improvements over various baseline systems and set new state-of-the-art performances.

Related Works
Machine Reading Comprehension (MRC) has been a trending research topic in recent years.Among various types of MRC tasks, spanextraction reading comprehension has been enormously popular (such as SQuAD (Rajpurkar et al., 2016)), and we have seen a great progress on related neural network approaches (Wang and Jiang, 2016;Seo et al., 2016;Xiong et al., 2016;Cui et al., 2017;Hu et al., 2019), especially those were built on pre-trained language models, such as BERT (Devlin et al., 2019).While massive achievements have been made by the community, reading comprehension in other than English has not been well-studied mainly due to the lack of large-scale training data.Asai et al. (2018) proposed to use runtime machine translation for multilingual extractive reading comprehension.They first translate the data from the target language to English and then obtain an answer using an English reading comprehension model.Finally, they recover the corresponding answer in the original language using soft-alignment attention scores from the NMT model.However, though an interesting attempt has been made, the zero-shot results are quite low, and alignments between different languages, especially for those have different word orders, are significantly different.Also, they only evaluate on a rather small dataset (hundreds of samples) that was translated from SQuAD (Rajpurkar et al., 2016), which is not that realistic.
To solve the issues above and better exploit large-scale rich-resourced reading comprehension data, in this paper, we propose several zeroshot approaches which yield state-of-the-art performances on Japanese and French SQuAD data.Moreover, we also propose a supervised approach for the condition that there are training samples available for the target language.To evaluate the effectiveness of our approach, we carried out experiments on two realistic public Chinese reading comprehension data: CMRC 2018 (simplified Chinese) (Cui et al., 2019) and DRCD (traditional Chinese) (Shao et al., 2018).Experimental results demonstrate the effectiveness by modeling training samples in a bilingual environment.

Back-Translation Approaches
In this section, we illustrate back-translation approaches for cross-lingual machine reading comprehension, which is natural and easy to implement.Before introducing these approaches in detail, we will clarify crucial terminologies in this paper for better understanding.
• Source Language: Rich-resourced and has sufficient large-scale training data that we aim to extract knowledge from.We use subscript S for variables in the source language.
• Target Language: Low-resourced and has only a few training data that we wish to optimize on.We use subscript T for variables in the target language.
In this paper, we aim to improve the machine reading comprehension performance in Chinese (target language) by introducing English (source language) resources.The general idea of backtranslation approaches is to translate <passage, question> pair into the source language and generate an answer using a reading comprehension system in the source language.Finally, the generated answer is back-translated into the target language.In the following subsections, we will introduce several back-translation approaches Extracted Target Span for cross-lingual machine reading comprehension task.The architectures of the proposed backtranslation approaches are depicted in Figure 1.

GNMT
To build a simple cross-lingual machine reading comprehension system, it is straightforward to utilize translation system to bridge source and target language (Asai et al., 2018).Briefly, we first translate the target sample to the source language.Then we use a source reading comprehension system, such as BERT (Devlin et al., 2019), to generate an answer in the source language.Finally, we use back-translation to get the answer in the target language.As we do not exploit any training data in the target language, we could regard this approach as a zero-shot cross-lingual baseline system.Specifically, we use Google Neural Machine Translation (GNMT) system for source-to-target and target-to-source translations.One may also use advanced and domain-specific neural machine translation system to achieve better translation performance, while we leave it for individuals, and this is beyond the scope of this paper.
However, for span-extraction reading comprehension task, a major drawback of this approach is that the translated answer may not be the exact span in the target passage.To remedy this, we propose three simple approaches to improve the quality of the translated answer in the target language.

Simple Match
We propose a simple approach to align the translated answer into extract span in the target passage.We calculate character-level text overlap (for Chinese) between translated answer A trans and arbitrary sliding window in target passage P T [i:j] .The length of sliding window ranges len(A trans ) ± δ, with a relax parameter δ.Typically, the relax parameter δ ∈ [0, 5] as the length between ground truth and translated answer does not differ much in length.In this way, we would calculate character-level F1-score of each candidate span P T [i:j] and translated answer A trans , and we could choose the best matching one accordingly.Using the proposed SimpleMatch could ensure the predicted answer is an exact span in target passage.As SimpleMatch does not use target training data either, it could also be a pipeline component in zero-shot settings.

Answer Aligner
Though we could use unsupervised approaches for aligning answer, such as the proposed Sim-pleMatch, it stops at token-level and lacks semantic awareness between the translated answer and ground truth answer.In this paper, we also propose two supervised approaches for further improving the answer span when there is training data available in the target language.
The first one is Answer Aligner, where we feed translated answer A trans and target passage P T into the BERT and outputs the ground truth answer span A T .The model will learn the semantic relations between them and generate improved span for the target language.

Answer Verifier
In Answer Aligner, we did not exploit question information in target training data.One can also utilize question information to transform Answer Aligner into Answer Verifier, as we use complete P T , Q T , A T in the target language and additional translated answer A trans to verify its correctness and generate improved span.

Dual BERT
One disadvantage of the back-translation approaches is that we have to recover the source answer into the target language.To remedy the issue, in this paper, we propose a novel model called Dual BERT to simultaneously model the training data in both source and target language to better exploit the relations among <passage, question, answer>.The model could be used when there is training data available for the target language, and we could better utilize source language data to enhance the target reading comprehension system.The overall neural architecture for Dual BERT is shown in Figure 2.

Dual Encoder
Bidirectional Encoder Representation from Transformers (BERT) has shown marvelous performance in various NLP tasks, which substantially outperforms non-pretrained models by a large margin (Devlin et al., 2019).In this paper, we use multi-lingual BERT for better encoding the text in both source and target language.Formally, given target passage P T and question Q T , we organize the input X T for BERT as follows.
Similarly, we can also obtain source training sample by translating target sample with GNMT, forming input X S for BERT.Then we use X T and X S to obtain deep contextualized representations through a shared multi-lingual BERT, forming B T ∈ R L T * h , B S ∈ R L S * h , where L represents the length of input and h is the hidden size (768 for multi-lingual BERT).

Bilingual Decoder
Typically, in the reading comprehension task, attention mechanism is used to measure the relations between the passage and question.Moreover, as Transformers are fundamental components of BERT, multi-head self-attention layer (Vaswani et al., 2017) is used to extract useful information within the input sequence.
Specifically, in our model, to enhance the target representation, we use a multi-head self-attention layer to extract useful information in source BERT representation B S .We aim to generate target span by not only relying on target representation but also on source representation to simultaneously consider the <passage, question> relations in both languages, which can be seen as a bilingual decoding process.
Briefly, we regard target BERT representation B T as query and source BERT representation B S as key and value in multi-head attention mechanism.In original multi-head attention, we calculate a raw dot attention as follows. 2 This will result in an attention matrix A T S that indicate raw relations between each source and target token.
To combine the benefit of both inter-attention and self-attention, instead of using Equation 1, we propose a simple modification on multihead attention mechanism, which is called Self-Adaptive Attention (SAA).First, we calculate self-attention of B T and B S and apply the softmax function, as shown in Equation 2and 3.This is designed to use self-attention to filter the irrelevant part within each representation firstly, and inform the raw dot attention on paying more attention to the self-attended part, making the attention more precise and accurate.
Then we use self-attention A T and A S , interattention A T S to get self-attentive attention ÃT S .We calculate dot product between A ST and B S to Extracted After obtaining attended representation R , we use an additional fully connected layer with residual layer normalization which is similar to BERT implementation.
Finally, we calculate weighted sum of H T to get final span prediction P s T , P e T (superscript s for start, e for end).For example, the start position P s T is calculated by the following equation.
We calculate standard cross entropy loss for the start and end predictions in the target language.
(y s T log(P s T ) + y e T log(P e T )) (9)

Auxiliary Output
In order to evaluate how translated sample behaves in the source language system, we also generate span prediction for source language using BERT representation B S directly without further calculation, resulting in the start and target prediction P s S , P e S (similar to Equation 8).Moreover, we also calculate cross-entropy loss L aux for translated sample (similar to Equation 9), where a λ parameter is applied to this loss.
Instead of setting λ with heuristic value, in this paper, we propose a novel approach to better adjust λ automatically.As the sample was generated by the machine translation system, there would be information loss during the translation process.Wrong or partially translated samples may harm the performance of reading comprehension system.To measure how the translated samples assemble the real target samples, we calculate cosine similarity between the ground truth span representation in source and target language (denoted as HS and HT ).When the ground truth span representation in the translated sample is similar to the real target samples, the λ increase; otherwise, we only use target span loss as λ may decrease to zero.
The span representation is the concatenation of three parts: BERT representation of ground truth start B s ∈ R h , ground truth end B e ∈ R h , and self-attended span B att ∈ R h , which considers both boundary information (start/end) and mixed representation of the whole ground truth span.We use BERT representation B3 to get a self-attended span representation B att using a simple dot product with average pooling, to get a 2D-tensor.
λ = max{0, cos < HS , HT >} The overall loss for Dual BERT is composed by two parts: target span loss L T and auxiliary span loss in source language L aux .

Experimental Setups
We evaluate our approaches on two public Chinese span-extraction machine reading comprehension datasets: CMRC 2018 (simplified Chinese) (Cui et al., 2019)  Note that, since the test and challenge sets are preserved by CMRC 2018 official to ensure the integrity of the evaluation process, we submitted our best-performing systems to the organizers to get these scores.The resource in source language was chosen as SQuAD (Rajpurkar et al., 2016) training data.The settings of the proposed approaches are listed below in detail.
• Tokenization: Following the official BERT implementation, we use WordPiece tokenizer (Wu et al., 2016) for English and character-level tokenizer for Chinese.
• BERT: We use pre-trained English BERT on SQuAD 1.1 (Rajpurkar et al., 2016) for initialization, denoted as SQ-B en (base) and SQ-L en (large) for back-translation approaches.For other conditions, we use multi-lingual BERT as default, denoted as B mul (and SQ-B mul for those were pre-trained on SQuAD).6 • Translation: We use Google Neural Machine Translation (GNMT) system for translation. 7We evaluated GNMT system on NIST MT02/03/04/05/06/08 Chinese-English set and achieved an average BLEU score of 43.24, compared to previous best work (43.20) (Cheng et al., 2018), yielding state-of-the-art performance.
• Optimization: Following original BERT implementation, we use ADAM with weight decay optimizer (Kingma and Ba, 2014) using an initial learning rate of 4e-5 and use cosine learning rate decay scheme instead of the original linear decay, which we found it beneficial for stabilizing results.The training batch size is set to 64, and each model is trained for 2 epochs, which roughly takes 1 hour.
• Implementation: We modified the TensorFlow (Abadi et al., 2016) version run squad.pyprovided by BERT.All models are trained on Cloud TPU v2 that has 64GB HBM.

Overall Results
The overall results are shown in Table 2.As we can see that, without using any alignment approach, the zero-shot results are quite lower regardless of using English BERT-base (#1) or BERT-large (#2).When we apply Sim-pleMatch (#3), we observe significant improvements demonstrating its effectiveness.The Answer Aligner (#4) could further improve the performance beyond SimpleMatch approach, demonstrating that the machine learning approach could dynamically adjust the span output by learning the semantic relationship between translated answer and target passage.Also, the Answer Verifier (#5) could further boost performance and surpass the multi-lingual BERT baseline (#7) that only use target training data, demonstrating that it is beneficial to adopt rich-resourced language to improve machine reading comprehension in other languages.
When we do not use SQuAD pre-trained weights, the proposed Dual BERT (#8) yields significant improvements (all results are verified by p-test with p < 0.05) over both Chinese BERT (#6) and multi-lingual BERT (#7) by a large margin.If we only train the BERT with SQuAD (#9), which is a zero-shot system, we can see that it achieves decent performance on two Chinese reading comprehension data.Moreover, we can also pursue further improvements by continue training (#10) with Chinese data starting from the system #9, or mixing Chinese data with SQuAD and training from initial multi-lingual BERT (#11).Under powerful SQuAD pre-trained baselines, Dual BERT (#12) still gives moderate and consistent improvements over Cascade Training (#10) and Mixed Training (#11) baselines Table 2: Experimental results on CMRC 2018 and DRCD.† indicates unpublished works (some of the systems are using development set for training, which makes the results not directly comparable.).♠ indicates zero-shot approach.We mark our system with an ID in the first column for reference simplicity.
and set new state-of-the-art performances on both datasets, demonstrating the effectiveness of using machine-translated sample to enhance the Chinese reading comprehension performance.

Results on Japanese and French SQuAD
In this paper, we propose a simple but effective approach called SimpleMatch to align translated answer to original passage span.While one may argue that using neural machine translation attention to project source answer to original target passage span is ideal as used in Asai et al. (2018).However, to extract attention value in neural machine translation system and apply it to extract the original passage span is bothersome and computationally ineffective.To demonstrate the effectiveness of using SimpleMatch instead of using NMT attention to extract original passage span in zero-shot condition, we applied SimpleMatch to Japanese and French SQuAD (304 samples for each) which is what exactly used in Asai et al. (2018).The results are listed in Table 3.
From the results, we can see that, though our baseline (GNMT+BERT Len ) is higher than previous work (Back-Translation (Asai et al., 2018)), when using SimpleMatch to extract original passage span could obtain competitive of even larger improvements.In Japanese SQuAD, the F1 score improved by 9.6 in Asai et al. (2018) using NMT attention, while we obtain larger improvement with 11.8 points demonstrating the effectiveness of the proposed method.BERT with pretrained SQuAD weights yields the best performance among these systems, as it does not require the machine translation process and has unified text representations for different languages.

Ablation Studies
In this section, we ablate important components in our model to explicitly demonstrate its effectiveness.The ablation results are depicted in  trained weights (i.e., using randomly initialized BERT) hurts the performance most, suggesting that it is beneficial to use pre-trained weights though the source and the target language is different.Removing source BERT will degenerate to cascade training, and the results show that it also harms overall performance, demonstrating that it is beneficial to utilize translated sample for better characterizing the relations between <passage, question, answer>.The other modifications seem to also consistently decrease the performance to some extent, but not as salient as the data-related components (last two lines), indicating that datarelated approaches are important in cross-lingual machine reading comprehension task.

Discussion
In our preliminary cross-lingual experiments, we adopt English as our source language data.However, one question remains unclear.
Is it better to pre-train with larger data in a distant language (such as English, as oppose to Simplified Chinese), or with smaller data in closer language (such as Traditional Chinese)?
To investigate the problem, we plot the multilingual BERT performance on the CMRC 2018 development data using different language and data size in the pre-training stage.The results are depicted in Figure 3, and we come to several observations.
Firstly, when the size of pre-training data is under 25k (training data size of DRCD), we can see that there is no much difference whether we use Chinese or English data for pre-training, and even the English pre-trained models are better than Chinese pre-trained models in most of the times, which is not expected.We suspect that, by using multi-lingual BERT, the model tend to provide universal representations for the text and learn the language-independent semantic relations among the inputs which is ideal for cross-lingual tasks, thus the model is not that sensitive to the language in the pre-training stage.Also, as training data size of SQuAD is larger than DRCD, we could use more data for pre-training.When we add more SQuAD data (>25k) in the pre-training stage, the performance on the downstream task (CMRC 2018) continues to improve significantly.
In this context, we conclude that, • When the pre-training data is not abundant, there is no special preference on the selection of source (pre-training) language.
• If there are large-scale training data available for several languages, we should select the source language as the one that has the largest training data rather than its linguistic similarity to the target language.
Furthermore, one could also take advantages of data in various languages, but not only in a bilingual environment, to further exploit knowledge from various sources, which is beyond the scope of this paper and we leave this for future work.

Conclusion
In this paper, we propose Cross-Lingual Machine Reading Comprehension (CLMRC) task.When there is no training data available for the target language, firstly, we provide several zeroshot approaches that were initially trained on English and transfer to other languages, along with three methods to improve the translated answer span by using unsupervised and supervised approaches.When there is training data available for the target language, we propose a novel model called Dual BERT to simultaneously model the <passage, question, answer> in source and target languages using multi-lingual BERT.The proposed method takes advantage of the large-scale training data by rich-resource language (such as SQuAD) and learns the semantic relations between the passage and question in both source and target language.Experiments on two Chinese machine reading comprehension datasets indicate that the proposed model could give consistent and significant improvements over various state-of-the-art systems by a large margin and set baselines for future research on CLMRC task.
Future studies on cross-lingual machine reading comprehension will focus on 1) how to utilize various types of English reading comprehension data; 2) cross-lingual machine reading comprehension without the translation process, etc.

Figure 2 :
Figure 2: System overview of the Dual BERT model for cross-lingual machine reading comprehension task.

Figure 3
Figure 3: performance (average of EM and F1) with different amount of pre-training SQuAD (English) or DRCD (Traditional Chinese).

Table 4 .
As we can see that, removing SQuAD pre-

Table 4 :
Ablations of Dual BERT on the CMRC 2018 development set.