Enhancing Answer Boundary Detection for Multilingual Machine Reading Comprehension

Multilingual pre-trained models could leverage the training data from a rich source language (such as English) to improve performance on low resource languages. However, the transfer quality for multilingual Machine Reading Comprehension (MRC) is significantly worse than sentence classification tasks mainly due to the requirement of MRC to detect the word level answer boundary. In this paper, we propose two auxiliary tasks in the fine-tuning stage to create additional phrase boundary supervision: (1) A mixed MRC task, which translates the question or passage to other languages and builds cross-lingual question-passage pairs; (2) A language-agnostic knowledge masking task by leveraging knowledge phrases mined from web. Besides, extensive experiments on two cross-lingual MRC datasets show the effectiveness of our proposed approach.


Introduction
Machine Reading Comprehension (MRC) plays a critical role in the assessment of how well a machine could understand natural language. Among various types of MRC tasks, the span extractive reading comprehension task (like SQuAD (Rajpurkar et al., 2016)) has been become very popular. Promising achievements have been made with neural network based approaches (Seo et al., 2017;Wang et al., 2017;Xiong et al., 2018;Yu et al., 2018;Hu et al., 2017), especially those built on pre-trained language models such as BERT (Devlin et al., 2018), due to the availability of large-scale annotated corpora (Hermann et al., 2015;Rajpurkar et al., 2016;Joshi et al., 2017). However, these large-scale annotated corpora are * Work is done during internship at STCA NLP Group, Microsoft.
† Correspondence author.  (Lewis et al., 2019) is significantly larger than sentence level classification task like Natural Language Inference (NLI) (Conneau et al., 2018). In this experiment, we fine-tune XLM  on English and directly test on other languages. mostly exclusive to English, while research about MRC on languages other than English (i.e. multilingual MRC) has been limited due to the absence of sufficient training data.
To alleviate the scarcity of training data for multilingual MRC, the translation based data augmentation approaches were firstly proposed. For example, (question q, passage p, answer a) in English SQuAD can be translated into (q , p , a ) in other languages (Asai et al., 2018) to enrich the non-English MRC training data. However, these approaches are limited by the quality of the translators, especially for those low resource languages.
Most recently, approaches based on multilingual/cross-lingual pre-trained models (Devlin et al., 2018;Huang et al., 2019;Yang et al., 2019) have proved very effective on several crosslingual NLU tasks. These approaches learn language-agnostic features and align language representations in vector space during multilingual pre-training process Castellucci et al., 2019;Keung et al., 2019;[Question]: who were the kings of the southern kingdom   Jing et al., 2019;Cui et al., 2019). On top of these cross-lingual pre-trained models, zero-shot learning with English data only, or few-shot learning with an additional small set of non-English data derived from either translation or human annotation, can be conducted. Although these methods achieved significant improvement in sentence level multilingual tasks (like XNLI task (Conneau et al., 2018), the effectiveness on phrase level multilingual tasks is still limited. As shown in Table 1, MRC has bigger gap compared with sentence level classification tasks, in terms of the gap between non-English languages and English. To be specific, the EM metrics for non-English languages have 20+ points gap with the counterpart of English on average.
For extractive MRC, the EM metric is very critical since it indicates the answer boundary detection capability, i.e. the accuracy for extractive answer spans. In Table 2, there are two multilingual MRC cases with wrong boundary detection. In real scenarios, these bad extractive answers will bring negative impact to user experience. One interesting finding after case study is that the multilingual MRC model could roughly locate the correct span but still fail to predict the precise boundary (e.g. missing or adding some words in the spans as the cases in Table 2). For example, an error analysis of XLM on MLQA (Lewis et al., 2019) showed about 49% errors come from answers that partially overlap with golden span. Another finding is that a large amount (∼ 70% according to MLQA) of the extractive spans are language-specific phrases (kind of broad knowledge, such as entities or N-grams noun phrases). We call such phrases knowledge phrase in the rest of paper, and will leverage them as prior knowledge in our model. Motivated by the above observations, we propose two auxiliary tasks to enhance boundary detection for multilingual MRC, especially for low-resource languages.
First, we design a cross-lingual MRC task with mixed-languages question, passage pairs to better align the language representation. We then propose a knowledge phrase masking task as well as a languageagnostic method to generate per-language knowledge phrases from the Web. Extensive experiments on two multilingual MRC datasets show that our proposed tasks could substantially boost the model performance on answer span boundary detection. The main contributions of our paper can be summarized as follows.
• We design two novel auxiliary tasks in multitask fine-tuning to help improve the accuracy of answer span boundary detection for multilingual MRC model.
• We propose a language-agnostic method to mine language-specific knowledge phrase from search engines. This method is lightweight and easy to scale to any language.
• We conduct extensive experiments to prove the effectiveness of our proposed approach. In addition to an open benchmark dataset, we also create a new multilingual MRC dataset from real-scenario together with fine-grained answer type labels the in-depth impact analysis.  Overview of enhancing answer boundary detection work for multilingual machine reading comprehension. Our approach consists of three tasks: (a) Main task: multilingual MRC model requires to read text material and answer the question based on given context; (b) mixMRC task: cross-lingual MRC task with mix-language question, passage pairs; (c) LAKM task: A language-agnostic knowledge masking task by leveraging languagespecific knowledge mined from web.
Another approach to Multilingual NLU extracts language-independent features to address multilingual NLU tasks. Some works (Keung et al., 2019;Jia and Liang, 2017;Chen et al., 2019) apply adversarial technology to learn languageinvariant features and achieve significant performance gains. More recently, there has been an increasing trend to design cross-lingual pretrained models, such as multilingual BERT (Devlin et al., 2018), XLM , and Unicoder (Huang et al., 2019), which showed promising results due to the capability of cross-lingual representations in a shared contextual space (Pires et al., 2019). In this paper, we propose two novel sub-tasks in fine-tuning crosslingual models for MRC.

Knowledge based MRC
Prior works (Yang and Mitchell, 2017;Mihaylov and Frank, 2018;Weissenborn et al., 2017;Sun et al., 2018) mostly focus on leveraging structured knowledge from knowledge bases (KBs) to enhance MRC models following a retrieve-thenencode paradigm, i.e., relevant knowledge from KB are retrieved first and sequence modeling methods are used to capture complex knowledge features. However, such a paradigm often suffers from the sparseness of knowledge graphs.
Recently, some works fuse knowledge into pretrained models to get knowledge enhanced language representation. Zhang et al. (2019) uses both large-scale textual corpora and knowledge graphs to train an enhanced language representation.  construct unsupervised pretrained tasks with large scale data and prior knowledge to help the model efficiently learn the lexical, syntactic and semantic representations, which significantly outperforms BERT on MRC.
Most previous works on knowledge-based MRC are limited to English only. Meanwhile the requirement of acquiring large-scale prior knowledge (such as entity linking, NER models) may be challenging to meet for non-English languages. In this work, we propose a light-weight languageagnostic knowledge phrase mining approach and design a knowledge phrase masking task to boost the model performance for multilingual MRC.

Approach
In this section, we first introduce the overall training procedure, and then introduce two new tasks, namely, Mixed Machine Reading Comprehension (mixMRC) and Language-agnostic Knowledge Phrase Masking (LAKM), respectively.
The overview of our training procedure is shown at Figure 1. Our approach is built on top of popular multilingual pre-trained models (such as multilingual BERT and XLM). We concatenate passage, question (optional) together with special tokens [Start] and [Delim] as the input sequence of our model, and transform word embedding into contextually-encoded token representations using transformer. Finally, this contextual  representation is used for all three tasks introduced as following.
The first task, also our main task, is multilingual MRC, which aims to extract answers spans from the context passage according to the question. In this task, each language has its own data. However, only English has human labeled training data, and the other languages use machine translated training data from English. During training, the MRC training data in all languages will be used together for fine-tuning.
In the following, we introduce our new proposed tasks which will jointly train with our main task to boost multilingual MRC performance.

Mixed Machine Reading Comprehension (mixMRC)
We propose a task, named mixMRC, to detect answer boundaries even when question, passage are in different languages, which is shown in Figure 1 (b). It is mainly motivated by the strategy of data augmentation (Singh et al., 2019). In detail, we utilize the mixMRC to derive more accurate answer span boundaries according to the constructed question, passage pairs. The way to obtain question, passage pairs consists of two steps: 1) translate training data from English into non-English; 2) construct mixlanguage training data for mix-MRC task. We show the entire data generation process in the Figure 2.
Step 1: Data Translation When using machine translation system to translate paragraphs and questions from English into non-English, the key challenge is how to address the answer span in translation.
To solve this problem, we enclose the answer text of source passage in special token pair "([" and "])", similar to . After translation, we discard training the instances where the translation model does not map the answer into a span well. Some skip data can still be recalled by finding the translated answer in the translated passage. The statistics of translated data are shown in Table 3.
Formally, given a monolingual dataset D = {(q i , p i , a i )} where q i , p i and a i mean the query, passage and answer of language i respectively. We apply a public translator and create a translated dataset D = {(q j ,p j ,ã j )}, whereq j is the translation of q i , andã j is the answer span boundary iñ p j .  Step 2: Mix Language After translation, we create a mixed-language dataset D = {(q k ,p l ,ã l )} where l = k. This could encourage MRC model to distinguish the phrases boundary by answer span selection and also keep the alignment of the underlying representations between two languages. In this task, we use the same finetuning framework as in monolingual MRC task.

Language-agnostic Knowledge Phrase
Masking (LAKM) In this section, we first introduce the approach for mining knowledge phrases from the Web. We then introduce the masking task created with these knowledge phrases.  Figure 3: The process to generate knowledge data.
Data Generation In the following, we will describe our data generation method to collect large-scale phrase knowledge for different languages. The source data comes from a search engine, consisting of queries and the top N relevant documents. Let us take a running example of query {when is the myth of George Washington cutting down cherry tree made}. As shown in Figure 3, our mining pipeline consists of two main steps: 1. Phrase Candidates Generation: This step targets at high recall. We enumerate all the n-grams (n=2,3,4) of the given query as phrase candidates, such as when is, the myth, George Washington, cherry tree, is the myth, etc. We further filter the candidates with a stop word list. A manual analysis (by asking humans to identify all meaningful n-gram phrases in the given queries) shows that recall reaches ∼ 83%.

Phrase
Filtering: This step targets at high precision by removing useless phrases. For each candidate, we count its frequency in the titles of relevant documents. We only keep those frequent candidates. For example, phrases George Washington, cherry tree appear in every title. We name them as knowledge phrases. Our empirical study suggests a frequency of 0.7 results in a good balance between precision and recall, and we use this threshold in our approach.
Following this approach, large amount of meaningful phrases can be mined independent of languages. After this, we further extract the passages which contain the mined knowledge phrases from the documents (following similar passage creation approach proposed by Rajpurkar et al. (2016)), which is the input of the LAKM. For the purpose of fair comparisons, the number of passages in different languages is equal, and the total amount of training data in LAKM is the same as that of mixMRC. The statistics of the knowledge phrases are given in Table 4.  Model Structure Given a passage, knowledge phrases pair, denoted as (X, Y ), we formalize that X = (x 1 , x 2 , . . . , x m ) is a passage with m tokens, Y = (y 1 , y 2 , . . . , y n ) is a set of languagespecific knowledge phrases generated as before, where y i = (x j , x j+1 , . . . , x j+(l−1) )(1 ≤ j ≤ m), l is the number of tokens in y i (1 ≤ i ≤ n). The representations h θ can be easily obtained from transformer. To inject language-specific knowledge into multilingual MRC model, we use masked language model as the fine-tuning objective. This task-specific loss has an additional summation over the length of sequence: where p t is the prediction value of t th word, m is the number of tokens in the input passage, y kt is the target word, W, b are the output projections for the task-specific loss L LAKM , and h θ (x) t refers to the pre-trained embedding of the t th word.

Experiments
In this section, we firstly describe the dataset and evaluation in Section 4.1; then introduce the baseline models in Section 4.2 and experiment setting in Section 4.3; thirdly the experimental results are shown in Section 4.4.

Dataset
To verify the effectiveness of our approach, we conduct experiments on two multilingual datasets: one open benchmark called MLQA (Lewis et al., 2019); the other newly constructed multilingual QA dataset with multiple fine-grained answer types (MTQA).

MLQA.
A multilingual question answering benchmark (Lewis et al., 2019). MLQA contains QA instances in 7 languages. Due to resource limitation, we evaluate our models on three languages (English, German, Spanish) of the dataset.

MTQA.
To further evaluate our approach on real-scenario as well as conduct in-depth analysis of the impact on different answer types (in Section 5.3), we construct a new QnA dataset with finegrained answer types. The construction process is described as following: 1. question, passage pairs come from the question answering system of one commercial search engine. Specifically, questions are real user searched queries on one commercial search engine, which are more diverse, covering various answer types. For each question, a QA system is leveraged to rank the best passage from the top 10 URLs returned by search engine. For each question, only the best passage is selected.

2.
To annotate the answer span in each passage, we leverage crowd sourcing annotators for the labeling. Annotators are asked to first select the best shortest span * in the passage which can answer the question and also assign an answer type according to the query * Only single span is considered. and the answer span. Each case are labeled by three annotators and those instances which are labeled with consensus (no less than two annotators agree on the result) are finally selected. An English example is given in Table  5.
Detailed statistics of MTQA dataset are given in Table 6 as well as the distribution of answer types in our dataset shown in Figure 4.

Experimental Evaluation
We use the same evaluation metrics in the SQuAD dataset (Rajpurkar et al., 2016), i.e., F1 and Exact Match, to evaluate the model performance. Exact Match Score measures the percentage of predictions that exactly match any one of the ground truths. F1 score is used to measure the answer overlap between predictions and ground truth. We treat the predictions and ground truth as bags of words, and compute their F1 score. For a given question, we select the maximum value of F1 over all of the ground truths, and then we average over all of the questions.

Baseline Models
We use the following two multilingual pre-trained models to conduct experiments: • M-BERT: Multilingual version of BERT released by (Devlin et al., 2018) which is pretrained with monolingual corpora in 104 languages. This model proves to be very effective at zero-shot multilingual transferring between different languages (Pires et al., 2019). • XLM: A cross-lingual language model (15 languages)  pre-trained with both monolingual data and cross-lingual data as well as cross-lingual tasks to enhance the transferring capacity among different languages.
For baseline, we directly fine-tune the pretrained models using MRC training data only.

Experimental Setting
We use Adam optimizer with β 1 = 0.9 , β 2 = 0.999. The learning rate is set as 3e-5 for the mixMRC, LAKM and multilingual MRC tasks. The pre-trained model is configured with its default setting. Each of the tasks is trained until the metric of MRC task converges. mixMRC. We jointly train mixMRC and multilingual MRC tasks using multi-task training at the batch level to extract the answer boundary in the given context. For both tasks, the max sequence length is 384.
LAKM. LAKM and multilingual MRC tasks are jointly trained using multi-task training. In terms of input, we randomly mask 15% of all WordPiece tokens in each sequence in a two step approach. Firstly, if the i − th token belongs to a knowledge phrase, we replace the itoken with (1) the [MASK] token 80% of the time (2) a random token 10% of the time (3) the unchanged i − th token 10% of the time. Secondly, if the proportion of knowledge phrase is less than 15%, we will further randomly mask other WordPiece tokens to make the total masked ratio to reach 15%. For LAKM, the max sequence length is set as 256.
mixMRC + LAKM. We jointly train mixMRC, LAKM and multilingual MRC tasks, take the gradients with respect to the multilingual MRC loss, mixMRC loss and LAKM loss, and apply the gradient updates sequentially at batch level. During the training, the max sequence length is 384 for multilingual MRC model, 256 for LAKM and 384 for mixMRC.

Experiment Results
The overall experimental results are shown in Table 7. Compared with M-BERT & XLM baselines, both mixMRC and LAKM have decent improvements in fr, es and de, and on-par performance in en in terms of both MLQA and MTQA datasets. This demonstrates the effectiveness of our models.
The combination of LAKM and mixMRC tasks gets the best results on both datasets. Take M-BERT and MLQA dataset as an example, mixMRC+LAKM have 1.7% and 4.7% EM improvements on es and de languages respectively, compared with baseline. In terms of LAKM task, there are decent gains for all languages, including English. However, the gains are bigger on low resource languages compared with English performance. Take XLM and MLQA dataset as an example, LAKM gets 1.8% and 3.2% EM improvements on es and de, while the improvement on en is about 0.5%. The intuition behind en gains is that LAKM brings extra data with knowledge to en as well.
In terms of mixMRC task, there are slight regression on en compared with decent gains on es, de and fr. Take XLM and MTQA dataset for illustrations, mixMRC has 0.6% EM regression on en versus 1.4% and 0.5% EM gains on fr and de languages. This shows that mixMRC mainly improves the transferring capability from rich resource language to low resource language.

Analysis
In this section, we ablate important components in LAKM to explicitly demonstrate its effectiveness.

Random N-gram Masking vs LAKM
To study the effectiveness of LAKM, we compare LAKM with Random N-gram Masking † based on XLM and MTQA dataset. LAKM and Random N-gram Masking refer to fine-tuning XLM with the language-specific knowledge masking strategy and random n-gram masking strategy respectively. As shown in Table 8, without the languageagnostic knowledge masking strategy, the EM metrics drops by 0.2% -0.87%, which proves the necessity of LAKM.

Zero Shot Fine-tuning w/ vs w/o LAKM
To illustrate the effectiveness of the auxiliary tasks, an extreme scenario is considered when only English training data is available and there is no translation data. That means that we are unable to use mixMRC task to driver more accurate answer span boundaries. At this point, we only leverage LAKM to enhance answer boundary detection and compares the performance of M-BERT baseline with our model in Table 9. From the experimental results, zero shot finetuning with LAKM is significantly better than M-BERT baseline. On MTQA, our model gets 2%, † Random N-gram Masking shows gains in English SQuAD.
3.3%, 3.8% EM improvements on English, French and German respectively. On MLQA, we get 1.6%, 1.4%, 1.2% EM improvements on English, Spanish and German.  Table 9: Zero Shot experimental results on MLQA and MTQA datasets (%). We only use English MRC training data and don't use translation data.

Extensive Analysis on Fine-grained Answer Types
To have an insight that how the new tasks (LAKM/mixMRC) affect the multilingual MRC task, we further analyze model performance on various answer types, as shown in Figure 5. The comparison with baseline indicates that in most of the answer types (like color, description, money), both LAKM and mixMRC can enhance the answer boundary detection for multilingual MRC task.
One interesting finding is that in terms of animal, full name, LAKM outperforms mixMRC by a great margin, which are 9.1% and 14.3% respectively. One possible explanation is that the knowledge phrases of LAKM can cover some entity related phrases like animals and names, leading to the significant EM boost.
In terms of those numerical answer types (like money, numeric, length), the per-formance between mixMRC and LAKM are similar. The intuition behind this is that these numerical answers may be easier to transfer between different languages since answers like length are similar across different languages.

Conclusion
This paper proposes two auxiliary tasks (mixMRC and LAKM) in the multilingual MRC fine-tuning stage to enhance answer boundary detection especially for low resource languages. Extensive experiments on two multilingual MRC datasets have been conducted to prove the effective of our proposed approach. Meanwhile, we further analyze the model performance on fine-grained answer types, which shows interesting insights.