D-NET: A Pre-Training and Fine-Tuning Framework for Improving the Generalization of Machine Reading Comprehension

In this paper, we introduce a simple system Baidu submitted for MRQA (Machine Reading for Question Answering) 2019 Shared Task that focused on generalization of machine reading comprehension (MRC) models. Our system is built on a framework of pretraining and fine-tuning, namely D-NET. The techniques of pre-trained language models and multi-task learning are explored to improve the generalization of MRC models and we conduct experiments to examine the effectiveness of these strategies. Our system is ranked at top 1 of all the participants in terms of averaged F1 score. Our codes and models will be released at PaddleNLP.


Introduction
Machine reading comprehension (MRC) requires machines to understand text and answer questions about the text, and it is an important task in natural language processing (NLP). With the increasing availability of large-scale datasets for MRC (Rajpurkar et al., 2016;Bajaj et al., 2016;Dunn et al., 2017;Joshi et al., 2017; and the development of deep learning techniques, MRC has achieved remarkable advancements in the last few years (Wang and Jiang, 2016;Seo et al., 2016;Xiong et al., 2016;Wang et al., 2017;Yu et al., 2018). Although a number of neural models obtain even human parity performance on several datasets, these models may generalize poorly on other datasets (Talmor and Berant, 2019).
We expect that a truly effective question answering system works well on both the examples drawn from the same distribution as the training 1 https://github.com/PaddlePaddle/ models/tree/develop/PaddleNLP/Research/ MRQA2019-D-NET data and the ones draw from different distributions. Nevertheless, there has been relatively little work that explores the generalization of MRC models.
This year, MRQA (Machine Reading for Question Answering) 2019 Shared Task tries to test whether the question answering systems can generalize well beyond the datasets on which they are trained. Specifically, participants will submit question answering systems trained on a training set pooled from six existing MRC datasets, and the systems will be evaluated on twelve different test datasets without any additional training examples in the target domain (i.e. generalization).
As shown in Table 1, the major challenge of the shared task is that the train and test datasets differ in the following ways: • Questions: They come from different sources, e.g. crowdsourcing workers, examine writers, search logs, synthetics, etc.
• Language Understanding Ability: They might require different language understanding abilities, e.g. matching, reasoning and arithmetic.
To address the above challenge, we introduce a simple framework of pre-training and fine-tuning, namely D-NET, for improving the generalization of MRC models by exploring the following techniques: • Pre-trained Models: We leverage multiple pre-trained models, e.g. BERT , XLNET (Yang et al., 2019) and ERNIE 2.0 (Sun et al., 2019). Since different pre-trained models are trained on various corpus with different pre-training tasks (e.g. masked language model, discourse relations, etc.), they may capture different aspects of linguistics. Hence, we expect that the combination of these pre-trained models can improve the generalization capability of MRC models.
• Multi-task Learning: Since the pre-training is usually performed on corpus with restricted domains, it is expected that increasing the domain diversity by further pretraining on other corpus may improve the generalization capability. Hence, we incorporate masked language model by using corpus from various domains as an auxiliary task in the fine-tuning phase, along with MRC. The side effect of adding a language modeling objective to MRC is that it can avoid catastrophic forgetting and keep the most useful features learned from pretraining task (Chronopoulou et al., 2019). Additionally, we explore multi-task learning (Liu et al., 2019) by incorporating the supervised dataset from other NLP tasks (e.g. natural language inference and paragraph ranking) to learn better language representation.
Our system is ranked at top 1 of all the participants in terms of averaged F1 score. We also conduct the experiments to examine the effectiveness of multiple pre-trained models and multi-task learning. Our major observations are as follows: • The pre-trained models are still the most important keys to improve the generalization of MRC models in our experiments. Moreover, the ensembles of MRC models based on different pre-trained models show better generalization on out-of-domain set than the ensembles of MRC models based on the same pre-trained models.
• The auxiliary task of masked language model can help improve the generalization of MRC models.
• We do not observe much improvements from the auxiliary tasks of natural language inference and paragraph ranking.
The remainder of this paper is structured as follows: Section 2 describes the detailed overview of our system. Section 3 shows the experimental settings and results. Finally, we conclude our work in Section 4. Figure 1 depicts D-NET, a simple framework of pre-training and fine-tuning to improve the generalization capability of MRC models. There are basically two stages in D-NET: (1) We incorporate multiple pre-trained language models. (2) We fine-tune MRC models with multi-task learning. In this section, we will introduce each stage in details.

Pre-trained Models
Recently pre-trained language models present new state-of-the-art results in MRC. Since different pre-trained models are trained on various corpus with different pre-training tasks, they may capture different aspects of linguistics. Hence, we expect that the combination of these pre-trained models may generalize well on various corpus with different domains. The pre-trained models that are used in our experiments are listed below: BERT (Devlin et al., 2019) uses multi-layer Transformer encoding blocks as its encoder. The pre-training tasks include masked language model and next sentence prediction, which enable the model to capture bidirectional and global information. In our system, we use the BERT large configuration that contains 24 Transformer encoding blocks, each with 16 self attention heads and 1024 hidden units.
Note that we use this pre-trained model for experimental purpose, and it is not included in the final submission. In our experiments, we initialize the parameters of the encoding layers from the checkpoint 2 of the model (Alberti et al., 2019) namely BERT + N-Gram Masking + Synthetic Self-Training. The model is initialized from Whole Word Masking BERT (BERT wwm ), further fine-tuned on the SQuAD 2.0 task with synthetic generated question answering corpora. In our experiments, we find that this model performs consistently better than the original BERT large and 2 The checkpoint can be downloaded from https: //worksheets.codalab.org/worksheets/ 0xd7b08560b5b24bd1874b9429d58e2df1 BERT wwm without synthetic data augmentation, as officially released by Google 3 .
XLNET (Yang et al., 2019) uses a novel pretraining task, i.e. permutation language modeling, by introducing two-stream self attention. Besides BooksCorpus and Wikipedia, on which the BERT is trained, XLNET uses more corpus in its pretraining, including Giga5, ClueWeb and Common Crawl. In our system, we use the 'large' configuration that contains 24 layers, each with 16 self attention heads and 1024 hidden units.
We initialize the parameters of XLNET encoding layers using the version that is released by the authors 4 . In our experiments, we find that XL-NET shows superior performance on the datasets that require reasoning and arithmetic, e.g. DROP and RACE.
ERNIE 2.0 (Sun et al., 2019) is a continual pretraining framework for language understanding in which pre-training tasks can be incrementally built and learned through multi-task learning. It designs multiple pre-training tasks, including named entity prediction, discourse relation recognition, sentence order prediction, to learn language representations.
ERNIE uses the same Transformer encoder as BERT. In our system, we use the 'large' configuration that contains 24 Transformer encoding blocks, each with 16 self attention heads and 1024 hidden units. We initialize the parameters of ERNIE encoding layer using the official released  Table 2: The configurations and hyper-parameters of the eleven models used in our experiments. The configurations include the pre-trained models, the corpus for the masked language model task, the types of supervised NLP tasks. The hyper-parameters include the max sequence length, batch size and the mix ratio λ used the auxiliary tasks in multi-task learning.

Fine-tuning MRC Models with Multi-Task Learning
To fine-tune MRC models, we simply use a linear output layer for each pre-trained model, followed by a standard softmax operation, to predict answer boundaries. We further introduce multitasking learning in the fine-tuning stage to learn more general language representations. Specifically, we have the following auxiliary tasks: Masked Language Model Since the pretraining is usually preformed on the corpus with restricted domains, it is expected that further pretraining on more diverse domains may improve the generalization capability. Hence, we add an auxiliary task, masked language model (Chronopoulou et al., 2019), in the fine-tuning stage, along with the MRC task. Moreover, we use three corpus with different domains as the input for masked language model: (1) the passages in MRQA in-domain datasets that include wikipedia, news and search snippets; (2) the search snippets from Bing 6 . (3) the science questions in Yahoo! Answers. 7 . The side effect of adding a language modeling objective to MRC is that it can avoid catastrophic forgetting and keep the most useful features learned from pre-training 5 https://github.com/PaddlePaddle/ERNIE 6 http://www.msmarco.org/dataset.aspx 7 http://goo.gl/JyCnZq task (Chronopoulou et al., 2019).
Supervised Tasks Motivated by (Liu et al., 2019), we explore multi-task learning by incorporating the supervised datasets from other NLP tasks to learn more general language representation.
Specifically, we incorporate natural language inference and paragraph ranking as auxiliary tasks to MRC. (1) Previous work (Clark et al., 2019;Liu et al., 2019) show that MNLI (Williams et al., 2017) (a popular natural language inference dataset) can help improve the performance of the major task in a multi-task setting. In our system, we also leverage MNLI as an auxiliary task. (2) Previous work (Tan et al., 2017; examine the effectiveness of the joint learning of MRC and paragraph ranking. In our system, we also leverage paragraph ranking as an auxiliary task. We generate the datasets of paragraph ranking from MRQA in-domain datasets. The generated data and the details of data generation will be released at PaddleNLP.

Experimental Settings
In our experiments, we train eleven single models (M 0 -M 10 ) under the framework of D-NET. Table 2 lists the detailed configurations and the hyper-parameters of these models. In the settings of multi-task leaning, we randomly sample  batches from different tasks with 'mix ratio' 1 : When fine-tuning all pre-trained models, we use Adam optimizer with learning rate of 3 × 10 −5 , learning rate warmup over the first 10% steps, and linear decay of the learning rate 8 . All the models are fine-tuned for two epochs. The experiments are conducted with PaddlePaddle framework on NVIDA TESLA V100 GPUs (with 32G RAM).

The Main Results and the Effects of
Pre-trained Models Table 3 shows the main results and the results for the effects of pre-trained models. From Table 3, we have the following observations: (1) Our submitted system significantly outperforms the official baseline by about 10 F1 score, and it is ranked at top 1 of all the participants in terms of averaged F1 score 9 . The technique of model ensemble can improve the generalization of MRC models. In the shared task, the participants are required to submit a question answering system which is able to run on a single GPU 10 with certain latency limit. Hence, we choose to submit a system that combines only one XLNET-based model with one ERNIE-based model.
(2) The pre-trained models are still the most important keys to improve the generalization of MRC models in our experiments. For example, pure XLNET-based models perform consistently better 8 When fine-tuning XLNET, we use layer-wise learning rate decay.
9 Please refer to the official evaluation results on test set for the details: https: //docs.google.com/spreadsheets/ d/1vE-uK4aUKqSnTyflwCrE9R9XP_ J2Is2uN72tcGPKeSM 10 NVIDIA TITAN Xp than BERT-based models with multi-task learning. Moreover, the ensembles of MRC models based on different pre-trained models show better generalization on out-of-domain set than the ensembles of MRC models based on the same pre-trained models. For example, the ensemble of one BERT-based model and one XLNET-based model has better generalization than the ensemble of one BERT-based models and the ensemble of four XLNET-based models. By incorporating one BERT-based model to our submitted system, the generalization capability of the system is further improved. One possible reason behind this observation is that different pre-trained models are trained on different corpus by designing different pre-training tasks (e.g. masked language model, discourse relations, etc.), and they may capture different aspects of linguistics.

The Effects of Multi-Task Learning
We conduct the experiments to examine the effects of multi-task learning on BERT. Table 4 shows the experimental results: (1) From the first two rows in Table 4, we can observe that the auxiliary task of masked language model can improve the performance on both indomain and out-of-domain development set, especially on the out-of-domain set. This means the task of masked language model can help improve the generalization of MRC models on outof-domain data.
(2) From the last two rows in Table 4, we do not observe that the auxiliary tasks of natural language inference and paragraph ranking bring further benefits in terms of generalization. Although paragraph ranking brings better performance on the in-domain development set, it performs worse on the out-of-domain development set. This ob-  servation is different from the previous work (Tan et al., 2017;Clark et al., 2019;Liu et al., 2019) that multi-task learning can improve the system performance. One possible reason might be the size of MRQA training data is large. Hence, the auxiliary tasks do not bring further advantages in terms of learning more robust language representations from more supervised data.

Summary
In a summary, we have the following major observations about generalization in our experiments: (1) The pre-trained models are still the most important keys to improve the generalization of MRC models in our experiments. The ensemble of MRC models based on different pre-trained models can improve the generalization of MRC models.
(2) The auxiliary task of masked language model can help improve the generalization of MRC models.
(3) We do not observe much improvements from the auxiliary tasks of natural language inference and paragraph ranking.

Analysis
In this section, we try to examine that what properties may affect the generalization capability of the submitted system. Specifically, we analyze the performance of the submitted system on different subsets of the testing set. Since the testing set differs from the training set in terms of document sources (see Table 1), we divide the testing set into two subsets: (1) Wiki & Web & News and (2) Other. Please refer to Table 5 for the detailed partition. The document source of the first subset is similar to the training set and we expect that the system works better on the first subset. However, we observe from   sets by the requirement of language understanding ability: (1) Matching, (2) Reasoning and (3) Arithmetic. Please refer to Table 6 for the detailed partition. Since most of the questions in the training set (except HotpotQA) require only matching but less reasoning, we expect that the system performs better on the first subset. From Table 6, we observe that the system performs much worse on the the subsets of Reasoning and Arithmetic. Another reason might be that the current models are not well designed for reasoning or arithmetic. Hence, they perform worse on these subsets.

Conclusions
In this paper, we describe a simple baseline system that Baidu submitted for the MRQA 2019 Shared Task. Our system is built on a framework of pre-training and fine-tuning, namely D-NET. D-NET employs the techniques of pre-trained lan-guage models and multi-task learning to improve the generalization of MRC models and we conduct the experiments to examine the effectiveness of these strategies.