Robust Machine Reading Comprehension by Learning Soft labels

Neural models have achieved great success on the task of machine reading comprehension (MRC), which are typically trained on hard labels. We argue that hard labels limit the model capability on generalization due to the label sparseness problem. In this paper, we propose a robust training method for MRC models to address this problem. Our method consists of three strategies, 1) label smoothing, 2) word overlapping, 3) distribution prediction. All of them help to train models on soft labels. We validate our approach on the representative architecture - ALBERT. Experimental results show that our method can greatly boost the baseline with 1% improvement in average, and achieve state-of-the-art performance on NewsQA and QUOREF.


Introduction
Extractive reading comprehension is a challenging task in the field of natural language processing. Its objective is, as shown in Fig.1, to detect a fragment from a passage to answer a given question. Our model needs to understand the context, locate the correct answer with exact word boundaries. Since multiple questions can be derived for a passage with different fragments of the words as the corresponding answers, the task is deemed as a benchmark for the deep understanding of human language, becoming a hot research topic in recent years.
Usually, there exist multiple answers (or answer instances) for a given question in the extractive reading comprehension scenario. These answers generally express the same meaning with different combination of words, or same expression but appear repeatedly in different locations, e.g. the multiple correct answers shown in Fig.1. However, the application of the standard cross-entropy loss in training allows only one correct answer to be considered, with other candidate answers ignored.
A variety of methods have been proposed to address this issue (Xiong et al., 2017;Hu et al., 2018;Su et al., 2020). Most of them are focused on the training algorithm of MRC model, using reinforcement learning or modifying loss function to generalize the word overlapping between a candidate answer and the correct answer. Using this strategy, the training of model can be very complex, with word overlapping in only one training sample into account, missing other Q&A patterns in the training set. In order to better resolve this problem, we choose a focus on the training data perspective and propose soft label based data augmentation without modifying the model structure and training algorithm. As a model-independent data augmentation method, our methods permits a flexibility of using different methods to construct soft label, and to design the framework of the model. Altogether we test 3 different methods to generate soft labels, including word overlapping, on the state-of-the-art ALBERT model . All the methods can boost ALBERT with notable improvement. In the best case, ALBERT can be improved with about 2% by soft label based augmentation, proving our approach simple yet effective.

Soft Label based Data Augmentation
In this section, we will introduce our soft label based data augmentation methods. We investigate three implementations of soft label: label smoothing, word overlapping, and distribution prediction. A brief illustration and comparison of the three methods is shown in Fig.2.  (Szegedy et al., 2016). For a training sample (x, y), the probability of the correct category q(y|x) is defined as 1 and other categories q(¬y|x) is as 0, and thereby there is a golden distribution q. The loss function of the training sample is usually defined as the cross-entropy loss shown in the Eq.(1).
where p(k|x) is the probability predicted by the model. Label smoothing mixes the original one-hot distribution q and a distribution u that is independent of the training sample, to generate a new training target q as shown in Eq. (2): where is a weight to control the importance of q and u in the final distribution. u(k) is defined as a uniform distribution 1 K , where K is the total number of categories. In this paper, K denotes the length of the context in the label smoothing.
Word Overlapping Although label smoothing can weaken the golden answer, it fails to strengthen other possible correct answers. Therefore, most of the existing researches soften the labels based on word overlapping. The word overlapping is measured by token-level F1 score, and normalized with Softmax function upon all possible answer spans as shown in Eq.(3). Then the start/end label distribution is calculated by marginalizing it with respect to all possible end/start positions: (3) Distribution Prediction Word overlapping is more enlightening, but this information is still obtained from one training sample. Moreover, it may generate incorrect label distribution. In the previous example, [the 1050s] and [1050s .] have higher F1 score compared with [in the 1050s], but obviously [in the 1050s] is more likely to be correct. Therefore, we build another kind of soft label, the target distribution q is predicted by another model following the idea of knowledge distillation (Hinton et al., 2015), while some details are implemented differently. Instead of controlling the temperature, we propose cross-decoding similar to cross-validation. First, randomly divide the data set T into many smaller sets {T 1 , T 2 , . . . , T n }, select {T 2 , T 3 , . . . , T n } to train a model F 1 , then predict the probability of start/end position p s , p e on the reserved sample x ∈ T 1 . Second, using the predicted p as new target distribution q , we construct a soften example (x, y ) on T 1 . At the next iterations, we select different parts to decode, until all samples in T have predicted label, i.e. T = {(x, y )|y = F i (x)}.
Since F is trained on a large-scale data set {T 2 , T 3 , . . . , T n }, the predicted distribution q can be considered as composed all the Q&A patterns in {T 2 , T 3 , . . . , T n }. Thus, the potentially correct candidate answers (the Q&A patterns that appeared in the training set, to be more precise) will have higher probability, which will become a more informative guidance for the MRC model.

Experimental Settings
This paper uses SQuAD 2.0 (Rajpurkar et al., 2018), NewsQA (Trischler et al., 2017), QUOREF (Dasigi et al., 2019) in the experiments. NewsQA is gathered from CNN articles, and the others are from English Wikipedia. But QUOREF is more focused on coreference resolution. SQuAD 2.0 and NewsQA both contain unanswerable questions, while QUOREF doesn't. The size of these datasets is shown in Table 1  Without loss of representativeness, we use the state-of-the-art ALBERT as a baseline. The model used in the following experiments is ALBERT-xxlarge-v1, which performs best among all the ALBERT models. We implement ALBERT by modifying HuggingFace's Transformers toolkit (Wolf et  For label smoothing, we use different label smoothing parameter on different datasets, 0.3 for SQuAD 2.0 and 0.1 for others. For distribution prediction, we use an ensemble model as the model F for label prediction. The sub-models of the ensemble model are selected from top-performed runs of ALBERT-xlarge and ALBERT-xxlarge with different hyper-parameters, for the three datasets we use 4, 6, 6 sub-models respectively. Each sub-model will predict a probability distribution of start/end position, we average the distribution and use the averaged result as the final probability of the ensemble model. The ensemble model improved every metric of 1.0-2.0. We also tried to average the weights of the sub-models, but as this approach can only be applied to models of same size and the improvement after ensemble is not obvious, we didn't apply this method.

Results and Analysis
In this section, we present the experimental results and make some analysis. The extractive reading comprehension task is usually evaluated with two metrics: exact match(EM) and F1, both metrics are the higher the better. EM checks whether the answer extracted by the model are exactly the same as the correct answer. F1 does not require the predicted result to be exactly the same as the correct answer, it measures the degree of word overlap at token level. All the results are shown in Table 2.   Table 2, we can find that the performance of the reproduced ALBERT has a small gap with the original paper. To solve this issue, we tried over 50 hyper-parameter sets on ALBERT baseline including the one shown in ALBERT paper, but none reached its reported results. Other PyTorch experiments from the Transformers' community discussion 1 also reported similar results, so we believe the gap is more likely to be related to the difference in implementation details between TensorFlow and PyTorch.
On the NewsQA dataset, the performance of reproduced ALBERT reached 63.7/74.0. According to its leaderboard 2 , the state-of-the-art model on NewsQA is SpanBERT (Joshi et al., 2020), whose F1 score is 73.6. ALBERT with no augmentation can reach 74.0 on F1, which proves that ALBERT is a strong baseline model. QUOREF also has an official leaderboard 3 , in which the highest result reported is TASE-RoBERTa (Segal et al., 2019), whose performance is 79.4/85.0, which is also significantly lower than the ALBERT model we used.
After augmenting the soft label, we observe steady improvement on all metrics. Label smoothing has positive effect on each dataset. The increase in SQuAD and NewsQA is relatively small, with an average increase of about 0.2; on QUOREF, the increase is more obvious, with an average increase of about 0.9. We think that the reason why the augmentation impacts the most on QUOREF is that the Q&A patterns in QUOREF are different. The Q&A pairs in SQuAD and NewsQA cover a wide range of areas, including time, entity, location and so on. However, the design of QUOREF is focused on pronoun resolution. The answers are mostly named entities which repeat more frequently. Although they are same in expression, they appear in different places in the context, which can be taken as multiple correct answers. The performance improvement of word overlapping is greater than label smoothing, we think that's because that word overlapping introduces less noise than label smoothing in the target distribution. Using the predicted distribution as additional soft label, the performance of the model can be further improved, showing more significant effects on the three datasets. It proves that the distribution prediction does provide more accurate labels than other methods, which can further enhance the model. Finally, with soft label based data augmentation, ALBERT can be greatly improved, reaches state-ofthe-art on NewsQA and QUOREF and surpasses human on SQuAD 2.0 and NewsQA, which proves the effectiveness of our data augmentation methods.

Conclusion
In this paper, we propose a simple yet effective data augmentation strategy based on soft label to capture the multiple correct answers in extractive reading comprehension task. We investigate 3 methods to generate soft label, i.e. label smoothing, word overlapping and distribution prediction, and validate them by ALBERT on SQuAD 2.0, NewsQA and QUOREF datasets. The experimental results indicate that all strategies are positive, with the predicted distribution top-performed with state-of-the-art results on NewsQA and QUOREF and outperforms human on SQuAD 2.0 and NewsQA. Finally, we would suggest that phenomenon of multiple answers in the MRC dataset is indeed widespread, and more effective approaches are desired to address this issue.