Answer Span Correction in Machine Reading Comprehension

Answer validation in machine reading comprehension (MRC) consists of verifying an extracted answer against an input context and question pair. Previous work has looked at re-assessing the “answerability” of the question given the extracted answer. Here we address a different problem: the tendency of existing MRC systems to produce partially correct answers when presented with answerable questions. We explore the nature of such errors and propose a post-processing correction method that yields statistically significant performance improvements over state-of-the-art MRC systems in both monolingual and multilingual evaluation.


Introduction
Extractive machine reading comprehension (MRC) has seen unprecedented progress in recent years (Pan et al., 2019;Liu et al., 2020;Khashabi et al., 2020;. Nevertheless, existing MRC systems-readers, henceforth-extract only partially correct answers in many cases. At the time of this writing, for example, the top systems on leaderboards like SQuAD (Rajpurkar et al., 2016), HotpotQA (Yang et al., 2018) and Quoref (Dasigi et al., 2019) all have a difference of 5-13 points between their exact match (EM) and F1 scores, which are measures of full and partial overlap with the ground truth answer(s), respectively. Figure 1 shows three examples of such errors that we observed in a state-of-the-art (SOTA) RoBERTa-large ) model on the recently released Natural Questions (NQ) (Kwiatkowski et al., 2019) dataset. In this paper, we investigate the nature of such partial match errors in MRC and also their post hoc correction in context. Recent work on answer validation (Peñas et al., 2007) has focused on improving the prediction of the answerability of a question given an already extracted answer. Hu et al. (2019) look for support of the extracted answer in local entailments between the answer sentence and the question. Back et al. (2020) propose an attention-based model that explicitly checks if the candidate answer satisfies all the conditions in the question. Zhang et al. (2020) use a two-stage reading process: a sketchy reader produces a preliminary judgment on answerability and an intensive reader extracts candidate answer spans to verify the answerability.
Here we address the related problem of improving the answer span, and present a correction model that re-examines the extracted answer in context to suggest corrections. Specifically, we mark the extracted answer with special delimiter tokens and show that a corrector with architecture similar to that of the original reader can be trained to produce a new accurate prediction.

Partial Match in MRC
Short-answer extractive MRC only extracts short sub-sentence answer spans, but locating the best span can still be hard. For example, the answer may contain complex substructures including multiitem lists or question-specific qualifications and contextualizations of the main answer entity. This section analyzes the distribution of broad categories of errors that neural readers make when they fail to pinpoint the exact ground truth span (GT) despite making a partially correct prediction. To investigate, we evaluate a RoBERTa-large reader (details in Section 3) on the NQ dev set and identify 587 examples where the predicted span has only a partial match (EM = 0, F1 > 0) with the GT. Since most existing MRC readers are trained to produce single spans, we discard examples where the NQ annotators provided multi-span answers consisting of multiple non-contiguous subsequences of the context. After discarding such multi-span GT examples, we retain 67% of the 587 originally identified samples.
There are three broad categories of partial match errors: 1. Prediction ⊂ GT: As the top example in Figure 1 shows, in these cases, the reader only extracts part of the GT and drops words/phrases such as items in a comma-separated list and qualifications or syntactic completions of the main answer entity. 2. GT ⊂ Prediction: Exemplified by the second example in Figure 1, this category comprises cases where the model's prediction subsumes the closest GT, and is therefore not minimal. In many cases, these predictions lack syntactic structure and semantic coherence as a textual unit. 3. Prediction ∩ GT = ∅: This final category consists of cases similar to the last example of Figure 1, where the prediction partially overlaps with the GT. (We slightly abuse the set notation for conciseness.) Such predictions generally exhibit both verbosity and inadequacy. Table 1 shows the distribution of errors over all categories.

Method
In this section, we describe our approach to correcting partial-match predictions of the reader.

The Reader
We train a baseline reader for the standard MRC task of answer extraction from a passage given a question. The reader uses two classification heads on top of a pre-trained transformer-based language model , pointing to the start and end positions of the answer span. The entire network is then fine-tuned on the target MRC training data. For additional details on a transformer-based reader, see .

The Corrector
Our correction model uses an architecture that is similar to the reader's, but takes a slightly different input. As shown in Figure 2, the input to the corrector contains special delimiter tokens marking the boundaries of the reader's prediction, while the rest is the same as the reader's input. Ideally, we want the model to keep answers that already match the GT intact and correct the rest.
To generate training data for the corrector, we need a reader's predictions for the training set. To obtain these, we split the training set into five folds, train a reader on four of the folds and get predictions on the remaining fold. We repeat this process five times to produce predictions for all (question, answer) pairs in the training set. The training examples for the corrector are generated using these reader predictions and the original GT annotations.
To create examples that do not require correction, we create a new example from each original example where we delimit the GT answer itself in the input, indicating no need for correction. For examples that need correction, we use the reader's top k incorrect predictions (k is a hyperparameter) to create an example for each, where the input is the reader's predicted span and the target is the GT. The presence of both GT (correct) and incorrect predictions in the input data ensures that the corrector learns both to detect errors in the reader's predictions and to correct them.

Datasets
We evaluate our answer correction model on two benchmark datasets.
Natural Questions (NQ) (Kwiatkowski et al., 2019) is an English MRC benchmark which contains questions from Google users, and requires systems to read and comprehend entire Wikipedia articles. We evaluate our system only on the answerable questions in the dev and test sets. NQ contains 307,373 instances in the train set, 3,456 answerable questions in the dev set and 7,842 total questions in the blind test set of which an undisclosed number is answerable. To compute exact match on answerable test set questions, we submitted a system that always outputs an answer and took the recall value from the leaderboard.   For each dataset, the answer corrector uses the same underlying transformer language model as the corresponding reader. While creating training data for the corrector, to generate examples that need correction, we take the two (k = 2) highestscoring incorrect reader predictions (the value of k was tuned on dev). Since our goal is to fully correct any inaccuracies in the reader's prediction, we use exact match (EM) as our evaluation metric. We train the corrector model for one epoch with a batch size of 32, a warmup rate of 0.1 and a maximum query length of 30. For NQ, we use a learning rate of 2e-5 and a maximum sequence length of 512; the corresponding values for MLQA are 3e-5 and 384, respectively.

Results
We report results obtained by averaging over three seeds. Table 2 shows the results on the answerable questions of NQ. Our answer corrector improves upon the reader by 1.6 points on the dev set and 1.3 points on the blind test set.
Results on MLQA are shown in Table 3. We compare performances in two settings: one with the paragraph in English and the question in any    Table 4 shows the differences in exact match scores for all 49 MLQA language pair combinations, from using the answer corrector over the reader. On average, the corrector gives performance gains for paragraphs in all languages (last row). The highest gains are observed in English contexts, which is expected as the model was trained to correct English answers in context. However, we find that the approach generalizes well to the other languages in a zero-shot setting as exact match improves in 40 of the 49 language pairs. We performed Fisher randomization tests (Fisher, 1936) on the exact match numbers to verify the statistical significance of our results. For MLQA, we found our reader + corrector pipeline to be significantly better than the baseline reader on the 158k-example test set at p < 0.01. For NQ, the p-value for the dev set results was approximately 0.05.

Model
EM Reader 61.2 Ensemble of Readers 62.1 Reader + Corrector 62.8 Table 5: Error correction versus model ensembling.

Comparison with Equal Parameters
In our approach, the reader and the corrector have a common architecture, but their parameters are separate and independently learned. To compare with an equally sized baseline, we build an ensemble system for NQ which averages the output logits of two different RoBERTa readers. As Table 5 shows, the corrector on top of a single reader still outperforms this ensemble of readers. These results confirm that the proposed correction objective complements the reader's extraction objective well and is fundamental to our overall performance gain.

Changes in Answers
We inspect the changes made by the answer corrector to the reader's predictions on the NQ dev set. Overall, it altered 13% (450 out of 3,456) of the reader predictions. Of all changes, 24% resulted in the correction of an incorrect or a partially correct answer to a GT answer and 10% replaced the original correct answer with a new correct answer (due to multiple GT annotations in NQ). In 57% of the cases, the change did not correct the error. On a closer look, however, we observe that the F1 score went up in more of these cases (30%) compared to when it dropped (15%). Finally, 9% of the changes introduced an error in a correct reader prediction. These statistics are shown in Table 6.

R\R+C
Correct Incorrect Correct 45 (10%) 43 (9%) Incorrect 109 (24%) 253 (57%)    Table 8: Examples from the Natural Questions dev set wherein the answer corrector introduces an error in a previously correct reader output. The ground truth answer is marked in bold in each passage. R and C refer to reader and corrector, respectively. Table 9 shows the percentage of errors corrected in each error class. Corrections were made in all three categories, but more in GT ⊂ Prediction and Prediction ∩ GT = ∅ than in Prediction ⊂ GT, indicating that the corrector learns the concepts of minimality and syntactic structure better than that of adequacy. We note that most existing MRC systems that only output a single contiguous span are not equipped to handle multi-span discontinuous GT.

Conclusion
We describe a novel method for answer span correction in machine reading comprehension. The proposed method operates by marking an original, possibly incorrect, answer prediction in context and then making a new prediction using a corrector model. We show that this method corrects the predictions of a state-of-the-art English-language reader in different error categories. In our experiments, the approach also generalizes well to multilingual and cross-lingual MRC in seven languages. Future work will explore joint answer span cor-