No Answer is Better Than Wrong Answer: A Reflection Model for Document Level Machine Reading Comprehension

The Natural Questions (NQ) benchmark set brings new challenges to Machine Reading Comprehension: the answers are not only at different levels of granularity (long and short), but also of richer types (including no-answer, yes/no, single-span and multi-span). In this paper, we target at this challenge and handle all answer types systematically. In particular, we propose a novel approach called Reflection Net which leverages a two-step training procedure to identify the no-answer and wrong-answer cases. Extensive experiments are conducted to verify the effectiveness of our approach. At the time of paper writing (May. 20, 2020), our approach achieved the top 1 on both long and short answer leaderboard, with F1 scores of 77.2 and 64.1, respectively.


Introduction
Deep neural network models, such as (Cui et al., 2017;Clark and Gardner, 2018;Wang et al., 2018;, have greatly advanced the state-of-the-arts of machine reading comprehension (MRC). Natural Questions (NQ) (Kwiatkowski et al., 2019) is a new Question Answering benchmark released by Google, which brings new challenges to the MRC area. One challenge is that the answers are provided at two-level granularity, i.e., long answer (e.g., a paragraph in the document) and short answer (e.g., an entity or entities in a paragraph). Therefore, the task requires the models to search for answers at both document level and passage level. Moreover, there are richer answer types in the NQ task. In addition to indicating textual answer spans (long and short), the ‡ Corresponding author.
* https://ai.google.com/research/ NaturalQuestions/leaderboard (a) Question: who made it to stage 3 in american ninja warrior season 9 Wikipedia Page: American Ninja Warrior (season 9)  models need to handle cases including no-answer (51%), multi-span short answer (3.5%), and yes/no (1%) answer. Table 1 shows several examples in NQ challenge. Several works have been proposed to address the challenge of providing both long and short answers. Kwiatkowski et al. (2019) adopts a pipeline approach, where a long answer is first identified from the document and then a short answer is extracted from the long answer. Although this approach is reasonable, it may lose the inherent correlation between the long and the short answer, since they are modeled separately. Several other works propose to model the context of the whole document and jointly train the long and short answers. For example,  split a document into multiple training instances using sliding windows, and leverages the overlapped tokens between windows for context modeling. A MRC model based on BERT ) is applied to model long and short answer span jointly. While previ- ous approaches have proved effective to improve the performance on the NQ task, few works focus on the challenge of rich answer types in this QA set. We note that 51% of the questions have no answer in the NQ set, therefore, it is critical for the model to accurately predict when to output the answer. For other answer types, such as multi-span short answer or yes/no answer, although they have a small percentage in the NQ set, they should not be ignored. Instead, a systematic design which can handle all kinds of answer types well would be more preferred in practice.
In this paper, we target the challenge of rich answer types, and particularly for no-answer. In particular, we first train an all types handling MRC model. Then, we leverage the trained MRC model to inference all the training data, train a second model, called the Reflection model takes as inputs the predicted answer, its context and MRC head features to predict a more accurate confidence score which distinguish the right answer from the wrong ones. There are three reasons of applying a secondphase Reflection model. Firstly, the common practice of MRC confidence computing is based on heuristics of logits, which isn't normalized and isn't very comparable between different questions.  Secondly, when training long document MRC model, the negative instances are down sampled by a large magnitude because they are too many compared with positive ones (see Section 2.1). But when predicting, MRC model should inference all the instances. This data distribution discrepancy of train and predict result in that MRC model may be puzzled by some negative instance and predict a wrong answer with a high confidence score. Thirdly, MRC model learns the representation towards the relation between the question, its type, and the answer which isn't aware of the correctness of the predicted answer. Our second-phase model addresses these three issues and is similar to a reflection process which become the source of its name. To the best of our knowledge, this is the first work to model all answer types in NQ task. We conducted extensive experiments to verify the effectiveness of our approach. Our model achieved top 1 performance on both long and short answer leaderboard of NQ Challenge at the time of paper writing (May. 20, 2020). The F1 scores of our model were 77.2 and 64.1, respectively, improving over the previous best result by 1.1 and 2.7.

Our Approach
We propose Reflection Net (see Figure 1), which consists of a MRC model for answer prediction and a Reflection model for answer confidence.

MRC Model
Our MRC model (see Figure 1(b)) is based on pre-trained transformers , and it is able to handle all answer types in NQ challenge. We adopt the sliding window approach to deal with long document , which slices the whole document into overlapping sliding windows. We pair each window with the question to get one training instance limiting the length to 512. The instances divide into positive ones whose window contain the answer and negative ones whose window doesn't contain. Since the documents are usually very long, there are too many negative instances. For efficient training, we down-sample the negative instances to some extent.
The targets of our MRC model include answer type and answer spans, which are denoted as l = (t, s, e, ms). t is the answer type, which can be one of the answer types described before or the special "no-answer". s and e are the start and end positions of the minimum single span that contains the corresponding answer. All answer types in NQ have a minimum single span . When answer type is multi-span, ms represents the sequence labels of this answer, otherwise null. We adopt the B, I, O scheme to indicate multi-span answer (Li et al., 2016) in which ms = (n 1 , . . . , n T ), where n i ∈ {B, I, O}. Then, the architecture of our MRC model is illustrated as following. The input instance x = (x 1 , . . . , x T ) of the MRC model has the embedding: where and E w , E p and E s are the operation of word embedding, positional embedding and segment embedding, respectively. The contextual hidden representation of the input sequence is where T θ is pretrained Transformer (Vaswani et al., 2017; with parameter θ. Next, we describe three types of model outputs. Answer Type: Same with the method in Kwiatkowski et al. (2019), we classify the hidden representation of [cls] token, h(x 1 ) to answer types: where, p type ∈ R K is answer type probability, K is the number of answer types , h(x 1 ) ∈ R H , H is the size of hidden vectors in Transformer, W o ∈ R K×H is the parameters need to be learned. The loss of answer type prediction is: where t is the ground truth answer type.
Single Span: As described above, all kinds of answers have a minimal single span. We model this target as predicting the start and end positions independently. For the no-answer case, we set the positions pointing to the [cls] token as in .
where S ∈ R H , E ∈ R H are parameters need to be learned. The single span loss is: Multi Spans: We formulate the multi-spans prediction as a sequence labeling problem. To make the loss comparable with that for answer type and single span , we do not use the traditional CRF or other sequence labeling loss, instead, directly feed the hidden representation of each token to a linear transformation and then classify to B, I, O labels: where, p label i ∈ R 3 is the B, I, O label probabilities of the i-th token. W l ∈ R 3×H is the parameter matrix. The loss of multi spans is: Combining all above three losses together, the total MRC model loss is denoted as: For cases which do not have multi-span answer, we simply set L multi-span as 0.
Besides of predicting answer, MRC model should also output a corresponding confidence score. In practice, we use the following heuristic  to represent the confidence score of the predicted span: Feature name Description score heuristic answer confidence score based on MRC model predictions, e.g. Eq. (12) ans type one-hot answer type feature. Answer type corresponding to the predicted answer is one, others are zeros. ans type probs the probabilities of each answer type, e.g. Eq. (4) ans type prob the probability of the answer type corresponding to the predicted answer. start logits start logits of predicted answer, [cls] token and top n start logits. end logits end logits of predicted answer, [cls] token and top n end logits. start probs start probabilities of predicted answer, [cls] token and top n start probabilities. end probs end probabilities of predicted answer, [cls] token and top n end probabilities. where x s , x e , x 1 are the predicted start, end and [cls] tokens, respectively. S and E are the learned parameters in Eq. 6 and 7.
To be specific of the answer prediction and confidence score calculation: firstly, we use MRC model to predict spans for all the sliding window instances of a document; then we rank predicted single spans based on its score Eq. (12), choose the top 1 as predicted answer, and determine answer type based on probabilities of Eq. (4), if the answer type is multi-span, we decode its corresponding sequence labels further; thirdly, we select as the long answer the DOM tree top level node containing the predicted top 1 span. The final confidence score of the predicted answer is its corresponding span score.

Reflection Model
Reflection model target a more precise confidence score which distinguish the right answer from two kinds of wrong ones (see Section 3.4). The first one is predicting a wrong answer for a has-ans question, the second is predicting any answer for a no-ans question.
Training Data Generation: To generate Reflection model's training data, we leverage the trained MRC model above to inference its full training data (i.e. all the sliding window instances.): • For all the instances belong to each one question, we only select the one with top 1 predicted answer according to its confidence score.
• The selected instance, MRC predicted answer, its corresponding head features described below and correctness label (if the predicted answer is same to the ground-truth answer, the label is 1; otherwise 0) together become a training case for Reflection model † . † When MRC model has predicted 'no-answer', Reflection Model Training: As shown in Figure 1(a), we initialize Reflection model with the parameters of the trained MRC model, and utilize a learning rate several times smaller than the one used in MRC model. To directly receive important state information of the MRC model, we extract head features from the top layer of the MRC model when it is predicting the answer. As detailed in Table 2, score and ans type prob features are the two most straightforward ones; probabilities and logits features correspond to "soft-targets" in knowledge distillation (Hinton et al., 2015), which are so-called "dark knowledge" with its distribution reflecting MRC model's inner state during answer prediction process. Here we only use top n = 5 logits/probs instead of all. The head features are concatenated with the hidden representation of [cls] token, then followed by a hidden layer for final confidence prediction.
Formulation: Reflection model takes as inputs the selected instance x and the predicted answer. In detail, we create a dictionary Ans whose elements are answer types and answer position marks ‡ . We add answer type mark to the [cls] token, the position mark to corresponding tokens in position, and EMPTY to other tokens. The embedding representation of i-th token is given by: where r denotes Reflection model, E(x i ) is taken from Eq.
(2), f i is one of Ans element corresponding to token x i as described above, E r is its embedding operation whose parameters is randomly initialized. We use the same Transformer architecture as MRC model with parameter Φ, denoted model throw away this question since the finial output is noanswer already determined. as T Φ . The contextual hidden representations are given by: Then, we concatenate the [cls] token representation h r (x 1 ) with the head features, send it to a linear transformation activated with GELU (Hendrycks and Gimpel, 2016) to get the aggregated representation as: where, W r ∈ R H×(H+h) is parameter matrix, head(x) ∈ R h are head features § . At last, we get the confidence score in probability: where A ∈ R H is parameter vector. The loss is binary classification cross entropy given by: (17) where for RoBERTa; then use sliding window approach to slice document into instances as described in Section 2.1. For NQ, since the document is quite long, we add special atomic markup tokens to indicate which part of the document the model is reading.

Implementation
Our implementation is based on Huggingface Transformers (Wolf et al., 2019

Baselines
The first baseline is DocumentQA (Clark and Gardner, 2018) proposed to address the multi-paragraph reading comprehension task. The second baseline is DecAtt + DocReader which is a pipeline approach (Kwiatkowski et al., 2019) and decompose full document reading comprehension task to firstly select long answer and then extract short answer. The third baseline BERT joint is proposed by  which is similar to our MRC model but that it omits yes/no, multi-span short answer and it doesn't have a confidence prediction model like Reflection model. The rest two baselines include a single human annotator (Single-human) and an ensemble of human annotators (Super-annotator).

Results
The dev set results are shown in Table 3 . 20, 2020). The top block rows are baselines we described in Section 3.2. The middle rows are top 3 performance methods in leaderboard. The last is ours which achieved top 1 in both long and short answer leaderboard. Note that in terms of R@P=90 metric which is mostly used in real production scenarios, we surpass the top system by 12.8 and 6.8 absolute points for long and short answer respectively.
pass BERT joint which ignores yes/no, multi-span answers by F1 scores of 4.8 and 1.8 point for long and short answers respectively. This shows the effectiveness of addressing all answer types in NQ. Compared with BERT all type and RoBERTa all type , our Reflection model can further boost model performance significantly by providing more accurate answer confidence score. Take RoBERTa all type as an example, our Reflection model improves the F1 scores of long and short answers by 2.9 and 3.1 points respectively which outperform the single human annotator results on both long and short answers. For ensemble, we train 3 RoBERTa all type models with different random seed. When predicting, per each question we combine the same answers by summing its confidence scores and then select the final answer which has the highest confidence score. For "+ Reflection", we leverage the same shared Reflection models to provide confidence scores for these three MRC models predicted answers and conduct the same ensemble strategy. We see that Reflection model can further boost MRC ensemble due to a more precise and consistent score. Table 4 shows the leaderboard result on sequestered test data. At the time we are writing this paper, there are 40+ submissions for each long and short answer leaderboard. We list the aforementioned two baselines: DecAtt + DocReader and BERT joint , top 3 performance submissions and our ensemble (Ensemble (3) + Reflection). We achieved top 1 on both long and short leaderboard. In real production scenarios, the most practical metric is recall at a fixed high precision like R@P=90. For example, in search engine scenarios, question answering system should provide answers with a guaranteed high precision bar. In terms of R@P=90, our method surpasses top submissions by a large margin, 12.8 and 6.8 points for long and short answer respectively.

Analysis
NQ contains question which has a answer (has-ans) and question has no answer (no-ans). For has-ans Table 5: The count of model predictions categorized as right-ans, wrong-ans and no-ans. Compared with RoBERTa all type , Reflection model leads to the decrease of wrong-ans and increase of no-ans and right-ans.
questions, good performing model should predict right-ans as much as possible and wrong-ans as little as possible or replace wrong-ans with no-ans to increase precision. For no-ans questions, the best is to always predict no-ans because predict any answer equals to wrong-ans. As shown in Table 5, the no-ans questions are about half in NQ (3222 for long, 4374 for short in dev set) which is challenge. MRC model (RoBERTa all type ) though powerful has predicted a lot of wrong-ans in each scenario. With our Reflection model to provide a more accurate confidence score which is leveraged to determine answer triggering, the prediction count of wrongans is decreased and no-ans increased saliently, thus lead to the improvement of evaluation metrics. The overall trend agree well with our paper title "No answer is better than wrong answer". However, as we can see, the no-answer & wrong-ans identification problem is hard and far from being solved: Ideally, all the wrong-ans case should be assigned a low confidence score thus identified as no-ans, which requires more powerful confidence models.  Table 6: Ablation study on answer types. We compare all answer types handling model with ablation of multispans, yes/no type and both.
As described in Section 2.1, our MRC model can deal with all answer types. We perform experiments to verify the effectiveness of dealing with these answer types in short answer, based on the same RoBERTa large MRC model architecture. As shown in Table 6, without dealing with multi-spans answers results in a 0.8 point F1 drop. And with-out dealing with the yes/no answer leads to a 1.4 point F1 drop. When we neither deal with multispans nor yes/no answer types, but only address single-span answer, we get a 56.0 F1 score which is 2.2 point less than our all types handling model: RoBERTa all type . Note that the ratios of multi-spans and yes/no answer types are only 3.5% and 1% respectively. Thus 2.2 points gain is quite decent considering the low coverage of these answer types.

Ablation and Variant of Reflection Model
For ablation/variation experiments on Reflection model, we use the same MRC model: RoBERTa all type to predict answer, which means they have exactly the same answer but different confidence score. The results are shown in Table 7.
Comparison with Verifier: To compare with verifier (Tan et al., 2018;Hu et al., 2019), we build an analogue one by taking following steps upon Reflection model: remove head features, keep predicted answer input and initialize transformer with original RoBERTa large parameters. This setting corresponds to a RoBERTa based verifier. The result is shown in " w/o head features & init." row, although there is a 1.2 and 0.7 point F1 boost of long and short answers respectively, it is less effective than our Reflection model. This demonstrates that head features and parameter initialization from MRC model are very important for Reflection model performing well.
Effect of Head Features: Head features are manually crafted features based on MRC model as described in Section 2.2. We believe these features contain state information that can be leveraged to predict accurate confidence score. To justify our assumption, we feed head features alone to a feedforward neural network (FNN) with one hidden layer sized 200 and one output neuron which produces the confidence score. For training this FNN, we use the same pipeline and training target as our Reflection model. The results are shown in "only  This configuration save a lot memory and computation cost of prediction: all the data only need to pass through one Transformer. The results show it can improve most of the metrics. However, the [cls] representation in MRC model targets at answer types classification which include no-answer but not predicted wrong-ans, the performance isn't as good as Reflection model.

Related Work
Machine Reading Comprehension: Machine reading comprehension (Hermann et al., 2015;Rajpurkar et al., 2016;Clark and Gardner, 2018) is mostly based on the attention mechanism (Bahdanau et al., 2015;Vaswani et al., 2017) that take as input question, paragraph , compute an interactive representation of them and predict the start and end positions of the answer. When dealing with no-answer cases, popular method is to jointly model the answer position probability and no-answer probability by a shared softmax normalizer (Kundu and Ng, 2018;Clark and Gardner, 2018;, or independently model the answerability as a binary classification problem (Hu et al., 2019;. For long document processing, there are pipeline approaches of IR + Span Extraction , DecAtt + DocReader (Kwiatkowski et al., 2019), sliding window approach  and recently proposed long sequence handling Transformers (Kitaev et al., 2020;Guo et al., 2019;Beltagy et al., 2020) Answer Verifier: Answer verifier (Tan et al., 2018;Hu et al., 2019) is proposed to validate the legitimacy of the answer predicted by MRC model. First a MRC model is trained to predict the candidate answer. Then a verification model takes question, answer sentence as input and further verifies the validity of the answer. Our method extends ideas of this work, but there are some main differences. The primary one is that our model takes as inputs answer, context and MRC model's state where an answer is generated. Another difference is that our model is based on transformer and is initialized with MRC.

Conclusion
In this paper, we propose a systematic approach to handle rich answer types in MRC. In particular, we develop a Reflection Model to address the no-answer/wrong-answer cases. The key idea is to train a second phase model and predict the confidence score of a predicted answer based on its content, context and the state of MRC model. Experiments show that our approach achieves the state-of-the-art results on the NQ set. Measured by F1 and R@P=90, and on both long and short answer, our method surpasses the previous top systems with a large margin. Ablation studies also confirm the effectiveness of our approach.