RECONSIDER: Improved Re-Ranking using Span-Focused Cross-Attention for Open Domain Question Answering

State-of-the-art Machine Reading Comprehension (MRC) models for Open-domain Question Answering (QA) are typically trained for span selection using distantly supervised positive examples and heuristically retrieved negative examples. This training scheme possibly explains empirical observations that these models achieve a high recall amongst their top few predictions, but a low overall accuracy, motivating the need for answer re-ranking. We develop a successful re-ranking approach (RECONSIDER) for span-extraction tasks that improves upon the performance of MRC models, even beyond large-scale pre-training. RECONSIDER is trained on positive and negative examples extracted from high confidence MRC model predictions, and uses in-passage span annotations to perform span-focused re-ranking over a smaller candidate set. As a result, RECONSIDER learns to eliminate close false positives, achieving a new extractive state of the art on four QA tasks, with 45.5% Exact Match accuracy on Natural Questions with real user questions, and 61.7% on TriviaQA. We will release all related data, models, and code.


Introduction
Open-domain Question Answering (Voorhees et al., 1999) (QA) involves answering questions by extracting correct answer spans from a large corpus of passages, and is typically accomplished by a light-weight passage retrieval model followed by a heavier Machine Reading Comprehension (MRC) model (Chen et al., 2017). The span selection components of MRC models are trained on distantly supervised positive examples (containing the answer string) together with heuristically chosen negative examples, typically from upstream retrieval models. This training scheme possibly explains empirical findings (Wang et al., 2018b,c) that while MRC models can confidently identify top-K answer candidates (high recall), they cannot effectively discriminate between top semantically similar false positive candidates (low accuracy). In this paper, we develop a general approach to make answer reranking successful for span-extraction tasks, even over large pretrained models, and improve the state of the art on four QA datasets.
Earlier work (Wang et al., 2018c,b) on opendomain QA have recognized the potential of answer re-ranking, which we continue to observe despite recent advances using large pre-trained models like BERT . Figure 1 shows the top-3 predictions of a BERT-based SOTA model ) on a question from Natural Questions (NQ) (Kwiatkowski et al., 2019), "Who was the head of the Soviet Union when it collapsed?" While all predictions are very relevant and refer to Soviet Union heads, Mikhail Gorbachev is correct and the rest are close false positives. Table 1  presents accuracies obtained by the same model on  four QA datasets, if the answer exactly matches  Dataset  Top-1 Top-5 Top-10 Top- any of the top-k predictions for k = 1, 5, 10 and 25. We observe that an additional 10% and 20% of correct answers exist amongst the top-5 and top-25 candidates respectively, presenting an enormous opportunity for span reranking models.
Our re-ranking model is trained using positive and negative examples extracted from high confidence MRC model predictions, and thus, learns to eliminate hard false positives. This can be viewed as a coarse-to-fine approach of training span selectors, with the base MRC model trained on heuristically chosen negatives and the re-ranker trained on finer, more subtle negatives. This contrasts with multi-task training approaches (Wang et al., 2018c), whose re-scoring gains are limited by training on the same data, especially when coupled with large pre-trained models. Our approach also scales to any number of ranked candidates, unlike previous concatenation based cross-passage re-ranking methods (Wang et al., 2018b) that do not transfer well to current length-bounded large pre-trained models. Similar to MRC models, our re-ranking approach uses cross-attention between the question and a candidate passage (Seo et al., 2016). However, we now demarcate a specific candidate answer span in each passage, to assist the model to perform span-focused reasoning, in contrast to MRC models, which must reason across all spans. Therefore, the re-ranker performs span ranking of carefully chosen candidates, rather than span selection like the MRC model. Similar focused cross-attention methods have recently proved to be effective for Entity Linking  tasks, although they annotate the query rather than the passage.
We use our broadly applicable span-focused reranking approach on models from  and achieve a new extractive state of the art on four QA datasets, including 45.5% on the opendomain setting of NQ (real user queries, +1.6% on small models) and 61.1% on TriviaQA (Joshi et al., 2017) (+2.5% on small models). To our knowledge, we are the first to successfully leverage re-ranking to improve over large pre-trained models on opendomain QA.

Background
Open-domain Question Answering (QA) aims to answer factoid questions from a large corpus of passages (Voorhees et al., 1999) (such as Wikipedia) in contrast with single passage MRC tasks (Rajpurkar et al., 2016). Prior works use pipelined approaches, that first retrieve candidate passages and subsequently use a neural MRC model to extract answer spans (Chen et al., 2017), with further improvements using joint learning (Wang et al., 2018a;Tan et al., 2018). Recent successes involve improving retrieval, thereby increasing the coverage of passages fed into the MRC model (Guu et al., 2020;. In this paper, we significantly improve MRC model performance by making re-ranking successful using span-focused re-ranking of its highly confident predictions. For Open-domain QA, it is crucial to train MRC models to distinguish passage-span pairs containing the answer (positives) from those that do not (negatives). Using negatives that appear as close false positives can produce more robust MRC models. However, prior work relies on upstream retrieval models to supply distantly supervised positives (contain answer string) and negatives (Asai et al., 2020), that are in-turn trained using heuristically chosen positives and negatives. Our approach leverages positives and negatives from highly confident MRC predictions which are hard to classify, and thus, improve upon MRC model performance.
Jia and Liang (2017) motivate recent work on answer verification for QA by showing that MRC models are easily confused by similar passages. Wang et al. (2018b) use a weighted combination of three re-rankers and rescore a concatenation of all passages with a particular answer using a sequential model, while, Wang et al. (2018c) develop a multi-task end-to-end answer scoring approach. Although the main idea is to consider multiple passage-span candidates collectively, such approaches either used concatenation, which is prohibitively expensive to couple with lengthrestricted models like BERT, or are trained on the same data without variations only to realize marginal gains. Hu et al. (2019) use answer verification to predict the unanswerability of a question-passage pair for traditional MRC tasks. To our knowledge, our work is the first to (i) successfully demonstrate a re-ranking approach that significantly improves over large pre-trained models  in an open domain setting, and (ii) use annotated top model predictions as harder negatives to train more robust models for QA.

Model
We assume an extractive MRC model M coupled with a passage retrieval model, that given a question q and a passage corpus P, produces a list of N passage and span pairs, {(p j , s j )} N j=1 , p j ∈ P and s j is a span within p j , ranked by the likelihood of s j answering q. Note that {p j } N j=1 is not disjoint as a passage can have multiple answer spans. In this section, we develop a span-focused re-ranking model R, that learns a distribution p, over top-K (p j , s j ) pairs 1 ≤ j ≤ K, given question q. Essentially, model R first scores every (q, p j , s j ) triple using scoring function r, and then normalizes over these scores to produce p: (1) Specifically, if E(q, p j , s j ) ∈ R H is a dense representation of (q, p j , s j ), r is defined as: where w ∈ R H is a learnable vector.
Span-focused tuple encoding We compute E using the representation of the [CLS] token of a BERT model  applied to a spanfocused encoding of (q, p j , s j ). This encoding is generated by first marking the tokens of s j within passage p j with special start and end symbols [A] and [/A], to formp j , followed by concatenating the [CLS] and question tokens, with the annotated passage tokensp j , using separator token [SEP]. We find span marking to be a crucial ingredient for answer re-ranking, without which, performance deteriorates (Section 5).
Training We obtain top K predictions (p j , s j ) of model M for each question q i in its training set, which we divide into positives, where s j is exactly the groundtruth answer, and remaining negatives. We train R using mini-batch gradient descent, where in each iteration, for question q, we include 1 randomly chosen positive and M − 1 randomly chosen negatives, and maximize the likelihood of the positive. Unlike the heuristically chosen negatives used to train M, R is trained using negatives from high confidence predictions of M, which are harder to classify. Thus, this can be viewed as an effective coarse-to-fine negative selection strategy for span extraction models (Section 5).

Baseline Model M
We use the state-of-the-art models of  which consists of 1) a dense passage retriever, and 2) a span extractive BERT reader, as our model M. The retriever uses a passage encoder f p and a question encoder f q to represent all passages and questions as dense vectors in the same space. During inference, it retrieves top-100 passages similar to question q based on their inner product, and passes them on to the MRC reader. The MRC reader is an extension of model R of Section 3, to perform span extraction. We briefly describe it but  has complete details. Its input is a question q together with positive and negative passages p j from its retrieval model. (q, p j ) tuples are encoded as before (enc(q, p j ) = q [SEP] p j ), but without spans being marked (as spans are unavailable). A distribution over passages p s is computed as before using scoring function r and context encoder E. In addition, a start-span probability, p st (t i |q, p j ) and an end-span probability, p e (t i |q, p j ) is computed for every token t i in enc(q, p j ). The model is trained to maximize the likelihood of p s (p j ) × p st (s|q, p j ) × p e (t|q, p j ) for each correct answer span (s, t) in p j , and outputs the top-K scoring passage-span pairs during inference.

Experiments
Datasets We use four benchmark open-domain QA datasets following : Natural Questions (NQ) contains real user questions asked on Google searches; we consider questions with short answers up to 5 tokens. TRIVIAQA (Joshi et al., 2017) consists of questions collected from trivia and quiz-league websites; we take questions in an unfiltered setting and discard the provided web snippets. WebQuestions (WEBQ) (Berant et al., 2013) is a collection of questions extracted from the Google Suggest API, with answers being Freebase entities. CuratedTREC (Baudiš and Šedivỳ, 2015) (Min et al., 2019a) 28.1 50.9 --GraphRetriever (Min et al., 2019b) 34.5 56.0 36.4 -PathRetriever (Asai et al., 2020) 32.6 ---REALM (Guu et al., 2020) 39.2 -40.2 46.8 REALM News (Guu et al., 2020) 40.4 -40.7 42.9 Models that use DPR multi  DPR-BERT base      Table 2 demonstrates the effectiveness of a coarse-to-fine approach for selecting negative passages, with dense retrieval based negatives (DPR) outperforming BM25, and in turn, improved upon by our reranking approach. We obtain gains despite R being not only very similar in architecture to the MRC reader M, but also trained on the same QA pairs, owing to (i) training using harder false-positive style negatives, and (ii) answer-span annotations that allow a re-allocation of modeling capacity from modeling all spans to reasoning about specific spans with respect to the question and the passage. Re-ranking performance suffers without these crucial methods. For example, replacing answer-span annotations with answer concatenation reduces accuracy by ∼1% on the dev set of NQ.
We train a large variant of RECONSIDER using BERT large for model R, trained on predictions from a BERT large model M. For a fair comparison, we re-evaluate DPR using BERT large . RECONSIDER large outperforms it by ∼1% on all datasets (+ ∼2% on TREC). This model is also comparable in size to RAG  (which uses BART large ) but outperforms it on all tasks (+1 on NQ, +5.5 on TRIVIAQA, +3 on TREC), demonstrating that retrieve-extract architectures can perform better than answer generation models.
We find K=5 (testing) to be best for all datasets, and increasing K has little effect on accuracy, despite training on top-100 predictions. Although in contrast with our expectations based on Table 1, this is anticipated since very low-ranked predictions are less likely to be reranked highly, but this also presents an opportunity for future work.
In Table 3, we present examples from the validation set of NQ, of cases where 1) DPR-BERT base

Question
Where are zebra mussels found in the United States? DPR-BERTbase ... on the genetic algorithm for rule -set production (garp), a group of researchers predicted that the southeastern united states is moderately to highly likely to be inhabited by zebra mussels ... +RECONSIDER ... of zebra mussel in the great lakes alone exceeds $500 million a year.

Question
Where do you find neurons in the brain? DPR-BERTbase ... there is strong evidence for generation of substantial numbers of new neurons in two brain areas, the hippocampus and olfactory bulb. A neuron is a specialized type of cell found in the bodies ... +RECONSIDER The brain is the most complex organ in a vertebrate's body. In a human, the cerebral cortex contains approximately 14-16 billion neurons.

Question
Who said if I have seen further it is by standing on the shoulders of giants? DPR-BERTbase Standing on the shoulders of giants ... this concept has been traced to the 12th century, attributed to Bernard of Chartres. Its most familiar expression in English is by Isaac Newton in 1675: "If I have seen further it is by standing on the shoulders of giants ... +RECONSIDER Standing on the shoulders of giants ... this concept has been traced to the 12th century, attributed to Bernard of Chartres. Its most familiar expression in English is by Isaac Newton in 1675: "If I have seen further it is by standing on the shoulders of giants ... Table 3: Top passage with answer span (in bold) for example questions from the validation set of NQ, both with and without re-ranking using RECONSIDER. For the first two examples, RECONSIDER re-ranks to obtain the correct answer, while in the last example, re-ranking eliminates the already correct top answer.
produces an incorrect top answer, which is corrected after re-ranking with RECONSIDER (top 2 examples), and 2) DPR-BERT base 's answer is correct but is ranked lower after re-ranking. Of the 15.4% validation examples that were amenable for correction by re-ranking the top-5 candidates from DPR-BERT base , RECONSIDER was able to fix 6.1%. However, in this process, 4.3% of answers that were originally correct (top-ranked), lost their top-rank after RECONSIDER, and this presents an opportunity for further improving re-ranking.

Conclusion
We use a synergistic combination of two techniques viz. retraining with harder negatives, and, spanfocused cross attention, to make re-ranking successful for span-extractive tasks over large pretrained models. This method achieves SOTA extractive results on four open domain QA datasets, also outperforming recent generative pre-training approaches.

A Computing Infrastructure Used
All experiments were run on a machine with 2 chips of Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz with 20 cores (40 threads) each, equipped with 8 NVIDIA TESLA V100 GPUs, each with 32 GB of memory.

B Average Run-time and #Parameters
We report average run-times for training and inference on NQ (TRIVIAQA is similar), as well as number of model parameters, in Table 4 Table 8 presents validation set performance for the experiments that we ran for this paper.

D Hyperparameters
For training RECONSIDER, we use top-100 predictions of the baseline MRC model. This was chosen based on validation set accuracy, and other values that were experimented with were 50 and 75. For training RECONSIDER we use 1 positive and M −1 negatives during each iteration. We tried values of M between 5 and 40 in increments of 5 and chose M = 30 based on validation set accuracy (see Table 6). Similarly, we re-rank K = 5 candidates   during inference, and this value was chosen by experimenting with values 2, 3, 4 and values between 5 and 20, in increments of 5 (see Table 5).