Retrieve, Read, Rerank: Towards End-to-End Multi-Document Reading Comprehension

This paper considers the reading comprehension task in which multiple documents are given as input. Prior work has shown that a pipeline of retriever, reader, and reranker can improve the overall performance. However, the pipeline system is inefficient since the input is re-encoded within each module, and is unable to leverage upstream components to help downstream training. In this work, we present RE^3QA, a unified question answering model that combines context retrieving, reading comprehension, and answer reranking to predict the final answer. Unlike previous pipelined approaches, RE^3QA shares contextualized text representation across different components, and is carefully designed to use high-quality upstream outputs (e.g., retrieved context or candidate answers) for directly supervising downstream modules (e.g., the reader or the reranker). As a result, the whole network can be trained end-to-end to avoid the context inconsistency problem. Experiments show that our model outperforms the pipelined baseline and achieves state-of-the-art results on two versions of TriviaQA and two variants of SQuAD.


Introduction
Teaching machines to read and comprehend text is a long-term goal of natural language processing. Despite recent success in leveraging reading comprehension (RC) models to answer questions given a related paragraph (Wang et al., 2017;Hu et al., 2018;, extracting answers from documents or even a large corpus of text (e.g., Wikipedia or the whole web) remains to be an open challenge. This paper considers the multidocument RC task (Joshi et al., 2017), where the system needs to, given a question, identify the answer from multiple evidence documents. Unlike single-pargraph settings (Rajpurkar et al., 2016), this task typically involves a retriever for selecting few relevant document content (Chen et al., 2017), a reader for extracting answers from the retrieved context (Clark and Gardner, 2018), and even a reranker for rescoring multiple candidate answers (Bogdanova and Foster, 2016).
Previous approaches such as DS-QA (Lin et al., 2018) and R 3 (Wang et al., 2018a) consist of separate retriever and reader models that are jointly trained. Wang et al. (2018d) further propose to rerank multiple candidates for verifying the final answer. Wang et al. (2018b) investigate the full retrieve-read-rerank process by constructing a pipeline system that combines an information retrieval (IR) engine, a neural reader, and two kinds of answer rerankers. Nevertheless, the pipeline system requires re-encoding inputs for each subtask, which is inefficient for large RC tasks. Moreover, as each model is trained independently, highquality upstream outputs can not benefit downstream modules. For example, as the training proceeds, a neural retriever is able to provide more relevant context than an IR engine (Htut et al., 2018). However, the reader is still trained on the initial context retrieved using IR techniques. As a result, the reader could face a context inconsistency problem once the neural retriever is used. Similar observation has been made by Wang et al. (2018c), where integrating both the reader and the reranker into a unified network is more benefical than a pipeline (see Table 1 for more details).
In this paper, we propose RE 3 QA, a neural question answering model that conducts the full retrieve-read-rerank process for multi-document RC tasks. Unlike previous pipelined approaches that contain separate models, we integrate an early-stopped retriever, a distantly-supervised reader, and a span-level answer reranker into a unified network. Specifically, we encode segments of text with pre-trained Transformer blocks (Devlin Table 1: Comparison of RE 3 QA with existing approaches. Our approach performs the full retrieve-read-rerank process with a unified network instead of a pipeline of separate models. *: R 3 and Extract-Select jointly train two models with reinforcement learning. et al., 2018), where earlier blocks are used to predict retrieving scores and later blocks are fed with few top-ranked segments to produce multiple candidate answers. Redundant candidates are pruned and the rest are reranked using their span representations extracted from the shared contextualized representation. The final answer is chosen according to three factors: the retrieving, reading, and reranking scores. The whole network is trained end-to-end so that the context inconsistency problem can be alleviated. Besides, we can avoid reencoding input segments by sharing contextualized representations across different components, thus achieving better efficiency.
We evaluate our approach on four datasets. On TriviaQA-Wikipedia and TriviaQA-unfiltered datasets (Joshi et al., 2017), we achieve 75.2 F1 and 71.2 F1 respectively, outperforming previous best approaches. On SQuAD-document and SQuAD-open datasets, both of which are modified versions of SQuAD (Rajpurkar et al., 2016), we obtain 14.8 and 4.1 absolute gains on F1 score over prior state-of-the-art results. Moreover, our approach surpasses the pipelined baseline with faster inference speed on both TriviaQA-Wikipedia and SQuAD-document. Source code is released for future research exploration 1 .

Related Work
Recently, several large datasets have been proposed to facilitate the research in document-level reading comprehension (RC) (Clark and Gardner, 2018) or even open-domain question answering (Chen et al., 2017). TriviaQA (Joshi et al., 2017) is a challenging dataset containing over 650K question-answer-document triples, in which the document are either Wikipedia articles 1 https://github.com/huminghao16/RE3QA or web pages. Quasar-T (Dhingra et al., 2017) and SearchQA (Dunn et al., 2017), however, pair each question-answer pair with a set of web page snippets that are more analogous to paragraphs. Since this paper considers the multi-document RC task, we therefore choose to work on TriviaQA and two variants of SQuAD (Rajpurkar et al., 2016).
To tackle this task, previous approaches typically first retrieve relevant document content and then extract answers from the retrieved context.  construct a coarse-to-fine framework that answers the question from a retrieved document summary. Wang et al. (2018a) jointly train a ranker and a reader with reinforcement learning (Sutton and Barto, 2011). Lin et al. (2018) propose a pipeline system consisting of a paragraph selector and a paragraph reader.  combine BERT with an IR toolkit for open-domain question answering.
However, Jia and Liang (2017) show that the RC models are easily fooled by adversarial examples. By only extracting an answer without verifying it, the models may predict a wrong answer and are unable to recover from such mistakes (Hu et al., 2019). In response, Wang et al. (2018d) present an extract-then-select framework that involves candidate extraction and answer selection. Wang et al. (2018c) introduce a unified network for cross-passage answer verification. Wang et al. (2018b) explore two kinds of answer rerankers in an existing retrieve-read pipeline system. There are some other works that handle this task in different perspectives, such as using hierarchical answer span representations (Pang et al., 2019), modeling the interaction between the retriever and the reader (Das et al., 2019), and so on.
Our model differs from these approaches in several ways: (a) we integrate the retriever, reader, ...

Retrieve
Read Rerank The input documents are pruned and splitted into multiple segments of text, which are then fed into the model 2 . Few top-ranked segments are retrieved and the rest are early stopped. Multiple candidate answers are proposed for each segment, which are later pruned and reranked. RE 3 QA has three outputs per candidate answer: the retrieving, reading, and reranking scores. The network is trained end-to-end with a multi-task objective. "T-Block" refers to pre-trained Transformer block (Devlin et al., 2018). and reranker components into a unified network instead of a pipeline of separate models, (b) we share contextualized representation across different components while pipelined approaches reencode inputs for each model, and (c) we propose an end-to-end training strategy so that the context inconsistency problem can be alleviated.
A cascaded approach is recently proposed by Yan et al. (2019), which also combines several components such as the retriever and the reader while sharing several sets of parameters. Our approach is different in that we ignore the document retrieval step since a minimal context phenomenon has been observed by Min et al. (2018), and we additionally consider answer reranking. Figure 1 gives an overview of our multi-document reading comprehension approach. Formally, given a question and a set of documents, we first filter out irrelevant document content to narrow the search space ( §3.1). We then split the remaining context into multiple overlapping, fixed-length text segments. Next, we encode these segments along with the question using pre-trained Transformer blocks (Devlin et al., 2018) ( §3.2). To maintain efficiency, the model computes a retrieving score based on shallow contextual representations with early summarization, and only returns a few top-ranked segments ( §3.3). It then continues encoding these retrieved segments and outputs multiple candidate answers under the distant supervision setting ( §3.4). Finally, redundant candidates are pruned and the rest are reranked using their span representations ( §3.5). The final answer is chosen according to the retrieving, reading, and reranking scores. Our model is trained end-toend 3 by back-propagation ( §3.6).

Document Pruning
The input to our model is a question q and a set of documents D = {d 1 , ..., d N D }. Since the documents could be retrieved by a search engine (e.g., up to 50 webpages in the unfiltered version of TriviaQA (Joshi et al., 2017)) or Wikipedia articles could contain hundreds of paragraphs, we therefore first discard irrelevant document content at paragraph level. Following Clark and Gardner (2018), we select the top-K paragraphs that have smallest TF-IDF cosine distances with each question. These paragraphs are then sorted according to their positions in the documents and concatenated to form a new pruned document d. As a result, a large amount of unrelated text can be filtered out while a high recall is guaranteed. For example, nearly 95% of context are discarded while the chance of selected paragraphs containing correct answers is 84.3% in TriviaQA-unfiltered.

Segment Encoding
Typically, existing approaches either read the retrieved document at the paragraph level (Clark and Gardner, 2018) or at the sentence level (Min et al., 2018). Instead, following , we slide a window of length l with a stride r over the pruned document d and produce a set of text segments C = {c 1 , ..., c n }, where n = L d −l r + 1, and L d is the document length. Next, we encode these segments along with the question using pretrained Transformer blocks (Devlin et al., 2018), which is a highly parallel encoding scheme instead of recurrent approaches such as LSTMs.
The input to the network is a sequence of tokens is another token for differentiating sentences. We refer to this sequence as "segment" in the rest of this paper. For each token x i in x, its input representation is the element-wise addition of word, type, and position embeddings. Then, we can obtain the input embeddings h 0 ∈ R Lx×D h , where D h is hidden size.
Next, a series of I pre-trained Transformer blocks are used to project the input embeddings into a sequence of contextualized vectors as: Here, we omit a detailed introduction on the block architecture and refer readers to Vaswani et al. (2017) for more details.

Early-Stopped Retriever
While we find the above parallel encoding scheme very appealing, there is a crucial computational inefficiency if all segments are fully encoded. For example, the average number of segments per instance in TriviaQA-unfiltered is 20 even after pruning, while the total number of Transformer blocks is 12 or 24. Therefore, we propose to rank all segments using early-summarized hidden representations as a mechanism for efficiently retrieving few top-ranked segments.
Specifically, let h J denote the hidden states in the J-th block, where J < I. We compute a score r ∈ R 2 by summarizing h J into a fix-sized vector with a weighted self aligning layer followed by multi-layer perceptrons as: where w µ , w r , W r are parameters to be learned. After obtaining the scores of all segments, we pass the top-N ranked segments per instance to the subsequent blocks, and discard the rest. Here, N is relatively small so that the model can focus on reading the most revelant context.
To train the retrieving component, we normalize score r and define the objective function as: where y r is an one-hot label indicating whether current segment contains at least one exactlymatched ground truth answer text or not.

Distantly-Supervised Reader
Given the retrieved segments, the reading component aims to propose multiple candidate answers per segment. This is achieved by first elementwisely projecting the final hidden states h I into two sets of scores as follows: where score s ∈ R Lx and score e ∈ R Lx are the scores for the start and end positions of answer spans, and w s , w e are trainable parameter vectors. Next, let α i and β i denote the start and end indices of candidate answer a i . We compute a reading score, s i = score s α i + score e β i , and then propose top-M candidates according to the descending order of the scores, yielding a set of preliminary candidate answers A = {a 1 , ..., a M } along with their scores S = {s 1 , ..., s M }.
Following previous work (Clark and Gardner, 2018), we label all text spans within a segment that match the gold answer as being correct, thus yielding two label vectors y s ∈ R Lx and y e ∈ R Lx . Since there is a chance that the segment does not contain any answer string, we then label the first element in both y s and y e as 1, and set the rest as 0. Finally, we define the objective function as: y e j log(softmax(score e ) j ) (2)

Answer Reranker
The answer reranker aims to rerank the candidate answers proposed by the previous reader. We first introduce a span-level non-maximum suppression algorithm to prune redundant candidate spans, and then predict the reranking scores for remaining candidates using their span representations.
Span-level non-maximum suppression So far, the reader has proposed multiple candidate spans. However, since there is no constraint to predict an unique span for an answer string, multiple candidates may refer to the same text. As a result, other than the first correct span, all other spans on the same text would be false positives. Figure 2 shows a qualitative example of this phenomenon.  Figure 2: An example from TriviaQA shows that multiple candidate answers refer to the same text.
Inspired by the non-maximum suppression (NMS) algorithm (Rosenfeld and Thurston, 1971) that is used to prune redundant bounding boxes in object detection (Ren et al., 2015), we present a span-level NMS (Algorithm 1) to alleviate the problem. Specifically, span-level NMS starts with a set of candidate answers A with scores S. After selecting the answer a i that possesses the maximum score, we remove it from the set A and add it to B. We also delete any answer a j in A that is overlapped with a i . We define that two candidates overlap with each other if they share at least one boundary position 4 . This process is repeated for remaining answers in A, until A is empty or the size of B reaches a maximum threshold.
Candidate answer reranking Given the candidate answer a i in B, we compute a reranking score based on its span representation, where the representation is a weighted self-aligned vector bounded by the span boundary of the answer, similar to Lee et al. (2017); He et al. (2018): Algorithm 1 Span-level NMS for aj in A do 6: if overlap(ai, aj) then 7: Here, score a ∈ R M * , and h I α i :β i is a shorthand for stacking a list of vectors h I j (α i ≤ j ≤ β i ). To train the reranker, we construct two kinds of labels for each candidate a i . First, we define a hard label y hard i as the maximum exact match score between a i and ground truth answers. Second, we also utilize a soft label y sof t i , which is computed as the maximum F1 score between a i and gold answers, so that the partially correct prediction can still have a supervised signal. The above labels are annotated for each candidate in B, yielding y hard ∈ R M * and y sof t ∈ R M * . If there is no correct prediction in B (all elements of y hard are 0), then we replace the least confident candidate with a gold answer. Finally, we define the following reranking objective:

Training and Inference
Rather than separately training each component, we propose an end-to-end training strategy so that downstream components (e.g., the reader) can benefit from the high-quality upstream outputs (e.g., the retrieved segments) during training. Specifically, we take a multi-task learning approach (Caruna, 1993;Ruder, 2017), sharing the parameters of earlier blocks with a joint objective function defined as: J = L I + L II + L III Algorithm 2 details the training process. Before each epoch, we compute score r for all segments in the training set X . Then, we retrieve top-N segments per instance and construct a new training set X , which only contains retrieved segments. For  each instance, if all of its top-ranked segments are negative examples, then we replace the least confident one with a gold segment. During each epoch, we sample two sets of mini-batch from both the X and theX , where the first batch is used to calculate L I and the other one for computing L II and L III . Note that the contextualized vectors h I are shared across the reader and the reranker to avoid repeated computations. The batch size of X is dynamically decided so that both of X andX can be traversed with the same number of steps. During inference, we take the retrieving, reading, and reranking scores into account. We compare the scores across all segments from the same instance, and choose the final answer according to the weighted addition of these three scores.
Algorithm 2 End-to-end training of RE 3 QA ; MΘ; k X is the dataset containing t instances X i is i-th instance containing n segments MΘ denotes the model with parameters Θ k is the maximum number of epoch 1: Initialize Θ from pre-trained parameters 2: for epoch in 1, ..., k do 3: Compute score r for all x in X 4: Retrieve top-N segments per instance 5: Construct a newX that includes retrieved x 6: for batchX , batchX in X ,X do 7: Compute LI using batchX by Eq. 1 8: Compute LII using batchX by Eq. 2 9: Reuse h I to compute LIII by Eq. 3 10: Update  Table 2.
Data preprocessing Following Clark and Gardner (2018), we merge small paragraphs into a single paragraph of up to a threshold length in Triv-iaQA and SQuAD-open. The threshold is set as 200 by default. We manually tune the number of retrieved paragraphs K for each dataset, and set the number of retrieved segments N as 8. Following Devlin et al. (2018), we set the window length l as 384−L q −3 so that L x is 384 and set the stride r as 128, where L q is the question length. We also calculate the answer recall after document pruning, which indicates the performance upper bound.
Model settings We initialize our model using two publicly available uncased versions of BERT 5 : BERT BASE and BERT LARGE , and refer readers to Devlin et al. (2018) for details on model sizes. We use Adam optimizer with a learning rate of 3e-5 and warmup over the first 10% steps to fine-tune the network for 2 epochs. The batch size is 32 and a dropout probability of 0.1 is used. The number of blocks J used for early-stopped retriever is 3 for base model and 6 for large model by default. The number of proposed answers M is 20, while the threshold of NMS M * is 5. During inference, we tune the weights for retrieving, reading, and reranking, and set them as 1.4, 1, 1.4.

Evaluation metrics
We use mean average precision (MAP) and top-N to evaluate the retriev-   S-Norm (Clark and Gardner, 2018)   ing component. As for evaluating the performance of reading and reranking, we measure the exact match (EM) accuracy and F1 score calculated between the final prediction and gold answers.
Baselines We construct two pipelined baselines (denoted as BERT PIPE and BERT PIPE *) to investigate the context inconsistency problem. Both systems contain exactly the same components (e.g., retriever, reader, and reranker) as ours, except that they are trained separately. For BERT PIPE , the reader is trained on the context retrieved by an IR engine. As for BERT PIPE *, the reading context is obtained using the trained neural retriever.     Table 6: Comparison between our approach and the pipelined method. "Speed" denotes the number of instances processed per second during inference.

Main Results
However, the score of 80.3 EM on the verified set implies that there is still room for improvement. We also report the performance on documentlevel SQuAD in Table 4 to assess our approach in single-document setting. We find our approach adapts well: the best model achieves 87.2 F1. Note that the BERT LARGE model has obtained 90.9 F1 on the original SQuAD dataset (single-paragraph setting), which is only 3.7% ahead of us.
Finally, to validate our approach in opendomain scenarios, we run experiments on the TriviaQA-unfiltered and SQuAD-open datasets, as shown in Table 5. Again, RE 3 QA surpasses prior works by an evident margin: our best model achieves 71.2 F1 on TriviaQA-unfiltered, and outperforms a BERT baseline by 4 F1 on SQuADopen, indicating that our approach is effective for the challenging multi-document RC task.

Model Analysis
In this section, we analyze our approach by answering the following questions 6 : (a) Is end-to-  Comparison with pipelined method First, we compare our approach with the pipelined baselines on TriviaQA-Wikipedia and SQuAD-document development sets in Table 6. Our approach outperforms BERT PIPE by 1.6/1.2 F1 on two datasets respectively, and is also 2.3/2.1 times faster during inference. Moreover, RE 3 QA also beats the BERT PIPE * baseline by 1.1/0.8 F1, even as the parameters of retriever and reader are trained sequentially in BERT PIPE *. The above results confirm that the end-to-end training can indeed mitigate the context inconsistency problem, perhaps due to multi-task learning and parameter sharing. Our approach can also obtain inference speedups because of the fact that it avoids re-encoding inputs by sharing contextualized representations.
Ablation study To show the effect of each individual component, we plot the F1 curve with respect to different number of retrieved segments in Figure 3. We notice that all curves become stable as more text are used, implying that our ap-proach is robust across different amounts of context. Next, to evaluate the reranker, we only consider the retrieving and reading scores, and the performance decreases by 2.8/0.8 F1 on two datasets after the reranker is removed. To ablate the retriever, we select segments based on the TF-IDF distance instead. The results show that the F1 score reduces by about 3.3 and 2.5 points on two datasets after the ablation. Removing both the retriever and the reranker performs the worst, which only achieves 68.1/81.0 F1 on two datasets at peak. The above results suggest that combining retriever, reader, and reranker is crucial for achieving promising performance.
Effect of early-stopped retriever We assess whether the early-stopped retriever is sufficient for the segment retrieving task. Table 7 details the retrieving and reading results with different number of blocks J being used. As we can see, the model performs worst but maintains a high speed when J is 1. As J becomes larger, the retrieving metrices such as MAP, Top-3 and Top-5 significantly increase on both datasets. On the other hand, the speed continues to decline since more computations have been done during retrieving. A J of 6 eventually leads to an out-of-memory issue on both datasets. As for the F1 score, the model  achieves the best result when J reaches 3, and starts to degrade as J continues rising. We experiment with the RE 3 QA LARGE model and observe similar results, where the best J is 6. A likely reason for this observation may be that sharing highlevel features with the retriever could disturb the reading prediction. Therefore, the above results demonstrate that an early-stopped retriever with a relatively small J is able to reach a good trade-off between efficiency and effectiveness.
Effect of answer reranker Finally, we run our model under different reranking ablations and report the results in Table 8. As we can see, removing the non-maximum suppression (NMS) algorithm has a negative impact on the performance, suggesting it is necessary to prune highlyoverlapped candidate answers before reranking. Ablating the hard label leads to a drop of 0.81 and 0.64 F1 scores on two datasets respectively, while the F1 drops by 0.39 and 0.76 points after removing the soft label. This implies that the hard label has a larger impact than the soft label on the Triv-iaQA dataset, but vice versa on SQuAD.

Conclusion
We present RE 3 QA, a unified network that answers questions from multiple documents by conducting the retrieve-read-rerank process. We design three components for each subtask and show that an end-to-end training strategy can bring in additional benefits. RE 3 QA outperforms the pipelined baseline with faster inference speed and achieves state-of-the-art results on four challenging reading comprehension datasets. Future work will concentrate on designing a fast neural pruner to replace the IR-based pruning component, developing better end-to-end training strategies, and adapting our approach to other datasets such as Natural Questions (Kwiatkowski et al., 2019).