Revealing the Importance of Semantic Retrieval for Machine Reading at Scale

Machine Reading at Scale (MRS) is a challenging task in which a system is given an input query and is asked to produce a precise output by “reading” information from a large knowledge base. The task has gained popularity with its natural combination of information retrieval (IR) and machine comprehension (MC). Advancements in representation learning have led to separated progress in both IR and MC; however, very few studies have examined the relationship and combined design of retrieval and comprehension at different levels of granularity, for development of MRS systems. In this work, we give general guidelines on system design for MRS by proposing a simple yet effective pipeline system with special consideration on hierarchical semantic retrieval at both paragraph and sentence level, and their potential effects on the downstream task. The system is evaluated on both fact verification and open-domain multihop QA, achieving state-of-the-art results on the leaderboard test sets of both FEVER and HOTPOTQA. To further demonstrate the importance of semantic retrieval, we present ablation and analysis studies to quantify the contribution of neural retrieval modules at both paragraph-level and sentence-level, and illustrate that intermediate semantic retrieval modules are vital for not only effectively filtering upstream information and thus saving downstream computation, but also for shaping upstream data distribution and providing better data for downstream modeling.


Introduction
Extracting external textual knowledge for machine comprehensive systems has long been an important yet challenging problem. Success requires not only precise retrieval of the relevant information sparsely restored in a large knowledge source but also a deep understanding of both the selected knowledge and the input query to give the corresponding output. Initiated by Chen et al. (2017), the task was termed as Machine Reading at Scale (MRS), seeking to provide a challenging situation where machines are required to do both semantic retrieval and comprehension at different levels of granularity for the final downstream task.
Progress on MRS has been made by improving individual IR or comprehension sub-modules with recent advancements on representative learning (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2018). However, partially due to the lack of annotated data for intermediate retrieval in an MRS setting, the evaluations were done mainly on the final downstream task and with much less consideration on the intermediate retrieval performance. This led to the convention that upstream retrieval modules mostly focus on getting better coverage of the downstream information such that the upper-bound of the downstream score can be improved, rather than finding more exact information. This convention is misaligned with the nature of MRS where equal effort should be put in emphasizing the models' joint performance and optimizing the relationship between the semantic retrieval and the downstream comprehension subtasks.
Hence, to shed light on the importance of semantic retrieval for downstream comprehension tasks, we start by establishing a simple yet effective hierarchical pipeline system for MRS using Wikipedia as the external knowledge source. The system is composed of a term-based retrieval module, two neural modules for both paragraphlevel retrieval and sentence-level retrieval, and a neural downstream task module. We evaluated the system on two recent large-scale open do-main benchmarks for fact verification and multihop QA, namely FEVER  and HOTPOTQA (Yang et al., 2018), in which retrieval performance can also be evaluated accurately since intermediate annotations on evidences are provided. Our system achieves the start-ofthe-art results with 45.32% for answer EM and 25.14% joint EM on HOTPOTQA (8% absolute improvement on answer EM and doubling the joint EM over the previous best results) and with 67.26% on FEVER score (3% absolute improvement over previously published systems).
We then provide empirical studies to validate design decisions. Specifically, we prove the necessity of both paragraph-level retrieval and sentencelevel retrieval for maintaining good performance, and further illustrate that a better semantic retrieval module not only is beneficial to achieving high recall and keeping high upper bound for downstream task, but also plays an important role in shaping the downstream data distribution and providing more relevant and high-quality data for downstream sub-module training and inference. These mechanisms are vital for a good MRS system on both QA and fact verification.

Related Work
Machine Reading at Scale First proposed and formalized in Chen et al. (2017), MRS has gained popularity with increasing amount of work on both dataset collection (Joshi et al., 2017; and MRS model developments (Wang et al., 2018;Clark and Gardner, 2017;Htut et al., 2018). In some previous work , paragraph-level retrieval modules were mainly for improving the recall of required information, while in some other works (Yang et al., 2018), sentence-level retrieval modules were merely for solving the auxiliary sentence selection task. In our work, we focus on revealing the relationship between semantic retrieval at different granularity levels and the downstream comprehension task. To the best of our knowledge, we are the first to apply and optimize neural semantic retrieval at both paragraph and sentence levels for MRS. Automatic Fact Checking: Recent work  formalized the task of automatic fact checking from the viewpoint of machine learning and NLP. The release of FEVER  stimulates many recent developments (Nie et al., 2019;Yoneda et al., 2018;Hanselowski et al., 2018) on data-driven neural networks for automatic fact checking. We consider the task also as MRS because they share almost the same setup except that the downstream task is verification or natural language inference (NLI) rather than QA. Information Retrieval Success in deep neural networks inspires their application to information retrieval (IR) tasks (Huang et al., 2013;Guo et al., 2016;Mitra et al., 2017;Dehghani et al., 2017). In typical IR settings, systems are required to retrieve and rank (Nguyen et al., 2016) elements from a collection of documents based on their relevance to the query. This setting might be very different from the retrieval in MRS where systems are asked to select facts needed to answer a question or verify a statement. We refer the retrieval in MRS as Semantic Retrieval since it emphasizes on semantic understanding.

Method
In previous works, an MRS system can be complicated with different sub-components processing different retrieval and comprehension sub-tasks at different levels of granularity, and with some subcomponents intertwined. For interpretability considerations, we used a unified pipeline setup. The overview of the system is in Fig. 1.
To be specific, we formulate the MRS system as a function that maps an input tuple (q, K) to an output tuple (ŷ, S) where q indicates the input query, K is the textual KB,ŷ is the output prediction, and S is selected supporting sentences from Wikipedia. Let E denotes a set of necessary evidences or facts selected from K for the prediction. For a QA task, q is the input question andŷ is the predicted answer. For a verification task, q is the input claim andŷ is the predicted truthfulness of the input claim. For all tasks, K is Wikipedia.
The system procedure is listed below: (1) Term-Based Retrieval: To begin with, we used a combination of the TF-IDF method and a rule-based keyword matching method 2 to narrow the scope from whole Wikipedia down to a set of related paragraphs; this is a standard procedure in MRS (Chen et al., 2017;Nie et al., 2019). The focus of this step is to efficiently select a candidate set P I that can cover the information as much as possible (P I ⊂ K) while keeping the  size of the set acceptable enough for downstream processing.
(2) Paragraph-Level Neural Retrieval: After obtaining the initial set, we compare each paragraph in P I with the input query q using a neural model (which will be explained later in Sec 3.1). The outputs of the neural model are treated as the relatedness score between the input query and the paragraphs. The scores will be used to sort all the upstream paragraphs. Then, P I will be narrowed to a new set P N (P N ⊂ P I ) by selecting top k p paragraphs having relatedness score higher than some threshold value h p (going out from the P-Level grey box in Fig. 1). k p and h p would be chosen by keeping a good balance between the recall and precision of the paragraph retrieval.
(3) Sentence-Level Neural Retrieval: Next, we select the evidence at the sentence-level by decomposing all the paragraphs in P N into sentences. Similarly, each sentence is compared with the query using a neural model (see details in Sec 3.1) and obtain a set of sentences S ⊂ P N for the downstream task by choosing top k s sentences with output scores higher than some threshold h s (S-Level grey box in Fig. 1). During evaluation, S is often evaluated against some ground truth sentence set denoted as E.
(4) Downstream Modeling: At the final step, we simply applied task-specific neural models (e.g., QA and NLI) on the concatenation of all the sen-tences in S and the query, obtaining the final outputŷ.
In some experiments, we modified the setup for certain analysis or ablation purposes which will be explained individually in Sec 6.

Modeling and Training
Throughout all our experiments, we used BERT-Base (Devlin et al., 2018) to provide the state-ofthe-art contextualized modeling of the input text. 3 Semantic Retrieval: We treated the neural semantic retrieval at both the paragraph and sentence level as binary classification problems with models' parameters updated by minimizing binary cross entropy loss. To be specific, we fed the query and context into BERT as: We applied an affine layer and sigmoid activation on the last layer output of the [CLS ] token which is a scalar value. The parameters were updated with the objective function: wherep i is the output of the model, T p/s pos is the positive set and T p/s neg is the negative set. As shown in Fig. 1, at sentence level, ground-truth sentences were served as positive examples while other sentences from upstream retrieved set were served as negative examples. Similarly at the paragraphlevel, paragraphs having any ground-truth sentence were used as positive examples and other paragraphs from the upstream term-based retrieval processes were used as negative examples.
QA: We followed Devlin et al. (2018) for QA span prediction modeling. To correctly handle yes-or-no questions in HOTPOTQA, we fed the two additional "yes" and "no" tokens between [CLS ] and the Query as: where the supervision was given to the second or the third token when the answer is "yes" or "no", such that they can compete with all other predicted spans. The parameters of the neural QA model were trained to maximize the log probabilities of the true start and end indexes as: whereŷ s i andŷ e i are the predicted probability on the ground-truth start and end position for the ith example, respectively. It is worth noting that we used ground truth supporting sentences plus some other sentences sampled from upstream retrieved set as the context for training the QA module such that it will adapt to the upstream data distribution during inference.
Fact Verification: Following , we formulate downstream fact verification as the 3-way natural language inference (NLI) classification problem (MacCartney and Manning, 2009;Bowman et al., 2015) and train the model with 3-way cross entropy loss. The input format is the same as that of semantic retrieval and the objective is J ver = − i y i · log(ŷ i ), whereŷ i ∈ R 3 denotes the model's output for the three verification labels, and y i is a one-hot embedding for the ground-truth label. For verifiable queries, we used ground truth evidential sentences plus some other sentences sampled from upstream retrieved set as new evidential context for NLI. For nonverifiable queries, we only used sentences sampled from upstream retrieved set as context because those queries are not associated with ground truth evidential sentences. This detail is important for the model to identify non-verifiable queries and will be explained more in Sec 6. Additional training details and hyper-parameter selections are in the Appendix (Sec. A; Table 6).
It is worth noting that each sub-module in the system relies on its preceding sub-module to provide data both for training and inference. This means that there will be upstream data distribution misalignment if we trained the sub-module in isolation without considering the properties of its precedent upstream module. The problem is similar to the concept of internal covariate shift (Ioffe and Szegedy, 2015), where the distribution of each layer's inputs changes inside a neural network. Therefore, it makes sense to study this issue in a joint MRS setting rather than a typical supervised learning setting where training and test data tend to be fixed and modules being isolated. We release our code and the organized data both for reproducibility and providing an off-the-shelf testbed to facilitate future research on MRS.

Experimental Setup
MRS requires a system not only to retrieve relevant content from textual KBs but also to poccess enough understanding ability to solve the downstream task. To understand the impact or importance of semantic retrieval on the downstream comprehension, we established a unified experimental setup that involves two different downstream tasks, i.e., multi-hop QA and fact verification.

Tasks and Datasets
HOTPOTQA: This dataset is a recent large-scale QA dataset that brings in new features: (1) the questions require finding and reasoning over multiple documents; (2) the questions are diverse and not limited to pre-existing KBs; (3) it offers a new comparison question type (Yang et al., 2018). We experimented our system on HOTPOTQA in the fullwiki setting, where a system must find the answer to a question in the scope of the entire Wikipedia, an ideal MRS setup. The sizes of the train, dev and test split are 90,564, 7,405, and 7,405. More importantly, HOTPOTQA also provides human-annotated sentence-level supporting facts that are needed to answer each question. Those intermediate annotations enable evaluation on models' joint ability on both fact retrieval and answer span prediction, facilitating our direct analysis on the explainable predictions and its relations with the upstream retrieval. FEVER: The Fact Extraction and VERification dataset  is a recent dataset collected to facilitate the automatic fact checking. The work also proposes a benchmark task in which given an arbitrary input claim, candidate systems are asked to select evidential sentences from Wikipedia and label the claim as either SUPPORT, REFUTE, or NOT ENOUGH INFO, if the claim can be verified to be true, false, or non-verifiable, respectively, based on the evidence. The sizes of the train, dev and test split are 145,449, 19,998, and 9,998. Similar to HOT-POTQA, the dataset provides annotated sentencelevel facts needed for the verification. These intermediate annotations could provide an accurate evaluation on the results of semantic retrieval and thus suits well for the analysis on the effects of retrieval module on downstream verification.
As in Chen et al. (2017), we use Wikipedia as our unique knowledge base because it is a comprehensive and self-evolving information source often used to facilitate intelligent systems. Moreover, as Wikipedia is the source for both HOT-POTQA and FEVER, it helps standardize any further analysis of the effects of semantic retrieval on the two different downstream tasks.

Metrics
Following ; Yang et al. (2018), we used annotated sentence-level facts to calculate the F1, Precision and Recall scores for evaluating sentence-level retrieval. Similarly, we labeled all the paragraphs that contain any ground truth fact as ground truth paragraphs and used the same three metrics for paragraph-level retrieval evaluation. For HOTPOTQA, following Yang et al. (2018), we used exact match (EM) and F1 metrics for QA span prediction evaluation, and used the joint EM and F1 to evaluate models' joint performance on both retrieval and QA. The joint EM and F1 are calculated as: P j = P a · P s ; R j = R a · R s ; F j = 2P j ·R j P j +R j ; EM j = EM a · EM s , where P , R, and EM denote precision, recall and EM; the subscript a and s indicate that the scores are for answer span and supporting facts.
For the FEVER task, following , we used the Label Accuracy for evaluating downstream verification and the Fever Score  Table 2: Performance of systems on FEVER. "F1" indicates the sentence-level evidence F1 score. "LA" indicates Label Acc. without considering the evidence prediction. "FS"=FEVER Score  for joint performance. Fever score will award one point for each example with the correct predicted label only if all ground truth facts were contained in the predicted facts set with at most 5 elements. We also used Oracle Score for the two retrieval modules. The scores were proposed in Nie et al. (2019)

Results on Benchmarks
We chose the best system based on the dev set, and used that for submitting private test predictions on both FEVER and HOTPOTQA 4 .
As can be seen in Table 1, with the proposed hierarchical system design, the whole pipeline sys-tem achieves new start-of-the-art on HOTPOTQA with large-margin improvements on all the metrics. More specifically, the biggest improvement comes from the EM for the supporting fact which in turn leads to doubling of the joint EM on previous best results. The scores for answer predictions are also higher than all previous best results with ∼8 absolute points increase on EM and ∼9 absolute points on F1. All the improvements are consistent between test and dev set evaluation.
Similarly for FEVER, we showed F1 for evidence, the Label Accuracy, and the FEVER Score (same as benchmark evaluation) for models in Table 2. Our system obtained substantially higher scores than all previously published results with a ∼4 and ∼3 points absolute improvement on Label Accuracy and FEVER Score. In particular, the system gains 74.62 on the evidence F1, 22 points greater that of the second system, demonstrating its ability on semantic retrieval.
Previous systems (Ming Ding, 2019; Yang et al., 2018) on HOTPOTQA treat supporting fact retrieval (sentence-level retrieval) just as an auxiliary task for providing extra model explainability. In Nie et al. (2019), although they used a similar three-stage system for FEVER, they only applied one neural retrieval module at sentence-level which potentially weaken its retrieval ability. Both of these previous best systems are different from our fully hierarchical pipeline approach. These observations lead to the assumption that the performance gain comes mainly from the hierarchical retrieval and its positive effects on downstream. Therefore, to validate the system design decisions in Sec 3 and reveal the importance of semantic retrieval towards downstream, we conducted a series of ablation and analysis experiments on all the modules. We started by examining the necessity of both paragraph and sentence retrieval and give insights on why both of them matters.

Analysis and Ablations
Intuitively, both the paragraph-level and sentencelevel retrieval sub-module help speeding up the downstream processing. More importantly, since downstream modules were trained by sampled data from upstream modules, both of neural retrieval sub-modules also play an implicit but important role in controlling the immediate retrieval distribution i.e. the distribution of set P N and set S (as shown in Fig. 1), and providing better infer-ence data and training data for downstream modules.

Ablation Studies
Setups: To reveal the importance of neural retrieval modules at both paragraph and sentence level for maintaining the performance of the overall system, we removed either of them and examine the consequences. Because the removal of a module in the pipeline might change the distribution of the input of the downstream modules, we re-trained all the downstream modules accordingly. To be specific, in the system without the paragraph-level neural retrieval module, we re-trained the sentence-level retrieval module with negative sentences directly sampled from the term-based retrieval set and then also re-trained the downstream QA or verification module. In the system without the sentence-level neural retrieval module, we re-train the downstream QA or verification module by sampling data from both ground truth set and retrieved set directly from the paragraph-level module. We tested the simplified systems on both FEVER and HOTPOTQA.

Results
: Table 3 and 4 shows the ablation results for the two neural retrieval modules at both paragraph and sentence level on HOTPOTQA and FEVER. To begin with, we can see that removing paragraph-level retrieval module significantly reduces the precision for sentence-level retrieval and the corresponding F1 on both tasks. More importantly, this loss of retrieval precision also led to substantial decreases for all the downstream scores on both QA and verification task in spite of their higher upper-bound and recall scores. This indicates that the negative effects on downstream module induced by the omission of paragraph-level retrieval can not be amended by the sentence-level retrieval module, and focusing semantic retrieval merely on improving the recall or the upper-bound of final score will risk jeopardizing the performance of the overall system.
Next, the removal of sentence-level retrieval module induces a ∼2 point drop on EM and F1 score in the QA task, and a ∼15 point drop on FEVER Score in the verification task. This suggests that rather than just enhance explainability for QA, the sentence-level retrieval module can also help pinpoint relevant information and reduce the noise in the evidence that might otherwise distract the downstream comprehension   Table 4: Ablation over the paragraph-level and sentence-level neural retrieval sub-modules on FEVER. "LA"=Label Accuracy; "FS"=FEVER Score; "Orcl." is the oracle upperbound of FEVER Score assuming all downstream modules are perfect. "L-F1 (S/R/N)" means the classification f1 scores on the three verification labels: SUPPORT, REFUTE, and NOT ENOUGH INFO.
module. Another interesting finding is that without sentence-level retrieval module, the QA module suffered much less than the verification module; conversely, the removal of paragraph-level retrieval neural induces a 11 point drop on answer EM comparing to a ∼9 point drop on Label Accuracy in the verification task. This seems to indicate that the downstream QA module relies more on the upstream paragraph-level retrieval whereas the verification module relies more on the upstream sentence-level retrieval. Finally, we also evaluate the F1 score on FEVER for each classification label and we observe a significant drop of F1 on NOT ENOUGH INFO category without retrieval module, meaning that semantic retrieval is vital for the downstream verification module's discriminative ability on NOT ENOUGH INFO label.

Sub-Module Change Analysis
To further study the effects of upstream semantic retrieval towards downstream tasks, we change training or inference data between intermediate layers and then examine how this modification will affect the downstream performance.

Effects of Paragraph-level Retrieval
We fixed h p = 0 (the value achieving the best performance) and re-trained all the downstream parameters and track their performance as k p (the number of selected paragraph) being changed from 1 to 12. The increasing of k p means a potential higher coverage of the answer but more noise in the retrieved facts. Fig. 2 shows the results.
As can be seen that the EM scores for supporting fact retrieval, answer prediction, and joint perfor- mance increase sharply when k p is changed from 1 to 2. This is consistent with the fact that at least two paragraphs are required to ask each question in HOTPOTQA. Then, after the peak, every score decrease as k p becomes larger except the recall of supporting fact which peaks when k p = 4. This indicates that even though the neural sentencelevel retrieval module poccesses a certain level of ability to select correct facts from noisier upstream information, the final QA module is more sensitive to upstream data and fails to maintain the overall system performance. Moreover, the reduction on answer EM and joint EM suggests that it might be risky to give too much information for downstream modules with a unit of a paragraph.

Effects of Sentence-level Retrieval
Similarly, to study the effects of neural sentencelevel retrieval module towards downstream QA and verification modules, we fixed k s to be 5 and set h s ranging from 0.1 to 0.9 with a 0.1 interval.  Then, we re-trained the downstream QA and verification modules with different h s value and experimented on both HOTPOTQA and FEVER. Question Answering: Fig. 3 shows the trend of performance. Intuitively, the precision increase while the recall decrease as the system becomes more strict about the retrieved sentences. The EM score for supporting fact retrieval and joint performance reaches their highest value when h s = 0.5, a natural balancing point between precision and recall. More interestingly, the EM score for answer prediction peaks when h s = 0.2 and where the recall is higher than the precision. This misalignment between answer prediction performance and retrieval performance indicates that unlike the observation at paragraph-level, the downstream QA module is able to stand a certain amount of noise at sentence-level and benefit from a higher recall. Fact Verification: Fig. 4 shows the trends for Label Accuracy, FEVER Score, and Evidence F1 by modifying upstream sentence-level threshold h s . We observed that the general trend is similar to   that of QA task where both the label accuracy and FEVER score peak at h s = 0.2 whereas the retrieval F1 peaks at h s = 0.5. Note that, although the downstream verification could take advantage of a higher recall, the module is more sensitive to sentence-level retrieval comparing to the QA module in HOTPOTQA. More detailed results are in the Appendix.

Answer Breakdown
We further sample 200 examples from HOT-POTQA and manually tag them according to several common answer types (Yang et al., 2018). The proportion of different answer types is shown in Figure 5. The performance of the system on each answer type is shown in Table 5. The most frequent answer type is 'Person' (24%) and the least frequent answer type is 'Event' (2%). It is also interesting to note that the model performs the best in Yes/No questions as shown in Table 5, reaching an accuracy of 70.6%.  Figure 6: An example with a distracting fact. P-Score and S-Score are the retrieval score at paragraph and sentence level respectively. The full pipeline was able to filter the distracting fact and give the correct answer. The wrong answer in the figure was produced by the system without paragraph-level retrieval module. Fig. 6 shows an example that is correctly handled by the full pipeline system but not by the system without paragraph-level retrieval module. We can see that it is very difficult to filter the distracting sentence after sentence-level either by the sentence retrieval module or the QA module.

Examples
Above findings in both FEVER and HOT-POTQA bring us some important guidelines for MRS: (1) A paragraph-level retrieval module is imperative; (2) Downstream task module is able to undertake a certain amount of noise from sentence-level retrieval; (3) Cascade effects on downstream task might be caused by modification at paragraph-level retrieval.

Conclusion
We proposed a simple yet effective hierarchical pipeline system that achieves state-of-the-art results on two MRS tasks. Ablation studies demonstrate the importance of semantic retrieval at both paragraph and sentence levels in the MRS system. The work can give general guidelines on MRS modeling and inspire future research on the relationship between semantic retrieval and downstream comprehension in a joint setting. Table 6: Hyper-parameter selection for the full pipeline system. h and k are the retrieval filtering hyperparameters mentioned in the main paper. P-level and S-level indicate paragraph-level and sentence-level respectively. "{}" means values enumerated from a set. "[]" means values enumerated from a range with inter-val=0.1 "BS."=Batch Size "# E."=Number of Epochs The hyper-parameters were chosen based on the performance of the system on the dev set. The hyper-parameters search space is shown in Table 6 and the learning rate was set to 10 −5 in all experiments.

B Term-Based Retrieval Details
FEVER We used the same key-word matching method in Nie et al. (2019) to get a candidate set for each query. We also used TF-IDF (Chen et al., 2017) method to get top-5 related documents for each query. Then, the two sets were combined to get final term-based retrieval set for FEVER. The mean and standard deviation of the number of the retrieved paragraph in the merged set were 8.06 and 4.88.
HOTPOTQA We first used the same procedure on FEVER to get an initial candidate set for each query in HOTPOTQA. Because HOTPOTQA requires at least 2-hop reasoning for each query, we  then extract all the hyperlinked documents from the retrieved documents in the initial candidate set, rank them with TF-IDF (Chen et al., 2017) score and then select top-5 most related documents and add them to the candidate set. This gives the final term-based retrieval set for HOTPOTQA. The mean and standard deviation of the number of the retrieved paragraph for each query in HOTPOTQA were 39.43 and 16.05.

C Detailed Results
• The results of sentence-level retrieval and downstream QA with different values of h s on HOTPOTQA are in Table 7.
• The results of sentence-level retrieval and downstream verification with different values of h s on FEVER are in Table 8.
• The results of sentence-level retrieval and downstream QA with different values of k p on HOTPOTQA are in Table 9.

D Examples and Case Study
We further provide examples, case study and error analysis for the full pipeline system . The examples  are shown from Tables 11, 12, 13, 14, 15. The examples show high diversity on the semantic level and the error occurs often due to the system's failure of extracting precise (either wrong, surplus or insufficient) information from KB.   Ground Truth Facts: (D1NZ, 0) D1NZ is a production car drifting series in New Zealand.

Ground Truth Answer: Drifting
Predicted Facts: (D1NZ, 0) D1NZ is a production car drifting series in New Zealand.