Unsupervised Alignment-based Iterative Evidence Retrieval for Multi-hop Question Answering

Evidence retrieval is a critical stage of question answering (QA), necessary not only to improve performance, but also to explain the decisions of the QA method. We introduce a simple, fast, and unsupervised iterative evidence retrieval method, which relies on three ideas: (a) an unsupervised alignment approach to soft-align questions and answers with justification sentences using only GloVe embeddings, (b) an iterative process that reformulates queries focusing on terms that are not covered by existing justifications, which (c) stops when the terms in the given question and candidate answers are covered by the retrieved justifications. Despite its simplicity, our approach outperforms all the previous methods (including supervised methods) on the evidence selection task on two datasets: MultiRC and QASC. When these evidence sentences are fed into a RoBERTa answer classification component, we achieve state-of-the-art QA performance on these two datasets.


Introduction
Explainability in machine learning (ML) remains a critical unsolved challenge that slows the adoption of ML in real-world applications (Biran and Cotton, 2017;Gilpin et al., 2018;Alvarez-Melis and Jaakkola, 2017;Arras et al., 2017).
Question answering (QA) is one of the challenging natural language processing (NLP) tasks that benefits from explainability. In particular, multihop QA requires the aggregation of multiple evidence facts in order to answer complex natural language questions (Yang et al., 2018). Several multi-hop QA datasets have been proposed recently (Yang et al., 2018;Khashabi et al., 2018a;Welbl et al., 2018;Dua et al., 2019;Khot et al., 2019a;Jansen and Ustalov, 2019;Rajpurkar et al., 2018). While several neural methods have achieved state-of-theart results on these datasets (Devlin et al., 2019;Liu et al., 2019;Yang et al., 2019), we argue that many of these directions lack a human-understandable explanation of their inference process, which is necessary to transition these approaches into realworld applications. This is especially critical for multi-hop, multiple choice QA (MCQA) where: (a) the answer text may not come from an actual knowledge base passage, and (b) reasoning is required to link the candidate answers to the given question (Yadav et al., 2019b). Figure 1 shows one such multi-hop example from a MCQA dataset.
In this paper we introduce a simple alignmentbased iterative retriever (AIR) 1 , which retrieves high-quality evidence sentences from unstructured knowledge bases. We demonstrate that these evidence sentences are useful not only to explain the required reasoning steps that answer a question, but they also considerably improve the performance of the QA system itself.
Unlike several previous works that depend on supervised methods for the retrieval of justification sentences (deployed mostly in settings that rely on small sets of candidate texts, e.g., HotPotQA, MultiRC), AIR is completely unsupervised and scales easily from QA tasks that use small sets of candidate evidence texts to ones that rely on large knowledge bases (e.g., QASC (Khot et al., 2019a)). AIR retrieves justification sentences through a simple iterative process. In each iteration, AIR uses an alignment model to find justification sentences that are closest in embedding space to the current query (Kim et al., 2017;Yadav et al., 2018), which is initialized with the question and candidate answer text. After each iteration, AIR adjusts its query to focus on the missing information (Khot et al., 2019b) in the current set of justifications.
AIR also conditionally expands the query using the justifications retrieved in the previous steps.
In particular, our key contributions are: (1) We develop a simple, fast, and unsupervised iterative evidence retrieval method, which achieves state-of-the-art results on justification selection on two multi-hop QA datasets: MultiRC (Khashabi et al., 2018a) and QASC (Khot et al., 2019a). Notably, our simple unsupervised approach that relies solely on GloVe embeddings (Pennington et al., 2014) outperforms three transformer-based supervised state-of-the-art methods: BERT (Devlin et al., 2019), XLnet (Yang et al., 2019) and RoBERTa (Liu et al., 2019) on the justification selection task. Further, when the retrieved justifications are fed into a QA component based on RoBERTa (Liu et al., 2019), we obtain the best QA performance on the development sets of both MultiRC and QASC. 2 (2) AIR can be trivially extended to capture parallel evidence chains by running multiple instances of AIR in parallel starting from different initial evidence sentences. We show that aggregating multiple parallel evidences further improves the QA performance over the vanilla AIR by 3.7% EM0 on the MultiRC and 5.2% accuracy on QASC datasets (both absolute percentages on development sets). Thus, with 5 parallel evidences from AIR we obtain 36.3% EM0 on MultiRC and 81.0% accuracy on QASC hidden test sets (on their respective leaderboards). To our knowledge from published works, these results are the new state-of-the-art QA results on these two datasets. These scores are also accompanied by new state-of-the-art performance on evidence retrieval on both the datasets, which emphasizes the interpretability of AIR.
(3) We demonstrate that AIR's iterative process that focuses on missing information is more robust to semantic drift. We show that even the supervised RoBERTa-based retriever trained to retrieve evidences iteratively, suffers substantial drops in performance with retrieval from consecutive hops.

Related Work
Our work falls under the revitalized direction that focuses on the interpretability of QA systems, where the machine's inference process is explained to the end user in natural language evidence text (Qi et al., 2019;Yang et al., 2018;Wang et al., 2019b;Yadav et al., 2019b;Bauer et al., 2018). Several 2 In settings where external labeled resources are not used.  (Yang et al., 2018;Khashabi et al., 2018a;Khot et al., 2019a;Jansen and Ustalov, 2019) provide annotated evidence sentences enabling the automated evaluation of interpretability via evidence text selection.
QA approaches that focus on interpretability can be broadly classified into three main categories: supervised, which require annotated justifications at training time, latent, which extract justification sentences through latent variable methods driven by answer quality, and, lastly, unsupervised ones, which use unsupervised algorithms for evidence extraction.
In the first class of supervised approaches, a supervised classifier is normally trained to identify correct justification sentences driven by a query (Nie et al., 2019;Tu et al., 2019;Banerjee, 2019). Many systems tend to utilize a multi-task learning setting to learn both answer extraction and justification selection with the same network (Min et al., 2018;Gravina et al., 2018). Although these approaches have achieved impressive performance, they rely on annotated justification sentences, which may not be always available. Few approaches have used distant supervision methods (Lin et al., 2018;Wang et al., 2019b) to create noisy training data for evidence retrieval but these usually underperform due to noisy labels.
In the latent approaches for selecting justifica-tions, reinforcement learning (Geva and Berant, 2018;Choi et al., 2017) and PageRank (Surdeanu et al., 2008) have been widely used to select justification sentences without explicit training data. While these directions do not require annotated justifications, they tend to need large amounts of question/correct answer pairs to facilitate the identification of latent justifications.
In unsupervised approaches, many QA systems have relied on structured knowledge base (KB) QA. For example, several previous works have used ConceptNet (Speer et al., 2017) to keep the QA process interpretable (Khashabi et al., 2018b;Sydorova et al., 2019). However, the construction of such structured knowledge bases is expensive, and may need frequent updates. Instead, in this work we focus on justification selection from textual (or unstructured) KBs, which are inexpensive to build and can be applied in several domains. In the same category of unsupervised approaches, conventional information retrieval (IR) methods such as BM25 (Chen et al., 2017) have also been widely used to retrieve independent individual sentences. As shown by (Khot et al., 2019a;Qi et al., 2019), and our table 2, these techniques do not work well for complex multi-hop questions, which require knowledge aggregation from multiple related justifications. Some unsupervised methods extract groups of justification sentences Yadav et al., 2019b) but these methods are exponentially expensive in the retrieval step. Contrary to all of these, AIR proposes a simpler and more efficient method for chaining justification sentences.
Recently, many supervised iterative justification retrieval approaches for QA have been proposed (Qi et al., 2019;Feldman and El-Yaniv, 2019;Banerjee, 2019;Das et al., 2018). While these were shown to achieve good evidence selection performance for complex questions when compared to earlier approaches that relied on just the original query (Chen et al., 2017;Yang et al., 2018), they all require supervision.
As opposed to all these iterative-retrieval methods and previously discussed directions, our proposed approach AIR is completely unsupervised, i.e., it does not require annotated justifications. Further, unlike many of the supervised iterative approaches (Feldman and El-Yaniv, 2019;Sun et al., 2019a) that perform query reformulation in a continuous representation space, AIR employs a simpler and more interpretable query reformulation strategy that relies on explicit terms from the query and the previously retrieved justification. Lastly, none of the previous iterative retrieval approaches address the problem of semantic drift, whereas AIR accounts for drift by controlling the query reformulation as explained in section 3.1.

Approach
As shown in fig. 2, the proposed QA approach consists of two components: (a) an unsupervised, iterative component that retrieves chains of justification sentences given a query; and (b) an answer classification component that classifies a candidate answer as correct or not, given the original question and the previously retrieved justifications. We detail these components in the next two sub-sections.

Iterative Justification Retrieval
AIR iteratively builds justification chains given a query. AIR starts by initializing the query with the concatenated question and candidate answer text 3 . Then, AIR iteratively repeats the following two steps: (a) It retrieves the most salient justification sentence given the current query using an alignment-IR approach (Yadav et al., 2019a). The candidate justification sentences come from datasetspecific KBs. For example, in MultiRC, we use as candidates all the sentences from the paragraph associated with the given question. In QASC, which has a large KB 4 of 17.4 million sentences), similar to Khot et al. (2019a) candidates are retrieved using the Heuristic+IR method which returns 80 candidate sentences for each candidate answer from the provided QASC KB. (b) it adjusts the query to focus on the missing information, i.e., the keywords that are not covered by the current evidence chain. AIR also dynamically adds new terms to the query from the previously retrieved justifications to nudge multi-hop retrieval. These two iterative steps repeat until a parameter-free termination condition is reached.
We first detail the important components of AIR.
Alignment: To compute the similarity score between a given query and a sentence from KB, AIR 3 Note that this work can be trivially adapted to reading comprehension tasks. In such tasks (e.g., SQuAD (Rajpurkar et al., 2018)), the initial query would contain just the question text. 4 In large KB-based QA, AIR first uses an off-the-shelf Lucene BM25 (Robertson et al., 2009) to retrieve a pool of candidate justification sentences from which the evidence chains are constructed. "paragraph":{"text":"Sent 0: Chinese Influences: The Japanese were forced out of the Korean peninsula in the sixth century, but not before the Koreans had bequeathed to the Yamato court copies of the sacred images and scriptures of Chinese Buddhism.
……. ………… Sent 10: At this early stage in its history, Japan was already (for the most part) only nominally ruled by the emperor. .……………. Passage Query Figure 2: A walkthrough example showing the iterative retrieval of justification sentences by AIR on MultiRC. Each current query includes keywords from the original query (which consists of question + candidate answer) that are not covered by previously retrieved justifications (see 2 nd hop). If the number of uncovered keywords is too small, the query is expanded with keywords from the most recent justification (3 rd hop). The retrieval process terminates when all query terms are covered by existing justifications. Qc indicates the proportion of query terms covered in the justifications; Qr indicates the query terms which are still not covered by the justifications. AIR can retrieve parallel justification chains by running the retrieval process in parallel, starting from different candidates for the first justification sentence in a chain.
uses a vanilla unsupervised alignment method of Yadav et al. (2019a) which uses only GloVe embeddings (Pennington et al., 2014). 5 The alignment method computes the cosine similarity between the word embeddings of each token in the query and each token in the given KB sentence, resulting in a matrix of cosine similarity scores. For each query token, the algorithm select the most similar token in the evidence text using max-pooling. At the end, the element-wise dot product between this maxpooled vector of cosine-similarity scores and the vector containing the IDF values of the query tokens is calculated to produce the overall alignment score s for the given query Q and the supporting paragraph P j : 5 Alignment based on BERT embeddings marginally outperformed the one based on GloVe embeddings, but BERT embeddings were much more expensive to generate.
where q i and p k are the i th and k th terms of the query (Q) and evidence sentence (P j ) respectively.
Remainder terms (Q r ): Query reformulation in AIR is driven by the remainder terms, which are the set of query terms not yet covered in the justification set of i sentences (retrieved from the first i iterations of the retrieval process): where t(Q) represents the unique set of query terms, t(s k ) represents the unique terms of the k th justification, and S i represents the set of i justification sentences. Note that we use soft matching of alignment for the inclusion operation: we consider a query term to be included in the set of terms in the justifications if its cosine similarity with a justification term is larger than a similarity threshold M (we use M =0.95 for all our experiments -see section 5.2), thus ensuring that the two terms are similar in the embedding space.
Coverage (Q c ): measures the coverage of the query keywords by the retrieved chain of justifications S: where |t(Q)| denotes the size of unique query terms.

The AIR retrieval process
Query reformulation: In each iteration j, AIR reformulates the query Q(j) to include only the terms not yet covered by the current justification chain, Q r (j − 1). See, for example, the second hop in fig. 2. To mitigate ambiguous queries, the query is expanded with the terms from all the previously retrieved justification sentences only if the number of uncovered terms is less than T (we used T = 2 for MultiRC and T = 4 for QASC (see section 5.2). See, for example, the third hop in fig. 2, in which the query is expanded with the terms of all the previously retrieved justification sentences. Formally: where j is the current iteration index.
Stopping criteria: AIR stops its iterative evidence retrieval process when either of the following conditions is true: (a) no new query terms are discovered in the last justification retrieved, i.e., Q r (i−1) == Q r (i), or (b) all query terms are covered by justifications, i.e., Q c = 1.

Answer Classification
AIR's justification chains can be fed into any supervised answer classification method. For all experiments in this paper, we used RoBERTa (Liu et al., 2019), a state-of-the-art transformer-based method. In particular, for MultiRC, we concatenate the query (composed from question and candidate answer text) with the evidence text, with the [SEP] token between the two texts. A sigmoid is used over the [CLS] representation to train a binary classification task 6 (correct answer or not).
For QASC, we fine-tune RoBERTa as a multiplechoice QA 7 (MCQA) (Wolf et al., 2019) classifier with 8 choices using a softmax layer(similar to (Khot et al., 2019a)) instead of the sigmoid. The input text consists of eight queries (from eight candidate answers) and their corresponding eight evidence texts. Unlike the case of MultiRC, it is possible to train a MCQA classifier for QASC because every question has only 1 correct answer. We had also tried the binary classification approach for QASC but it resulted in nearly 5% lower performance for majority of the experiments in table 2.
In QA tasks that rely on large KBs there may exist multiple chains of evidence that support a correct answer. This is particularly relevant in QASC, whose KB contains 17.2M facts. 8 Figure 1 shows an example of this situation. To utilize this type of redundancy in answer classification, we extend AIR to extract parallel evidence chains. That is, to extract N parallel chains, we run AIR N times, ensuring that the first justification sentences in each chain are different (in practice, we start a new chain for each justification in the top N retrieved sentences in the first hop). After retrieving N parallel evidence chains, we take the union of all the individual justification sentences to create the supporting evidence text for that candidate answer.

Experiments
We evaluated our approach on two datasets: Multi-sentence reading comprehension (Mul-tiRC), which is a reading comprehension dataset provided in the form of multiple-choice QA task (Khashabi et al., 2018a). Every question is based on a paragraph, which contains the gold justification sentences for each question. We use every sentence of the paragraph as candidate justifications for a given question. Here we use the original 6 We used RoBERTa base with maximum sequence length of 512, batch size = 8, learning rate of 1e-5, and 5 number of epochs. RoBERTa-base always returned consistent performance on MultiRC experiments; many runs from RoBERTalarge failed to train (as explained by (Wolf et al., 2019)), and generated near random performance. 7 We used similar hyperparameters as in the MultiRC experiments, but instead used RoBERTa-large, with maximum sequence length of 128. 8 The dataset creators make a similar observation (Khot et al., 2019a).  Table 1: Results on the MultiRC development and test sets. The first column specifies the runtime overhead required for selection of evidence sentences, where N is the total number of sentences in the passage, and K is the selected number of sentences. The second column specifies if the retrieval system is a supervised method or not. The last three columns indicate evidence selection performance, whereas the previous three indicate overall QA performance. Only the last block of results report performance on the test set. The bold italic font highlights the best performance without using parallel evidences. denotes usage of external labeled data for pretraining.
MultiRC dataset, 9 which includes the gold annotations for evidence text, unlike the version available on SuperGlue (Wang et al., 2019a).
Question Answering using Sentence Composition (QASC), a large KB-based multiple-choice QA dataset (Khot et al., 2019a). Each question is provided with 8 answer candidates, out of which 4 candidates are hard adversarial choices. Every question is annotated with a fixed set of two justification sentences for answering the question. The  Table 2: QA and evidence selection performance on QASC. We also report recall@10 similar to Khot et al. (2019a). both found reports the recall scores when both the gold justifications are found in top 10 ranked sentences and similarly atleast one found reports the recall scores when either one or both the gold justifications are found in the top 10 ranked sentences. Recall@10 are not reported (row 8-17) when number of retrieved sentences are lesser than 10. Other notations are same as table 1.

Baselines
In addition to previously-reported results, we include in the tables several in-house baselines. For MultiRC, we considered three baselines. The first baseline is where we feed all passage sentences to the RoBERTa classifier (row 11 in table 1). The second baseline uses the alignment method of (Kim et al., 2017) to retrieve the top k sentences (k = 2, 5). Since AIR uses the same alignment approach for retrieving justifications in each iteration, the comparison to this second baseline highlights the gains from our iterative process with query reformulation. The third baseline uses a supervised RoBERTa classifier trained to select the gold justifications for every query (rows 16-21 in table 1). Lastly, we also developed a RoBERTa-based iterative retriever by concatenating the query with the retrieved justification in the previous step. We retrain the RoBERTa iterative retriever in every step, using the new query in each step. We considered two baselines for QASC. The first baseline does not include any justifications (row 7 in table 2). The second baseline uses the top k sentences retrieved by the alignment method (row (8-12 in table 2).

Evidence Selection Results
For evidence selection, we report precision, recall, and F1 scores on MultiRC (similar to (Wang et al., 2019b;Yadav et al., 2019b)). For QASC, we report Recall@10, similar to the dataset authors (Khot et al., 2019a). We draw several observation from the evidence selection results: (1) AIR vs. unsupervised methods -AIR outperforms all the unsupervised baselines and previous works in both MultiRC (row 9-15 vs. row 23 in table 1) and QASC(rows 0-6 vs. row 18). Thus, highlighting strengths of AIR over the standard IR baselines. AIR achieves 5.4% better F1 score compared to the best parametric alignment baseline (row 12 in table 1), which highlights the importance of the iterative approach over the vanilla alignment in AIR. Similarly, rows (4 and 5) of table 2 also highlight this importance in QASC.
(2) AIR vs. supervised methods -Surprisingly, AIR also outperforms the supervised RoBERTaretriver in every setting(rows 16-21 in table 1). Note that the performance of this supervised re-trieval method drops considerably when trained on passages from a specific domain (row 19 in table 1), which highlights the domain sensitivity of supervised retrieval methods. In contrast, AIR is unsupervised and generalize better as it is not tuned to any specific domain. AIR also achieves better performance than supervised RoBERTa-iterativeretriever (row 21 in table 1) which simply concatenates the retrieved justification to the query after every iteration and further trains to retrieve the next justification. The RoBERTa-iterative-retriever achieves similar performance as that of the simple RoBERTa-retriever (row 16 vs. 21) which suggests that supervised iterative retrievers marginally exploit the information from query expansion. On the other hand, controlled query reformulation of AIR leads to 5.4% improvement as explained in the previous point. All in all, AIR achieves state-of-theart results for evidence retrieval on both MultiRC (row 23 in table 1) and QASC (row 18 of table 2).
(3) Soft-matching of AIR -the alignment-based AIR is 10.7% F1 better than AIR that relies on lexical matching (rather than the soft matching) on MultiRC (row 22 vs. 23), which emphasizes the advantage of alignment methods over conventional lexical match approaches.

Question Answering Results
For overall QA performance, we report the standard performance measures (F 1 a , F 1 m , and EM 0) in MultiRC (Khashabi et al., 2018a), and accuracy for QASC (Khot et al., 2019a).
The results in tables 1 and 2 highlight: (1) State-of-the-art performance: Development set -On both MultiRC and QASC, RoBERTa fine-tuned using the AIR retrieved evidence chains (row 23 in table 1 and row 14 in table 2) outperforms all the previous approaches and the baseline methods. This indicates that the evidence texts retrieved by AIR not only provide better explanations, but also contribute considerably in achieving the best QA performance.
Test set -On the official hidden test set, RoBERTa fine-tuned on 5 parallel evidences from AIR achieves new state-of-the-art QA results, outperforming previous state-of-the-art methods by 7.8% accuracy on QASC (row 21 vs. 20), and 10.2% EM0 on MultiRC (row 35 vs. 34).
(2) Knowledge aggregation -The knowledge aggregation from multiple justification sentences  leads to substantial improvements, particularly in QASC (single justification (row 4 and 8) vs. evidence chains (row 5 and row 9) in table table 2). Overall, the chain of evidence text retrieved by AIR enables knowledge aggregation resulting in the improvement of QA performances.
(3) Gains from parallel evidences -Further, knowledge aggregation from parallel evidence chains lead to another 3.7% EM0 improvement on MultiRC (row 27), and 5.6% on QASC over the single AIR evidence chain (row 18). To our knowledge, these are new state-of-the-art results in both the datasets.

Analysis
To further understand the retrieval process of AIR we implemented several analyses.

Semantic Drift Analysis
To understand the importance of modeling missing information in query reformulation, we analyzed a simple variant of AIR in which, rather the focusing on missing information, we simply concatenate the complete justification sentence to the query after each hop. To expose semantic drift, we retrieve a specified number of justification sentences. As seen in table 3, now the AIR(lexical)-uncontrolled and AIR-uncontrolled perform worse than both BM25 and the alignment method. This highlights that the focus on missing information during query reformulation is an important deterrent of semantic drift. We repeated the same experiment with the supervised RoBERTa retriever (trained iteratively for 2 steps) and the original parameter-free AIR, which decides its number of hops using the stopping conditions. Again, we observe similar performance drops in both: the RoBERTa retriever drops from 62.3% to 57.6% and AIR drops to 55.4%.

Robustness to Hyper Parameters
We evaluate the sensitivity of AIR to the 2 hyper parameters: the threshold (Q r ) for query expansion, and the cosine similarity threshold M in computation of alignment. As shown in table 5, evidence selection performance of AIR drops with the lower values of M but the drops are small, suggesting that AIR is robust to different M values.
Similarly, there is a drop in performance for MultiRC with the increase in the Q r threshold used for query expansion, hinting to the occurrence of semantic drift for higher values of Q r (table 4). This is because the candidate justifications are coming from a relatively small numbers of paragraphs in MultiRC; thus even shorter queries (= 2 words) can retrieve relevant justifications. On the other hand, the number of candidate justifications in QASC is much higher, which requires longer queries for disambiguation (>= 4 words).

Saturation of Supervised Learning
To verify if the MultiRC training data is sufficient to train a supervised justification retrieval method, we trained justification selection classifiers based on BERT, XLNet, and RoBERTa on increasing proportions of the MultiRC training data (table 6). This analysis indicates that all three classifiers approach their best performance at around 5% of the training data. This indicates that, while these supervised methods converge quickly, they are unlikely to outperform AIR, an unsupervised method, even if more training data were available.

Conclusion
We introduced a simple, unsupervised approach for evidence retrieval for question answering. Our approach combines three ideas: (a) an unsupervised alignment approach to soft-align questions and answers with justification sentences using GloVe embeddings, (b) an iterative process that reformulates queries focusing on terms that are not covered by existing justifications, and (c) a simple stopping condition that concludes the iterative process when all terms in the given question and candidate answers are covered by the retrieved justifications. Overall, despite its simplicity, unsupervised nature, and its sole reliance on GloVe embeddings, our approach outperforms all previous methods (including supervised ones) on the evidence selection task on two datasets: MultiRC and QASC. When these evidence sentences are fed into a RoBERTa answer classification component, we achieve the best QA performance on these two datasets. Further, we show that considerable improvements can be obtained by aggregating knowledge from parallel evidence chains retrieved by our method. In addition of improving QA, we hypothesize that these simple unsupervised components of AIR will benefit future work on supervised neural iterative retrieval approaches by improving their query reformulation algorithms and termination criteria.