Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

We propose an unsupervised strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection can be coupled with any supervised QA model. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2’s Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among systems that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.


Introduction
Interpretable machine learning (ML) models, where the end user can understand how a decision was reached, are a critical requirement for the wide adoption of ML solutions in many fields such as healthcare, finance, and law Alvarez-Melis and Jaakkola, 2017;Arras et al., 2017;Gilpin et al., 2018;Biran and Cotton, 2017) For complex natural language processing (NLP) such as question answering (QA), human readable explanations of the inference process have been proposed as a way to interpret QA models . To which organ system do the esophagus, liver, pancreas, small intestine, and colon belong? (A) reproductive system (B) excretory system (C) digestive system (D) endocrine system ROCC-selected justification sentences: 1.
vertebrate digestive system has oral cavity, teeth and pharynx, esophagus and stomach, small intestine, pancreas, liver and the large intestine 2. digestive system consists liver, stomach, large intestine, small intestine, colon, rectum and anus BM25-selected justification sentences: 1. their digestive system consists of a stomach, liver, pancreas, small intestine, and a large intestine 2. the liver pancreas and gallbladder are the solid organ of the digestive system with the correct answer in bold, followed by justification sentences selected by our approach (ROCC) vs. sentences selected by a strong IR baseline (BM25). ROCC justification sentences fully cover the five key terms in the question (shown in italic), whereas BM25 misses two: esophagus and colon. Further, the second BM25 sentence is largely redundant with the first, not covering other query terms.
Recently, multiple datasets have been proposed for multi-hop QA, in which questions can only be answered when considering information from multiple sentences and/or documents Khashabi et al., 2018a;Welbl et al., 2018;Mihaylov et al., 2018;Bauer et al., 2018;Dunn et al., 2017;Dhingra et al., 2017;Lai et al., 2017;Rajpurkar et al., 2018;. The task of selecting justification sentences is complex for multi-hop QA, because of the additional knowledge aggregation requirement (examples of such questions and answers are shown in Figures 1 and 2). Although various neural QA methods have achieved high performance on some of these datasets Trivedi et al., 2019;Tymoshenko et al., 2017;Seo et al., 2016;Wang and Jiang, 2016;De Cao et al., 2018;Back et al., 2018), we argue that more effort must be dedicated to explaining their inference process.
In this work we propose an unsupervised algorithm for the selection of multi-hop justifications from unstructured knowledge bases (KB). Unlike other supervised selection methods (Dehghani et al., 2019;Bao et al., 2016;Wang et al., 2018b,a;Tran and Niedereée, 2018;Trivedi et al., 2019), our approach does not require any training data for justification selection. Unlike approaches that rely on structured KBs, which are expensive to create, (Khashabi et al., 2016;Khot et al., 2017;Khashabi et al., 2018b;Cui et al., 2017;Bao et al., 2016), our method operates over KBs of only unstructured texts. We demonstrate that our approach has a bigger impact on downstream QA approaches that use these justification sentences as additional signal than a strong baseline that relies on information retrieval (IR). In particular, the contributions of this work are: (1) We propose an unsupervised, non-parametric strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the Relevance of the selected sentences; (b) minimizes the lexical Overlap between the selected facts; and (c) maximizes the lexical Coverage of both question and answer. We call our approach ROCC. ROCC operates by first creating n k justification sets from the top n sentences selected by the BM25 information retrieval model (Robertson et al., 2009), where k ranges from 2 to n, and then ranking them all by a formula that combines the three criteria above. The set with the top score becomes the set of justifications output by ROCC for a given question and candidate answer. As shown in Figure 1, the justification sentences selected by ROCC perform more meaningful knowledge aggregation than a strong IR baseline (BM25), which does not account for overlap (or complementarity) and coverage.
(2) ROCC can be coupled with any supervised QA approach that can use the selected justification sentences as additional signal. To demonstrate its effectiveness, we combine ROCC with a state-of-theart QA method that relies on BERT (Devlin et al., 2018) to classify correct answers, using the text of the question, the answer, and (now) the justification sentences as input. On the Multi-Sentence Reading Comprehension (MultiRC) dataset (Khashabi et al., 2018a), we achieved a gain of 8.3% EM0 with ROCC justifications when compared to the case where the complete comprehension passage was provided to the BERT classifier. On AI2's Reason-ing Challenge (ARC) dataset , the QA approach enhanced with ROCC justifications outperforms the QA method without justifications by 9.15% accuracy, and the approach that uses top sentences provided by BM25 by 2.88%. Further, we show that the justification sentences selected by ROCC are considerably more correct on their own than justifications selected by BM25 (e.g., the justification score in MultiRC was increased by 11.58% when compared to the best performing BM25 justifications), which indicates that the interpretability of the overall QA system was also increased.
(3) Lastly, our analysis indicates that ROCC is more stable across the different domains in the MultiRC dataset than a supervised strategy for the selection of justification sentences that relies on a dedicated BERT-based classifier, with a difference of over 10% F1 score in some configurations.
The ROCC system and the codes for generating all the analysis are provided herehttps: //github.com/vikas95/AutoROCC.

Related Work
The body of QA work that addresses the selection of justification sentences can be classified into roughly four categories: (a) supervised approaches that require training data to learn how to select justification sentences (i.e., questions and answers coupled with correct justifications); (b) methods that treat justifications as latent variables and learn jointly how to answer questions and how to select justifications from questions and answers alone; (c) approaches that rely on information retrieval to select justification sentences; and, lastly, (d) methods that do not use justification sentences at all.
In the first category, previous works (e.g., (Trivedi et al., 2019)) have used entailment resources including labeled trained datasets such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2017) to train components for selecting justification sentences for QA. Other works have explicitly focused on training sentence selection components for QA models (Min et al., 2018;Wang et al., 2019). In datasets where gold justification sentences are not provided, researchers have trained such components by retrieving justifications from structured KBs (Cui et al., 2017;Bao et al., 2016;Zhang et al., 2016;Hao et al., 2017) such as ConceptNet (Speer et al., 2017), or from IR systems coupled with denoising components (Wang et al., 2019). While these works offer exciting directions, they all rely on training data for justifications, which is expensive to generate and may not be available in real-world use cases.
The second group of methods tend to rely on reinforcement learning (Choi et al., 2017;Lai et al., 2018;Geva and Berant, 2018) or PageRank (Surdeanu et al., 2008) to learn how to select justification sentences without explicit training data. Other works have used end-to-end (mostly RNNs with attention mechanisms) QA architectures for learning to pay more attention on better justification sentences (Min et al., 2018;Seo et al., 2016;Yu et al., 2014;Gravina et al., 2018). While these approaches do not require annotated justifications, they need large amounts of question/answer pairs during training so they can discover the latent justifications. In contrast to these two directions, our approach requires no training data at all for the justification selection process.
The third category of methods utilize IR techniques to retrieve justifications from both unstructured (Yadav et al., 2019) and structured (Khashabi et al., 2016) KBs. Our approach is closer in spirit to this direction, but it is adjusted to account for more intentional knowledge aggregation. As we show in Section 4, this is important for both the quality of the justification sentences and the performance of the downstream QA system.
The last group of QA approaches learn how to classify answers without any justification sentences (Mihaylov et al., 2018;Devlin et al., 2018). While this has been shown to obtain good performance for answer classification, we do not focus on it in this work because these methods cannot easily explain their inference.
Note that some of the works discussed here transfer knowledge from external datasets into the QA task they address (Chung et al., 2017;Pan et al., 2019;Min et al., 2017;Qiu et al., 2018;Chen et al., 2017). In this work, we focus solely on the resources provided in the task itself because such compatible external resources may not be available in real-world applications of QA.

Approach
ROCC, coupled with a QA system, operates in the following steps (illustrated in Figure 2): (1) Retrieval of candidate justification sentences: For datasets that rely on huge supporting KBs (e.g., ARC), we retrieve the top n sentences 1 from this KB using an IR query that concatenates the question and the candidate answer, similar to ; Yadav et al. (2019). We implemented this using the BM25 IR model with the default parameters in Lucene 2 . For reading comprehension datasets where the question is associated with a text passage (e.g., MultiRC), all the sentences in this passage become candidates.
(2) Generation of candidate justification sets: Since its focus is on knowledge aggregation, ROCC ranks sets of justification sentences (see below) rather than individual sentences. In this step we create candidate justification sets by generating n k groups of sentences from the previous n sentences, using multiple values of k.
(3) Ranking of candidate justification sets: For every candidate justification set, we calculate its ROCC score (see Section 3.1), which estimates the likelihood that this group of justifications explains the given answer. We then rank the justification sets in descending order of ROCC score, and choose the top set as the group of justifications that is the output of ROCC for the given question and answer. In MultiRC, we rearrange the justification sentences according to their original indexes in the given passage to bring coherence in the selected sequence of sentences.
(4) Answer classification: ROCC can be coupled with any supervised QA component for answer classification. In this work, we feed in the question, answer, and justification texts into a state-of-theart classifier that relies on BERT (see Section 3.2). Because the justification sentences in the reading comprehension use case (e.g., MultiRC) come from the same passage and their sequence is likely to be coherent, we concatenate them into a single passage, and use a single BERT instance for classification. This approach is shown on the left side of the answer classification component in Figure 2. On the other hand, the justification sentences retrieved from an external KB (e.g., ARC) may not form a coherent passage when aggregated. For this reason, in the ARC use case, we classify each justification sentence separately (together with the question and candidate answer), and then average all these scores to produce a single score for the candidate answer (right-hand side of the figure).

BERT
MultiRC ARC Answer classification model Figure 2: An example of the ROCC process for a question from the MultiRC dataset. Here, ROCC correctly extracts the two justification sentences necessary to explain the correct answer.

Ranking of Candidate Justification Sets
Each set of justifications is ranked based on its ROCC score, which: (a) maximizes the Relevance of the selected sentences; (b) minimizes the lexical Overlap between the selected facts; and (c) maximizes the lexical Coverage of both question and answer (C ques , C ans ). The overall score for a given justification set P i is calculated as: To avoid zeros, we add a small constant ( = 1 here) to each component that can have a value of 0. 3 We detail the components of this formula below.
Relevance (R) We use the Lucene implementation 4 of the BM25 IR model (Robertson et al., 2009) to estimate the relevance of each justification sentence to a given question and candidate answer. In particular, we form a query that concatenates the question and candidate answer, and use as underlying document collection (necessary to compute document statistics such as inverse document frequencies (IDF)) either: sentences in the entire KB (for ARC), or all sentences in the corresponding passage in the case of reading comprehension (MultiRC). The arithmetic mean of BM25 scores over all sentences in a given justification set gives the value of R for the entire set.

Overlap (O)
To ensure diversity and complementarity between justification sentences, we compute the overlap between all sentence pairs in a given group. Thus, minimizing this score reduces redundancy and encourages the aggregated sentences to address different parts of the question and answer: where S is the given set of justification sentences; s i is the i th sentence in S; and t(s i ) denotes the set of unique terms in sentence s i . Note that we divide by |S| 2 to normalize across different sizes of justification sets.
Coverage (C) Complementing the overlap score, this component measures the lexical coverage of the question and the answer texts by the given set of justifications S. This coverage is weighted by the IDF of question and answer terms. Thus, maximizing this value encourages the justifications to address more of the meaningful content mentioned in the question (X = Q) and the answer (X = A): where t(X) denotes the unique terms in X, and C t (X) represents the set of all unique terms in X that are present in any of the sentences of the given justification set. C(X) gives the IDF weighted average of C t (X) terms.

Answer Classification
As indicated earlier, we propose two flavors for the answer classification component: if the sentences in a justification group come from the same passage and, thus, are likely to be coherent, they are concatenated into a single text before classification, and handled by a single answer classifier. If the sentences come from different texts, they are handled by separate instances of the answer classifier. In the latter case, all scores are averaged to produce a single score for a candidate answer. In all situations we used BERT (Devlin et al., 2018) for answer classification. In particular, we employed BERT as a binary classifier operating over two texts. The first text consists of the concatenated question and answer, and the second text consists of the justification text. The classifier operates over the hidden states of the two texts, i.e., the state corresponding to the [CLS] token (Devlin et al., 2018). 5 We observed empirically that pre-training the BERT classifier on all n sentences retrieved by BM25, and then fine tuning on the ROCC justifications improves performance on all datasets we experimented with. This resembles the transfer learning discussed by Howard and Ruder (2018), where the source domain would be the BM25 sentences, and the target domain the ROCC justifications. However, one important distinction is that, in our case, all this knowledge comes solely from the resources provided within each dataset, and is retrieved using unsupervised method (BM25). We conjecture that this helped mainly because the pretraining step exposed BERT to more data which, even if imperfect, is topically related to the corresponding question and answer. 5 We used the following hyper parameters with BERT Large: learning rate of 1e-5, maximum sequence length of 128, batch size = 16, number of epochs = 6. Question + answer text Justification set Animal cells obtain energy 1) obtain water and nutrient by by || absorbing nutrients absorbing them directly into plant cell 2) the animal obtain nourishment by absorbing nutrient released by symbiotic bacteria Table 1: Example of a justification set in ARC which was scored by annotator with a precision of 1 2 because the first justification sentence is not relevant, and a coverage of 1 2 because the link between nourishment and energy is not covered.

Empirical Evaluation
We evaluated ROCC coupled with the proposed QA approach on two QA datasets. We use the standard train/development/test partitions for each dataset, as well as the standard evaluation measures: accuracy for ARC , and F1 m (macro-F1 score), F1 a (micro-F1 score), and EM0 (exact match) for MultiRC (Khashabi et al., 2018a).

Multi-sentence
reading comprehension (MultiRC): this is a reading comprehension dataset implemented as multiple-choice QA (Khashabi et al., 2018a). Each question is accompanied by a supporting passage, which contains the correct answer. We use all sentences from such paragraphs as candidate justifications for the corresponding questions.
AI2's Reasoning Challenge (ARC): this is a multiple-choice question dataset, containing questions from science exams from grade 3 to grade 9 . The dataset is split in two partitions: Easy and Challenge, where the latter partition contains the more difficult questions that require reasoning. Most of the questions have 4 answer choices, with <1% of all the questions having either 3 or 5 answer choices. Importantly, ARC includes a supporting KB of 14.3M unstructured text passages. We use BM25 over this entire KB to retrieve candidate justification sentences for ROCC.

Justification Results
To demonstrate that ROCC has the capacity to select better justification sentences, we also report the quality of the extracted justification sentences. For MultiRC, we report precision/recall/F1 justification scores, computed against the gold justification sentences provided by the dataset. 6 For ARC, where gold justifications are not provided, we used an  external annotator to annotate the justifications for a random stratified sample of 70 questions, with 10 questions selected from each grade (3 -9).The annotator reported two scores: precision, and coverage. Precision was defined as the fraction of justification sentences that are relevant for the inference necessary to connect the corresponding question and candidate answer. Coverage was defined as 1 if the justification set completely covers the inference process for the given question and answer, 1/2 if the set of justifications partially addresses the inference, and 0 if the justification set is completely irrelevant. Table 1 illustrates these scores with an actual output from ARC.

Question answering results
In addition to comparing ROCC with previously reported results, we include multiple baselines: (a) the BERT answer classifier trained on the entire passage of the given question (MultiRC), to demonstrate that ROCC has the capacity to filter out irrelevant content from these paragraphs; (b) BERT trained without any justification sentences (ARC), to show that ROCC has the capacity to aggregate useful information from large unstructured KBs, and (c) BERT trained on sentences retrieved using BM25, to demonstrate that ROCC performs better than other unsupervised approaches. Note that the  BM25 baseline has an additional hyper parameter: the number of sentences to be considered (k). Table 2 reports comprehensive results on MultiRC, including both overall QA performance, measured using F1 m , F1 a , and EM0, as well as justification quality, measured using standard precision (P), recall (R), and F1. Note that the bulk of the results are reported on the development partition. The last row in the table reports results on the test partition, computed using the official submission portal which can be accessed only once per model (including its variants). To understand ROCC's behavior, the table includes both the parametric form of ROCC, where the size of the justification sets (k) is manually tuned as well as the non-parametric ROCC, where k is automatically selected in the third step of the ROCC algorithm (see Figure 2) by sorting across all sizes of justification sets together, instead of sorting within each value of k. Table 3 lists equivalent results on ARC.
We draw several observations from these tables: (1) Despite its simplicity, ROCC combined with the BERT classifier obtains new state-of-the-art performance on both MultiRC and ARC for the class of approaches that do not use external resources to either train the justification sentence selection or the answer classifier. For example, ROCC outper-forms the previous best result in MultiRC by 2.5 EM0 points on the development partition (row 24 vs. row 6), and 1.6 EM0 points on test (row 30 vs. row 29). In ARC, ROCC outperforms the previous best approach by 5.8% accuracy on the Challenge partition, and 2.9% overall (row 23 vs. row 9).
(2) On both datasets, the non-parametric form of ROCC (AutoROCC) slightly outperforms the parametric variant. Importantly, it always achieves higher justification scores compared to the parametric ROCC. In MultiRC, AutoROCC outperforms our baseline of BERT + entire passage (row 10 vs 22) by 8.3% EM0, indicating that AutoROCC can filter out irrelevant content. In ARC, AutoROCC outperforms the baseline with no justification sentences by 9.1% (row 21 vs row 11), demonstrating that ROCC aggregates useful knowledge.
(3) The results of the parametric forms of ROCC (rows 16 -19 in Table 2 and rows 17 -20 in  (4) The justification scores in both datasets are considerably higher than the equivalent configuration that uses BM25 instead of ROCC (i.e., row 24 vs. row 23 in Table 2, and row 23 vs. row 22 in Table 3). This confirms that the joint scoring of sets of justifications that ROCC performs is better than the individual ranking of justification sentences performed by standard IR models such as BM25.

Domain Robustness Analysis
To understand ROCC's domain robustness, we compared it against a supervised BERT-based classifier for the selection of justification sentences, as well as against GPT-2 (Wang et al., 2019). For this experiment, we used MultiRC, where gold justifications are provided. We used this data to train a classifier for the selection of justification sentences on various domain-specific sections of MultiRC. The results of this experiment are shown in Table 4. Unsurprisingly, training and testing in the same domain (e.g., Fiction) leads to the best performance on sentence selection. However, ROCC is more stable across domains than the supervised sentence selection component, with a difference of over 10 F1 points in some configurations. This suggests that ROCC is a better solution for real-world use cases where the distribution of the test data may be very different from the training data. Compared to BERT, the unsupervised Auto-ROCC achieves almost the same or better performance in the majority of the domains except Wiki articles and News. We conjecture this happens because the BERT language model was trained on a large text corpus that comes from these two do-  mains. However, importantly, AutoROCC is more robust across domains that are different from these two, since it is an unsupervised approach that is not tuned for any specific domain. The ARC dataset does not provide justification sentences, so we instead ask how well our questionanswering models do on a related inference task, the SciTail entailment dataset . We trained three QA classifiers on the ARC dataset: BERT with no justification, BERT with BM25 (k = 4) justifications, and BERT with AutoROCC justifications. We tested these on SciTail, and achieved 64.49%, 69.70%, and 73.46% accuracy, respectively, indicating that AutoROCC's knowledge aggregation is a valid proxy for entailment. Table 5 shows an ablation of the different components of ROCC. Row 0 reports the score from the full AutoROCC model. In row 1, we remove IDF weights from coverage calculations (see eq. (4))  of both question and answer text. In row 2, 3 and 4, we remove the coverage of answer, coverage of question, and overlap from the ROCC formula (see eq.

Ablation Analysis
(1)) respectively. In all the cases, we found small drops in both performance and justification scores across both the datasets, with the removal of either C(A) or C(Q) having the largest impact.

Error Analysis
We analyzed ROCC's justification selection performance on three different types of questions in MultiRC: True/False/Yes/No, Verbatim, and Nonverbatim (Khashabi et al., 2018b). As shown in Table 6, AutoROCC achieves higher recall scores on Verbatim questions, where the answer text is likely to appear within the given justification passage, and worse recall on question types where such overlap does not exist, e.g., Non-verbatim and True/False. This suggests that the C(A) component of ROCC is important for the extraction of meaningful justifications.

Alignment ROCC
To understand the dependence between ROCC and exact lexical match, we compare the justification selection performance of ROCC when its score components are computed based on lexical match (the approach used throughout the paper up to this point) vs. the semantic alignment match of Yadav et al. (2018). The latter approach relaxes the requirement for lexical match, i.e., two tokens are considered to be matched when the cosine similarity of their embedding vectors is larger than 0.95. 7 As shown in Table 7, the alignment-based ROCC indeed performs better than the ROCC that relies on lexical match. However, the improvements are not large, e.g., the maximum improvement is 1.6% (when k = 4), which indicates that ROCC is robust to a certain extent to lexical variation.

Conclusion
We introduced ROCC, a simple unsupervised approach for selecting justification sentences for question answering, which balances relevance, overlap of selected sentences, and coverage of the question and answer. We coupled this method with a state-of-the-art BERT-based supervised question answering system, and achieved a new state-ofthe-art on the MultiRC and ARC datasets among approaches that do not use external resources during training. We showed that ROCC-based QA approaches are more robust across domains, and generalize better to other related tasks like entailment. In the future, we envision that ROCC scores can be used as distant supervision signal to train supervised justification selection methods.