Answering while Summarizing: Multi-task Learning for Multi-hop QA with Evidence Extraction

Question answering (QA) using textual sources for purposes such as reading comprehension (RC) has attracted much attention. This study focuses on the task of explainable multi-hop QA, which requires the system to return the answer with evidence sentences by reasoning and gathering disjoint pieces of the reference texts. It proposes the Query Focused Extractor (QFE) model for evidence extraction and uses multi-task learning with the QA model. QFE is inspired by extractive summarization models; compared with the existing method, which extracts each evidence sentence independently, it sequentially extracts evidence sentences by using an RNN with an attention mechanism on the question sentence. It enables QFE to consider the dependency among the evidence sentences and cover important information in the question sentence. Experimental results show that QFE with a simple RC baseline model achieves a state-of-the-art evidence extraction score on HotpotQA. Although designed for RC, it also achieves a state-of-the-art evidence extraction score on FEVER, which is a recognizing textual entailment task on a large textual database.


Introduction
Reading comprehension (RC) is a task that uses textual sources to answer any question. It has seen significant progress since the publication of numerous datasets such as SQuAD (Rajpurkar et al., 2016). To achieve the goal of RC, systems must be able to reason over disjoint pieces of information in the reference texts. Recently, multi-hop question answering (QA) datasets focusing on this capability, such as QAngaroo  Figure 1: Concept of explainable multi-hop QA. Given a question and multiple textual sources, the system extracts evidence sentences from the sources and returns the answer and the evidence. and HotpotQA (Yang et al., 2018), have been released.
Multi-hop QA faces two challenges. The first is the difficulty of reasoning. It is difficult for the system to find the disjoint pieces of information as evidence and reason using the multiple pieces of such evidence. The second challenge is interpretability. The evidence used to reason is not necessarily located close to the answer, so it is difficult for users to verify the answer. Yang et al. (2018) released HotpotQA, an explainable multi-hop QA dataset, as shown in Figure 1. Explainable multi-hop QA provides the evidence sentences of the answer for supervised learning. The capability of being able to explicitly extract evidence is an advance towards meeting the above two challenges.
Here, we propose Query Focused Extractor (QFE) based on a summarization model. We regard the evidence extraction of the explainable multi-hop QA as a query-focused summarization task. Query-focused summarization is the task of summarizing the source document with regard to the given query. QFE sequentially extracts the evidence sentences by an RNN with an attention mechanism to the question sentence, while the existed method extracted each evidence sentence in-dependently. It enables QFE to consider the dependency among the evidence sentences and cover the important information in the question sentence. Our overall model uses multi-task learning with a QA model for answer selection and QFE for evidence extraction. The multi-task learning with QFE is general because it can be combined with any QA model.
Moreover, we find that the recognizing textual entailment (RTE) task on the large textual database, FEVER (Thorne et al., 2018), can be regarded as an explainable multi-hop QA task. We confirm that QFE effectively extracts the evidence both on HotpotQA for RC and on FEVER for RTE.
Our main contributions are as follows.
• We propose QFE for explainable multi-hop QA. We use the multi-task learning of the QA model for the answer selection and QFE for the evidence extraction.
• QFE adaptively determines the number of evidence sentences by considering the dependency among the evidence sentences and the coverage of the question.
• QFE achieves state-of-the-art performance on both HotpotQA and FEVER in terms of the evidence extraction score and comparable performance to competitive models in terms of the answer selection score. QFE is the first model that overcomes the baseline on Hot-potQA.

Task Definition
Here, we re-define explainable multi-hop QA so that it includes the RC and the RTE tasks.

Def. 1. Explainable Multi-hop QA
Input: Context C (multiple texts), Query Q (text) The Context C is regarded as one connected text in the model. If the connected C is too long (e.g. over 2000 words), it is truncated. The Query Q is the query. The model answers Q with an answer type A T or an answer string A S . The Answer Type A T is selected from the answer candidates, such as 'Yes'. The answer candidates depend on the task setting. The Answer String A S exists only if the answer candidates are not enough to answer Q. The answer string A S is a short span in C. Evidence E consists of the sentences in C and is required to answer Q.
For RC, we tackle HotpotQA. In HotpotQA, the answer candidates are 'Yes', 'No' and 'Span'. The answer string A S exists if and only if A T is 'Span'. C consists of ten Wikipedia paragraphs and the evidence E consists of two or more sentences in C.
For RTE, we tackle FEVER. In FEVER, the answer candidates are 'Supports', 'Refutes', and 'Not Enough Info'. The answer string A S does not exist. The context C is the Wikipedia database. The evidence E consists of the sentences in C.

Proposed Method
This section first explains the overall model architecture, which contains our model as a module, and then the details of our QFE.

Model Architecture
Except for the evidence layer, our model is the same as the baseline (Clark and Gardner, 2018) used in HotpotQA (Yang et al., 2018). Figure 2 shows the model architecture. The input of the model is the context C and the query Q. The model has the following layers.
The Word Embedding Layer encodes C and Q to sequences of word vectors. A word vector is the concatenation of a pre-trained word embedding and a character-based embedding obtained using CNN (Kim, 2014). Outputs are C 1 ∈ R lw×dw , Q 1 ∈ R mw×dw , where l w is the length (in words) of C, m w is the length of Q and d w is the size of a word vector.
The Context Layer encodes C 1 , Q 1 to contextual vectors C 2 ∈ R lw×2dc , Q 2 ∈ R mw×2dc by using a bi-directional RNN (Bi-RNN), where d c is the output size of a uni-directional RNN.
The Matching Layer encodes C 2 , Q 2 to matching vectors C 3 ∈ R lw×dc by using a bi-directional attention (Seo et al., 2017), a Bi-RNN and a selfattention (Wang et al., 2017).
by a Bi-RNN. Let j 1 (i) be the index of the first word of the i-th sentence in C and j 2 (i) be the index of the last word. We define the vector of the i-th sentence as: Here, X ∈ R ls×2dc is the sentence-level context vectors, where l s is the number of sentence of C.
QFE, described later, receives sentence-level context vectors X ∈ R ls×2dc and the contextual query vectors Q 2 ∈ R mw×2dc as Y. QFE outputs the probability distribution that the i-th sentence is the evidence: (1) Then, the evidence layer concatenates the wordlevel vectors and the sentence-level vectors: where the j-th word in C is included in the i(j)-th sentence in C.

The Answer
Layer predicts the answer type A T and the answer string A S from C 5 . The layer has stacked Bi-RNNs. The output of each Bi-RNN is mapped to the probability distribution by the fully connected layer and the softmax function.
For RC, the layer has three stacked Bi-RNNs. Each probability indicates the start of the answer string,Â S1 ∈ R lw , the end of the answer strinĝ A S2 ∈ R lw , and the answer type,Â T ∈ R |A T | . For RTE, the layer has one Bi-RNN. The probability indicates the answer type.
Loss Function: Our model uses multi-task learning with a loss function L = L A + L E , where L A is the loss of the answer and L E is the loss of the evidence. The answer loss L A is the sum of the cross entropy losses for all probability distributions obtained by the answer layer. The evidence loss L E is defined in subsection 3.3.
Glimpse Sentence Vectors Query Vectors Figure 3: Overview of Query Focused Extractor at step t. z t is the current summarization vector. g t is the query vector considering the current summarization. e t is the extracted sentence. x e t updates the RNN state.

Query Focused Extractor
Query Focused Extractor (QFE) is shown as the red box in Figure 2. QFE is an extension of the extractive summarization model of Chen and Bansal (2018), which is not for query-focused settings. Chen and Bansal used an attention mechanism to extract sentences from the source document such that the summary would cover the important information in the source document. To focus on the query, QFE extracts sentences from C with attention to Q such that the evidence covers the important information with respect to Q. Figure 3 shows an overview of QFE. The inputs of QFE are the sentence-level context vectors X ∈ R ls×2dc and contextual query vectors Y ∈ R mw×2dc . We define the timestep to be the operation to extract a sentence. QFE updates the state of the RNN (the dark blue box in Figure 3) as follows: where e t ∈ {1, · · · , l s } is the index of the sentence extracted at step t. We define E t = {e 1 , · · · , e t } to be the set of sentences extracted until step t.
QFE extracts the i-th sentence according to the probability distribution (the light blue box): .
Then, QFE selects e t = argmax Pr(i; E t−1 ). Let g t be a query vector considering the importance at step t. We define g t as the glimpse vector (Vinyals et al., 2016) (the green box): The initial state of the RNN is the vector obtained via the fully connected layer and the max pooling from X. All parameters W · ∈ R 2dc×2dc and v · ∈ R 2dc are trainable.

Training Phase
In the training phase, we use teacher-forcing to make the loss function. The loss of the evidence L E is the negative log likelihood regularized by a coverage mechanism (See et al., 2017): The max operation in the first term enables the sentence with the highest probability to be extracted, because the evidence does not have an order in which it is to be extracted. The coverage vector c t is defined as c t = t−1 τ =1 α τ . In order to learn the terminal condition of the extraction, QFE adds a dummy sentence, called the EOE sentence, to the sentence set. When the EOE sentence is extracted, QFE terminates the extraction. The EOE sentence vector x EOE ∈ R 2dc is a trainable parameter in the model, so x EOE is independent of the samples. We train the model to extract the EOE sentence after all evidence.

Test Phase
In the test phase, QFE terminates the extraction by reaching the EOE sentence. The predicted evidence is defined aŝ whereÊ t is the predicted evidence until step t. QFE uses the beam search algorithm to searchÊ.

HotpotQA Dataset
In HotpotQA, the query Q is created by crowd workers, on the condition that answering Q requires reasoning over two paragraphs in Wikipedia. The candidates of A T are 'Yes', 'No' and 'Span'. The answer string A S , if it exists, is a span in the two paragraphs. The context C is ten paragraphs, and its content has two settings. In the distractor setting, C consists of the two  gold paragraphs used to create Q and eight paragraphs retrieved from Wikipedia by using TF-IDF with Q. In the fullwiki setting, all ten paragraphs of C are retrieved paragraphs. Hence, C may not include two gold paragraphs, and in that case, A S and E cannot be extracted. Therefore, the oracle model does not achieve 100 % accuracy. Table 1 shows the statistics of the distractor setting.

Experimental Setup
Our baseline model is the same as the baseline in Yang et al. (2018) except as follows. Whereas we use equation (1), they use where w ∈ R 2dc , b ∈ R are trainable parameters. The evidence loss L E is the sum of binary cross entropy functions on whether each of the sentences is evidence or not. In the test phase, the sentences with probabilities higher than a threshold are selected. We set the threshold to 0.4 because it gave the highest F1 score on the development set. The remaining parts of the implementations of our and baseline models are the same. The details are in Appendix A.1. We evaluated the prediction of A T , A S and E by using the official metrics in HotpotQA. Exact match (EM) and partial match (F1) were used to evaluate both the answer and the evidence. For the answer evaluation, the score was measured by the classification accuracy of A T . Only if A T is 'Span' was the score also measured by the wordlevel matching of A S . For the evidence, the partial match was evaluated by the sentence ids, so word-level partial matches were not considered. For metrics on both the answer and the evidence, we used Joint EM and Joint F1 (Yang et al., 2018).

Results
Does our model achieve state-of-the-art performance? Table 2 shows that, in the distractor setting, QFE performed the best in terms of the evi-   dence extraction score among all competing models including the most recent unpublished ones. It also achieved comparable performance in terms of the answer selection score and therefore achieved state-of-the-art performance on the joint EM and F1 metrics, which are the main metric on the dataset. QFE outperformed the published baseline model in all metrics. Although our model does not use any pre-trained language model such as BERT (Devlin et al., 2019) for encoding, it outperformed the methods that used BERT. In particular, the improvement in the evidence EM score was +37.5 points against the baseline and +5.4 points against GRN.
In the fullwiki setting, Table 3 shows that QFE achieved state-of-the-art performance in all metrics compared with the published models. Compared with the unpublished model, Cognitive Graph outperformed our model. Our current QA and QFE models do not consider answering unanswerable questions explicitly; our future work will deal with responding to unanswerable questions.
Does QFE contribute to the performance? Table 4 shows the results of the ablation study.  Table 4: Performance of our models and the baseline models on the development set in the distractor setting. QFE performed the best among the models compared. Although the difference between QFE and the baseline is the evidence extraction model, the answer scores also improved. QFE also outperformed the model that used only RNN extraction without glimpse.
QFE defines the terminal condition as reaching the EOE sentence, which we call adaptive termination. We confirmed that the adaptive termination of QFE contributed to its performance. We compared QFE with a baseline that extracts the two sentences with the highest scores, since the most frequent number of evidence sentences is two. QFE outperformed this baseline.
What are the characteristics of our evidence extraction? Table 5 shows the evidence extraction performance in the distractor setting. Our model improves both precision and recall, and the improvement in precision is larger. Figure 4 reveals the reason for the high EM and precision scores; QFE rarely extracts too much evidence. That is, it predicts the number of evidence sentences more accurately than the baseline. Table  5 also shows the correlation of our model about the number of evidence sentences is higher than that of the baseline.
We consider that the sequential extraction and the adaptive termination help to prevent overextraction. In contrast, the baseline evaluates each sentence independently, so the baseline often extracts too much evidence.    Table 6: Performance of our model in terms of the number of gold evidence sentences on the development set in the distractor setting. # sample, Num, P and R mean the proportion in the dataset, number of predicted evidence sentences, precision, and recall, respectively.
What questions in HotpotQA are difficult for QFE? We analyzed the difficulty of the questions for QFE from the perspective of the number of evidence sentences and reasoning type; the results are in Table 6 and Table 7. First, we classified the questions by the number of gold evidence sentences. Table 6 shows the model performance for each number. The answer scores were low for the questions answered with five evidence sentences, which indicated that questions requiring much evidence are difficult. However, the five-evidence questions amount to only 80 samples, so this observation needs to be confirmed with more analysis. QFE performed well when the number of gold evidence sentences was two. Even though QFE was relatively conservative when extracting many evidence sentences, it was able to extract more than two sentences adaptively.  Second, we should mention the reasoning types in Table 7. HotpotQA has two reasoning types: entity bridge and entity comparison. Entity bridge means that the question mentioned one entity and the article of this entity has another entity required for the answer. Entity comparison means that the question compares two entities. Table 7 shows that QFE works on each reasoning type.
Qualitative Analysis. Table 8 shows an example of the behavior of QFE. In it, the system must compare the number of members of Kitchens of Distinction and with those of Royal Blood. The system extracted the two sentences describing the number of members. Then, the system extracted the EOE sentence.
We should note two sentences that were not extracted. The first sentence includes 'members' and 'Kitchens of Distinction', which are included in the query. However, this sentence does not mention the number of the members of Kitchens of Distinction. The second sentence also shows that Royal Blood is a duo. However, our model preferred Royal Blood (band name) to Royal Blood (album name) as the subject of the sentence.
Other examples are shown in Appendix A.2.

FEVER Dataset
In FEVER, the query Q is created by crowd workers. Annotators are given a randomly sampled sen-  tence and a corresponding dictionary. The given sentence is from Wikipedia. The key-value of the corresponding dictionary consists of an entity and a description of the entity. Entities are those that have a hyperlink from the given sentence. The description is the first sentence of the entity's Wikipedia page. Only using the information in the sentence and the dictionary, annotators create a claim as Q. The candidates of A T are 'Supports', 'Refutes' and 'Not Enough Info (NEI)'. The proportion of the samples with more than one evidence sentence is 27.3% in the samples the label of which is not 'NEI'. The context C is Wikipedia database shared among all samples. Table 9 shows the statistics.

Experimental Setup
Because C is large, we used the NSMN document retriever (Nie et al., 2019) and gave only the top-five paragraphs to our model. Similar to NSMN, in order to capture the semantic and numeric relationships, we used 30-dimension Word-Net features and five-dimension number embeddings. The WordNet features are binaries reflecting the existence of hypernymy/antonymy words in the input. The number embedding is a realvalued embedding assigned to any unique number. Because the number of samples in the training data is biased on the answer type A T , randomly selected samples were copied in order to equalize the numbers. Our model used ensemble learning of 11 randomly initialized models. For the evidence extraction, we used the union of the predicted evidences of each model. If the model predicts A T as 'Supports' or 'Refutes', the model extracts at least one sentence. Details of the implementation are in Appendix A.1.
We evaluated the prediction of A T and the evidence E by using the official metrics in FEVER. A T was evaluated in terms of the label accuracy. E was evaluated in terms of the precision, recall and F1, which were measured by sentence id. The FEVER score was used as a metric accounting for  both A T and E. The FEVER score of a sample is 1 if the predicted evidence includes all gold evidence and the answer is correct. That is, the FEVER score emphasizes the recall of extracting evidence sentences over the precision.

Results
Does our multi-task learning approach achieve state-of-the-art performance? Table 10 shows QFE achieved state-of-the-art performance in terms of the evidence F1 and comparable performance in terms of label accuracy to the competitive models. The FEVER score of our model is lower than those of other models, because the FEVER score emphasizes recall. However, the importance of the precision and the recall depends on the utilization. QFE is suited to situations where concise output is preferred.
What are the characteristics of our evidence extraction?  (Nguyen et al., 2016) and TriviaQA (Joshi et al., 2017). For such datasets, the document retrieval model is combined with the contextquery matching model Wang et al., 2018a,b;Nishida et al., 2018). Some techniques have been proposed for understanding multiple texts. Clark and Gardner (2018) used simple methods, such as connecting texts. This observation is one of the motivations behind multi-hop QA. HotpotQA (Yang et al., 2018) is a task including supervised evidence extraction. QAngaroo ) is a task created by using Wikipedia entity links. The difference between QAngaroo and our focus is two-fold: (1) QAngaroo does not have supervised evidence and (2) the questions in QAngaroo are inherently limited because the dataset is constructed using a knowledge base. (Bowman et al., 2015;Williams et al., 2018) is performed by sentence matching (Rocktäschel et al., 2016;.

RTE
FEVER (Thorne et al., 2018) has the aim of verification and fact checking for RTE on a large database. FEVER requires three sub tasks: doc-ument retrieval, evidence extraction and answer prediction. In the previous work, the sub tasks are performed using pipelined models (Nie et al., 2019;Yoneda et al., 2018). In contrast, our approach solves evidence extraction and answer prediction simultaneously by regarding FEVER as explainable multi-hop QA.

Summarization
A typical approach to sentence-level extractive summarization has an encoder-decoder architecture (Cheng and Lapata, 2016;Nallapati et al., 2017;Narayan et al., 2018). Sentence-level extractive summarization is also used for content selection in abstractive summarization (Chen and Bansal, 2018). Their model extracts sentences in the order of importance and edits them. We extend this model so that it can be used for evidence extraction because we consider that the evidence must be extracted in the order of importance rather than the original order like the conventional models.

Conclusion
We consider that the main contributions of our study are (1) the proposed QFE model that is based on a summarization model for the explainable multi-hop QA, (2) the dependency among the evidence and the coverage of the question due to the usage of the summarization model, and (3) the state-of-the-art performance in evidence extraction in both RC and RTE tasks.
Regarding RC, we confirmed that the architecture with QFE, which is a simple replacement of the baseline, achieved state-of-the-art in a task setting. The ablation study showed that the replacement of the evidence extraction model to QFE contributes to the performance. Our adaptive termination contributes to the exact matching and the precision score of the evidence extraction. The difficulty of the questions for QFE depends on the number of the required evidence sentences. This study is the first to base its experimental discussion on HotpotQA.
Regarding RTE, we confirmed that, compared with competing models, the architecture with QFE has a higher evidence extraction score and comparable label prediction score. This study is the first to show a joint approach for RC and FEVER.