Multi-hop Inference for Question-driven Summarization

Question-driven summarization has been recently studied as an effective approach to summarizing the source document to produce concise but informative answers for non-factoid questions. In this work, we propose a novel question-driven abstractive summarization method, Multi-hop Selective Generator (MSG), to incorporate multi-hop reasoning into question-driven summarization and, meanwhile, provide justifications for the generated summaries. Specifically, we jointly model the relevance to the question and the interrelation among different sentences via a human-like multi-hop inference module, which captures important sentences for justifying the summarized answer. A gated selective pointer generator network with a multi-view coverage mechanism is designed to integrate diverse information from different perspectives. Experimental results show that the proposed method consistently outperforms state-of-the-art methods on two non-factoid QA datasets, namely WikiHow and PubMedQA.


Introduction
Recent years have witnessed several attempts on exploring question-driven summarization, which aims at summarizing the source document with respect to a specific question, to produce a concise but informative answer in non-factoid question answering (QA) (Tomasoni and Huang, 2010;Chan et al., 2012;Song et al., 2017). Unlike factoid QA (Rajpurkar et al., 2016), e.g., "Who is the author of Harry Potter?", whose answer is generally a single phrase or a short sentence with limited information, the answers for non-factoid questions are supposed to be more informative, involving some detailed analysis to explain or justify the final answers, such as questions in community QA (Ishida et al., 2018;Deng et al., 2020a) or explainable QA (Fan et al., 2019;Nakatsuji and Okui, 2020). As the example from PubMedQA (Jin et al., 2019) presented in Figure 1, the answer can be regarded as the summary over the document driven by the reasoning process of the given question.
Most of related studies focus on query-based summarization approaches for summarizing the query-related content from the source document (Shen and Li, 2011;Wang et al., 2013;Cao et al., 2016;Nema et al., 2017). However, these approaches fall short of tackling question-driven summarization problem in QA scenario, since the query-based summarization process is typically based on semantic relevance measurement without a careful reasoning or inference process, which is essential to question-driven summarization. Currently, question-driven summarization is mainly explored by traditional information retrieval methods to select sentences from the source document to construct the final answer (Wang et al., 2014;Song et al., 2017;Yulianti et al., 2018), which heavily rely on hand-crafted features or tedious multi-stage pipelines. Besides, compared to extractive summarization (Cao et al., 2016), abstractive methods (Nema et al., 2017) can produce more coherent and logical summaries to answer the given question. To this end, we study question-driven abstractive summarization to generate natural form of answers by summarizing the source document with respect to a specific question.
To tackle question-driven abstractive summarization, the content selection process for summarization is not only determined by the semantic relevance to the given question, but it also requires a human-like reasoning and inference process to consider the content interrelationship comprehensively and carefully across the whole source text for gener- Figure 1: An example from PubMedQA. The highlighted sentences illustrate the inference process when humans answer the given question. Italic represents direct matching sentences from the question. Underlined and :::::::::::::: wavy-underlined represent sentences inferred by 2nd-hop and 3rd-hop reasoning, respectively, to justify the answer.
ating the summary. For instance, in Figure 1, given the specific question, there are several highlighted sentences required to be concentrated for conducting summarization so as to generate the answer. It leads to the necessity of measuring the importance of each sentence, instead of regarding the source text as an undifferentiated whole. Among these highlighted sentences, only the italic sentences are directly related to the given question, while other highlighted sentences need to be inferred from their interrelationships with other sentences. In other words, the generated summary is likely to lose important information, if we only focus on the semantically relevant content to the given question. Moreover, it can be observed that one-time inference sometimes is insufficient for collecting all the required information for producing a summary. In this example, the answer is summarized from both the 1st-hop and ::::::: 3rd-hop inference sentences in the document, indicating the importance of multi-hop reasoning for content selection in question-driven summarization.
In this work, we propose a question-driven abstractive summarization model, namely Multi-hop Selective Generator (MSG), which incorporates multi-hop inference to summarize abstractive answers over the source document for non-factoid questions. Concretely, the document is regarded as a hierarchical text structure to be assessed with the importance degree in both word-and sentencelevel for content selection. Then we develop a multi-hop inference module to enable human-like multi-hop reasoning in question-driven summarization, which considers the semantic relevance to the question as well as the information consistency among different sentences. Finally, a gated selec-tive pointer generator network with multi-view coverage mechanism is proposed to generate a concise but informative summary as the answer to the given question.
The main contributions of this paper can be summarized as follows: (1) We propose a novel question-driven abstractive summarization model for generating answers in non-factoid QA, which incorporates multi-hop reasoning to infer the important content for facilitating answer generation; (2) We propose a multi-view coverage mechanism to address the repetition issue along with the multiview pointer network and generate informative answers; (3) Experimental results show that the proposed method achieves state-of-the-art performance on WikiHow and PubMedQA datasets, and it is able to provide justification sentences as the evidence for the answer.

Related Works
Query-based Summarization. Early works on query-based summarization focus on extracting query-related sentences to construct the summary (Lin et al., 2010;Shen and Li, 2011), which are later improved by exploiting sentence compression on the extracted sentences (Wang et al., 2013;Li and Li, 2014). Recently, some data-driven neural abstractive models are proposed to generate natural form of summaries with respect to the given query (Nema et al., 2017;Hasselqvist et al., 2017). However, current studies on query-based abstractive summarization are restricted by the lack of large-scale datasets (Baumel et al., 2016;Nema et al., 2017). One the other hand, some researchers spark a new pave of question-driven summarization in non-factoid QA (Song et al., 2017;Yulianti et al., 2018;Deng et al., 2020b), which requires the ability of reasoning or inference for supporting summarization, not merely relevance measurement, and also preserves remarkable testbeds of largescale datasets. Non-factoid Question Answering. Different from factoid QA that can be tackled by extracting answer spans (Rajpurkar et al., 2016) or generating short sentences (Nguyen et al., 2016;Kociský et al., 2018), non-factoid QA aims at producing relatively informative and complete answers. In the past studies, non-factoid QA focused on retrievalbased methods, such as answer sentence selection (Nakov et al., 2015) or answer ranking . Recently, several efforts have been made on tackling long-answer generative question answering over supporting documents, which targets on questions that require detailed explanations (Fan et al., 2019). This kind of QA problem contains a large proportion of non-factoid questions, such as "how" or "why" type questions (Koupaee and Wang, 2018;Ishida et al., 2018;Deng et al., 2020a). Besides, some studies aim at generating a conclusion for the concerned question (Jin et al., 2019;Nakatsuji and Okui, 2020). Fan et al. (2019) propose a multi-task Seq2Seq model with the concatenation of the question and support documents to generate long-form answers. Iida et al. (2019) and Nakatsuji and Okui (2020) incorporate some background knowledge into Seq2Seq model for why questions and conclusion-centric questions. Some latest works (Feldman and El-Yaniv, 2019;Yadav et al., 2019;Nishida et al., 2019a) attempt to provide evidence or justifications for humanunderstandable explanation of the multi-hop inference process in factoid QA, where the inferred evidences are only treated as the middle steps for finding the answer. However, in non-factoid QA, the intermediate output is also important to form a complete answer, which requires a bridge between the multi-hop inference and summarization.

Proposed Framework
We propose a question-driven abstractive summarization model, namely Multi-hop Selective Generator (MSG). The overview of MSG is depicted in Figure 2, which consists of three main components: (1) Co-attentive Encoder (Section 3.1), (2) Multi-hop Inference Module (Section 3.2), and (3) Gated Selective Generator (Section 3.3). Moreover, Multi-view Coverage Loss is integrated to the overall training procedure (Section 3.4).

Co-attentive Encoder
Pre-trianed word embeddings, E q and E s i , of the question q and each sentence s i in the document D = {s 1 , s 2 , ..., s n } are input into the model. We first encode the question and each sentence in the document by a Bi-LSTM (Bidirectional Long Short-Term Memory Networks) shared encoder to learn the word-level contextual information, H q , H s i ∈ R l×d h , where l and d h denotes the sentence length and the dimension of the encoder output respectively. The overall word-level representations H d for the document is sequentially concatenated by [H s 1 , H s 2 , ..., H sn ].
We compute the attention weights to align the word-level information between the question and the document sentences, and obtain the attentionweighted vectors of each word for both the question and the document sentences. For the question q and the i-th sentence s i in the document D, we have: where U ∈ R d h ×d h is the attention matrix to be learned; α q i and α s i are co-attention weights for the question and i-th sentence in the document. We conduct dot product between the attention vectors and the word-level representations to generate the sentence representations for the question and the document: where M q and M s denote the sentence-level representations for the question and the document.

Multi-hop Inference Module
Multi-hop Inference Module measures the degree of importance for each sentence in the document to generate the answer, through a multi-hop reasoning procedure, which contains two kinds of inference units: Attentive Unit and MAR Unit.

Attentive Unit
Attentive Unit basically measures the matching degree between each sentence in the document and the given question by the following vanilla attention mechanism: where W m and ω m are the attention matrices to be learned. α s is the sentence-level attention weight which measures the matching degree of each document sentence with the given question. denotes the element-wise product for obtaining the attentive sentence-level representations for the document.

MAR Unit
Maximal Marginal Relevance (MMR) is an IR model that can be adopted to measure the queryrelevancy and information-redundancy simultaneously for extractive summarization (Carbonell and Goldstein, 1998). However, as for the content selection in abstractive summarization, the relevance to both the question and the other sentences in the document should be taken into consideration for a high recall of selecting necessary content. Thus, we propose Maximal Absolute Relevance (MAR) to select highly salient sentences for generating the summary, which is formulated as: where λ is a hyper-parameter for balancing the question-relevancy and information-consistency measurement. The relevance to the question is calculated by: where U 1 is a similarity matrix to be learned. We apply an attention mechanism over other sentences in the document to choose the highest relevance score, which can be regarded as the reasoning procedure where the next-hop justification sentences are supposed to be highly related to the last-hop justification sentences.
where U 1 is a similarity matrix to be learned. Then the weighted sentence representations are computed by the element-wise product of the original sentence representations and the MAR scores gated by a sigmoid function denoted as σ: Overall, MAR Unit assigns higher weights to sentences in two situations: (i) Those sentences are correlated to the given question, due to the first term in Equation 9, (ii) Those sentences are consistent with the highly weighted justification sentences from the last hop, due to the second term.

Reasoning Procedure
In accordance with human-like multi-hop inference procedure, the first hop is supposed to capture the semantic-relevant sentences to the given question. Then the subsequent hops should consider not only the relevance to the question, but also the information-consistency with the previous attended sentences. Hence, the Attentive Unit is adopted as the 1st-hop inference unit, while the MAR Unit is served as the kth-hop unit, where k > 1. Before each hop, a Bi-LSTM layer is employed to refine the input sentence representation. For instance, a 3-hop inference procedure is as follows: Sentence Attention Then, we merge the 3-hop sentence representa-

Final Distribution
s ], via the following attention mechanism: where W h and ω h are attention matrices to be learned. Z is the final sentence-level document representation for justifying the importance degree of each sentence in the decoding phase.

Gated Selective Generator
We obtain the word-level representations H q and H d for the question and document, respectively, from the encoding phase, and the sentence-level document representation Z via the multi-hop inference module. Figure 3 depicts the Gated Selective Pointer Generator Network in MSG.
A unidirectional LSTM is adopted as the decoder. At each step t, the decoder produces hidden state s t with the input of the previous word w t−1 . The attention for each word in the question and the document, α q t and α d t , are generated by: e where W q , W qs , W d , W ds , ω q t , ω d t , b q , b d are parameters to be learned.
Then, we incorporate the multi-hop inference results Z to compute the gated attention weights β t for each sentence in the document: where W s , W ss , ω s t , b s are parameters to be learned. We re-weight the word-level document attention scores α d gated by the sentence-level document attention scores β to attend important justification sentences along with the decoding process: Thus, the re-weighted word-level document attentionα d naturally blends with the results from the multi-hop inference module to enhance the influence of those important justification sentences. Finally, a multi-view pointer-generator architecture is designed to generate answers with multihop inference results as well as handle the multiperspective out-of-vocabulary (OOV) issue. Such approach enables MSG to copy words from the question and be aware of the differential importance degree of different sentences in the document.
The attention weights α q t andα d t are used to compute context vectors c q t and c d t as the probability distribution over the source words: The context vector aggregates the information from the source text for the current step. We concatenate the context vector with the decoder state s t and pass through a linear layer to generate the answer representation h s t : where W 1 and b 1 are parameters to be learned. Then, the probability distribution P v over the fixed vocabulary is obtained by passing the answer representation h s t through a softmax layer: where W 2 and b 2 are parameters to be learned. The final probability distribution of y t is obtained from three views of word distributions: where W ρ and b ρ are parameters to be learned, ρ is the multi-view pointer scalar to determine the weight of each view of the probability distribution.

End-to-end Training
Multi-view Coverage Loss. The original coverage mechanism (See et al., 2017) could only prevent repeated attention from one certain source text. However, the repetition problem becomes more severe, as we leverage both the question and document as the source text. Besides, similar to multi-view pointer network, coverage losses of different sources are supposed to be weighted by their contribution. Therefore, we design a multi-view coverage mechanism to address this issue as well as balance the generating and copying processes.
In each decoder timestep t, the coverage vector c t = t−1 t =0 a t is used to represent the degree of coverage so far. The coverage vector c t will be applied to compute the attention weight α t in Equations 19 and 21. The coverage loss is trained to penalize the repetition in updated attention weight α t from all views. The re-normalized pointer weightŝ ρ = ρ c / c∈{q,d} ρ c are employed to balance the coverage loss of different views: Overall Loss Function. The overall model is trained to minimize the negative log likelihood and the multi-view coverage loss: where λ is a hyper-parameter to balance losses.

Datasets and Evaluation Metrics
We evaluate on a large-scale summarization dataset with non-factoid questions, WikiHow (Koupaee and Wang, 2018), and a non-factoid QA dataset with abstractive answers, PubMedQA (Jin et al., 2019). WikiHow is an abstractive summarization dataset collected from a community-based QA website, WikiHow 1 , in which each sample consists of a non-factoid question, a long article, and the corresponding summary as the answer to the given question. PubMedQA is a conclusion-based biomedical QA dataset collected from PubMed 2 abstracts, in which each instance is composed of a question, a context, and an abstractive answer which is the summarized conclusion of the context corresponding to the question. The statistics of the WikiHow

Baseline Methods and Implementations
To evaluate the proposed method, we compare with several baselines and state-of-the-art methods on query-based abstractive summarization and generative QA. We first employ four widely-adopted summarization baseline methods, including two unsupervised extractive methods, LEAD3 and MMR, and two abstractive methods, S2SA (Bahdanau et al., 2015), and PGN (See et al., 2017). Then two popular query-based abstractive summarization methods are adopted for evaluation: (1) SD 2 (Nema et al., 2017), which is a sequence-tosequence model with a query attention, and (2) QS (Hasselqvist et al., 2017), which incorporates question information into the pointer-generator network with the vanilla attention mechanism.
Finally, we implement two latest generative QA models for comparisons: (1) S2S-MT (Fan et al., 2019), which uses a multi-task Seq2Seq model with the concatenation of question and support document, and (2) QPGN (Deng et al., 2020a), which is a question-driven pointer-generator network with co-attention between the question and document.
We train all the models with pre-trained GloVE embeddings 4 of 300 dimensions and set the vocabulary size to 50k. During training and testing procedure, we restrict the length of generated summaries within 50 words. As for the proposed method, we train with a learning rate of 0.15 and an initial accumulator value of 0.1. The dropout rate is set to 0.5. The hidden unit sizes of the BiLSTM encoder and the LSTM decoder are all set to 256. We train our models with the batch size of 32. All other parameters are randomly initialized from [-0.05 Table 2 summarizes the experimental results on both datasets. As for WikiHow, which is an abstractive summarization dataset with non-factoid questions, current query-based summarization (SD 2 , QS) and generative QA approaches (S2S-MT, QPGN) barely improve the performance from traditional summarization approaches. It indicates that the question information is not fully exploited for summarization, while MSG outperforms all these methods with a noticeable margin, about 2%. Besides, since PubMedQA is a QA dataset with abstractive answers, we can observe that QPGN, which employs special design for modeling the interaction between the question and document, achieves relatively better performance than other summarization methods. Favorably MSG raises the state-of-the-art result by about 3%. Furthermore, MSG achieves promising improvements via the multi-hop inference on these two datasets.

Performance Comparison
We conduct human evaluation to evaluate the generated answer from four aspects: (1) Informativity: how rich is the generated answer in information? (2) Conciseness: how concise is the sum- mary? (3) Readability: how fluent and coherent is the summary? (4) Correctness: how well does the generated answer respond to the given question? We randomly sample 50 questions from two datasets and generate their answers with three query-based summarization methods, including SD 2 , QS, QPGN and the proposed MSG. Three annotators are asked to score each generated answer with 1 to 5 (higher the better). Results are presented in Table 3. We observe that MSG consistently and substantially outperforms existing querybased summarization methods in all aspects, especially for the informativeness and correctness.
The results show that MSG effectively generates concise but also informative answers, since MSG not only considers question-related information, but also captures logically necessary content for answering the given question via multi-hop reasoning. Consequently, it leads to a more precise answer.

Ablation Study
We conduct ablation study to validate the effectiveness of different components in MSG as well as the detailed design for the multi-hop inference module. The upper part in Table 4 presents the ablation study on multi-hop inference module. First of all, the model performance suffers a great decrease from discarding the multi-hop inference module on two datasets, showing the necessity of incorporating the multi-hop reasoning into the questiondriven summarization. In specific, the fusion of the selective sentence representations from all hops brings performance improvement, including aggre- gating all the hops as well as applying attention to weight the importance of each hop. Besides, it also achieves better performance to apply the proposed MAR Unit as the multi-hop unit, instead of repeatedly using Attentive Unit, indicating that it is not enough to only consider the question-related information, while the interrelationship among different sentences also attaches great importance. The second part in Table 4 presents the ablation study in terms of discarding other model components in MSG. In general, all the components contribute to the final performance to a certain extent. In detail, there are several notable observations: (1) Some existing works (Hsu et al., 2018;Nishida et al., 2019b) apply softmax function to normalize the weights of different sentences in the decoding phase, which falls short of differentiating the importance degree of each sentence. The result shows that MSG achieves better performance by employing gated attention to distinguish salient justification sentences for generating the summaries. (2) Discarding the question pointer casts a noticeably greater decrease on PubMedQA than WikiHow. We conjecture that those questions from PubMedQA contain more words available to be copied for generating precise summaries, as the statistic of the question length shown in Table 1. These results also validate the importance of multi-view PGN on question-driven abstractive summarization, which is underutilized in current methods. (3) Multi-view coverage (MVC) loss makes a great contribution to the performance by alleviating the severe repetition problem along with the multi-view PGN.

Analysis of Multi-hop Reasoning
As the results presented in Section 4.3, MSG (3-Hop) outperforms MSG (1-Hop) by 0.5% and 0.7% on WikiHow and PubMedQA, respectively, indicating the effectiveness of incorporating multi-hop reasoning in question-driven summarization. Figure 4(a) presents the model performance in terms of using different hops of reasoning. We can see that, as expected, the performance of the model begins with growth when increasing the number of hops for reasoning. However, the performance becomes generally unchanged (e.g., WikiHow) or even slightly decreases (e.g., PubMedQA) when we further increase the number of hops. In practice, it is actually unnecessary to reason for too many hops, which may cause over-fitting. And adopting 3-hops in the implementation can be regarded as a hyper-parameter that is tuned on the datasets.
In addition, we extract and normalize the sentence weights from Eq. 7&9 to analyze some characteristics of the justification sentences in multihop inference. Figure 4(b) summarizes the statistic result of the sentence importance degree in each hop. We observe that the most important sentences in the 1st-hop of reasoning are likely to appear at the beginning of the document, while those in the 3rd-hop are concentrated in the latter part of the document. Comparatively, the important sentences in the 2nd-hop appear equally in all positions of the document. The results show that the proposed multi-hop inference procedure of justification sentences is generally in accordance with human-like reading habits.

Case Study
We present a case study in Figure 5 with generated answers from the proposed method and some baseline methods, QPGN, QS, and SD 2 , to intuitively compare these methods. With the multi-hop reasoning process in MSG, we can obtain a clear clue of how to answer the given question. As it can be observed that the reference answer is composed of the information from the 1st-hop and :::::::: 3rd-hop inference sentences, it is inadequate to simply summarize the question-related content for generating the answer. For the generated summaries, there are several observations as follows: (1) MSG (3hop) successfully summarizes the source document with all the necessary and correct information. (2) MSG (2-hop) also effectively summarizes the 1sthop and 2nd-hop inference content in the document. However, in this case, 3-hop inference is required to answer the given question. (3) MSG (1-hop) only measures the semantic relevance to the given question, leading to an incomplete summary that is lack of some necessary content, and even introduces some general sentences due to the data-driven learning. (4) QPGN only considers the Question: Does high molecular weight hyaluronan decrease oxidative DNA damage induced by EDTA in human corneal epithelial cells?

Document: <S>To investigate the toxic effects of Ethylenediaminetetraacetic Acid Disodium Salt (EDTA), a corneal penetration enhancer in topical ophthalmic formulations, on DNA in human corneal epithelial cells (HCEs), and to investigate whether the effect induced by EDTA can be inhibited by high molecular weight hyaluronan (HA)
. <S>Cells were exposed to EDTA in concentrations ranging from 0.00001 to 0.01% for 60 min, or 30 min high molecular weight HA pretreatment followed by EDTA treatment. <S>…<S>EDTA exhibited no adverse effect on cell viability and did not induce cell apoptosis in human corneal epithelial cells at concentrations lower than 0.01%. <S>However, a significant increase of DNA single-and double-strand breaks was observed in a dose-dependent manner with all the concentrations of EDTA tested in HCEs. <S> In addition, EDTA treatment led to elevated ROS generation. <S>Moreover, 30 min preincubation with high molecular weight HA significantly decreased EDTA-induced ROS generation and DNA damage. <S> Reference Answer: EDTA could induce DNA damage in HCEs, probably through oxidative stress. Furthermore, high molecular weight HA was an effective protective agent that had antioxidant properties and decreased DNA damage induced by EDTA. MSG (3-Hop): High molecular weight HA reduces oxidative DNA damage induced by EDTA in human corneal epithelial cells. Moreover, high molecular weight HA significantly decreased EDTA-induced ROS generation and DNA damage. MSG (2-Hop): High molecular weight HA pretreatment followed by EDTA reduces oxidative DNA damage induced by EDTA in human corneal epithelial cells. However, a significant increase of DNA single-and double-strand breaks was observed in a dose-dependent manner with high molecular weight HA. MSG (1-Hop): High molecular weight HA significantly reduced EDTA-induced ROS generation and DNA damage in human corneal epithelial cells. These results suggest that EDTA may be a potential therapeutic agent for the treatment of corneal penetration.

QPGN: In addition to the corneal penetration of HCEs, EDTA can induce cell apoptosis in human corneal epithelial cells, and the effect induced by EDTA in human corneal epithelial cells can be inhibited by EDTA.
QS: EDTA induces cell apoptosis in human corneal epithelial cells, suggesting that EDTA may be a potential therapeutic agent for the treatment of corneal epithelial cells, in the prevention of DNA damage in the corneal epithelial cell population. SD2: These results suggest that EDTA may be a potential therapeutic agent for the treatment of human corneal epithelial damage caused by EDTA in the topical ophthalmic formulation of topical ophthalmic formulations. semantic relevance to the given question, leading to an incomplete summary that is lack of some necessary content. (5) QS and SD 2 fail to capture the key information, resulting in generating irrelevant summaries to the given question, or producing some general sentences due to the data-driven learning. It shows the capability of MSG to implement multihop reasoning and provide justification sentences.
Additionally, we observe that many cases probably require more than 3-hop inference or only involve one or two hops. However, we can still evaluate how MSG works in these cases. Compared to the reference answer, MSG (3-hop) can still capture most of the useful information to generate a good summary for answering the question. Besides, MSG (2-hop) and MSG (1-hop) also manage to attend some important content in the document. In general, our model is able to only attend a single hop if one-hop is enough, while our model may regard several hops as an integral hop when more hops are required. However, the baseline methods introduce much unnecessary or even incorrect information into the summarized answers.

Duplication Analysis in Answers
We adopt Distinct scores to analyze whether the multi-view coverage mechanism can alleviate the repetition issue in the generation procedure of multi-view PGN. Figure 6 summarizes the percentage of n-grams duplication on the ground-truth answers and the generated answers with or without the original (See et al., 2017) and multi-view coverage mechanism. We observe that the original Figure 6: Duplication Analysis in Answers coverage mechanism can still reduce word repetition in multi-view PGN. Moreover, multi-view coverage further reduces the ratio of duplication to a great extent, since multi-view coverage not only prevents repeatedly attending to the same element in both question and document, but also balances the weight of penalty between them.

Conclusion
We propose a novel question-driven abstractive summarization method, Multi-hop Selective Generator (MSG), to summarize concise but informative answers for non-factoid QA. We incorporate multihop reasoning to infer justification sentences for abstractive summarization. Experimental results show that the proposed method achieves state-ofthe-art performance on two benchmark non-factoid QA datasets, namely WikiHow and PubMedQA.