Cross-Task Knowledge Transfer for Query-Based Text Summarization

We demonstrate the viability of knowledge transfer between two related tasks: machine reading comprehension (MRC) and query-based text summarization. Using an MRC model trained on the SQuAD1.1 dataset as a core system component, we first build an extractive query-based summarizer. For better precision, this summarizer also compresses the output of the MRC model using a novel sentence compression technique. We further leverage pre-trained machine translation systems to abstract our extracted summaries. Our models achieve state-of-the-art results on the publicly available CNN/Daily Mail and Debatepedia datasets, and can serve as simple yet powerful baselines for future systems. We also hope that these results will encourage research on transfer learning from large MRC corpora to query-based summarization.


Introduction
Query-based single-document text summarization is the process of selecting the most relevant points in a document for a given query and arranging them into a concise and coherent snippet of text. The query can range from an individual word to a fully formed natural language question. Extractive summarizers select verbatim the most relevant span of text in the source, while abstractive summarizers further paraphrase the selected content for better clarity and brevity.
By and large, existing approaches train models using summarization data corpora (Nema et al., 2017;Hasselqvist et al., 2017), which are of moderate size. At the same time, large corpora are available for related tasks, specifically machine reading comprehension (MRC) and machine translation (MT). To find out if such corpora have utility for summarizers, we propose methods to di- * Work done at IBM. rectly produce extractive and abstractive querybased summaries from pretrained MRC and MT modules, requiring no further adaptation or transfer learning steps.
In our experiments, this approach outperforms existing methods, suggesting a novel route to query-based summarization: pre-training systems on such related tasks, where an abundance of training data is enabling extremely rapid progress Sun et al., 2018;Vaswani et al., 2017), and using summarizationspecific corpora for transfer learning.
The main contributions of this work are: • We show how existing off-the-shelf components for tasks other than query-based summarization are competitive with the state-ofthe-art in the field, even without model adaptation or transfer learning -we hope to encourage researchers to more closely examine transfer learning among these tasks.
• Specifically, we show how processing the output of an MRC system (trained on the SQuAD1.1 dataset (Rajpurkar et al., 2016)) with a simple rule-based sentence compression module that operates on the dependency parse (de Marneffe and Manning, 2008) of the answer sentence yields results that are better than those of query-based extractive summarizers trained for the specific dataset.
• We demonstrate how a sequence-to-sequence model (Sutskever et al., 2014) that uses two machine translation engines-from and to English, respectively-applied to the output of the above, yields results that are better than query-based abstractive summarizers trained for the specific dataset.
Passage: people whether overweight or not are still people. you can not compare a person with a suitcase. suitcases don't live and breathe. this rule is the same with weight. excess weight in a suitcase is not comparable with a fat person . Query: is it necessary to charge fat passengers extra when flying? Reference Summary: there is no comparison between a person and a suitcase.
Our method (abstractive) : The overweight in the bag can't be compared with the fat guy. Diversity driven attention model: beings are definitely by the <unk> to illegal illegal.

Task Definition
Given a document D = (S 1 , ..., S n ) with n sentences comprising of a set of words D W = {d 1 , ..., d w }, and a query Q = (q 1 , ..., q m ) with m words, one desires to produce an extractive (S E ) or abstractive (S A ) summary that provides information about the answer to Q, where S E ⊆ D W and S A = {w 1 , ..., w s } | ∃w i ∈ D W . Tables 1  and 2 show examples of abstractive and extractive summaries, respectively.

Method
Our proposed system comprises of three modules for extractive summarization: retrieval of candidate answer phrases using a reading comprehension system, sentence extraction, and sentence compression. Additionally we utilize two MT modules (English to Spanish and back) to paraphrase for abstractive summarization.

Machine Reading Comprehension
MRC requires the identification of a contiguous span of words in a passage that answers a given query (Rajpurkar et al., 2016;Hu et al., 2017). We use the MRC model by Wang et al. (2016b) trained on the SQuAD1.1 dataset (Rajpurkar et al., 2016) to identify the top n (empirically: n=5) possibly overlapping candidate answer phrases, or chunks, for the given query. The chunks are typically short, 3.2 words on average in the training set. Obviously, chunks from MRC are  not meant to be summaries, but in our system they help the summarizer focus on the regions of the input document that appear related to the query.

Sentence Extraction
Sentence extraction consists of selecting the sentences containing the top n chunks produced by MRC. This is in contrast to methods based on sentence ranking algorithms such as those used in (Boudin et al., 2015;Parveen and Strube, 2015;Nallapati et al., 2017;Cheng and Lapata, 2016).
For our experiments, we impose the constraint that the candidate answer chunks for each query be contained in a single sentence. Hence, starting from n = 5, we iteratively reduce n until the top n candidate chunks are all contained in one sentence.

Sentence Compression
Sentence extraction often produces results that are much longer than those in the reference summaries-the training data (Table 4) suggests that 20 words is a good upper limit for the length of the summaries. We address this problem by introducing a novel sentence compression framework based on pruning the dependency parses of the sentences. Our approach is partially inspired by the work of Wang et al. (2016a), which performs sentence compression based on constituency parses. The intuition is that dependency parses better capture the semantic relations between words than constituents, which actually model syntactic structure.
Input Sentence: it is ridiculous to suggest governments should restrict their own ability to help their economies. Paraphrase (with MT): It is absurd to suggest that governments impose limits on their ability to help their economies. Input Sentence: this favoritism would only increase that of which the laws are trying to suppress . Paraphrase (with MT): These nepotism will only increase the laws that you try to suppress. Given a summary with length ≥ 20, we obtain the dependency parses of its sentences using the IBM Watson NLU toolkit. Next, we remove words in the sentences (starting from the rear) that are not in a dependency relationship with any of the candidate phrases, until the summary length limit is reached.

Back Translation
Recent research has shown gains in leveraging on the enormous corpora in machine translation (MT) for paraphrasing (Mallinson et al., 2017;Wieting and Gimpel, 2017). Inspired by such research and our fundamental goal of investigating the viability of cross-task knowledge transfer for querybased summarization, we paraphrase our extracts using an off-the-shelf MT system 1 . The final English paraphrase of the input sentence is obtained by translating it into Spanish and back-translating the translation into English. We experimented with English-French-English and English-Italian-English as well as with multi-hops approaches before settling on the English-Spanish pair, based on subjective analysis of the results. Table 3 shows examples of paraphrased sentences using backtranslation.

Experiments
We test our approach on two publicly available datasets-Debatepedia (Nema et al., 2017)   summarization. No training was involved; the test sets were simply passed through the modules discussed in section 3.

Datasets
We processed the CNN/DM 2 and Debatepedia 3 datasets using the respective official Python scripts to yield the corpora with passages, queries and summaries tailored to the queries (Table 4). CNN/DM is much larger in terms of both the number of samples and the lengths of passages, with short queries consisting of few words, mostly entity names. Debatepedia is a smaller dataset, but the queries are fully-formed natural language questions. Interestingly, although our MRC system was originally designed to answer full-length questions, as our results show later in this section, it identifies key regions of the document remarkably well in both test sets.

Evaluation
As customary in summarization tasks, we evaluate our system using ROUGE (Lin, 2004)-a family of metrics that compute the textual overlap between the output and the reference summary. The publicly available ROUGE 2.0 toolkit 4 was used as the implementation.

Results
Tables 5 and 6 summarize the performances of our model and other published models on Debatepedia and CNN/Daily Mail, respectively. Our models, both extractive and abstractive, outperform the published results on both test sets. The extractive performance on CNN/DM indicates that the combination of a reading compre-  Extractive R-1 R-2 R-L R-SU4 QSum (Hasselqvist et al., 2017) 33.81 18.19 29.22 17.49 Ours 65.45 30.07 60.40 36.62 Abstractive QSum (Hasselqvist et al., 2017) 18.25 5.04 16.17 6.13 Ours 58.46 25.12 54.32 32.06 Table 6: ROUGE (%) scores of our models and the competing model on the CNN/Daily Mail dataset. Our proposed approach yields the best system for both extractive and abstractive summarization.
hension system and a syntax-driven compression module can be highly effective in identifying regions in a document that contain key information with respect to a given query. Moreover, the abstractive performances on both test sets show the effectiveness of machine translation as a paraphrasing component for abstractive summarization. In particular, in the CNN/DM test set the improvement over the baseline is greater in the abstractive than in the extractive case, again suggesting that both text selection and MT-based paraphrasing contribute to the gain.

Related Work
Text summarization has long been an active area of research and query-based summarization has gained momentum more recently. Classical summarization models usually identify salient parts of a text by encapsulating manually crafted rules into linear functions (Lin and Bilmes, 2011) which are solved using integer linear programming (ILP) (Nayeem and Chali, 2017; Boudin et al., 2015), conditional random fields (CRF) (Shen et al., 2007), or graph algorithms (Parveen and Strube, 2015; Erkan and Radev, 2004). More recently, neural networks, mostly with an encoder-decoder framework (Bahdanau et al., 2014), have been used to learn the underlying features (Jadhav and Rajan, 2018;Nallapati et al., 2016) trained by minimizing the cross-entropy loss (Nallapati et al., 2017) or reinforcement learning (Narayan et al., 2018;Paulus et al., 2017). Our baseline models for query-based summa-rization (Nema et al., 2017;Hasselqvist et al., 2017) are both implemented on the encoderdecoder framework with the former incorporating a diversity function in their model aimed at minimizing the problem of repetitive word generation inherent in encoder-decoder models. However our approach is similar to neither, as our goal is not to train a query-based summarizer from scratch but rather to investigate the competitiveness of using pre-trained models for closely related tasks-i.e., MRC and MT-on query-based summarization.

Conclusions
We described an approach to extractive and abstractive summarization that relies on components designed for different tasks: MRC, sentence compression, and MT. We have shown that retrieving the top n answer chunks from a passage with an MRC system and trimming the corresponding sentences using their dependency trees yields an extractive summarizer that outperforms published results on a publicly available dataset. We also showed that using MT to produce a paraphrase of the answers yields a high-performance abstractive summarization method. This work lays the foundations for transfer learning based approaches that use summarization data to adapt MRC models for summarization. We also envision: i) using summarization data to learn how to re-rank top n candidates from backtranslation; ii) replacing the pruning system with a trained sequence-to-sequence model with an objective function that incorporates readability; and iii) computing the AMR parse (Banarescu et al., 2013) of the candidate answers followed by text generation (Song et al., 2018) instead of using MT.