Towards Improving Abstractive Summarization via Entailment Generation

Abstractive summarization, the task of rewriting and compressing a document into a short summary, has achieved considerable success with neural sequence-to-sequence models. However, these models can still benefit from stronger natural language inference skills, since a correct summary is logically entailed by the input document, i.e., it should not contain any contradictory or unrelated information. We incorporate such knowledge into an abstractive summarization model via multi-task learning, where we share its decoder parameters with those of an entailment generation model. We achieve promising initial improvements based on multiple metrics and datasets (including a test-only setting). The domain mismatch between the entailment (captions) and summarization (news) datasets suggests that the model is learning some domain-agnostic inference skills.


Introduction
Abstractive summarization, the task of rewriting a document into a short summary is a significantly more challenging (and natural) task than extractive summarization, which only involves choosing which sentence from the original document to keep or discard in the output summary. Neural sequence-to-sequence models have led to substantial improvements on this task of abstractive summarization, via machine translation inspired encoder-aligner-decoder approaches, further enhanced via convolutional encoders, pointer-copy mechanisms, and hierarchical attention (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017).
Despite these promising recent improvements, Input Document: may is a pivotal month for moving and storage companies . Ground-truth Summary: moving companies hit bumps in economic road Baseline Summary: a month to move storage companies Multi-task Summary: pivotal month for storage firms Figure 1: Motivating output example from our summarization+entailment multi-task model.
there is still scope in better teaching summarization models about the general natural language inference skill of logical entailment generation. This is because the task of abstractive summarization involves two subtasks: salient (important) event detection as well as logical compression, i.e., the summary should not contain any information that is contradictory or unrelated to the original document. Current methods have to learn both these skills from the same dataset and a single model. Therefore, there is benefit in learning the latter ability of logical compression via external knowledge from a separate entailment generation task, that will specifically teach the model how to rewrite and compress a sentence such that it logically follows from the original input. To achieve this, we employ the recent paradigm of sequence-to-sequence multi-task learning (Luong et al., 2016). We share the decoder parameters of the summarization model with those of the entailment-generation model, so as to generate summaries that are good at both extracting important facts from as well as being logically entailed by the input document. Fig. 1 shows such an (actual) output example from our model, where it successfully learns both salient information extraction as well as entailment, unlike the strong baseline model.
Empirically, we report promising initial improvements over some solid baselines based on several metrics, and on multiple datasets: Gigaword and also a test-only setting of DUC. Impor-tantly, these improvements are achieved despite the fact that the domain of the entailment dataset (image captions) is substantially different from the domain of the summarization datasets (general news), which suggests that the model is learning certain domain-independent inference skills. Our next steps to this workshop paper include incorporating stronger pointer-based models and employing the new multi-domain entailment corpus (Williams et al., 2017).

Related Work
Earlier summarization work focused more towards extractive (and compression) based summarization, i.e., selecting which sentences to keep vs discard, and also compressing based on choosing grammatically correct sub-sentences having the most important pieces of information (Jing, 2000;Knight and Marcu, 2002;Clarke and Lapata, 2008;Filippova et al., 2015). Bigger datasets and neural models have allowed the addressing of the complex reasoning involved in abstractive summarization, i.e., rewriting and compressing the input document into a new summary. Several advances have been made in this direction using machine translation inspired encoder-aligner-decoder models, convolution-based encoders, switching pointer and copy mechanisms, and hierarchical attention models (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017).
Recognizing textual entailment (RTE) is the classification task of predicting whether the relationship between a premise and hypothesis sentence is that of entailment (i.e., logically follows), contradiction, or independence (Dagan et al., 2006). The SNLI corpus Bowman et al. (2015) allows training accurate end-to-end neural networks for this task. Some previous work (Mehdad et al., 2013;Gupta et al., 2014) has explored the use of textual entailment recognition for redundancy detection in summarization. They label relationships between sentences, so as to select the most informative and non-redundant sentences for summarization, via sentence connectivity and graphbased optimization and fusion. Our focus, on the other hand, is entailment generation and not recognition, i.e., to teach summarization models the general natural language inference skill of generating a compressed sentence that logically entails the original longer sentence, so as to produce more effective short summaries. We achieve this via multi-task learning with entailment generation.
Multi-task learning involves sharing parameters between related tasks, whereby each task benefits from extra information in the training signals of the related tasks, and also improves its generalization performance. Luong et al. (2016) showed improvements on translation, captioning, and parsing in a shared multi-task setting. Recently, Pasunuru and Bansal (2017) extend this idea to video captioning with two related tasks: video completion and entailment generation. We demonstrate that abstractive text summarization models can also be improved by sharing parameters with an entailment generation task.

Models
First, we discuss our baseline model which is similar to the machine translation encoder-alignerdecoder model of Luong et al. (2015), and presented by Chopra et al. (2016). Next, we introduce our multi-task learning approach of sharing the parameters between abstractive summarization and entailment generation models.

Baseline Model
Our baseline model is a strong, multi-layered encoder-attention-decoder model with bilinear attention, similar to Luong et al. (2015) and following the details in Chopra et al. (2016). Here, we encode the source document with a two-layered LSTM-RNN and generate the summary using another two-layered LSTM-RNN decoder. The word probability distribution at time step t of the decoder is defined as follows: where g is a non-linear function and c t and s t are the context vector and LSTM-RNN decoder hidden state at time step t, respectively. The context vector c t = α t,i h i is a weighted combination of encoder hidden states h i , where the attention weights are learned through the bilinear attention mechanism proposed in Luong et al. (2015). For the rest of the paper, we use same notations.
We also use the same model architecture for the entailment generation task, i.e., a sequence-tosequence model encoding the premise and decoding the entailed hypothesis, via bilinear attention between them.  Figure 2: Multi-task learning of the summarization task (left) with the entailment generation task (right).

Multi-Task Learning
Multi-task learning helps in sharing knowledge between related tasks across domains (Luong et al., 2015). In this work, we show improvements on the task of abstractive summarization by sharing its parameters with the task of entailment generation. Since a summary is entailed by the input document, sharing parameters with the entailment generation task improves the logically-directed aspect of the summarization model, while maintaining the salient information extraction aspect. In our multi-task setup, we share the decoder parameters of both the tasks (along with the word embeddings), as shown in Fig. 2, and we optimize the two loss functions (one for summarization and another for entailment generation) in alternate mini-batches of training. Let α s be the number of mini-batches of training for summarization after which it is switched to train α e number of minibatches for entailment generation. Then, the mixing ratio is defined as αs αs+αe : αe αs+αe .

Training Details
We use the following simple settings for all the models, unless otherwise specified. We unroll the encoder RNN's to a maximum of 50 time steps and decoder RNN's to a maximum of 30 time steps.  Table 1: Summarization results on Gigaword. Rouge scores are full length F-1, following previous work.
We use RNN hidden state dimension of 512 and word embedding dimension of 256. We do not initialize our word embeddings with any pre-trained models, i.e., we learn them from scratch. We use the Adam (Kingma and Ba, 2015) optimizer with a learning rate of 0.001. During training, to handle the large vocabulary, we use the sampled loss trick of Jean et al. (2014). We always tune hyperparameters on the validation set of the corresponding dataset, where applicable. For multi-task learning, we tried a few mixing ratios and found 1 : 0.05 to work better, i.e., 100 mini-batches of summarization with 5 mini-batches of entailment generation task in alternate training rounds.

Summarization Results: Gigaword
Baseline Results and Previous Work Our baseline is a strong encoder-attention-decoder model based on Luong et al. (2015) and presented by Chopra et al. (2016). As shown in Table 1, it is reasonably close to some of the state-of-theart (comparable) results in previous work, though making this baseline further strong (e.g., based on pointer-copy mechanism) is our next step.

Multi-Task Results
We show promising initial multi-task improvements on top of our baseline, based on several metrics. This suggests that the entailment generation model is teaching the summarization model some skills about how to choose a logical subset of the events in the full input document. This is especially promising given that the domain of the entailment dataset (image captions) is very different from the domain of the summarization datasets (news), suggesting that the model might be learning some domain-agnostic inference skills.   our Luong et al. (2015) baseline model achieves competitive performance with previous work, esp. on Rouge-2 and Rouge-L. Next, we show promising multi-task improvements over this baseline of around 0.4% across all metrics, despite being a test-only setting and also with the mismatch between the summarization and entailment domains. Figure 3 shows some additional interesting output examples of our multi-task model and how it generates summaries that are better at being logically entailed by the input document, whereas the baseline model contains some crucial contradictory or unrelated information.

Conclusion and Next Steps
We presented a multi-task learning approach to incorporate entailment generation knowledge into summarization models. We demonstrated promising initial improvements based on multiple datasets and metrics, even when the entailment knowledge was extracted from a domain different from the summarization domain. Our next steps to this workshop paper include: (1) stronger summarization baselines, e.g., using pointer copy mechanism (See et al., 2017;Nallapati et al., 2016), and also adding this capability to the entailment generation model; (2) results on CNN/Daily Mail corpora (Nallapati et al., 2016); (3) incorporating entailment knowledge from other news-style domains such as the new Multi-NLI corpus (Williams et al., 2017), and (4) demonstrating mutual improvements on the entailment generation task.