Soft Layer-Specific Multi-Task Summarization with Entailment and Question Generation

An accurate abstractive summary of a document should contain all its salient information and should be logically entailed by the input document. We improve these important aspects of abstractive summarization via multi-task learning with the auxiliary tasks of question generation and entailment generation, where the former teaches the summarization model how to look for salient questioning-worthy details, and the latter teaches the model how to rewrite a summary which is a directed-logical subset of the input document. We also propose novel multi-task architectures with high-level (semantic) layer-specific sharing across multiple encoder and decoder layers of the three tasks, as well as soft-sharing mechanisms (and show performance ablations and analysis examples of each contribution). Overall, we achieve statistically significant improvements over the state-of-the-art on both the CNN/DailyMail and Gigaword datasets, as well as on the DUC-2002 transfer setup. We also present several quantitative and qualitative analysis studies of our model’s learned saliency and entailment skills.


Introduction
Abstractive summarization is the challenging NLG task of compressing and rewriting a document into a short, relevant, salient, and coherent summary. It has numerous applications such as summarizing storylines, event understanding, etc. As compared to extractive or compressive summarization (Jing and McKeown, 2000; Knight and * Equal contribution (published at ACL 2018). Marcu, 2002;Clarke and Lapata, 2008;Filippova et al., 2015;Henß et al., 2015), abstractive summaries are based on rewriting as opposed to selecting. Recent end-to-end, neural sequence-tosequence models and larger datasets have allowed substantial progress on the abstractive task, with ideas ranging from copy-pointer mechanism and redundancy coverage, to metric reward based reinforcement learning (Rush et al., 2015;Chopra et al., 2016;Nallapati et al., 2016;See et al., 2017).
Despite these strong recent advancements, there is still a lot of scope for improving the summary quality generated by these models. A good rewritten summary is one that contains all the salient information from the document, is logically followed (entailed) by it, and avoids redundant information. The redundancy aspect was addressed by coverage models (Suzuki and Nagata, 2016;Chen et al., 2016;Nallapati et al., 2016;See et al., 2017), but we still need to teach these models about how to better detect salient information from the input document, as well as about better logicallydirected natural language inference skills.
In this work, we improve abstractive text summarization via soft, high-level (semantic) layerspecific multi-task learning with two relevant auxiliary tasks. The first is that of document-toquestion generation, which teaches the summarization model about what are the right questions to ask, which in turn is directly related to what the salient information in the input document is. The second auxiliary task is a premise-to-entailment generation task to teach it how to rewrite a summary which is a directed-logical subset of (i.e., logically follows from) the input document, and contains no contradictory or unrelated information. For the question generation task, we use the SQuAD dataset (Rajpurkar et al., 2016), where we learn to generate a question given a sentence containing the answer, similar to the recent work by Du et al. (2017). Our entailment generation task is based on the recent SNLI classification dataset and task (Bowman et al., 2015), converted to a generation task (Pasunuru and Bansal, 2017).
Further, we also present novel multi-task learning architectures based on multi-layered encoder and decoder models, where we empirically show that it is substantially better to share the higherlevel semantic layers between the three aforementioned tasks, while keeping the lower-level (lexico-syntactic) layers unshared. We also explore different ways to optimize the shared parameters and show that 'soft' parameter sharing achieves higher performance than hard sharing.
Empirically, our soft, layer-specific sharing model with the question and entailment generation auxiliary tasks achieves statistically significant improvements over the state-of-the-art on both the CNN/DailyMail and Gigaword datasets. It also performs significantly better on the DUC-2002 transfer setup, demonstrating its strong generalizability as well as the importance of auxiliary knowledge in low-resource scenarios. We also report improvements on our auxiliary question and entailment generation tasks over their respective previous state-of-the-art. Moreover, we significantly decrease the training time of the multitask models by initializing the individual tasks from their pretrained baseline models. Finally, we present human evaluation studies as well as detailed quantitative and qualitative analysis studies of the improved saliency detection and logical inference skills learned by our multi-task model.
Multi-task learning (MTL) is a useful paradigm to improve the generalization performance of a task with related tasks while sharing some common parameters/representations (Caruana, 1998;Argyriou et al., 2007;Kumar and Daumé III, 2012). Several recent works have adopted MTL in neural models (Luong et al., 2016;Misra et al., 2016;Hashimoto et al., 2017;Pasunuru and Bansal, 2017;Ruder et al., 2017;Kaiser et al., 2017). Moreover, some of the above works have investigated the use of shared vs unshared sets of parameters. On the other hand, we investigate the importance of soft parameter sharing and highlevel versus low-level layer-specific sharing.
Our previous workshop paper (Pasunuru et al., 2017) presented some preliminary results for multi-task learning of textual summarization with entailment generation. This current paper has several major differences: (1) We present question generation as an additional effective auxiliary task to enhance the important complementary aspect of saliency detection; (2) Our new high-level layer-specific sharing approach is significantly better than alternative layer-sharing approaches (including the decoder-only sharing by Pasunuru et al. (2017)); (3) Our new soft sharing parameter approach gives stat. significant improvements over hard sharing; (4) We propose a useful idea of starting multi-task models from their pretrained baselines, which significantly speeds up our experiment cycle 1 ; (5) For evaluation, we show diverse improvements of our soft, layer-specific MTL model (over state-of-the-art pointer+coverage baselines) on the CNN/DailyMail, Gigaword, as well as DUC datasets; we also report human evaluation plus analysis examples of learned saliency and entailment skills; we also report improvements on the auxiliary question and entailment generation tasks over their respective previous state-of-the-art.
In our work, we use a question generation task to improve the saliency of abstractive summarization in a multi-task setting. Using the SQuAD dataset (Rajpurkar et al., 2016), we learn to generate a question given the sentence containing the answer span in the comprehension (similar to Du et al. (2017)). For the second auxiliary task of entailment generation, we use the generation version of the RTE classification task (Dagan et al., 2006;Lai and Hockenmaier, 2014;Jimenez et al., 2014;Bowman et al., 2015). Some previous work has explored the use of RTE for redundancy detection in summarization by modeling graph-based relationships between sentences to select the most non-redundant sentences (Mehdad et al., 2013;Gupta et al., 2014), whereas our approach is based on multi-task learning.

Models
First, we introduce our pointer+coverage baseline model and then our two auxiliary tasks: question generation and entailment generation (and finally the multi-task learning models in Sec. 4).

Baseline Pointer+Coverage Model
We use a sequence-attention-sequence model with a 2-layer bidirectional LSTM-RNN encoder and a 2-layer uni-directional LSTM-RNN decoder, along with Bahdanau et al. (2015) style attention. Let x = {x 1 , x 2 , ..., x m } be the source document and y = {y 1 , y 2 , ..., y n } be the target summary. The output summary generation vocabulary distribution conditioned over the input source document is P v (y|x; θ) = n t=1 p v (y t |y 1:t−1 , x; θ). Let the decoder hidden state be s t at time step t and let c t be the context vector which is defined as a weighted combination of encoder hidden states. We concatenate the decoder's (last) RNN layer hidden state s t and context vector c t and apply a linear transformation, and then project to the vocabulary space by another linear transformation. Finally, the conditional vocabulary distribution at each time step t of the decoder is defined as: and sfm(·) is the softmax function.
Pointer-Generator Networks Pointer mechanism (Vinyals et al., 2015) helps in directly copying the words from the source sequence during target sequence generation, which is a good fit for a task like summarization. Our pointer mechanism approach is similar to See et al. (2017), who use a soft switch based on the generation probability p g = σ(W g c t +U g s t +V g e w t−1 +b g ), where σ(·) is a sigmoid function, W g , U g , V g and b g are parameters learned during training. e w t−1 is the previous time step output word embedding. The final word distribution is P f (y) = p g ·P v (y)+(1−p g )·P c (y), where P v vocabulary distribution is as shown in Eq. 1, and copy distribution P c is based on the attention distribution over source document words.
Coverage Mechanism Following previous work (See et al., 2017), coverage helps alleviate the issue of word repetition while generating long summaries. We maintain a coverage vector c t = t−1 t=0 α t that sums over all of the previous time steps attention distributions α t , and this is added as input to the attention mechanism. Coverage loss is L cov (θ) = t i min(α t,i ,ĉ t,i ). Finally, the total loss is a weighted combination of cross-entropy loss and coverage loss: where λ is a tunable hyperparameter.

Two Auxiliary Tasks
Despite the strengths of the baseline model described above with attention, pointer, and coverage, a good summary should also contain maximal salient information and be a directed logical entailment of the source document. We teach these skills to the abstractive summarization model via multi-task training with two related auxiliary tasks: question generation task and entailment generation.

Question Generation
The task of question generation is to generate a question from a given input sentence, which in turn is related to the skill of being able to find the important salient information to ask questions about. First the model has to identify the important information present in the given sentence, then it has to frame (generate) a question based on this salient information, such that, given the sentence and the question, one has to be able to predict the correct answer (salient information in this case). A good summary should also be able to find and extract all the salient information in the given source document, and hence we incorporate such capabilities into our abstractive text summarization model by multi-task learning it with a question generation task, sharing some common parameters/representations (see more details in Sec. 4). For setting up the question generation task, we follow Du et al. (2017) and use the SQuAD dataset to extract sentencequestion pairs. Next, we use the same sequenceto-sequence model architecture as our summarization model. Note that even though our question generation task is generating one question at a time 2 , our multi-task framework (see Sec. 4) is set up in such a way that the sentence-level knowledge from this auxiliary task can help the documentlevel primary (summarization) task to generate multiple salient facts -by sharing high-level semantic layer representations. See Sec. 7 and Table 10 for a quantitative evaluation showing that the multi-task model can find multiple (and more) salient phrases in the source document. Also see Sec. 7 (and supp) for challenging qualitative examples where baseline and SotA models only recover a small subset of salient information but our multi-task model with question generation is able to detect more of the important information.
Entailment Generation The task of entailment generation is to generate a hypothesis which is entailed by (or logically follows from) the given premise as input. In summarization, the generation decoder also needs to generate a summary that is entailed by the source document, i.e., does not contain any contradictory or unrelated/extraneous information as compared to the input document. We again incorporate such inference capabilities into the summarization model via multi-task learning, sharing some common representations/parameters between our summarization and entailment generation model (more details in Sec. 4). For this task, we use the entailmentlabeled pairs from the SNLI dataset (Bowman et al., 2015) and set it up as a generation task (using the same strong model architecture as our abstractive summarization model). See Sec. 7 and Table 9 for a quantitative evaluation showing that the multi-task model is better entailed by the source document and has fewer extraneous facts. Also see Sec. 7 and supplementary for qualitative examples of how our multi-task model with the entailment auxiliary task is able to generate more logically-entailed summaries than the baseline and 2 We also tried to generate all the questions at once from the full document, but we obtained low accuracy because of this task's challenging nature and overall less training data.  Figure 1: Overview of our multi-task model with parallel training of three tasks: abstractive summary generation (SG), question generation (QG), and entailment generation (EG). We share the 'blue' color representations across all the three tasks, i.e., second layer of encoder, attention parameters, and first layer of decoder.
SotA models, which instead produce extraneous, unrelated words not present (in any paraphrased form) in the source document.

Multi-Task Learning
We employ multi-task learning for parallel training of our three tasks: abstractive summarization, question generation, and entailment generation. In this section, we describe our novel layerspecific, soft-sharing approaches and other multitask learning details.

Layer-Specific Sharing Mechanism
Simply sharing all parameters across the related tasks is not optimal, because models for different tasks have different input and output distributions, esp. for low-level vs. high-level parameters. Therefore, related tasks should share some common representations (e.g., high-level information), as well as need their own individual task-specific representations (esp. low-level information). To this end, we allow different components of model parameters of related tasks to be shared vs. unshared, as described next. Encoder Layer Sharing: Belinkov et al. (2017) observed that lower layers (i.e., the layers closer to the input words) of RNN cells in a seq2seq machine translation model learn to represent word structure, while higher layers (farther from input) are more focused on high-level semantic meanings (similar to findings in the computer vision community for image features (Zeiler and Fergus, 2014)). We believe that while textual summarization, question generation, and entailment generation have different training data distributions and low-level representations, they can still benefit from sharing their models' high-level components (e.g., those that capture the skills of saliency and inference). Thus, we keep the lower-level layer (i.e., first layer closer to input words) of the 2layer encoder of all three tasks unshared, while we share the higher layer (second layer in our model as shown in Fig. 1) across the three tasks. Decoder Layer Sharing: Similarly for the decoder, lower layers (i.e., the layers closer to the output words) learn to represent word structure for generation, while higher layers (farther from output) are more focused on high-level semantic meaning. Hence, we again share the higher level components (first layer in the decoder far from output as shown in Fig. 1), while keeping the lower layer (i.e., second layer) of decoders of all three tasks unshared. Attention Sharing: As described in Sec. 3.1, the attention mechanism defines an attention distribution over high-level layer encoder hidden states and since we share the second, high-level (semantic) layer of all the encoders, it is intuitive to share the attention parameters as well.

Soft vs. Hard Parameter Sharing
Hard-sharing: In the most common multi-task learning hard-sharing approach, the parameters to be shared are forced to be the same. As a result, gradient information from multiple tasks will directly pass through shared parameters, hence forcing a common space representation for all the related tasks. Soft-sharing: In our soft-sharing approach, we encourage shared parameters to be close in representation space by penalizing their l 2 distances. Unlike hard sharing, this approach gives more flexibility for the tasks by only loosely coupling the shared space representations. We minimize the following loss function for the primary task in soft-sharing approach: where γ is a hyperparameter, θ represents the primary summarization task's full parameters, while θ s and ψ s represent the shared parameter subset between the primary and auxiliary tasks.

Fast Multi-Task Training
During multi-task learning, we alternate the minibatch optimization of the three tasks, based on a tunable 'mixing ratio' α s : α q : α e ; i.e., optimizing the summarization task for α s mini-batches followed by optimizing the question generation task for α q mini-batches, followed by entailment generation task for α e mini-batches (and for 2way versions of this, we only add one auxiliary task at a time). We continue this process until all the models converge. Also, importantly, instead of training from scratch, we start the primary task (summarization) from a 90%-converged model of its baseline to make the training process faster. We observe that starting from a fully-converged baseline makes the model stuck in a local minimum.
In addition, we also start all auxiliary models from their 90%-converged baselines, as we found that starting the auxiliary models from scratch has a chance to pull the primary model's shared parameters towards randomly-initialized auxiliary model's shared parameters.  Table 1: CNN/DailyMail summarization results. ROUGE scores are full length F-1 (as previous work). All the multi-task improvements are statistically significant over the state-of-the-art baseline.

Experimental Setup
proval rate greater than 95%, and had at least 10,000 approved HITs. For the pairwise model comparisons discussed in Sec. 6.2, we showed the annotators the input article, the ground truth summary, and the two model summaries (randomly shuffled to anonymize model identities) -we then asked them to choose the better among the two model summaries or choose 'Not-Distinguishable' if both summaries are equally good/bad. Instructions for relevance were defined based on the summary containing salient/important information from the given article, being correct (i.e., avoiding contradictory/unrelated information), and avoiding redundancy. Instructions for readability were based on the summary's fluency, grammaticality, and coherence.  ROUGE scores are full length F-1. All the multitask improvements are statistically significant over the state-of-the-art baseline.

Multi-Task with Entailment Generation
We first perform multi-task learning between abstractive summarization and entailment generation with soft-sharing of parameters as discussed in Sec. 4. Table 1 and Table 2 shows that this multi-task setting is better than our strong baseline models and the improvements are statistically significant on all metrics 5 on both CNN/DailyMail (p < 0.01 in ROUGE-1/ROUGE-L/METEOR and p < 0.05 in ROUGE-2) and Gigaword (p < 0.01 on all metrics) datasets, showing that entailment generation task is inducing useful inference skills to the summarization task (also see analysis examples in Sec. 7).   Multi-Task with Entailment and Question Generation Finally, we perform multi-task learning with all three tasks together, achieving the best of both worlds (inference skills and saliency). Table 1 and Table 2 show that our full multi-task model achieves the best scores on CNN/DailyMail and Gigaword datasets, and the improvements are statistically significant on all metrics on both CNN/DailyMail (p < 0.01 in ROUGE-1/ROUGE-L/METEOR and p < 0.02 in ROUGE-2) and Gigaword (p < 0.01 on all metrics). Finally, our 3-way multi-task model (with both entailment and question generation) outperforms the publicly-available pretrained result ( †) of the previous SotA (See et al., 2017) with stat. significance (p < 0.01), as well the higher-reported results ( ) on ROUGE-1/ROUGE-2 (p < 0.01).

Human Evaluation
We also conducted a blind human evaluation on Amazon MTurk for relevance and readability, based on 100 samples, for both CNN/DailyMail and Gigaword (see instructions in Sec. 5). Table. 3 shows the CNN/DM results where we do pairwise comparison between our 3-way multi-task model's output summaries w.r.t. our baseline summaries and w.r.t. See et al. (2017) summaries. As shown, our 3-way multi-task model achieves both higher due to adding more data, we separately trained word embeddings on each auxiliary dataset (i.e., SNLI and SQuAD) and incorporated them into the summarization model. We found that both our 2-way multi-task models perform significantly better than these models using the auxiliary wordembeddings, suggesting that merely adding more data is not enough.   (2017), our MTL model is higher in relevance scores but a bit lower in readability scores (and is higher in terms of total aggregate scores). One potential reason for this lower readability score is that our entailment generation auxiliary task encourages our summarization model to rewrite more and to be more abstractive than See et al. (2017) -see abstractiveness results in Table 11.
We also show human evaluation results on the Gigaword dataset in Table 4 (again based on pairwise comparisons for 100 samples), where we see that our MTL model is better than our state-of-theart baseline on both relevance and readability. 7

Generalizability Results (DUC-2002)
Next, we also tested our model's generalizability/transfer skills, where we take the models trained on CNN/DailyMail and directly test them on DUC-2002. We take our baseline and 3way multi-task models, plus the pointer-coverage model from See et al. (2017). 8 We only retune the beam-size for each of these three models separately (based on DUC-2003 as the validation set). 9 As shown in Table 5, our multitask model achieves statistically significant improvements over the strong baseline (p < 0.01 in ROUGE-1 and ROUGE-L) and the pointercoverage model from See et al. (2017) (p < 0.01 in all metrics). This demonstrates that our model is able to generalize well and that the auxiliary knowledge helps more in low-resource scenarios.

Auxiliary Task Results
In this section, we discuss the individual/separated performance of our auxiliary tasks. 7 Note that we did not have output files of any previous work's model on Gigaword; however, our baseline is already a strong state-of-the-art model as shown in Table 2. 8 We use the publicly-available pretrained model from See et al. (2017)'s github for these DUC transfer results, which produces the † results in Table 1. All other comparisons and analysis in our paper are based on their higher results. 9 We follow previous work which has shown that larger beam values are better and feasible for DUC corpora. However, our MTL model still achieves stat. significant improvements (p < 0.01 in all metrics) over See et al. (2017) Table 7: Performance of our pointer-based question generation (QG) model w.r.t. previous work.

Entailment Generation
We use the same architecture as described in Sec. 3.1 with pointer mechanism, and Table 6 compares our model's performance to Pasunuru and Bansal (2017). Our pointer mechanism gives a performance boost, since the entailment generation task involves copying from the given premise sentence, whereas the 2-layer model seems comparable to the 1-layer model. Also, the supplementary shows some output examples from our entailment generation model.
Question Generation Again, we use same architecture as described in Sec. 3.1 along with pointer mechanism for the task of question generation. Table 7 compares the performance of our model w.r.t. the state-of-the-art Du et al. (2017). Also, the supplementary shows some output examples from our question generation model.

Ablation and Analysis Studies
Soft-sharing vs. Hard-sharing As described in Sec. 4.2, we choose soft-sharing over hard-sharing because of the more expressive parameter sharing it provides to the model. Empirical results in Table. 8 prove that soft-sharing method is statistically significantly better than hard-sharing with p < 0.001 in all metrics. 10 Comparison of Different Layer-Sharing Methods We also conducted ablation studies among various layer-sharing approaches. Table 8 shows results for soft-sharing models with decoder-only sharing (D1+D2; similar to Pasunuru et al. (2017)) as well as lower-layer sharing (encoder layer 1 + decoder layer 2, with and without attention  Table 8: Ablation studies comparing our final multi-task model with hard-sharing and different alternative layer-sharing methods. Here E1, E2, D1, D2, Attn refer to parameters of the first/second layer of encoder/decoder, and attention parameters. Improvements of final model upon ablation experiments are all stat. signif. with p < 0.05.
shared). As shown, our final model (high-level semantic layer sharing E2+Attn+D1) outperforms these alternate sharing methods in all metrics with statistical significance (p < 0.05). 11

Quantitative Improvements in Entailment
We employ a state-of-the-art entailment classifier (Chen et al., 2017), and calculate the average of the entailment probability of each of the output summary's sentences being entailed by the input source document. We do this for output summaries of our baseline and 2-way-EG multi-task model (with entailment generation). As can be seen in Table 9, our multi-task model improves upon the baseline in the aspect of being entailed by the source document (with statistical significance p < 0.001). Further, we use the Named Entity Recognition (NER) module from CoreNLP (Manning et al., 2014) to compute the number of times the output summary contains extraneous facts (i.e., named entities as detected by the NER system) that are not present in the source documents, based on the intuition that a well-entailed summary should not contain unrelated information not followed from the input premise. We found that our 2-way MTL model with entailment generation reduces this extraneous count by 17.2% w.r.t. the baseline.
The qualitative examples below further discuss this issue of generating unrelated information.   We then annotate the ground-truth and model summaries with this keyword classifier and compute the % match, i.e., how many salient words from the ground-truth summary were also generated in the model summary. The results are shown in Table 10, where the 2-way-QG MTL model (with question generation) versus baseline improvement is stat. significant (p < 0.01). Moreover, we found 93 more cases where our 2-way-QG MTL model detects 2 or more additional salient keywords than the pointer baseline model (as opposed to vice versa), showing that sentence-level question generation task is helping the document-level summarization task in finding more salient terms.
Qualitative Examples on Entailment and Saliency Improvements Fig. 2 presents an example of output summaries generated by See et al. (2017), our baseline, and our 3-way multitask model. See et al. (2017) and our baseline models generate phrases like "john hartson" and "hampden injustice" that don't appear in the input document, hence they are not entailed by the input. 12 Moreover, both models missed salient information like "josh meekings", "leigh griffiths", and "hoops", that our multi-task model recovers. 13 Hence, our 3-way multi-task model generates summaries that are both better at logical entailment and contain more salient information. We refer to supplementary Fig. 5 for more details and similar examples for separated 2-way multi-task models (supplementary Fig. 3, Fig. 4). 12 These extra, non-entailed unrelated/contradictory information are not present at all in any paraphrase form in the input document. 13 We consider the fill-in-the-blank highlights annotated by human on CNN/DailyMail dataset as salient information.
Input Document: celtic have written to the scottish football association in order to gain an ' understandingóf the refereeing decisions during their scottish cup semi-final defeat by inverness on sunday . the hoops were left outraged by referee steven mcleanś failure to award a penalty or red card for a clear handball in the box by josh meekings to deny leigh griffithś goal-bound shot during the first-half . caley thistle went on to win the game 3-2 after extra-time and denied rory deliaś men the chance to secure a domestic treble this season . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings . celticś adam matthews -lrb-right -rrb-slides in with a strong challenge on nick ross in the scottish cup semi-final . ' given the level of reaction from our supporters and across football , we are duty bound to seek an understanding of what actually happened ,ćeltic said in a statement . they added , ' we have not been given any other specific explanation so far and this is simply to understand the circumstances of what went on and why such an obvious error was made .however , the parkhead outfit made a point of congratulating their opponents , who have reached the first-ever scottish cup final in their history , describing caley as a ' fantastic club and saying ' reaching the final is a great achievement .ćeltic had taken the lead in the semi-final through defender virgil van dijkś curling free-kick on 18 minutes , but were unable to double that lead thanks to the meekings controversy . it allowed inverness a route back into the game and celtic had goalkeeper craig gordon sent off after the restart for scything down marley watkins in the area . greg tansey duly converted the resulting penalty . edward ofere then put caley thistle ahead , only for john guidetti to draw level for the bhoys . with the game seemingly heading for penalties , david raven scored the winner on 117 minutes , breaking thousands of celtic hearts . celtic captain scott brown -lrb-left -rrb-protests to referee steven mclean but the handball goes unpunished . griffiths shows off his acrobatic skills during celticś eventual surprise defeat by inverness . celtic pair aleksandar tonev -lrb-left -rrb-and john guidetti look dejected as their hopes of a domestic treble end . Ground-truth: celtic were defeated 3-2 after extra-time in the scottish cup semi-final .
leigh griffiths had a goal-bound shot blocked by a clear handball. however, no action was taken against offender josh meekings . the hoops have written the sfa for an 'understanding' of the decision . See et al. (2017): john hartson was once on the end of a major hampden injustice while playing for celtic . but he can not see any point in his old club writing to the scottish football association over the latest controversy at the national stadium . hartson had a goal wrongly disallowed for offside while celtic were leading 1-0 at the time but went on to lose 3-2 . Our Baseline: john hartson scored the late winner in 3-2 win against celtic . celtic were leading 1-0 at the time but went on to lose 3-2 . some fans have questioned how referee steven mclean and additional assistant alan muir could have missed the infringement . Multi-task: celtic have written to the scottish football association in order to gain an ' understanding ' of the refereeing decisions . the hoops were left outraged by referee steven mclean 's failure to award a penalty or red card for a clear handball in the box by josh meekings . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings .  2017) and the baseline both include nonentailed words/phrases (e.g. "john hartson"), as well as they missed salient information ("hoops", "josh meekings", "leigh griffiths") in their output summaries. Our multi-task model, however, manages to accomplish both, i.e., cover more salient information and also avoid unrelated information.

Conclusion
We presented a multi-task learning approach to improve abstractive summarization by incorporating the ability to detect salient information and to be logically entailed by the document, via question generation and entailment generation auxiliary tasks. We propose effective soft and highlevel (semantic) layer-specific parameter sharing and achieve significant improvements over the state-of-the-art on two popular datasets, as well as a generalizability/transfer DUC-2002 setup.

A Supplementary Material
A.1 Dataset Details

CNN/DailyMail
Dataset CNN/DailyMail dataset (Hermann et al., 2015;Nallapati et al., 2016) is a large collection of online news articles and their multi-sentence summaries. We use the original, non-anonymized version of the dataset provided by See et al. (2017). Overall, the dataset has 287, 226 training pairs, 13, 368 validation pairs and, 11, 490 test pairs. On an average, a source document has 781 tokens and a target summary has 56 tokens.
Gigaword Corpus Gigaword is based on a large collection of news articles, where the article's first sentence is considered as the input document and the headline of the article as output summary. We use the annotated corpus provided by Rush et al. (2015). It has around 3.8 million training samples. For validation, we use 2, 000 samples and for test evaluation we use the standard test set provided by Rush et al. (2015). Following previous work, we keep our vocabulary size to 50, 000 frequent words.

DUC Corpus
We use the DUC-2002 14 document summarization dataset for checking our model's generalizability capabilities. DUC-2002 corpus consists of 567 documents with one or two human annotated reference summaries. We also tried beam retuning using DUC-2003 15 as a validation set, which consists of 624 documents with single human annotated reference summaries.

SNLI corpus
We use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015) for our entailment generation task. Following Pasunuru and Bansal (2017), we use the same re-splits provided by them to ensure a zero train-test overlap and multi-reference setup. This dataset has a total of 145, 822 unique premise pairs out of 190, 113 pairs, which are used for training, and the rest of them are divided equally into validation and test sets.

SQuAD Dataset
We use Stanford Question Answering Dataset (SQuAD) for our question generation task (Rajpurkar et al., 2016). In SQuAD dataset, given the comprehension and question, the task is to predict the answer span in the comprehension. However, in our question generation task, we extract the sentence from the comprehension containing the answer span and create a sentence-question pair similar to Du et al. (2017). The dataset has around 100K sentence-question pairs from 536 articles.

A.2 Training Details
The following training details are common across all models and datasets. We use LSTM-RNN in our sequence models with hidden state size of 256 dimension. We use 128 dimension word embedding representations. We do not use dropout or any other regularization techniques, but we clip the gradient to allow a maximum gradient norm value of 2.0. We use Adam optimizer (Kingma and Ba, 2015) with a learning rate of 0.001. Also, we share the word embeddings representation of both encoder and decoder in our models. All our tuning decisions (including soft/hard and layerspecific sharing decisions) were made on the appropriate validation/development set. CNN/DailyMail: For all the models involving CNN/DailyMail dataset, we use a maximum encoder RNN step size of 400 and a maximum decoder RNN step size of 100. We use a minibatch size of 16. We initialize the LSTM-RNNs with uniform random initialization in the range [−0.02, 0.02]. We set λ to 1.0 in the joint crossentropy and coverage loss. Also, we only add coverage to the converged model with attention and pointer mechanism, and make the learning rate from 0.001 to 0.0001. During multi-task learning, we use coverage mechanism for primary (CNN/DailyMail summarization) task but not for auxiliary tasks (because they do not have traditional redundancy issues). The penalty coefficient γ for soft-sharing is set to 5 × 10 −5 and 1 × 10 −5 for 2-way and 3-way multi-task models respectively (the range of the penalty value is intuitively chosen such that we balance the crossentropy and regularization losses). In inference time, we use a beam search size of 4, following previous work (See et al., 2017). Gigaword: For all the models involving Gigaword dataset, we use a maximum encoder RNN step size of 50 and a maximum decoder RNN step size of 20. We use a mini-batch size of 256. We initialize the LSTM-RNNs with uniform random initialization in the range [−0.01, 0.01]. We do not use cov-Input Document: john hughes has revealed how he came within a heartbeat of stepping down from his job at inverness as the josh meekings controversy went into overdrive this week . the caley thistle boss says he felt so repulsed by the gut-wrenching predicament being endured by his young defender -before he was dramatically cleared -that he was ready to walk away from his post and the games he loves , just weeks before an historic scottish cup final date . keen cyclist hughes set off on a lonely bike ride after hearing meekings had been cited for the handball missed by officials in the semi-final against celtic , and admits his head was in a spin over an affair that has dominated the back-page headlines since last sunday . inverness defender josh meekings will be allowed to appear in scottish cup final after his ban was dismissed . only messages of support awaiting him on his return from footballing friends brought him back from the brink of quitting . hughes , who lives in the black isle just north of inverness , said : ' i came in here this morning after a day off . i turned my phone off and was away myself , away out on the bike with plenty of thinking time : a great freedom of mind . ' i was that sick of what has been going on in scottish football i was seriously contemplating my own future . i 'm serious when i say that . ' i had just had it up to here and was ready to just give it up . if it was n't for what happened when i turned my phone back on , with the phone calls and texts i received from people i really value in football , that my spirits picked up again . ' the calls and texts came in from all over the place , from some of the highest levels across the game . i 've had phone calls that have really got me back on my feet . ' i would n't like to name them all , but there were a lot of good people and a good few close friends in the football fraternity . meekings was not sent off and no penalty was given as inverness went on to beat celtic 3-2 after extra-time . ' they were saying : " you need to lead from the front , you need to fight it . " that restored and galvanised that focus and drive in me . and , if that was how i was feeling , how was the boy josh meekings feeling ? it should never have come to this . ' meekings was cleared to play in the final by the judicial panel yesterday , but hughes insists this ' unprecedented ' sfa wrangle must be the catalyst for change in scottish football 's governance . although those who sit on the panel are drawn from many walks of life , ranging from former players and coaches to ex-refs and members of the legal profession , hughes said he wants ' real football people ' drafted in instead of the ' suits ' he claims lack understanding of the nuances and spirit of the professional game . and he seemed to point a thinly-veiled finger of accusation at sfa chief executive stewart regan by alleging that compliance officer tony mcglennan was a mere ' patsy ' in the process . (...) Ground-truth: Inverness defender josh meekings has won appeal against one-match ban . the 22year-old was offered one-game suspension following incident . however , an independent judicial panel tribunal overturned decision . inverness reached the scottish cup final with 3-2 win over celtic . See et al. (2017): Josh meekings has been cleared to play in the scottish cup final .The englishman admitted he was fortunate not to have conceded a penalty and been sent off by referee steven mclean for stopping leigh griffiths net-bound effort on his goal-line . Meekings was not sent off and no penalty was given as inverness went on to beat celtic 3-2 . Our Baseline: Josh meekings cleared to play in the scottish cup final on may 30 . Inverness defender josh meekings will be allowed to appear in scottish cup final . Meekings was not sent off and no penalty was given as inverness went on to beat celtic 3-2 . Multi-task: Josh meekings has been cleared to play in the scottish cup final . Inverness defender josh meekings will be allowed to appear in scottish cup final after his ban was dismissed . Inverness went on to beat celtic 3-2 after extra-time .  (2017) and the baseline generate extraneous information that is not entailed by the source documents ("referee steven mclean" and "may 30"), but our multi-task model avoids such unrelated information to generate summaries that logically follow from the source document.
Input Document: bending and rising in spectacular fashion , these stunning pictures capture the paddy fields of south east asia and the arduous life of the farmers who cultivate them . in a photo album that spans over china , thailand , vietnam , laos and cambodia , extraordinary images portray the crop 's full cycle from the primitive sowing of seeds to the distribution of millions of tonnes for consumption . the pictures were taken by professional photographer scott gable , 39 , who spent four months travelling across the region documenting the labour and threadbare equipment used to harvest the carbohydrate-rich food . scroll down for video . majestic : a farmer wades through the mud with a stick as late morning rain falls on top of dragonsbone terraces in longsheng county , china . rice is a staple food for more than one-half the world 's population , but for many consumers , its origin remains somewhat of a mystery . the crop accounts for one fifth of all calories consumed by humans and 87 per cent of it is produced in asia . it is also the thirstiest crop there is -according to the un , farmers need at least 2,000 litres of water to make one kilogram of rice . mr gable said he was determined to capture every stage of production with his rice project -from the planting to the harvesting all the way down to the shipping of the food . after acquiring some contacts from experts at cornell university in new york and conducting his own research , he left for china last may and spent the next four months traveling . he said : ' the images were taken over a four month period from april to july last year across asia . i visited china , thailand , vietnam , laos and cambodia as part of my rice project . video courtesy of www.scottgable.com . breathtaking : a paddy field worker toils on the beautiful landscape of dragonsbone terraces in longsheng county , china . farmers ' procession : a rice planting festival parade takes place near the village of pingan in guangxi province , china . ' the project is one part of a larger three part project on global food staples -rice , corn and wheat . i am currently in the process of shooting the corn segment . ' the industrialisation of our food and mono-culture food staples have interested me for some time so that 's probably what inspired me to do this project . ' i shot the whole project using a canon slr and gopros . the actual shooting took four months and then post production took another four more months . ' the reaction to my work has been incredibly positive -i was able to secure a solo gallery show and create quite a bit of interest online which has been great . ' family crop : a hani woman in traditional clothing sits on top of her family 's rice store in yunnan province , china . arduous labour : employees of taiwan 's state-run rice experimental station are pictured beating rice husks by hand as the sun shines on them . mr gable spent months learning mandarin chinese in preparation for his trip , but the language barrier was still his greatest challenge . (...) Ground-truth: the spectacular photos were taken at paddy fields in china , thailand , vietnam , laos and cambodia . photographer scott gable spent four months travelling region to document the process of harvesting the crop . rice accounts for one fifth of all calories consumed by humans but crop is often still cultivated in primitive way . See et al. (2017): the pictures were taken by professional photographer scott gable , 39 , who spent four months travelling across the region documenting the labour and the arduous life of the farmers who cultivate them . the images were taken over a four month period from april to july last year across asia . mr gable said he was determined to capture every stage of production with his rice project . Our Baseline: rice is a staple food for more than one-half the world 's population . crop accounts for one fifth of all calories consumed by humans and 87 per cent of it is produced in asia . Multi-task: in a photo album that spans over china , thailand , vietnam , laos and cambodia , extraordinary images portray the crop 's full cycle from the primitive sowing of seeds to the distribution of millions of tonnes for consumption . the crop accounts for one fifth of all calories consumed by humans and 87 per cent of it is produced in asia . Input Document: celtic have written to the scottish football association in order to gain an ' understandingóf the refereeing decisions during their scottish cup semi-final defeat by inverness on sunday . the hoops were left outraged by referee steven mcleanś failure to award a penalty or red card for a clear handball in the box by josh meekings to deny leigh griffithś goal-bound shot during the first-half . caley thistle went on to win the game 3-2 after extra-time and denied rory deliaś men the chance to secure a domestic treble this season . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings . celticś adam matthews -lrb-right -rrb-slides in with a strong challenge on nick ross in the scottish cup semi-final . ' given the level of reaction from our supporters and across football , we are duty bound to seek an understanding of what actually happened ,ćeltic said in a statement . they added , ' we have not been given any other specific explanation so far and this is simply to understand the circumstances of what went on and why such an obvious error was made .however , the parkhead outfit made a point of congratulating their opponents , who have reached the first-ever scottish cup final in their history , describing caley as a ' fantastic club and saying ' reaching the final is a great achievement .ćeltic had taken the lead in the semi-final through defender virgil van dijkś curling free-kick on 18 minutes , but were unable to double that lead thanks to the meekings controversy . it allowed inverness a route back into the game and celtic had goalkeeper craig gordon sent off after the restart for scything down marley watkins in the area . greg tansey duly converted the resulting penalty . edward ofere then put caley thistle ahead , only for john guidetti to draw level for the bhoys . with the game seemingly heading for penalties , david raven scored the winner on 117 minutes , breaking thousands of celtic hearts . celtic captain scott brown -lrb-left -rrb-protests to referee steven mclean but the handball goes unpunished . griffiths shows off his acrobatic skills during celticś eventual surprise defeat by inverness . celtic pair aleksandar tonev -lrb-left -rrb-and john guidetti look dejected as their hopes of a domestic treble end . Ground-truth: celtic were defeated 3-2 after extra-time in the scottish cup semi-final .
leigh griffiths had a goal-bound shot blocked by a clear handball. however, no action was taken against offender josh meekings . the hoops have written the sfa for an 'understanding' of the decision . See et al. (2017): john hartson was once on the end of a major hampden injustice while playing for celtic . but he can not see any point in his old club writing to the scottish football association over the latest controversy at the national stadium . hartson had a goal wrongly disallowed for offside while celtic were leading 1-0 at the time but went on to lose 3-2 . Our Baseline: john hartson scored the late winner in 3-2 win against celtic . celtic were leading 1-0 at the time but went on to lose 3-2 . some fans have questioned how referee steven mclean and additional assistant alan muir could have missed the infringement . Multi-task: celtic have written to the scottish football association in order to gain an ' understanding ' of the refereeing decisions . the hoops were left outraged by referee steven mclean 's failure to award a penalty or red card for a clear handball in the box by josh meekings . celtic striker leigh griffiths has a goal-bound shot blocked by the outstretched arm of josh meekings .  2017), our baseline, and 3-way multi-task model with summarization and both entailment generation and question generation. The boxed-red highlighted words/phrases are not present in the input source document in any paraphrasing form. All the unboxedgreen highlighted words/phrases correspond to the salient information. See detailed discussion in Fig. 3 and Fig. 4 above. As shown, the outputs from See et al. (2017) and the baseline both include nonentailed words/phrases (e.g. "john hartson"), as well as they missed salient information ("hoops", "josh meekings", "leigh griffiths") in their output summaries. Our multi-task model, however, manages to accomplish both, i.e., cover more salient information and also avoid unrelated information.
Premise: People walk down a paved street that has red lanterns hung from the buildings. Entailment: People walk down the street. Premise: A young woman on a boat in a light colored bikini kicks a man wearing a straw cowboy hat. Entailment: A young woman strikes a man with her feet.

A.2.1 Multi-Task Learning Details
Multi-Task Learning with Question Generation Two important hyperparameters tuned are the mixing ratio between summarization and entailment generation, as well as the soft-sharing coefficient. Here, we choose the mixing ratios 3:2 between CNN/DailyMail and SQuAD, 100:1 between Gigaword and SQuAD. Intuitively, these mixing ratios are close to the ratio of their dataset sizes. We set the soft-sharing coefficient γ to 5 × 10 −5 and 1 × 10 −5 for CNN/DailyMail and Gigaword, resp.

A.3.1 Entailment Generation Examples
See Fig. 6 for interesting output examples by our entailment generation model.

A.3.2 Question Generation Examples
See Fig. 7 for interesting output examples by our question generation model.