A Unified Model for Extractive and Abstractive Summarization using Inconsistency Loss

We propose a unified model combining the strength of extractive and abstractive summarization. On the one hand, a simple extractive model can obtain sentence-level attention with high ROUGE scores but less readable. On the other hand, a more complicated abstractive model can obtain word-level dynamic attention to generate a more readable paragraph. In our model, sentence-level attention is used to modulate the word-level attention such that words in less attended sentences are less likely to be generated. Moreover, a novel inconsistency loss function is introduced to penalize the inconsistency between two levels of attentions. By end-to-end training our model with the inconsistency loss and original losses of extractive and abstractive models, we achieve state-of-the-art ROUGE scores while being the most informative and readable summarization on the CNN/Daily Mail dataset in a solid human evaluation.


Introduction
Text summarization is the task of automatically condensing a piece of text to a shorter version while maintaining the important points. The ability to condense text information can aid many applications such as creating news digests, presenting search results, and generating reports. There are mainly two types of approaches: extractive and abstractive. Extractive approaches assemble summaries directly from the source text typically selecting one whole sentence at a time. In contrast, abstractive approaches can generate novel words and phrases not copied from the source text.
Original Article: McDonald's says...... The company says it expects the new 'Artisan Grilled Chicken' to be in its more than 14,300 U.S. stores by the end of next week, in products including a new sandwich, as well as existing sandwiches, wraps and salads. It says the biggest change is the removal of sodium phosphates, which it said was used to keep the chicken moist, in favor of vegetable starch. The new recipe also does not use maltodextrin, which Mc-Donald's said is generally used as a sugar to increase browning or as a carrier for seasoning. Jessica Foust, director of culinary innovation at McDonald's, said the changes were made because customers said they want 'simple, clean ingredients' they are familiar with...... And Panera Bread has said it plans to purge artificial colors, flavors and preservatives from its food by 2016......
Extractive Approach: The company says it expects the new 'Artisan Grilled Chicken' to be in its more than 14,300 U.S. stores by the end of next week, in products including a new sandwich, as well as existing sandwiches, wraps and salads. It says the biggest change is the removal of sodium phosphates, which it said was used to keep the chicken moist, in favor of vegetable starch. The new recipe also does not use maltodextrin, which Mc-Donald's said is generally used as a sugar to increase browning or as a carrier for seasoning.
Abstractive Approach: McDonald's says it expects the new 'Artisan Grilled Chicken' to be in its more than 14,300 U.S. stores by the end of next week. The company says the changes were made because customers said they want 'simple, clean ingredients' they are familiar with. McDonald's said it plans to purge artificial colors, flavors and preservatives from its food by 2016.
Unified Approach: McDonald's says it expects the new 'Artisan Grilled Chicken' to be in its more than 14,300 U.S. stores by the end of next week, in products including a new sandwich, as well as existing sandwiches, wraps and salads. It says the biggest change is the removal of sodium phosphates. The new recipe also does not use maltodextrin, which McDonald's said is generally used as a sugar to increase browning or as a carrier for seasoning. Figure 1: Comparison of extractive, abstractive, and our unified summaries on a news article. The extractive model picks most important but incoherent or not concise (see blue bold font) sentences. The abstractive summary is readable, concise but still loses or mistakes some facts (see red italics font). The final summary rewritten from fragments (see underline font) has the advantages from both extractive (importance) and abstractive advantage (coherence (see green bold font)).
Hence, abstractive summaries can be more coherent and concise than extractive summaries.
Extractive approaches are typically simpler. They output the probability of each sentence to be selected into the summary. Many earlier works on summarization (Cheng and Lapata, 2016;Nallapati et al., 2016aNallapati et al., , 2017Narayan et al., 2017;Yasunaga et al., 2017) (Nallapati et al., 2016b;See et al., 2017;Paulus et al., 2017;Fan et al., 2017; typically involve sophisticated mechanism in order to paraphrase, generate unseen words in the source text, or even incorporate external knowledge. Neural networks (Nallapati et al., 2017;See et al., 2017) based on the attentional encoder-decoder model (Bahdanau et al., 2014) were able to generate abstractive summaries with high ROUGE scores but suffer from inaccurately reproducing factual details and an inability to deal with outof-vocabulary (OOV) words. Recently, See et al. (2017) propose a pointer-generator model which has the abilities to copy words from source text as well as generate unseen words. Despite recent progress in abstractive summarization, extractive approaches (Nallapati et al., 2017;Yasunaga et al., 2017) and lead-3 baseline (i.e., selecting the first 3 sentences) still achieve strong performance in ROUGE scores.
We propose to explicitly take advantage of the strength of state-of-the-art extractive and abstractive summarization and introduced the following unified model. Firstly, we treat the probability output of each sentence from the extractive model (Nallapati et al., 2017) as sentence-level attention. Then, we modulate the word-level dynamic attention from the abstractive model (See et al., 2017) with sentence-level attention such that words in less attended sentences are less likely to be generated. In this way, extractive summarization mostly benefits abstractive summarization by mitigating spurious word-level attention. Secondly, we introduce a novel inconsistency loss function to encourage the consistency between two levels of attentions. The loss function can be computed without additional human annotation and has shown to ensure our unified model to be mutually beneficial to both extractive and abstractive summarization. On CNN/Daily Mail dataset, our unified model achieves state-of-theart ROUGE scores and outperforms a strong extractive baseline (i.e., lead-3). Finally, to ensure the quality of our unified model, we conduct a solid human evaluation and confirm that our method significantly outperforms recent state-ofthe-art methods in informativity and readability.
To summarize, our contributions are twofold: • We propose a unified model combining sentence-level and word-level attentions to take advantage of both extractive and abstractive summarization approaches.
• We propose a novel inconsistency loss function to ensure our unified model to be mutually beneficial to both extractive and abstractive summarization. The unified model with inconsistency loss achieves the best ROUGE scores on CNN/Daily Mail dataset and outperforms recent state-of-the-art methods in informativity and readability on human evaluation.

Related Work
Text summarization has been widely studied in recent years. We first introduce the related works of neural-network-based extractive and abstractive summarization. Finally, we introduce a few related works with hierarchical attention mechanism. Extractive summarization. Kågebäck et al. (2014) and Yin and Pei (2015) use neural networks to map sentences into vectors and select sentences based on those vectors. Cheng  Figure 2: Our unified model combines the word-level and sentence-level attentions. Inconsistency occurs when word attention is high but sentence attention is low (see red arrow). (Vinyals et al., 2015) into their models to deal with out-of-vocabulary (OOV) words. Chen et al. (2016) and See et al. (2017) restrain their models from attending to the same word to decrease repeated phrases in the generated summary. Paulus et al. (2017) use policy gradient on summarization and state out the fact that high ROUGE scores might still lead to low human evaluation scores. Fan et al. (2017) apply convolutional sequenceto-sequence model and design several new tasks for summarization.  achieve high readability score on human evaluation using generative adversarial networks. Hierarchical attention. Attention mechanism was first proposed by Bahdanau et al. (2014). Yang et al. (2016) proposed a hierarchical attention mechanism for document classification. We adopt the method of combining sentence-level and word-level attention in Nallapati et al. (2016b). However, their sentence attention is dynamic, which means it will be different for each generated word. Whereas our sentence attention is fixed for all generated words. Inspired by the high performance of extractive summarization, we propose to use fixed sentence attention. Our model combines state-of-the-art extractive model (Nallapati et al., 2017) and abstractive model (See et al., 2017) by combining sentencelevel attention from the former and word-level attention from the latter. Furthermore, we design an inconsistency loss to enhance the cooperation between the extractive and abstractive models.

Our Unified Model
We propose a unified model to combine the strength of both state-of-the-art extractor (Nallapati et al., 2017) and abstracter (See et al., 2017). Before going into details of our model, we first define the tasks of the extractor and abstracter. Problem definition. The input of both extrac-tor and abstracter is a sequence of words w = [w 1 , w 2 , ..., w m , ...], where m is the word index. The sequence of words also forms a sequence of sentences s = [s 1 , s 2 , ..., s n , ...], where n is the sentence index. The m th word is mapped into the n(m) th sentence, where n(·) is the mapping function. The output of the extractor is the sentencelevel attention β = [β 1 , β 2 , ..., β n , ...], where β n is the probability of the n th sentence been extracted into the summary. On the other hand, our attention-based abstractor computes word-level attention α t = α t 1 , α t 2 , ..., α t m , ... dynamically while generating the t th word in the summary. The output of the abstracter is the summary text y = y 1 , y 2 , ..., y t , ... , where y t is t th word in the summary.
In the following, we introduce the mechanism to combine sentence-level and word-level attentions in Sec. 3.1. Next, we define the novel inconsistency loss that ensures extractor and abstracter to be mutually beneficial in Sec. 3.2. We also give the details of our extractor in Sec. 3.3 and our abstracter in Sec. 3.4. Finally, our training procedure is described in Sec. 3.5.

Combining Attentions
Pieces of evidence (e.g., Vaswani et al. (2017)) show that attention mechanism is very important for NLP tasks. Hence, we propose to explicitly combine the sentence-level β n and word-level α t m attentions by simple scalar multiplication and renormalization. The updated word attentionα t . (1) The multiplication ensures that only when both word-level α t m and sentence-level β n attentions are high, the updated word attentionα t m can be high. Since the sentence-level attention β n from the extractor already achieves high ROUGE scores, β n intuitively modulates the word-level attention α t m to mitigate spurious word-level attention such that words in less attended sentences are less likely to be generated (see Fig. 2). As highlighted in Sec. 3.4, the word-level attentionα t m significantly affects the decoding process of the abstracter. Hence, an updated word-level attention is our key to improve abstractive summarization.

Inconsistency Loss
Instead of only leveraging the complementary nature between sentence-level and word-level attentions, we would like to encourage these two-levels of attentions to be mostly consistent to each other during training as an intrinsic learning target for free (i.e., without additional human annotation). Explicitly, we would like the sentence-level attention to be high when the word-level attention is high. Hence, we design the following inconsistency loss, where K is the set of top K attended words and T is the number of words in the summary. This implicitly encourages the distribution of the wordlevel attentions to be sharp and sentence-level attention to be high. To avoid the degenerated solution for the distribution of word attention to be one-hot and sentence attention to be high, we include the original loss functions for training the extractor ( L ext in Sec. 3.3) and abstracter (L abs and L cov in Sec. 3.4). Note that Eq. 1 is the only part that the extractor is interacting with the abstracter. Our proposed inconsistency loss facilitates our end-to-end trained unified model to be mutually beneficial to both the extractor and abstracter.

Extractor
Our extractor is inspired by Nallapati et al. (2017).
The main difference is that our extractor does not need to obtain the final summary. It mainly needs to obtain a short list of important sentences with a high recall to further facilitate the abstractor. We first introduce the network architecture and the loss function. Finally, we define our ground truth important sentences to encourage high recall. Architecture. The model consists of a hierarchical bidirectional GRU which extracts sentence representations and a classification layer for predicting the sentence-level attention β n for each sentence (see Fig. 3). Extractor loss. The following sigmoid cross entropy loss is used, where g n ∈ {0, 1} is the ground-truth label for the n th sentence and N is the number of sentences. When g n = 1, it indicates that the n th sentence should be attended to facilitate abstractive summarization.
Ground-truth label. The goal of our extractor is to extract sentences with high informativity, which means the extracted sentences should contain information that is needed to generate an abstractive summary as much as possible. To obtain the ground-truth labels g = {g n } n , first, we measure the informativity of each sentence s n in the article by computing the ROUGE-L recall score (Lin, 2004) between the sentence s n and the reference abstractive summaryŷ = {ŷ t } t . Second, we sort the sentences by their informativity and select the sentence in the order of high to low informativity. We add one sentence at a time if the new sentence can increase the informativity of all the selected sentences. Finally, we obtain the ground-truth labels g and train our extractor by minimizing Eq. 3. Note that our method is different from Nallapati et al. (2017) who aim to extract a final summary for an article so they use ROUGE F-1 score to select ground-truth sentences; while we focus on high informativity, hence, we use ROUGE recall score to obtain as much information as possible with respect to the reference summaryŷ.

Abstracter
The second part of our model is an abstracter that reads the article; then, generate a summary In the decoder step t, our updated word attentionα t is used to generate context vector h * (α t ). Hence, it updates the final word distribution P f inal .
word-by-word. We use the pointer-generator network proposed by See et al. (2017) and combine it with the extractor by combining sentence-level and word-level attentions (Sec. 3.1).
The pointergenerator network (See et al., 2017) is a specially designed sequence-to-sequence attentional model that can generate the summary by copying words in the article or generating words from a fixed vocabulary at the same time. The model contains a bidirectional LSTM which serves as an encoder to encode the input words w and a unidirectional LSTM which serves as a decoder to generate the summary y. For details of the network architecture, please refer to See et al. (2017). In the following, we describe how the updated word attentionα t affects the decoding process. Notations. We first define some notations. h e m is the encoder hidden state for the m th word. h d t is the decoder hidden state in step t. h * (α t ) = M mα t m × h e m is the context vector which is a function of the updated word attentionα t . P vocab (h * (α t )) is the probability distribution over the fixed vocabulary before applying the copying mechanism.
is the probability of word w being decoded. p gen (h * (α t )) ∈ [0, 1] is the generating probability (see Eq.8 in See et al. (2017)) and 1 − p gen (h * (α t )) is the copying probability. Final word distribution. P f inal w (α t ) is the final probability of word w being decoded (i.e., y t = w). It is related to the updated word attentionα t as follows (see Fig. 4), Note that P f inal = {P f inal w } w is the probability distribution over the fixed vocabulary and out-ofvocabulary (OOV) words. Hence, OOV words can be decoded. Most importantly, it is clear from Eq. 5 that P f inal w (α t ) is a function of the updated word attentionα t . Finally, we train the abstracter to minimize the negative log-likelihood: whereŷ t is the t th token in the reference abstractive summary.
Coverage mechanism. We also apply coverage mechanism (See et al., 2017) to prevent the abstracter from repeatedly attending to the same place. In each decoder step t, we calculate the coverage vector c t = t−1 t =1α t which indicates so far how much attention has been paid to every input word. The coverage vector c t will be used to calculate word attentionα t (see Eq.11 in See et al. (2017)). Moreover, coverage loss L cov is calculated to directly penalize the repetition in updated word attentionα t : The objective function for training the abstracter with coverage mechanism is the weighted sum of negative log-likelihood and coverage loss.

Training Procedure
We first pre-train the extractor by minimizing L ext in Eq. 3 and the abstracter by minimizing L abs and L cov in Eq. 6 and Eq. 7, respectively. When pre-training, the abstracter takes ground-truth extracted sentences (i.e., sentences with g n = 1) as input. To combine the extractor and abstracter, we proposed two training settings : (1) two-stages training and (2) end-to-end training. Two-stages training. In this setting, we view the sentence-level attention β from the pre-trained extractor as hard attention. The extractor becomes a classifier to select sentences with high attention (i.e., β n > threshold). We simply combine the extractor and abstracter by feeding the extracted sentences to the abstracter. Note that we finetune the abstracter since the input text becomes extractive summary which is obtained from the extractor.
End-to-end training. For end-to-end training, the sentence-level attention β is soft attention and will be combined with the word-level attention α t as described in Sec. 3.1. We end-to-end train the extractor and abstracter by minimizing four loss functions: L ext , L abs , L cov , as well as L inc in Eq. 2. The final loss is as below: where λ 1 , λ 2 , λ 3 , λ 4 are hyper-parameters. In our experiment, we give L ext a bigger weight (e.g., λ 1 = 5) when end-to-end training with L inc since we found that L inc is relatively large such that the extractor tends to ignore L ext .

Experiments
We introduce the dataset and implementation details of our method evaluated in our experiments.

Dataset
We evaluate our models on the CNN/Daily Mail dataset (Hermann et al., 2015;Nallapati et al., 2016b;See et al., 2017) which contains news stories in CNN and Daily Mail websites. Each article in this dataset is paired with one humanwritten multi-sentence summary. This dataset has two versions: anonymized and non-anonymized. The former contains the news stories with all the named entities replaced by special tokens (e.g., @entity2); while the latter contains the raw text of each news story. We follow See et al. (2017) and obtain the non-anonymized version of this dataset which has 287,113 training pairs, 13,368 validation pairs and 11,490 test pairs.

Implementation Details
We train our extractor and abstracter with 128dimension word embeddings and set the vocabulary size to 50k for both source and target text. We follow Nallapati et al. (2017) and See et al. (2017) and set the hidden dimension to 200 and 256 for the extractor and abstracter, respectively. We use Adagrad optimizer (Duchi et al., 2011) and apply early stopping based on the validation set. In the testing phase, we limit the length of the summary to 120. Pre-training. We use learning rate 0.15 when pretraining the extractor and abstracter. For the extractor, we limit both the maximum number of sentences per article and the maximum number of tokens per sentence to 50 and train the model for 27k iterations with the batch size of 64. For the abstracter, it takes ground-truth extracted sentences (i.e., sentences with g n = 1) as input. We limit the length of the source text to 400 and the length of the summary to 100 and use the batch size of 16. We train the abstracter without coverage mechanism for 88k iterations and continue training for 1k iterations with coverage mechanism (L abs : L cov = 1 : 1). Two-stages training. The abstracter takes extracted sentences with β n > 0.5, where β is obtained from the pre-trained extractor, as input during two-stages training. We finetune the abstracter for 10k iterations. End-to-end training. During end-to-end training, we will minimize four loss functions (Eq. 8) with λ 1 = 5 and λ 2 = λ 3 = λ 4 = 1. We set K to 3 for computing L inc . Due to the limitation of the memory, we reduce the batch size to 8 and thus use a smaller learning rate 0.01 for stability. The abstracter here reads the whole article. Hence, we increase the maximum length of source text to 600. We end-to-end train the model for 50k iterations.

Results
Our unified model not only generates an abstractive summary but also extracts the important sentences in an article. Our goal is that both of the two types of outputs can help people to read and understand an article faster. Hence, in this section, we evaluate the results of our extractor in Sec. 5.1 and unified model in Sec. 5.2. Furthermore, in Sec. 5.3, we perform human evaluation and show that our model can provide a better abstractive summary than other baselines.

Results of Extracted Sentences
To evaluate whether our extractor obtains enough information for the abstracter, we use full-length ROUGE recall scores 1 between the extracted sentences and reference abstractive summary. High ROUGE recall scores can be obtained if the extracted sentences include more words or sequences overlapping with the reference abstractive summary. For each article, we select sentences with the sentence probabilities β greater than 0.5. We show the results of the ground-truth sentence labels (Sec. 3.3) and our models on the  In addition, our model trained end-to-end with inconsistency loss exceeds the lead-3 baseline. All our ROUGE scores have a 95% confidence interval with at most ±0.24. ' * ' indicates the model is trained and evaluated on the anonymized dataset and thus is not strictly comparable with ours.
test set of the CNN/Daily Mail dataset in Table  1. Note that the ground-truth extracted sentences can't get ROUGE recall scores of 100 because reference summary is abstractive and may contain some words and sequences that are not in the article. Our extractor performs the best when end-toend trained with inconsistency loss.

Results of Abstractive Summarization
We use full-length ROUGE-1, ROUGE-2 and ROUGE-L F-1 scores to evaluate the generated summaries. We compare our models (two-stage and end-to-end) with state-of-the-art abstractive summarization models (Nallapati et al., 2016b;Paulus et al., 2017;See et al., 2017; and a strong lead-3 baseline which directly uses the first three article sentences as the summary. Due to the writing style of news articles, the most important information is often written at the beginning of an article which makes lead-3 a strong baseline. The results of ROUGE F-1 scores are shown in Table 2. We prove that with help of the extractor, our unified model can outperform pointer-generator (the third row in Table 2) even with two-stages training (the fifth row in Table 2). After end-to-end training without inconsistency loss, our method already achieves better ROUGE scores by cooperating with each other. Moreover, our model end-to-end trained with inconsistency loss achieves state-of-the-art ROUGE scores and exceeds lead-3 baseline. In order to quantify the effect of inconsistency loss, we design a metric -inconsistency rate R inc -to measure the inconsistency for each generated summary. For each decoder step t, if the word with maximum attention belongs to a sentence with low attention (i.e., β n(argmax(α t )) < mean(β)), we define this step as an inconsistent step t inc . The inconsistency rate R inc is then defined as the percentage of the inconsistent steps in the summary.
where T is the length of the summary. The average inconsistency rates on test set are shown in Table 4. Our inconsistency loss significantly decrease R inc from about 20% to 4%. An example of inconsistency improvement is shown in Fig. 5.
Method informativity conciseness readability DeepRL (Paulus et al., 2017) 3.23 2.97 2.85 pointer-generator (See et al., 2017) 3.18 3.36 3.47 GAN  3    Figure 5: Visualizing the consistency between sentence and word attentions on the original article. We highlight word (bold font) and sentence (underline font) attentions. We compare our methods trained with and without inconsistency loss. Inconsistent fragments (see red bold font) occur when trained without the inconsistency loss.

Human Evaluation
We perform human evaluation on Amazon Mechanical Turk (MTurk) 2 to evaluate the informativity, conciseness and readability of the summaries. We compare our best model (end2end with inconsistency loss) with pointer-generator (See et al., 2017), generative adversarial network ) and deep reinforcement model (Paulus et al., 2017). For these three models, we use the test set outputs provided by the authors 3 .
We randomly pick 100 examples in the test set. All generated summaries are re-capitalized and de-tokenized. Since Paulus et al. (2017) trained their model on anonymized data, we also recover the anonymized entities and numbers of their outputs.
We show the article and 6 summaries (reference summary, 4 generated summaries and a random summary) to each human evaluator. The random summary is a reference summary randomly picked from other articles and is used as a trap. We show the instructions of three different aspects as: (1) Informativity: how well does the summary capture the important parts of the article? (2) Conciseness: is the summary clear enough to explain everything without being redundant? (3) Readability: how well-written (fluent and grammatical) the summary is? The user interface of our human evaluation is shown in the supplementary material.
We ask the human evaluator to evaluate each summary by scoring the three aspects with 1 to 5 score (higher the better). We reject all the evaluations that score the informativity of the random summary as 3, 4 and 5. By using this trap mechanism, we can ensure a much better quality of our human evaluation. For each example, we first ask 5 human evaluators to evaluate. However, for those articles that are too long, which are always skipped by the evaluators, it is hard to collect 5 reliable evaluations. Hence, we collect at least 3 evaluations for every example. For each summary, we average the scores over different human evaluators.
The results are shown in Table 3. The reference summaries get the best score on conciseness since the recent abstractive models tend to copy sentences from the input articles. However, our model learns well to select important information and form complete sentences so we even get slightly better scores on informativity and readability than the reference summaries. We show a typical example of our model comparing with other state-of-Original article (truncated): A chameleon balances carefully on a branch, waiting calmly for its prey... except that if you look closely, you will see that this picture is not all that it seems. For the 'creature' poised to pounce is not a colourful species of lizard but something altogether more human. Featuring two carefully painted female models, it is a clever piece of sculpture designed to create an amazing illusion. It is the work of Italian artist Johannes Stoetter. Scroll down for video. Can you see us? Italian artist Johannes Stoetter has painted two naked women to look like a chameleon. The 37-year-old has previously transformed his models into frogs and parrots but this may be his most intricate and impressive piece to date. Stoetter daubed water-based body paint on the naked models to create the multicoloured effect, then intertwined them to form the shape of a chameleon. To complete the deception, the models rested on a bench painted to match their skin and held the green branch in the air beneath them. Stoetter can take weeks to plan one of his pieces and hours to paint it. Speaking about The Chameleon, he said: 'I worked about four days to design the motif bigger and paint it with colours. The body painting took me about six hours with the help of an assistant. I covered the hair with natural clay to make the heads look bald.' Camouflage job: A few finishing touches are applied to the two naked models to complete the transformation. 'There are different difficulties on different levels as in every work, but I think that my passion and love to my work is so big, that I figure out a way to deal with difficulties. My main inspirations are nature, my personal life-philosophy, every-day-life and people themselves.' However, the finished result existed only briefly before the models were able to get up and wash the paint off with just a video and some photographs to record it. (...) Figure 6: Typical Comparison. Our model attended at the most important information (blue bold font) matching well with the reference summary; while other state-of-the-art methods generate repeated or less important information (red italic font).
the-art methods in Fig. 6. More examples (5 using CNN/Daily Mail news articles and 3 using nonnews articles as inputs) are provided in the supplementary material.

Conclusion
We propose a unified model combining the strength of extractive and abstractive summarization. Most importantly, a novel inconsistency loss function is introduced to penalize the inconsistency between two levels of attentions. The inconsistency loss enables extractive and abstractive summarization to be mutually beneficial. By end-to-end training of our model, we achieve the best ROUGE-recall and ROUGE while being the most informative and readable summarization on the CNN/Daily Mail dataset in a solid human evaluation.