Improving Abstraction in Text Summarization

Abstractive text summarization aims to shorten long text documents into a human readable form that contains the most important facts from the original document. However, the level of actual abstraction as measured by novel phrases that do not appear in the source document remains low in existing approaches. We propose two techniques to improve the level of abstraction of generated summaries. First, we decompose the decoder into a contextual network that retrieves relevant parts of the source document, and a pretrained language model that incorporates prior knowledge about language generation. Second, we propose a novelty metric that is optimized directly through policy learning to encourage the generation of novel phrases. Our model achieves results comparable to state-of-the-art models, as determined by ROUGE scores and human evaluations, while achieving a significantly higher level of abstraction as measured by n-gram overlap with the source document.


Introduction
Text summarization concerns the task of compressing a long sequence of text into a more concise form.The two most common approaches to summarization are extractive (Dorr et al., 2003;Nallapati et al., 2017), where the model extracts salient parts of the source document, and abstractive (Paulus et al., 2017;See et al., 2017), where the model not only extracts but also concisely paraphrases the important parts of the document via generation.We focus on developing a summarization model that produces an increased level of abstraction.That is, the model produces concise summaries without only copying long passages from the source document.* Work performed while at Salesforce Research.
A high quality summary is shorter than the original document, conveys only the most important and no extraneous information, and is semantically and syntactically correct.Because it is difficult to gauge the correctness of the summary, evaluation metrics for summarization models use word overlap with the ground-truth summary in the form of ROUGE (Lin, 2004) scores.However, word overlap metrics do not capture the abstractive nature of high quality human-written summaries: the use of paraphrases with words that do not necessarily appear in the source document.
The state-of-the-art abstractive text summarization models have high word overlap performance, however they tend to copy long passages of the source document directly into the summary, thereby producing summaries that are not abstractive (See et al., 2017).
We propose two general extensions to summarization models that improve the level of abstraction of the summary while preserving word overlap with the ground-truth summary.Our first contribution decouples the extraction and generation responsibilities of the decoder by factoring it into a contextual network and a language model.The contextual network has the sole responsibility of extracting and compacting the source document whereas the language model is responsible for the generation of concise paraphrases.Our second contribution is a mixed objective that jointly optimizes the n-gram overlap with the ground-truth summary while encouraging abstraction.This is done by combining maximum likelihood estimation with policy gradient.We reward the policy with the ROUGE metric, which measures word overlap with the ground-truth summary, as well as a novel abstraction reward that encourages the generation of words not in the source document.
We demonstrate the effectiveness of our contributions on a encoder-decoder summarization Article (cnn) to allay possible concerns, boston prosecutors released video friday of the shooting of a police officer last month that resulted in the killing of the gunman.the officer wounded, john moynihan, is white.angelo west, the gunman shot to death by officers, was black.after the shooting, community leaders in the predominantly african-american neighborhood of (...) Human-written summary boston police officer john moynihan is released from the hospital.video shows that the man later shot dead by police in boston opened fire first.moynihan was shot in the face during a traffic stop.Generated summary (See et al., 2017) boston prosecutors released video friday of the shooting of a police officer last month.the gunman shot to death by officers , was black .one said the officers were forced to return fire. he was placed in a medically induced coma at a boston hospital.Generated summary (Liu et al., 2018) boston prosecutors released video of the shooting of a police officer last month .the shooting occurred in the wake of the boston marathon bombing.the video shows west sprang out and fired a shot with a pistol at officer's face.
Our summary (ML+RL ROUGE+Novel, with LM) new: boston police release video of shooting of officer , john moynihan.new: angelo west had several prior gun convictions , police say.boston police officer john moynihan, 34, survived with a bullet wound .he was in a medically induced coma at a boston hospital , a police officer says.

model.
Our model obtains state-of-the-art ROUGE-L scores, and ROUGE-1 and ROUGE-2 performance comparable to state-of-the-art methods on the CNN/DailyMail dataset.Moreover, we significantly outperform all previous abstractive approaches in our abstraction metrics.Table 1 shows a comparison of summaries generated by our model and previous abstractive models, showing less copying and more abstraction in our model.

Base Model and Training Objective
The base model follows the encoder-decoder architecture with temporal attention and intraattention proposed by Paulus et al. (2017).Let E ∈ R n×d emb denote the matrix of d emb dimensional word embeddings of the n words in the source document.The encoding of the source document h enc is computed via a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) whose output has dimension d hid .
The decoder uses temporal attention over the encoded sequence that penalizes input tokens that previously had high attention scores.Let h dec t denote the decoder state at time t.The temporal at-tention context at time t, c tmp t , is computed as where we set q tmp ti to exp(s tmp ti ) for t = 1.The decoder also attends to its previous states via intra-attention over the decoded sequence.The intra-attention context at time t, c int t , is computed as The decoder generates tokens by interpolating between selecting words from the source document via a pointer network as well as selecting words from a fixed output vocabulary.Let z t denote the ground-truth label as to whether the tth output word should be generated by the selecting from the output vocabulary as opposed to from the source document.We compute p(z t ), the probability that the decoder generates from the output vocabulary, as The probability of selecting the word y t from a fixed vocabulary at time step t is defined as We set p cp (y t ), the probability of copying the word y t from the source document, to the temporal attention distribution α tmp t .The joint probability of using the generator and generating the word y t at time step t, p(z t , y t ), is then the likelihood of which is The objective function combines maximum likelihood estimation with policy learning.Let m denote the length of the ground-truth summary, The maximum likelihood loss L ml is computed as Policy learning uses ROUGE-L as its reward function and a self-critical baseline using the greedy decoding policy (Rennie et al., 2016).Let y sam denote the summary obtained by sampling from the current policy p, y gre and z gre the summary and generator choice obtained by greedily choosing from p(z t , y t ), R(y) the ROUGE-L score of the summary y, and Θ the model parameters.The policy learning loss is where we use greedy predictions by the model according to eq. ( 13) as a baseline for variance reduction.The policy gradient, as per Schulman et al. (2015), is The final loss is a mixture between the maximum likelihood loss and the policy learning loss, weighted by a hyperparameter γ.

Language Model Fusion
The decoder is an essential component of the base model.Given the source document and the previously generated summary tokens, the decoder both extracts relevant parts of the source document through the pointer network as well as composes paraphrases from the fixed vocabulary.We decouple these two responsibilities by augmenting the decoder with an external language model.The language model assumes responsibility of generating from the fixed vocabulary, and allows the decoder to focus on attention and extraction.This decomposition has the added benefit of easily incorporating external knowledge about fluency or domain specific styles via pre-training the language model on a large scale text corpora.
The architecture of our language model is based on Merity et al. (2018).We use a 3-layer unidirectional LSTM with weight-dropped LSTM units.
Let e t denote the embedding of the word generated during time step t.The hidden state of the language model at the l-th layer is At each time step t, we combine the hidden state of the last language model LSTM layer, h lm 3,t , with r t defined in eq. ( 8) in a fashion similar to Sriram et al. (2017).Let denote element-wise multiplication.We use a gating function whose output g t filters the content of the language model hidden state.
We then replace the output distribution of the language model p gen (y t ) in eq. 10 with

Abstractive Reward
In order to produce an abstractive summary, the model cannot exclusively copy from the source document.In particular, the model needs to parse large chunks of the source document and create concise summaries using phrases not in the source document.To encourage this behavior, we propose a novelty metric that promotes the generation of novel words.We define a novel phrase in the summary as one that is not in the source document.Let ng (x, n) denote the function that computes the set of unique n-grams in a document x, x gen the generated summary, x src the source document, and s the number of words in s.The unnormalized novelty metric N is defined as the fraction of unique n-grams in the summary that are novel.
To prevent the model for receiving high novelty rewards by outputting very short summaries, we normalize the metric by the length ratio of the generated and ground-truth summaries.Let x gt denote the ground-truth summary.We define the novelty metric as We incorporate the novelty metric as a reward into the policy gradient objective in eq. ( 15), alongside the original ROUGE-L metric.In doing so, we encourage the model to generate summaries that both overlap with human written ground-truth summaries as well as incorporate novel words not in the source document: where λ rou and λ nov are hyperparameters that control the weighting of each reward.

Datasets
We train our model on the CNN/Daily Mail dataset (Hermann et al., 2015;Nallapati et al., 2016).Previous works on abstractive summarization either use an anonymized version of this dataset or the original article and summary texts.Due to these different formats, it is difficult to compare the overall ROUGE scores and performance between each version.In order to compare against previous results, we train and evaluate on both versions of this dataset.For the anonymized version, we follow the pre-processing steps described in Nallapati et al. (2016), and the pre-processing steps of See et al. (2017) for the the full-text version.
We use named entities and the source document to supervise the model regarding when to use the pointer and when to use the generator (e.g.z t in eq. ( 13).Namely, during training, we teach the model to point from the source document if the word in the ground-truth summary is a named entity, an out-of-vocabulary word, or a numerical value that is in the source document.We obtain the list of named entities from Hermann et al. (2015).

Language Models
For each dataset version, we train a language model consisting of a 400-dimensional word embedding layer and a 3-layer LSTM with each layer having a hidden size of 800 dimensions, except the last input layer which has an output size of 400.The final decoding layer shares weights with the embedding layer (Inan et al., 2017;Press and Wolf, 2016).We also use DropConnect (Wan et al., 2013) in the hidden-to-hidden connections, as well as the non-monotonically triggered asynchronous gradient descent optimizer from Merity et al. (2018).
We train this language model on the CNN/Daily Mail ground-truth summaries only, following the same training, validation, and test splits as our main experiments.

Training details
The two LSTMs of our bidirectional encoder are 200-dimensional, and out decoder LSTM is 400dimensional.We restrict the input vocabulary for the embedding matrix to 150,000 tokens, and the output decoding layer to 50,000 tokens.We limit the size of input articles to the first 400 tokens, and the summaries to 100 tokens.We use scheduled sampling (Bengio et al., 2015) with a probability of 0.25 when calculating the maximum-likelihood training loss.We also set n = 3 when computing our novelty reward R nov (x gen , n).For our final training loss using reinforcement learning, we set γ = 0.9984, λ rou = 0.9, and λ nov = 0.1.Finally, we use the trigram repetition avoidance heuristic defined by Paulus et al. (2017) during beam search decoding to ensure that the model does not output twice the same trigram in a given summary, reducing the amount of repetitions.

Novelty baseline
We also create a novelty baseline by taking the outputs of our base model, without RL training and without the language model, and inserting random words not present in the article after each summary token with a probability r = 0.0005.This baseline will intuitively have a higher percentage of novel n-grams than our base model outputs while being very similar to these original outputs, hence keeping the ROUGE score difference relatively small.

Quantitative analysis
We obtain a validation and test perplexity of 65.80 and 66.61 respectively on the anonymized dataset, and 81.13 and 82.98 on the full-text dataset with the language models described in Section 3.2.
The ROUGE scores and novelty scores of our final summarization model on both versions of the CNN/Daily Mail dataset are shown in Table 2.We report the ROUGE-1, ROUGE-2, and ROUGE-L F-scores as well as the percentage of novel ngrams, marked NN-n, in the generated summaries, with n from 1 to 4. Results are omitted in cases where they have not been made available by previous authors.We also include the novel n-gram scores for the ground-truth summaries as a comparison to indicate the level of abstraction of human written summaries.Even though our model outputs significantly fewer novel n-grams than human written summaries, it has a much higher percentage of novel n-grams than all the previous abstractive approaches.It also achieves state-of-the-art ROUGE-L performance on both dataset versions, and obtains ROUGE-1 and ROUGE-2 scores close to state-of-the-art results.

Ablation study
In order to evaluate the relative impact of each of our individual contributions, we run ablation studies comparing our model ablations against each other and against the novelty baseline.The results of these different models on the validation set of the anonymized CNN/Daily Mail dataset are shown in Table 3. Results show that our base model trained with the maximum-likelihood loss only and using the language model in the decoder (ML, with LM) has higher ROUGE scores, novel unigrams, and novel bigrams scores than our base model without the language model (ML).ML with LM also beats the novelty baseline for these metrics.When training these models with reinforcement learning using the ROUGE reward (ML+RL ROUGE and ML+RL ROUGE with LM), the model with language model obtains higher ROUGE-1 and ROUGE-2 scores.However, it also loses its novel unigrams and bigrams advantage.Finally, using the mixed ROUGE and novelty rewards (ML+RL ROUGE+Novel) produces both higher ROUGE scores and more novel unigrams with the language model than without it.This indicates that the combination of the language model in the decoder and the novelty reward during training makes our model produce more novel unigrams while maintaining high ROUGE scores.

ROUGE vs novelty trade-off
In order to understand the correlation between ROUGE and novel n-gram scores across different architectures, and to find the model type that gives the best trade-off between each of these metrics, we plot the ROUGE-1 and novel unigram scores for the five best iterations of each model type on the anonymized dataset, as well as the ROUGE-2 and novel bigram scores on a separate plot.We also include the novelty baseline described in Section 4.2 for values of r between 0.005 and 0.035.For each model type, we indicate the Pareto frontier by a line plot (Ben-Tal, 1980), illustrating which models of a given type give the best combination of ROUGE and novelty scores.These plots are shown in Figure 2.
These plots show that there exist an inverse correlation between ROUGE and novelty scores in all model types, illustrating the challenge of choosing a model that performs well in both.Given that, our final model (ML+RL ROUGE+Novel, with LM) provides the best trade-off of ROUGE-1 scores compared to novel unigrams, indicated by the higher Pareto frontier in the first plot.Similarly, our final model gives one of the best trade-offs of ROUGE-2 scores to novel bigrams, even though the same model without LM produces more novel   bigrams with a lower ROUGE-2 score.

Qualitative evaluation
In order to ensure the quality of our model outputs, we ask 5 human evaluators to rate 100 randomly selected full-text test summaries, giving them two scores from 1 to 10 respectively for readability and relevance given the original article.We also include the full-text test outputs from See et al. (2017) and Liu et al. (2018) for comparison.Evaluators are shown different summaries corresponding to the same article side by side without being told which models have generated them.The mean score and confidence interval at 95% for each model and each evaluation criterion are reported in Table 4.These results show that our model matches the relevance score of See et al. (2017) and Liu et al. (2018), but is slightly inferior to them in terms of readability.

Related work
Text summarization.Existing summarization approaches are usually either extractive or abstrac-tive.In extractive summarization, the model selects passages from the input document and combines them to form a shorter summary, sometimes with a post-processing step to ensure final coherence of the output (Neto et al., 2002;Dorr et al., 2003;Filippova and Altun, 2013;Colmenares et al., 2015;Nallapati et al., 2017).While extractive models are usually robust and produce coherent summaries, they cannot create concise summaries that paraphrase the source document using new phrases.
Abstractive summarization allows the model to paraphrase the source document and create concise summaries with phrases not in the source document.The state-of-the-art abstractive summarization models are based on sequence-tosequence models with attention (Bahdanau et al., 2015).Extensions to this model include a selfattention mechanism (Paulus et al., 2017) and an article coverage vector (See et al., 2017) to prevent repeated phrases in the output summary.Different training procedures have also been used improve the ROUGE score (Paulus et al., 2017) or textual
entailment (Pasunuru and Bansal, 2018) with reinforcement learning; as well as generative adversarial networks to generate more natural summaries (Liu et al., 2018).
Several datasets have been used to train and evaluate summarization models.The Gigaword (Graff and Cieri, 2003) and some DUC datasets (Over et al., 2007) have been used for headline generation models (Rush et al., 2015;Nallapati et al., 2016), where the generated summary is shorter than 75 characters.However, generating longer summaries is a more challenging task, especially for abstractive models.Nallapati et al. (2016) have proposed using the CNN/Daily Mail dataset (Hermann et al., 2015) to train models for generating longer, multi-sentence summaries up to 100 words.The New York Times dataset (Sandhaus, 2008) has also been used as a benchmark for the generation of long summaries (Durrett et al., 2016;Paulus et al., 2017).
Training strategies for sequential models.The common approach to training models for sequence generation is maximum likelihood estimation with teacher forcing.At each time step, the model is given the previous ground-truth output and predicts the current output.The sequence objective is the accumulation of cross entropy losses from each time step.
Despite its popularity, this approach for sequence generation is suboptimal due to exposure bias (Huszar, 2015) and loss-evaluation mismatch (Wiseman and Rush, 2016).Goyal et al. (2016) propose one way to reduce exposure bias by explicitly forcing the hidden representations of the model to be similar during training and inference.Bengio et al. (2015) and Wiseman and Rush (2016) propose an alternate method that exposes the network to the test dynamics during training.Reinforcement learning methods (Sutton and Barto, 1998), such as policy learning (Sutton et al., 1999), mitigate the mismatch between the optimization objective and the evaluation metrics by directly optimizing evaluation metrics.This approach has led to consistent improvements in domains such as image captioning (Zhang et al., 2017) and abstractive text summarization (Paulus et al., 2017).
A recent approach to training sequential models utilizes generative adversarial networks to improving the human perceived quality of generated outputs (Fedus et al., 2018;Guimaraes et al., 2017;Liu et al., 2018).Such models use an additional discriminator network that distinguishes between natural and generated output to guide the generative model towards outputs akin to human-written text.

Conclusions
We introduced a new abstractive summarization model which uses an external language model in the decoder, as well as a new reinforcement learning reward to encourage summary abstraction.Experiments on the CNN/Daily Mail dataset show that our model generates summaries that are much more abstractive that previous approaches, while maintaining high ROUGE scores close to or above the state of the art.Future work could be done on closing the gap to match human levels of abstraction, which is still very far ahead from our model in terms of novel n-grams.Including mechanisms to promote paraphrase generation in the summary generator could be an interesting direction.

Figure 1 :
Figure1: The network architecture with the decoder factorized into separate contextual and language models.The reference vector, composed of context vectors c tmp t , c int t , and the hidden state of the contextual model h dec t , is fused with the hidden state of the language model and then used to compute the distribution over the output vocabulary.

Figure 2 :
Figure 2: ROUGE and novel n-grams results on the anonymized validation set for different runs of each model type.Lines indicates the Pareto frontier for each model type.

Table 1 :
Summaries generated by different models for the same CNN/Daily Mail article.The highlighted spans indicate phrases of 3 tokens or more that are copied word-by-word from the original article.

Table 2 :
Comparison of ROUGE (R-) and novel n-gram (NN-) test results for our model and other abstractive summarization models on the CNN/Daily Mail dataset.

Table 3 :
Ablation study on the validation set of the anonymized CNN/Daily Mail dataset.