Abstractive Text Summarization Using Sequence-to-Sequence RNNs and Beyond

In this work, we cast abstractive text summarization as a sequence-to-sequence problem and employ the framework of Attentional Encoder-Decoder Recurrent Neural Networks to this problem, outperforming state-of-the art model of Rush et. al. (2015) on two different corpora. We also move beyond the basic architecture, and propose several novel models to address important problems in summarization including modeling key-words, capturing the hierarchy of sentence-to-word structure and addressing the problem of words that are key to a document, but rare elsewhere. Our work shows that many of our proposed solutions contribute to further improvement in performance. In addition, we propose a new dataset consisting of multi-sentence summaries, and establish performance benchmarks for further research.


Introduction
Deep learning based sequence-to-sequence models have been successful in many problems such as machine translation (Bahdanau et al., 2014), speech recognition (Bahdanau et al., 2015) and video captioning (Venugopalan et al., 2015).In this work, we focus on the task of abstractive text summarization, which consists of generating a short summary or a headline that captures the salient ideas of a long article.This task can also be naturally thought of as mapping an input sequence of words in a source document to a target sequence of words called summary.In the framework of sequence-to-sequence models, a very relevant model to our task is the attentional Recurrent Neural Network (RNN) encoder-decoder model proposed in (Bahdanau et al., 2014), which has produced state-of-the-art performance in machine translation (MT).
Despite the similarities, summarization is a very different problem from MT. Unlike in MT, the target (summary) is typically very short and does not depend very much on the length of the source (document) in summarization.Additionally, a key challenge in summarization is to optimally compress the original document in a lossy manner such that the key concepts in the original document are preserved, whereas in MT, the translation is expected to be loss-less.In translation, there is a strong notion of almost one-to-one word-level alignment between source and target, but in summarization, it is less obvious.Hence, it remains unclear whether the models that succeeded in machine translation would perform equally well here.In this work, we aim to answer this question.More importantly, motivated by the contrasting nature of summarization from MT, we move beyond the basic architecture and propose new models that help solve problems specific to summarization.

Related Work
A vast majority of the past work in summarization has been extractive, which consists of identifying key sentences or passages in the source document and reproducing them as summary (Neto et al., 2002;Erkan and Radev, 2004;Wong et al., 2008;Filippova and Altun, 2013;Colmenares et al., 2015).
Humans on the other hand, tend to paraphrase the original story in their own words.As such, human summaries are abstractive in nature and seldom consist of reproduction of original sentences from the document.The task of abstractive summarization has been standardized using the DUC-2003 and DUC-2004 competitions. 1 The data for these tasks consists of news stories from various topics with multiple reference summaries per story generated by humans.The best performing system on the DUC-2004 task, called TOPIARY (Zajic et al., 2004), used a combination of linguistically motivated compression techniques, and an unsupervised topic detection algorithm that appends keywords extracted from the article onto the compressed output.Some of the other notable work in the task of abstractive summarization includes using traditional phrase-table based machine translation approaches (Banko et al., 2000), compression using weighted tree-transformation rules (Cohn and Lapata, 2008) and quasi-synchronous grammar approaches (Woodsend et al., 2010).
With the emergence of deep learning as a viable alternative for many NLP tasks (Collobert et al., 2011), researchers have started considering this framework as an attractive, fully data-driven alternative to abstractive summarization.In (Rush et al., 2015), the authors use convolutional models to encode the source, and a context-sensitive attentional feed-forward neural network to generate the summary, producing state-of-the-art results on Gigaword and DUC datasets.In another paper that is closely related to our work, (Hu et al., 2015) introduce a large dataset for Chinese short text summarization.They show experimental results on this dataset using an encoder-decoder RNN.
Our work employs the same framework as (Hu et al., 2015), but we make the following contributions: (i) We show that the attentional encoderdecoder architecture outperforms state-of-the-art systems on two different English corpora.(ii) We propose novel models that are motivated by problems specific to summarization and show that they boost performance.(iii) We propose a new dataset for the task of abstractive summarization of a document into multiple sentences and establish benchmarks.

Encoder-Decoder with Attention
This is our baseline model and corresponds to the neural machine translation model used in (Bahdanau et al., 2014).The encoder consists of a bidirectional GRU-RNN (Chung et al., 2014), while the decoder consists of a uni-directional GRU-RNN with the same hidden-state size as that of 1 http://duc.nist.gov/ the encoder, and an attention mechanism over the source-hidden states and a soft-max layer over target vocabulary to generate words.In the interest of space, we refer the reader to the original paper for a detailed treatment of this model.

Large Vocabulary Trick
In this variant, we adapted the large vocabulary 'trick' (LVT) described in (Jean et al., 2014) to the summarization problem.In our approach, the decoder-vocabulary of each mini-batch is restricted to words in the source documents of that batch.In addition, the most frequent words in the target dictionary are added until the vocabulary size reaches a fixed size.The aim of this model is to save computation in the soft-max layer of the decoder which is a bottleneck owing to the large target vocabulary size.In addition, this technique may also speed up convergence by focusing the modeling effort only on the words that really matter to a given example.This technique is particularly well suited to summarization since a large proportion of the words in the summary come from the source document anyway.

Vocabulary expansion
LVT addresses the computational bottleneck of a large softmax layer, but at the same time, it sacrifices some of its ability to produce novel but meaningful words, as it now relies heavily on the source vocabulary.To overcome this shortcoming, we propose to expand the LVT vocabulary by adding the 1-nearest-neighbors of all words in the source document, as measured by cosine similarity in the word embeddings space.We hope this technique will help achieve balance between the mutually conflicting goals of a more focused model vs. better generalizability.

Feature-rich Encoder
In summarization, one of the key challenges is to identify the key-concepts and key-entities in the document, around which the story revolves.In order to accomplish this goal, we may need to go beyond the word-embeddings-based representation of the input document and capture additional features such as parts-of-speech tags, named-entity tags, and TF and IDF statistics of the words.We therefore propose the following simple change to the architecture of the model: we create additional look-up based embedding matrices for the vocabulary of each of the tag-types, similar to the em-beddings for words.For continuous features such as TF and IDF, we convert them into categorical values by discretizing them into a fixed number of bins, and use one-hot representations to approximate the real values.This allows us to map them into an embeddings matrix like any other tag-type.Finally, for each word in the source document, we simply look-up its embeddings from all of its associated tags and concatenate them into a single long vector.This will replace the word-basedembeddings in the original model.On the target side, we continue to use only word-based embeddings as the representation.

Switching Generator/Pointer
Often-times in summarization, the keywords in a document that are central to the summary may actually be very rare at the corpus level; e.g., in the domain of news stories, the keywords are usually named entities and may rarely occur in other documents.Since our neural architecture depends on embedding representations of words, its performance may degrade on such rare words due to sparse training data.To address this problem, we propose a novel switching decoder/pointer architecture which is graphically represented in Figure 1.In this model, the decoder is equipped with a 'switch' that decides between using the generator or a pointer at every time-step.If the switch turns the generator on, the decoder produces a word from its target vocabulary in the normal fashion.However, if the switch turns the generator off, the decoder instead generates a pointer to one of the word-positions in the source.The word at the pointer-location is then 'copied back' into the summary.The pointer mechanism may be more robust in handling rare words because it uses as input the encoder's hidden-state representation of rare words which depends on the entire context of the word and not just its embeddings.Our hope is that, given a large number of training examples, the switch learns to pick the pointer over generator for production of rare words.Following (Vinyals et al., 2015), we use attention distribution over source word positions as the generative distribution for pointers.The switch is modeled as a sigmoid activation function over a linear layer based on the hidden state of the decoder, embedding vector from previous emission and context vector, as shown below: (1) where P (s i ) is the probability of the switch turning on, h i is the hidden state at the i th time-step of the decoder, E[o i−1 ] is the embedding vector of the emission from the previous time step, c i is the attention-weighted context vector, and W h , W e , W c , b and v are model parameters.For a more detailed description of this model and experiments on multiple problems, please refer to the parallel work published by some of the coauthors of this paper (Gulcehre et al., 2016).
Pointer networks have been used earlier for the problem of rare words in the context of machine translation (Luong et al., 2015).In their model, for each out-of-vocabulary (OOV) word in the target sentence, the system emits the position of its corresponding word in the source sentence.Our model is novel in that it combines emission and pointers into a joint model using the switch mechanism.

ENCODER DECODER Input Layer
Output Layer

Hierarchical Encoder with Hierarchical Attention
In datasets where the source document is very long, in addition to identifying the keywords in the document, it is also important to identify the key sentences from which the summary can be drawn.This model aims to capture this notion of two levels of importance by using two bi-directional RNNs on the source side, one at the word-level and the other at the sentence-level, similar to the model described in (Li et al., 2015).In contrast to their work where the attention mechanism operates only at sentence level, in our model, it operates at both levels simultaneously.The wordlevel attention is re-weighted by the corresponding sentence-level attention and re-normalized as shown below: where a w (i) is the word-level attention weight at i th position of the source, and s(i) is the id of the sentence at i th word position, a s (j) is the sentence-level attention weight for the j th sentence in the source, N is the number of words in the source document, and a(i) is the re-scaled attention at the i th word position.The re-scaled attention is then used to compute the attentionweighted context vector that goes as input to the hidden state of the decoder.This model therefore aims to model key sentences as well as keywords within those sentences jointly.A graphical representation of this model is displayed in Figure 2.  4 Experiments and Results

Gigaword Corpus
In this series of experiments2 , we used the annotated Gigaword corpus as described in (Rush et al., 2015).We used the scripts made available by the authors of this work3 to preprocess the data, which resulted in about 3.8M training examples.The script also produces about 400K validation and test examples, but we created a randomly sampled subset of 2000 examples each for validation and testing purposes, on which we report our performance.Further, we also acquired the exact test sample used in (Rush et al., 2015) to make precise comparison of our models with theirs.We also made small modifications to the script to extract not only the tokenized words, but also systemgenerated parts-of-speech tags and named-entity tags.
Training: For all the models we discuss below, we used word2vec vectors (Mikolov et al., 2013) trained on the same corpus to initialize the model embeddings, but we allowed the embeddings to be updated during training.When we used only the first sentence of the document as the source, as done in (Rush et al., 2015), the encoder vocabulary size was 119,505 and that of the decoder stood at 68,885.We used Adadelta ((Zeiler, 2012)) for training, with an initial learning rate of 0.001.We used a batch-size of 50 and randomly shuffled the training data at every epoch, while sorting every 10 batches according to their lengths to speed up training.We did not use any dropout or regularization, but we applied gradient clipping.We used early stopping based on the validation set and used the best model on the validation set to report all performance numbers.Decoding: At decode-time, we used beam search of size 5 to generate the summary, and limited the size of summary to a maximum of 30 words, since this is the maximum size we noticed in the sampled validation set.We found that the average system summary length from all our models (7.8 to 8.3) agrees very closely with that of the ground truth on the validation set (about 8.7 words), without any specific tuning.
Evaluation metrics: In (Rush et al., 2015), the authors used full-length version of Rouge recall 4 to evaluate their systems on the Gigaword corpus 5 .The reason is that the official 75 bytes ceiling in the limited-length Rouge recall metric is too long for Gigaword summaries, which tend to be much shorter than DUC ones.However, full-length re-call favors longer summaries, so it may not be fair to compare two systems that differ in summary lengths using it.Full-length F1 solves this problem since it can penalize longer summaries that are noisy.Therefore, we use full-length F1 scores from Rouge-1, Rouge-2 and Rouge-L using the official script to evaluate our systems on this corpus.However, in the interest of fair comparison with previous work, we also report full-length recall scores where necessary.In addition, we also report the percentage of tokens in the system summary that occur in the source (which we call 'src.copy rate' in Table 1).We describe all our experiments below.words-1sent: This is our baseline model and corresponds to the encoder-decoder model described in Sec.where we restrict the decoder vocabulary size to 2,000. 6The validation results in row #2 of the table show that this model achieves similar Rouge numbers as words-1sent, but not surprisingly, relies more on the source vocabulary as indicated in the last column of Table 1.In the rest of the models described below, we persist with this technique because it cuts down the training time per epoch by nearly three times, and helps this and all subsequent models converge in only 50%-75% of the epochs needed for words-1sent.words-lvt2k-(2|5)sent: These models are exactly same as the model above, except for that they are trained on the first 2 and 5 sentences of the source document respectively.The table shows that the model trained on 2 sentences (row #3) improves over the one trained on 1 sentence (row #2), while the one trained on 5 sentences (row #4) is not as good as the one trained on two.This may be attributed to the fact that, on Gigaword corpus, most of the information needed for the summary may be found in the first one or at most two sentences, and the rest contain extraneous information.We therefore fixed the source length at 2 sentences for subsequent models.words-lvt2k-2sent-exp: This model corresponds to Sec. 3.3.We augment the nearest neighbors to LVT-vocabulary before topping it off with the most frequent words from the target dictionary, to make sure that the LVT-vocabulary size is still maintained at 2,000.This model has produced modest gains as well, as shown in row #5, but did not improve when used in the context of bigger models in the following experiments.
words-lvt2k-2sent-hieratt: Since we used two sentences from source document, we implemented the hierarchical attention model proposed in Sec 3.6.Confirming our intuition, this model improves performance compared to its flatter counterpart, as shown in row #6.However, since this model is computationally expensive, we did not train it in our subsequent experiments using larger models.big-words-lvt2k-(1|2)sent: These are identical to models words-lvt-(1|2)sent, but we increase the embedding size to 200 and the hidden state size to 400.As shown in rows #7 and #8 of the table, while both models improve performance compared to their smaller counterparts, big-words-lvt2k-2sent, trained over 2 source sentences achieves the best performance.big-feats-lvt2k-2sent: Here, we exploit the partsof-speech and named-entity tags in the annotated gigaword corpus as well as TF, IDF values, to augment the input embeddings on the source side as described in Sec 3.4.In total, our embedding vector grew from the original 100 to 155, and produced a small gain compared to its counterpart big-words-lvt2k-2sent as shown in row #9 of Table 1, demonstrating the utility of syntax based features in this task.feats-lvt2k-2sent-ptr: This is the switching generator/pointer model described in Sec.3.5.We created ground truth data for pointers as follows.
In the training data, we created a pointer from each 'UNK' token7 in the summary to the position in the corresponding source document where the matching word occurs, as seen in the original data.
The resulting data has 2.7 pointers per 100 examples in the training set and 9.1 in the validation set.
When there are multiple matches in the source, we resolved the conflict in favor of the first occurrence of the matching word in the source document.We modeled the 'UNK' tokens on the source side using a single placeholder embedding that is shared across all documents.We trained this model using feature embeddings and pointer-data as input.Comparison with state-of-the-art: (Rush et al., 2015) reported recall-only from full-length version of Rouge, but the authors kindly provided us with their F1 numbers, as well as their test sample.We compared the performance of our models words-1sent and big-words-lvt2k-2sent with their best system on their sample, on both Recall as well as F1, as displayed in rows #14 through #19 in Table 1.The reason we did not evaluate our best models here is that this test set did not include NLP annotations, which are needed in our best models.The table shows that both of our models outperform the state of the art model of (Rush et al., 2015), on both recall and F1, with statistical significance.In addition, our models exhibit better abstractive ability as shown by the src.copy rate metric in the last column of the table.

DUC Corpus
The DUC corpus 8 comes in two parts: the 2003 corpus consisting of 624 document, summary pairs and the 2004 corpus consisting of 500 pairs.Since these corpora are too small to train large neural networks on, (Rush et al., 2015) trained their models on the Gigaword corpus, but combined it with an additional log-linear extractive 8 http://duc.nist.gov/duc2004/tasks.htmlsummarization model with handcrafted features, that is trained on the DUC 2003 corpus.They call the original neural attention model the ABS model, and the combined model ABS+.The latter model is state-of-the-art since it outperforms all previously published baselines, as measured by the official DUC metric of limited-length recall.
In these experiments, we use the same metric to evaluate our models too.
In our work, the only tuning we do on the validation set (DUC 2003 corpus) is to select a model from Table 1.As shown in Table 2, the best model on at least two variants of Rouge is big-words-lvt2k-1sent.The performance of this model is compared with ABS and ABS+ models, as well as TOPIARY, the top performing system on DUC-2004, on the test set in the same table.We note that although our model consistently outperforms ABS+ on all three variants of Rouge, the differences are not statistically significant.However, when the comparison is made with ABS model, which is really the true un-tuned counterpart of our model, the results are indeed statistically significant.

CNN/DailyMail Corpus
The existing abstractive text summarization corpora including Gigaword and DUC have only one sentence in each summary.In this section, we present a new corpus that consists of multi-sentence summaries ordered according to the events described in the source document.To produce this corpus, we modify an existing corpus that has been used for the task of passage-based question answering (Hermann et al., 2015).In this work, the authors used the human generated abstractive summary bullets from new-stories in CNN and Daily Mail websites as questions (with one of the entities hidden), and stories as the corresponding passages from which to answer the fill- respect to the baseline model on its dataset as given by the 95% confidence interval in the official Rouge script.We report statistical significance only for the best performing models.'src.copy rate' for the reference data on our validation sample is 45%.Please refer to Section 4 for explanation of notation.
in-the-blank question.The authors released the scripts that crawl these websites, extract and generate pairs of passages and questions.With a simple modification of the script, we were able to restore all the summary bullets of each story in the original order to obtain a multi-sentence summary, treating each bullet as a sentence.In all, this corpus has 286,817 training pairs, 13,368 validation pairs and 11,487 test pairs, as defined by their scripts.The source documents in the training set have 766 words spanning 29.74 sentences on an average while the summaries consist of 53 words and 3.72 sentences.The unique characteristics of this dataset such as long documents, and ordered multi-sentence summaries present interesting challenges, and we hope will attract future researchers to build and test novel models on it.The dataset is released in two versions: one that has actual entity names, and the other, in which entity occurrences are replaced with documentspecific integer-ids beginning from 0. Since the vocabulary size is smaller in the anonymized version, we used it in all our experiments below.We limited the source vocabulary size to 150K, and the target vocabulary to 60K, the source and target lengths to at most 800 and 100 words respectively.We used 100-dimension word2vec embeddings trained on this dataset as input, and we fixed  the model hidden state size at 200.We also created pointers in the training data by matching only the anonymized entity-ids between source and target on similar lines as we did for the UNK tokens in Gigaword corpus.
Results from three models we ran on this corpus are displayed in Table 3.Although this dataset is smaller and more complex than the Gigaword corpus, it is interesting to note that the numbers are in the same range.Our switching pointer/generator model slightly improves over the baseline, but we expect the hierarchical attention model to outperform the other models owing to the large number of sentences in the source.However, convergence rate of these models is quite slow on this dataset, and the hierarchical model has not yet reached convergence at the time of submission.

Qualitative Analysis
Table 4 presents a few high quality and poor quality output from big-feats-lvt2k-2sent, one of our best performing models on the validation set.Even when the model differs from the target summary, its summaries tend to be very meaningful and relevant, a phenomenon not captured by word/phrase matching evaluation metrics such as Rouge.On the other hand, the model sometimes 'misinterprets' the semantics of the text and generates a summary with what would be a laughable interpretation for humans, as shown in the poor quality examples in the table.Clearly, capturing the 'meaning' of complexly worded sentences remains a weakness of these models.
Good quality summary output S: a man charged with the murder last year of a british backpacker confessed to the slaying on the night he was charged with her killing , according to police evidence presented at a court hearing tuesday .T: man charged with british backpacker 's death confessed to crime police officer claims O: man charged with murdering british backpacker confessed to murder S: volume of transactions at the nigerian stock exchange has continued its decline since last week , a nse official said thursday .T: transactions dip at nigerian stock exchange O: transactions at nigerian stock exchange down Poor quality summary output S: broccoli and broccoli sprouts contain a chemical that kills the bacteria responsible for most stomach cancer , say researchers , confirming the dietary advice that moms have been handing out for years T: for release at #### <unk> mom was right broccoli is good for you say cancer researchers O: broccoli sprouts contain deadly bacteria S: j.p. morgan chase 's ability to recover from a slew of recent losses rests largely in the hands of two men , who are both looking to restore tarnished reputations and may be considered for the top job someday .T: # executives to lead j.p. morgan chase on road to recovery O: j.p. morgan chase may be considered for top job Table 4: Examples of generated summaries from our best model on the validation set of Gigaword corpus.S: source document, T: target summary, O: system output.
Our next example output, presented in Figure 5, displays the sample output from the switching generator/pointer model on the Gigaword corpus.It is apparent from the examples that the model learns to use pointers very accurately not only for named entities, but also for multi-word phrases.Despite its accuracy, the performance improvement of the overall model is not significant.We believe the impact of this model may be more pronounced in other settings with a heavier tail distribution of rare words.We intend to carry out more experiments with this model in the future.
On CNN/DailyMail data, although our models are able to produce good quality multi-sentence summaries, we notice that the same sentence or phrase often gets repeated in the summary.We believe models that incorporate intra-attention such as (Cheng et al., 2016) can fix this problem by encouraging the model to 'remember' the words it has already produced in the past.

Conclusion
In this work, we propose the attentional encoderdecoder architecture for the task of summarization with very promising results, outperforming stateof-the-art results significantly on two different datasets.Each of our proposed models such as the features-based embeddings, hierarchical attention model and switching pointer/generator network address a specific problem in abstractive summarization, yielding further improvement in performance.We also propose a new dataset for multisentence summarization and establish benchmark numbers on it.As part of our future work, we plan to focus our efforts on this data and build more robust models for summaries consisting of multiple sentences.

Figure 1 :
Figure 1: Switching generator/pointer model: When the switch shows 'G', the traditional generator consisting of the softmax layer is used to produce a word, and when it shows 'P', the pointer network is activated to copy the word from one of the source document positions.When the pointer is activated, the embedding from the source is used as input for the next time-step as shown by the arrow from the encoder to the decoder at the bottom.

Figure 2 :
Figure 2: Hierarchcial encoder with hierarchcial attention: the attention weights at the word level, represented by the dashed arrows are re-scaled by the corresponding sentencelevel attention weights, represented by the dotted arrows.

Figure 3 :
Figure 3: Sample output from switching generator/pointer networks.An arrow indicates that a pointer to the source position was used to generate the corresponding summary word.
3.1, where we used only the first sentence of the document as the source.The size of the embedding used in this model is 100 and the hidden state size is 200.The training time for this model on a single gpu was 13 hours per epoch, with con- vergence reaching in 19 epochs.We report the performance of this model on both our internal validation and test samples in Table1, in rows #1 and #11.words-lvt2k-1sent: In this model, we employ the large-vocabulary trick described in Sec.3.2,

Table 2 :
Evaluation of our models using the limited-length Rouge Recall on DUC validation and test sets.

Table 1 :
Performance comparison of various models.'*' indicates statistical significance of the corresponding model with

Table 3 :
Performance of various models on CNN/Daily