Abstractive Sentence Summarization with Attentive Recurrent Neural Networks

Abstractive Sentence Summarization generates a shorter version of a given sentence while attempting to preserve its meaning. We introduce a conditional recurrent neural network (RNN) which generates a summary of an input sentence. The conditioning is provided by a novel convolutional attention-based encoder which ensures that the decoder focuses on the appropriate input words at each step of generation. Our model relies only on learned features and is easy to train in an end-to-end fashion on large data sets. Our experiments show that the model signiﬁcantly outperforms the recently proposed state-of-the-art method on the Giga-word corpus while performing competitively on the DUC-2004 shared task.


Introduction
Generating a condensed version of a passage while preserving its meaning is known as text summarization. Tackling this task is an important step towards natural language understanding. Summarization systems can be broadly classified into two categories. Extractive models generate summaries by cropping important segments from the original text and putting them together to form a coherent summary. Abstractive models generate summaries from scratch without being constrained to reuse phrases from the original text.
In this paper we propose a novel recurrent neural network for the problem of abstractive sentence summarization. Inspired by the recently proposed architectures for machine translation , our model consists of a conditional recurrent neural network, which acts as a decoder to generate the summary of an input sentence, much like a standard recurrent language model. In addition, at every time step the decoder also takes a conditioning input which is the output of an encoder module. Depending on the current state of the RNN, the encoder computes scores over the words in the input sentence. These scores can be interpreted as a soft alignment over the input text, informing the decoder which part of the input sentence it should focus on to generate the next word. Both the decoder and encoder are jointly trained on a data set consisting of sentence-summary pairs. Our model can be seen as an extension of the recently proposed model for the same problem by Rush et al. (2015). While they use a feed-forward neural language model for generation, we use a recurrent neural network. Furthermore, our encoder is more sophisticated, in that it explicitly encodes the position information of the input words. Lastly, our encoder uses a convolutional network to encode input words. These extensions result in improved performance.
The main contribution of this paper is a novel convolutional attention-based conditional recurrent neural network model for the problem of abstractive sentence summarization. Empirically we show that our model beats the state-of-the-art systems of Rush et al. (2015) on multiple data sets. Particularly notable is the fact that even with a simple generation module, which does not use any extractive feature tuning, our model manages to significantly outperform their ABS+ system on the Gigaword data set and is comparable on the DUC-2004 task.

93
While there is a large body of work for generating extractive summaries of sentences (Jing, 2000;Knight and Marcu, 2002;McDonald, 2006;Clarke and Lapata, 2008;Filippova and Altun, 2013;Filippova et al., 2015), there has been much less research on abstractive summarization. A count-based noisy-channel machine translation model was proposed for the problem in Banko et al. (2000). The task of abstractive sentence summarization was later formalized around the DUC-2003and DUC-2004competitions (Over et al., 2007, where the TOP-IARY system (Zajic et al., 2004) was the state-ofthe-art. More recently Cohn and Lapata (2008) and later Woodsend et al. (2010) proposed systems which made heavy use of the syntactic features of the sentence-summary pairs. Later, along the lines of Banko et al. (2000), MOSES was used directly as a method for text simplification by Wubben et al. (2012). Other works which have recently been proposed for the problem of sentence summarization include (Galanis and Androutsopoulos, 2010; Napoles et al., 2011;Cohn and Lapata, 2013). Very recently Rush et al. (2015) proposed a neural attention model for this problem using a new data set for training and showing state-of-the-art performance on the DUC tasks. Our model can be seen as an extension of their model.

Attentive Recurrent Architecture
Let x denote the input sentence consisting of a sequence of M words x = [x 1 , . . . , x M ], where each word x i is part of vocabulary V, of size |V| = V . Our task is to generate a target sequence y = [y 1 , . . . , y N ], of N words, where N < M , such that the meaning of x is preserved: y = argmax y P (y|x), where y is a random variable denoting a sequence of N words.
Typically the conditional probability is modeled by a parametric function with parameters θ: P (y|x) = P (y|x; θ). Training involves finding the θ which maximizes the conditional probability of sentence-summary pairs in the training corpus. If the model is trained to generate the next word of the summary, given the previous words, then the above conditional can be factorized into a product of indi-vidual conditional probabilities: In this work we model this conditional probability using an RNN Encoder-Decoder architecture, inspired by  and subsequently extended in . We call our model RAS (Recurrent Attentive Summarizer).

Recurrent Decoder
The above conditional is modeled using an RNN: where h t is the hidden state of the RNN: Here c t is the output of the encoder module (detailed in §3.2). It can be seen as a context vector which is computed as a function of the current state h t−1 and the input sequence x.

Attentive Encoder
We now give the details of the encoder which computes the context vector c t for every time step t of the decoder above. With a slight overload of notation, for an input sentence x we denote by x i the d dimensional learnable embedding of the i-th word (x i ∈ R d ). In addition the position i of the word x i is also associated with a learnable embedding l i of size d (l i ∈ R d ). Then the full embedding for i-th word in x is given by a i = x i + l i . Let us denote by B k ∈ R q×d a learnable weight matrix which is used to convolve over the full embeddings of consecutive words. Let there be d such matrices (k ∈ {1, . . . , d}). The output of convolution is given by: where b k j is the j-th column of the matrix B k . Thus the d dimensional aggregate embedding vector z i is defined as z i = [z i1 , . . . , z id ]. Note that each word x i in the input sequence is associated with one aggregate embedding vector z i . The vectors z i can be seen as a representation of the word which captures the position in which it occurs in the sentence and also the context in which it appears in the sentence. In our experiments the width q of the convolution matrix B k was set to 5. To account for words at the boundaries of x we first pad the sequence on both sides with dummy words before computing the aggregate vectors z i 's.
Given these aggregate vectors of words, we compute the context vector c t (the encoder output) as: where the weights α j,t−1 are computed as

Training and Generation
Given a training corpus S = {(x i , y i )} S i=1 of S sentence-summary pairs, the above model can be trained end-to-end using stochastic gradient descent by minimizing the negative conditional log likelihood of the training data with respect to θ: where the parameters θ constitute the parameters of the decoder and the encoder.
Once the parametric model is trained we generate a summary for a new sentence x through a wordbased beam search such that P (y|x) is maximized, argmax P (y t |{y 1 , . . . , y t−1 }, x). The search is parameterized by the number of paths k that are pursued at each time step.

Datasets and Evaluation
Our models are trained on the annotated version of the Gigaword corpus (Graff et al., 2003;Napoles et al., 2012) and we use only the annotations for tokenization and sentence separation while discarding other annotations such as tags and parses. We pair the first sentence of each article with its headline to form sentence-summary pairs. The data is pre-processed in the same way as Rush et al.
(2015) and we use the same splits for training, validation, and testing. For Gigaword we report results on the same randomly held-out test set of 2000 sentence-summary pairs as (Rush et al., 2015). 1 We also evaluate our models on the DUC-2004 evaluation data set comprising 500 pairs (Over et al., 2007). Our evaluation is based on three variants of ROUGE (Lin, 2004), namely, ROUGE-1 (unigrams), ROUGE-2 (bigrams), and ROUGE-L (longest-common substring).

Architectural Choices
We implemented our models in the Torch library (http://torch.ch/) 2 . To optimize our loss (Equation 5) we used stochastic gradient descent with mini-batches of size 32. During training we measure the perplexity of the summaries in the validation set and adjust our hyper-parameters, such as the learning rate, based on this number. For the decoder we experimented with both the Elman RNN and the Long-Short Term Memory (LSTM) architecture (as discussed in § 3.1). We chose hyper-parameters based on a grid search and picked the one which gave the best perplexity on the validation set. In particular we searched over the number of hidden units H of the recurrent layer, the learning rate η, the learning rate annealing schedule γ (the factor by which to decrease η if the validation perplexity increases), and the gradient clipping threshold κ. Our final Elman architecture (RAS-Elman) uses a single layer with H = 512, η = 0.5, γ = 2, and κ = 10. The LSTM model (RAS-LSTM) also has a single layer with H = 512, η = 0.1, γ = 2, and κ = 10.

Results
On the Gigaword corpus we evaluate our models in terms of perplexity on a held-out set. We then pick the model with best perplexity on the held-out set and use it to compute the F1-score of ROUGE-1, ROUGE-2, and ROUGE-L on the test sets, all of which we report. For the DUC corpus however, inline with the standard, we report the recall-only ROUGE. As baseline we use the state-of-the-art attention-based system (ABS) of Rush et al. (2015) which relies on a feed-forward network decoder. Additionally, we compare to an enhanced version of their system (ABS+), which relies on a range of separate extractive summarization features that are added as log-linear features in a secondary learning step with minimum error rate training (Och, 2003). Table 1 shows that both our RAS-Elman and RAS-LSTM models achieve lower perplexity than   ABS as well as other models reported in Rush et al. (2015). The RAS-LSTM performs slightly worse than RAS-Elman, most likely due to over-fitting. We attribute this to the relatively simple nature of this task which can be framed as English-to-English translation with few long-term dependencies. The ROUGE results (Table 2) show that our models comfortably outperform both ABS and ABS+ by a wide margin on all metrics. This is even the case when we rely only on very fast greedy search (k = 1), while as ABS uses a much wider beam of size k = 50; the stronger ABS+ system also uses additional extractive features which our model does not. These features cause ABS+ to copy 92% of words from the input into the summary, whereas our model copies only 74% of the words leading to more abstractive summaries. On DUC-2004 we report recall ROUGE as is customary on this dataset. The results (Table 3) show that our models are better than ABS+. However the improvements are smaller than for Gi-gaword which is likely due to two reasons: First, tokenization of DUC-2004 differs slightly from our training corpus. Second, headlines in Gigaword are much shorter than in DUC-2004. For the sake of completeness we also compare our models to the recently proposed standard Neural Machine Translation (NMT) systems. In particular, we compare to a smaller re-implementation of the attentive stacked LSTM encoder-decoder of Luong et al. (2015). Our implementation uses two-layer LSTMs for the encoder-decoder with 500 hidden units in each layer. Tables 2 and 3 report ROUGE scores on the two data sets. From the tables we observe that the proposed RAS-Elman model is able to match the performance of the NMT model of Luong at al. (2015). This is noteworthy because RAS-Elman is significantly simpler than the NMT model at multiple levels. First, the encoder used by RAS-Elman is extremely light-weight (attention over the convolutional representation of the input words), compared to Luong's (a 2 hidden layer LSTM). Second, the decoder used by RAS-Elman is a single layer standard (Elman) RNN as opposed to a multi-layer LSTM. In an independent work, Nallapati et. al (2016) also trained a collection of standard NMT models and report numbers in the same ballpark as RAS-Elman on both datasets.
In order to better understand which component of the proposed architecture is responsible for the improvements, we trained the recurrent model with Rush et. al., (2015)'s ABS encoder on a subset of the Gigaword dataset. The ABS encoder, which does not have the position features, achieves a final validation perplexity of 38 compared to 29 for the proposed encoder, which uses position features as well as context information. This clearly shows the benefits of using the position feature in the proposed encoder.
Finally in Figure 1 we highlight anecdotal examples of summaries produced by the RAS-Elman system on the Gigaword dataset. The first two examples highlight typical improvements in the RAS model over ABS+. Generally the model produces more fluent summaries and is better able to capture the main actors of the input. For instance in Sentence 1, RAS-Elman correctly distinguishes the actions of "pepe" from "ferreira", and in Sentence 2 it identifies the correct role of the "think tank". The final two ex-I(1): brazilian defender pepe is out for the rest of the season with a knee injury , his porto coach jesualdo ferreira said saturday . G: football : pepe out for season A+: ferreira out for rest of season with knee injury R: brazilian defender pepe out for rest of season with knee injury I(2): economic growth in toronto will suffer this year because of sars , a think tank said friday as health authorities insisted the illness was under control in canada 's largest city . G: sars toll on toronto economy estimated at c$ # billion A+: think tank under control in canada 's largest city R: think tank says economic growth in toronto will suffer this year I(3): colin l. powell said nothing -a silence that spoke volumes to many in the white house on thursday morning . G: in meeting with former officials bush defends iraq policy A+: colin powell speaks volumes about silence in white house R: powell speaks volumes on the white house I(4): an international terror suspect who had been under a controversial loose form of house arrest is on the run , british home secretary john reid said tuesday . G: international terror suspect slips net in britain A+: reid under house arrest terror suspect on the run R: international terror suspect under house arrest Figure 1: Example sentence summaries produced on Gigaword. I is the input, G is the true headline, A is ABS+, and R is RAS-ELMAN.
amples highlight typical mistakes of the models. In Sentence 3 both models take literally the figurative use of the idiom "a silence that spoke volumes," and produce fluent but nonsensical summaries. In Sentence 4 the RAS model mistakes the content of a relative clause for the main verb, leading to a summary with the opposite meaning of the input. These difficult cases are somewhat rare in the Gigaword, but they highlight future challenges for obtaining human-level sentence summary.

Conclusion
We extend the state-of-the-art model for abstractive sentence summarization (Rush et al., 2015) to a recurrent neural network architecture. Our model is a simplified version of the encoder-decoder framework for machine translation . The model is trained on the Gigaword corpus to generate headlines based on the first line of each news article. We comfortably outperform the previous state-of-the-art on both Gigaword data and the DUC-2004 challenge even though our model does not rely on additional extractive features.