Selective Encoding for Abstractive Sentence Summarization

We propose a selective encoding model to extend the sequence-to-sequence framework for abstractive sentence summarization. It consists of a sentence encoder, a selective gate network, and an attention equipped decoder. The sentence encoder and decoder are built with recurrent neural networks. The selective gate network constructs a second level sentence representation by controlling the information flow from encoder to decoder. The second level representation is tailored for sentence summarization task, which leads to better performance. We evaluate our model on the English Gigaword, DUC 2004 and MSR abstractive sentence summarization datasets. The experimental results show that the proposed selective encoding model outperforms the state-of-the-art baseline models.


Introduction
Sentence summarization aims to shorten a given sentence and produce a brief summary of it. This is different from document level summarization task since it is hard to apply existing techniques in extractive methods, such as extracting sentence level features and ranking sentences. Early works propose using rule-based methods (Zajic et al., 2007), syntactic tree pruning methods (Knight and Marcu, 2002), statistical machine translation techniques (Banko et al., 2000) and so on for this task. We focus on abstractive sentence summarization task in this paper.
Recently, neural network models have been applied in this task. Rush et al. (2015) use autoconstructed sentence-headline pairs to train a neu- * Contribution during internship at Microsoft Research. ral network summarization model. They use a Convolutional Neural Network (CNN) encoder and feed-forward neural network language model decoder for this task. Chopra et al. (2016) extend their work by replacing the decoder with Recurrent Neural Network (RNN).  follow this line and change the encoder to RNN to make it a full RNN based sequence-tosequence model .
the sri lankan government on wednesday announced the closure of government schools with immediate effect as a military campaign against tamil separatists escalated in the north of the country . sri lanka closes schools as war escalates Figure 1: An abstractive sentence summarization system may produce the output summary by distilling the salient information from the highlight to generate a fluent sentence. We model the distilling process with selective encoding.
All the above works fall into the encodingdecoding paradigm, which first encodes the input sentence to an abstract representation and then decodes the intended output sentence based on the encoded information. As an extension of the encoding-decoding framework, attentionbased approach (Bahdanau et al., 2015) has been broadly used: the encoder produces a list of vectors for all tokens in the input, and the decoder uses an attention mechanism to dynamically extract encoded information and align with the output tokens. This approach achieves huge success in tasks like machine translation, where alignment between all parts of the input and output are required. However, in abstractive sentence summarization, there is no explicit alignment relationship between the input sentence and the summary ex-cept for the extracted common words. The challenge here is not to infer the alignment, but to select the highlights while filtering out secondary information in the input. A desired work-flow for abstractive sentence summarization is encoding, selection, and decoding. After selecting the important information from an encoded sentence, the decoder produces the output summary using the selected information. For example, in Figure 1, given the input sentence, the summarization system first selects the important information, and then rephrases or paraphrases to produce a well-organized summary. Although this is implicitly modeled in the encoding-decoding framework, we argue that abstractive sentence summarization shall benefit from explicitly modeling this selection process.
In this paper we propose Selective Encoding for Abstractive Sentence Summarization (SEASS). We treat the sentence summarization as a threephase task: encoding, selection, and decoding. It consists of a sentence encoder, a selective gate network, and a summary decoder. First, the sentence encoder reads the input words through an RNN unit to construct the first level sentence representation. Then the selective gate network selects the encoded information to construct the second level sentence representation. The selective mechanism controls the information flow from encoder to decoder by applying a gate network according to the sentence information, which helps improve encoding effectiveness and release the burden of the decoder. Finally, the attention-equipped decoder generates the summary using the second level sentence representation. We conduct experiments on English Gigaword, DUC 2004 and Microsoft Research Abstractive Text Compression test sets. Our SEASS model achieves 17.54 ROUGE-2 F1, 9.56 ROUGE-2 recall and 10.63 ROUGE-2 F1 on these test sets respectively, which improves performance compared to the state-of-the-art methods.

Related Work
Abstractive sentence summarization, also known as sentence compression and similar to headline generation, is used to help compress or fuse the selected sentences in extractive document summarization systems since they may inadvertently include unnecessary information. The sentence summarization task has been long connected to the headline generation task. There are some previous methods to solve this task, such as the linguistic rule-based method (Dorr et al., 2003). As for the statistical machine learning based methods, Banko et al. (2000) apply statistical machine translation techniques by modeling headline generation as a translation task and use 8000 article-headline pairs to train the system. Rush et al. (2015) propose leveraging news data in Annotated English Gigaword (Napoles et al., 2012) corpus to construct large scale parallel data for sentence summarization task. They propose an ABS model, which consists of an attentive Convolutional Neural Network encoder and an neural network language model (Bengio et al., 2003) decoder. On this Gigaword test set and DUC 2004 test set, the ABS model produces the state-of-theart results. Chopra et al. (2016) extend this work, which keeps the CNN encoder but replaces the decoder with recurrent neural networks. Their experiments showes that the CNN encoder with RNN decoder model performs better than Rush et al. (2015).  further change the encoder to an RNN encoder, which leads to a full RNN sequence-to-sequence model. Besides, they enrich the encoder with lexical and statistic features which play important roles in traditional feature based summarization systems, such as NER and POS tags, to improve performance. Experiments on the Gigaword and DUC 2004 test sets show that the above models achieve state-of-theart results. Gu et al. (2016) and Gulcehre et al. (2016) come up similar ideas that summarization task can benefit from copying words from input sentences. Gu et al. (2016) propose CopyNet to model the copying action in response generation, which also applies for summarization task. Gulcehre et al. (2016) propose a switch gate to control whether to copy from source or generate from decoder vocabulary. Zeng et al. (2016) also propose using copy mechanism and add a scalar weight on the gate of GRU/LSTM for this task. Cheng and Lapata (2016) use an RNN based encoder-decoder for extractive summarization of documents. Yu et al. (2016) propose a segment to segment neural transduction model for sequence-tosequence framework. The model introduces a latent segmentation which determines correspondences between tokens of the input sequence and the output sequence. Experiments on this task show that the proposed transduction model per-forms comparable to the ABS model.  propose to apply Minimum Risk Training (MRT) in neural machine translation to directly optimize the evaluation metrics. Ayana et al. (2016) apply MRT on abstractive sentence summarization task and the results show that optimizing for ROUGE improves the test performance.

Problem Formulation
For sentence summarization, given an input sentence x = (x 1 , x 2 , . . . , x n ), where n is the sentence length, x i ∈ V s and V s is the source vocabulary, the system summarizes x by producing y = (y 1 , y 2 , . . . , y l ), where l ≤ n is the summary length , y i ∈ V t and V t is the target vocabulary.
If |y| ⊆ |x|, which means all words in summary y must appear in given input, we denote this as extractive sentence summarization. If |y| |x|, which means not all words in summary come from input sentence, we denote this as abstractive sentence summarization. Table 1 provides an example. We focus on abstracive sentence summarization task in this paper.

Input:
South Korean President Kim Young-Sam left here Wednesday on a week -long state visit to Russia and Uzbekistan for talks on North Korea 's nuclear confrontation and ways to strengthen bilateral ties . Output: Kim leaves for Russia for talks on NKorea nuclear standoff Table 1: An abstractive sentence summarization example.

Model
As shown in Figure 2, our model consists of a sentence encoder using the Gated Recurrent Unit (GRU) (Cho et al., 2014), a selective gate network and an attention-equipped GRU decoder. First, the bidirectional GRU encoder reads the input words x = (x 1 , x 2 , . . . , x n ) and builds its representation (h 1 , h 2 , . . . , h n ). Then the selective gate selects and filters the word representations according to the sentence meaning representation to produce a tailored sentence word representation for abstractive sentence summarization task. Lastly, the GRU decoder produces the output summary with attention to the tailored representation. In the following sections, we introduce the sentence encoder, the selective mechanism, and the summary decoder respectively.

Sentence Encoder
The role of the sentence encoder is to read the input sentence and construct the basic sentence representation. Here we employ a bidirectional GRU (BiGRU) as the recurrent unit, where GRU is defined as: (3) where W z , W r and W h are weight matrices. The BiGRU consists of a forward GRU and a backward GRU. The forward GRU reads the input sentence word embeddings from left to right and gets a sequence of hidden states, ( h 1 , h 2 , . . . , h n ). The backward GRU reads the input sentence embeddings reversely, from right to left, and results in another sequence of hidden states, ( h 1 , h 2 , . . . , h n ): The initial states of the BiGRU are set to zero vectors, i.e., h 1 = 0 and h n = 0. After reading the sentence, the forward and backward hidden states are concatenated, i.e., h i = [ h i ; h i ], to get the basic sentence representation.

Selective Mechanism
In the sequence-to-sequence machine translation (MT) model, the encoder and decoder are responsible for mapping input sentence information to a list of vectors and decoding the sentence representation vectors to generate an output sentence (Bahdanau et al., 2015). Some previous works apply this framework to summarization generation tasks Gu et al., 2016;Gulcehre et al., 2016). However, abstractive sentence summarization is different from MT in two ways. First, there is no explicit alignment relationship between the input sentence and the output summary except for the common words. Second, summarization task needs to keep the highlights and remove the unnecessary information, while MT needs to keep all information literally.
Herein, we propose a selective mechanism to model the selection process for abstractive sentence summarization. The selective mechanism extends the sequence-to-sequence model by constructing a tailored representation for abstractive sentence summarization task. Concretely, the selective gate network in our model takes two vector inputs, the sentence word vector h i and the sentence representation vector s. The sentence word vector h i is the output of the BiGRU encoder and represents the meaning and context information of word x i . The sentence vector s is used to represent the meaning of the sentence. For each word x i , the selective gate network generates a gate vector sGate i using h i and s, then the tailored representation is constructed, i.e., h i . In detail, we concatenate the last forward hidden state h n and backward hidden state h 1 as the sentence representation s: For each time step i, the selective gate takes the sentence representation s and BiGRU hidden h i as inputs to compute the gate vector sGate i : where W s and U s are weight matrices, b is the bias vector, σ denotes sigmoid activation function, and is element-wise multiplication. After the selective gate network, we obtain another sequence of vectors (h 1 , h 2 , . . . , h n ). This new sequence is then used as the input sentence representation for the decoder to generate the summary.

Summary Decoder
On top of the sentence encoder and the selective gate network, we use GRU with attention as the decoder to produce the output summary.
At each decoding time step t, the GRU reads the previous word embedding w t−1 and previous context vector c t−1 as inputs to compute the new hidden state s t . To initialize the GRU hidden state, we use a linear layer with the last backward encoder hidden state h 1 as input: where W d is the weight matrix and b is the bias vector.
The context vector c t for current time step t is computed through the concatenate attention mechanism (Luong et al., 2015), which matches the current decoder state s t with each encoder hidden state h i to get an importance score. The importance scores are then normalized to get the current context vector by weighted sum: We then combine the previous word embedding w t−1 , the current context vector c t , and the decoder state s t to construct the readout state r t . The readout state is then passed through a maxout hidden layer (Goodfellow et al., 2013) to predict the next word with a softmax layer over the decoder vocabulary.
where W a , U a , W r , U r , V r and W o are weight matrices. Readout state r t is a 2d-dimensional vector, and the maxout layer (Equation 16) picks the max value for every two numbers in r t and produces a d-dimensional vector m t .

Objective Function
Our goal is to maximize the output summary probability given the input sentence. Therefore, we optimize the negative log-likelihood loss function: where D denotes a set of parallel sentencesummary pairs and θ is the model parameter. We use Stochastic Gradient Descent (SGD) with minibatch to learn the model parameter θ.

Experiments
In this section we introduce the dataset we use, the evaluation metric, the implementation details, the baselines we compare to, and the performance of our system.

Dataset
Training Set For our training set, we use a parallel corpus which is constructed from the Annotated English Gigaword dataset (Napoles et al., 2012) as mentioned in Rush et al. (2015). The parallel corpus is produced by pairing the first sentence and the headline in the news article with some heuristic rules. We use the script 1 released by Rush et al. (2015) to pre-process and extract the training and development datasets.  Rush et al. (2015). 2 We also find that except for the empty titles, this test set has some invalid lines like the input sentence containing only one word. Therefore, we further sample 2000 pairs as our internal test set and release it for future works 3 .

DUC 2004 Test Set
We employ DUC 2004 data for tasks 1 & 2 (Over et al., 2007) in our experiments as one of the test sets since it is too small to train a neural network model on. The dataset pairs each document with 4 different human-written reference summaries which are capped at 75 bytes. It has 500 input sentences with each sentence paired with 4 summaries. Toutanova et al. (2016) release a new dataset for sentence summarization task by crowdsourcing. This dataset contains approximately 6,000 source text sentences with multiple manually-created summaries (about 26,000 sentence-summary pairs in total). Toutanova et al. (2016) provide a standard split of the data into training, development, and test sets, with 4,936, 448 and 785 input sentences respectively. Since the training set is too small, we only use the test set as one of our test sets. We denote this dataset as MSR-ATC (Microsoft Research Abstractive Text Compression) test set in the following. Table 2 summarizes the statistic information of the three datasets we used.

Evaluation Metric
We employ ROUGE (Lin, 2004) as our evaluation metric. ROUGE measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS). It becomes the standard evaluation metric for DUC shared tasks and popular for summarization evaluation. Following previous work, we use ROUGE-1 (unigram), ROUGE-2 (bi-2 Thanks to Rush et al. (2015), we acquired the test set they used. Following Chopra et al. (2016), we remove pairs with empty titles resulting in slightly different accuracy compared to Rush et al. (2015) for their systems. The cleaned test set contains 1951 sentence-summary pairs. gram) and ROUGE-L (LCS) as the evaluation metrics in the reported experimental results.

Implementation Details
Model Parameters The input and output vocabularies are collected from the training data, which have 119,504 and 68,883 word types respectively. We set the word embedding size to 300 and all GRU hidden state sizes to 512. We use dropout (Srivastava et al., 2014) with probability p = 0.5.
Model Training We initialize model parameters randomly using a Gaussian distribution with Xavier scheme (Glorot and Bengio, 2010). We use Adam (Kingma and Ba, 2015) as our optimizing algorithm. For the hyperparameters of Adam optimizer, we set the learning rate α = 0.001, two momentum parameters β 1 = 0.9 and β 2 = 0.999 respectively, and = 10 −8 . During training, we test the model performance (ROUGE-2 F1) on development set for every 2,000 batches. We halve the Adam learning rate α if the ROUGE-2 F1 score drops for twelve consecutive tests on development set. We also apply gradient clipping (Pascanu et al., 2013) with range [−5, 5] during training. To both speed up the training and converge quickly, we use mini-batch size 64 by grid search.
Beam Search We use beam search to generate multiple summary candidates to get better results. To avoid favoring shorter outputs, we average the ranking score along the beam path by dividing it by the number of generated words. To both decode fast and get better results, we set the beam size to 12 in our experiments.

Baseline
We compare SEASS model with the following state-of-the-art baselines: ABS Rush et al. (2015) use an attentive CNN encoder and NNLM decoder to do the sentence summarization task. We trained this baseline model with the released code 1 and evaluate it with our internal English Gigaword test set and MSR-ATC test set. ABS+ Based on ABS model, Rush et al. (2015) further tune their model using DUC 2003 dataset, which leads to improvements on DUC 2004 test set. RNN sequence-to-sequence encoder-decoder model and add some features to enhance the encoder, such as POS tag, NER, and so on. Luong-NMT Neural machine translation model of Luong et al. (2015) with two-layer LSTMs for the encoder-decoder with 500 hidden units in each layer implemented in (Chopra et al., 2016). s2s+att We also implement a sequence-tosequence model with attention as our baseline and denote it as "s2s+att".

Results
We report ROUGE F1, ROUGE recall and ROUGE F1 for English Gigaword, DUC 2004 and MSR-ATC test sets respectively. We use the official ROUGE script (version 1.5.5) 4 to evaluate the summarization quality in our experiments. For English Gigaword 5 and MSR-ATC 6 test sets, the outputs have different lengths so we evaluate the system with F1 metric. As for the DUC 2004 test set 7 , the task requires the system to produce a fixed length summary (75 bytes), therefore we employ ROUGE recall as the evaluation metric. To satisfy the length requirement, we decode the output summary to a roughly expected length following Rush et al. (2015).
English Gigaword We acquire the test set from Rush et al. (2015) so we can make fair comparisons to the baselines.   In Table 3, we report the ROUGE F1 score of our model and the baseline methods. Our SEASS model with beam search outperforms all baseline models by a large margin. Even for greedy search, our model still performs better than other methods which used beam search. For the popular ROUGE-2 metric, our SEASS model achieves 17.54 F1 score and performs better than the previous works. Compared to the ABS model, our model has a 6.22 ROUGE-2 F1 relative gain. Compared to the highest CAs2s baseline, our model achieves 1.57 ROUGE-2 F1 improvement and passes the significant test according to the official ROUGE script. Table 4 summarizes our results on our internal test set using ROUGE F1 evaluation metrics. The performance on our internal test set is comparable to our development set, which achieves 24.58 ROUGE-2 F1 and outperforms the baselines.

DUC 2004
We evaluate our model using the ROUGE recall score since the reference summaries of the DUC 2004 test set are capped at 75 bytes. Therefore, we decode the summary to a fixed length 18 to ensure that the generated summary satisfies the minimum length requirement. As summarized in Table 5, our SEASS outperforms all the baseline methods and achieves 29.21, 9.56 and 25.51 for ROUGE 1, 2 and L recall. Compared to the ABS+ model which is tuned using DUC 2003 data, our model performs significantly better by 1.07 ROUGE-2 recall score and is trained only with English Gigaword sentence-summary data without being tuned using DUC data.  Figure 3: First derivative heat map of the output with respect to the selective gate. The important words are selected in the input sentence, such as "europe", "slammed" and "unacceptable". The output summary of our system is "council of europe slams french prison conditions" and the true summary is "council of europe again slams french prison conditions".

Discussion
In this section, we first compare the performance of SEASS with the s2s+att baseline model to illustrate that the proposed method succeeds in selecting information and building tailored representation for abstractive sentence summarization. We then analyze selective encoding by visualizing the heat map.

Effectiveness of Selective Encoding
We further test the SEASS model with different sentence lengths on English Gigaword test sets, which are merged from the Rush et al. (2015) test set and our internal test set. The length of sentences in the test sets ranges from 10 to 80. We group the sentences with an interval of 4 and get 18 different groups and we draw the first 14 groups. We find that the performance curve of our SEASS model always appears to be on the top of that of s2s+att with a certain margin. For the groups of 16, 20, 24, 32, 56 and 60, the SEASS model obtains big improvements compared to the s2s+att model. Overall, these improvements on all groups indicate that the selective encoding method benefits the abstractive sentence summarization task. Saliency Heat Map of Selective Gate Since the output of the selective gate network is a high dimensional vector, it is hard to visualize all the gate values. We use the method in  to visualize the contribution of the selective gate to the final output, which can be approximated by the first derivative. Given sentence words x with associated output summary y, the trained model associates the pair (x, y) with a score S y (x). The goal is to decide which gate g associated with a specific word makes the most significant contribution to S y (x). We approximate the S y (g) by computing the first-order Taylor expansion since the score S y (x) is a highly non-linear function in the deep neural network models: where w(g) is first the derivative of S y with respect to the gate g: We then draw the Euclidean norm of the first derivative of the output y with respect to the selective gate g associated with each input words. Figure 3 shows an example of the first derivative heat map, in which most of the important words are selected by the selective gate such as "europe", "slammed", "unacceptable", "conditions", and "france". We can observe that the selective gate determines the importance of each word before decoder, which releases the burden of it by providing tailored sentence encoding.

Conclusion
This paper proposes a selective encoding model which extends the sequence-to-sequence model for abstractive sentence summarization task. The selective mechanism mimics one of the human summarizers' behaviors, selecting important information before writing down the summary. With the proposed selective mechanism, we build an end-to-end neural network summarization model which consists of three phases: encoding, selection, and decoding. Experimental results show that the selective encoding model greatly improves the performance with respect to the state-of-theart methods on English Gigaword, DUC 2004 and MSR-ATC test sets.