Extractive Summarization with SWAP-NET: Sentences and Words from Alternating Pointer Networks

We present a new neural sequence-to-sequence model for extractive summarization called SWAP-NET (Sentences and Words from Alternating Pointer Networks). Extractive summaries comprising a salient subset of input sentences, often also contain important key words. Guided by this principle, we design SWAP-NET that models the interaction of key words and salient sentences using a new two-level pointer network based architecture. SWAP-NET identifies both salient sentences and key words in an input document, and then combines them to form the extractive summary. Experiments on large scale benchmark corpora demonstrate the efficacy of SWAP-NET that outperforms state-of-the-art extractive summarizers.


Introduction
Automatic summarization aims to shorten a text document while maintaining the salient information of the original text. The practical need for such systems is growing with the rapid and continuous increase in textual information sources in multiple domains.
Summarization tools can be broadly classified into two categories: extractive and abstractive. Extractive summarization selects parts of the input document to create its summary while abstractive summarization generates summaries that may have words or phrases not present in the input document. Abstractive summarization is clearly harder as methods have to address factual and grammatical errors that may be introduced and problems in utilizing external knowledge sources to obtain paraphrasing or generalization. Extractive summarizers obviate the need to solve these problems by selecting the most salient textual units (usually sentences) from the input documents. As a result, they generate summaries that are grammatically and semantically more accurate than those from abstractive methods. While they may have problems like incorrect or unclear referring expressions or lack of coherence, they are computationally simpler and more efficient to generate. Indeed, state-of-the-art extractive summarizers are comparable or often better in performance to competitive abstractive summarizers (see (Nallapati et al., 2017) for a recent empirical comparison).
Classical approaches to extractive summarization have relied on human-engineered features from the text that are used to score sentences in the input document and select the highestscoring sentences.
These include graph or constraint-optimization based approaches as well as classifier-based methods. A review of these approaches can be found in Nenkova et al. (2011). Some of these methods generate summaries from multiple documents. In this paper, we focus on single document summarization.
Modern approaches that show the best performance are based on end-to-end deep learning models that do not require human-crafted features. Neural models have tremendously improved performance in several difficult problems in NLP such as machine translation (Chen et al., 2017) and question-answering (Hao et al., 2017). Deep models with thousands of parameters require large, labeled datasets and for summarization this hurdle of labeled data was surmounted by Cheng and Lapata (2016), through the creation of a labeled dataset of news stories from CNN and Daily Mail consisting of around 280,000 documents and human-generated summaries.
Recurrent neural networks with encoderdecoder architecture (Sutskever et al., 2014) have been successful in a variety of NLP tasks where an encoder obtains representations of input sequences and a decoder generates target sequences. Attention mechanisms (Bahdanau et al., 2015) are used to model the effects of different loci in the input sequence during decoding. Pointer networks (Vinyals et al., 2015) use this mechanism to obtain target sequences wherein each decoding step is used to point to elements of the input sequence. This pointing ability has been effectively utilized by state-of-the-art extractive and abstractive summarizers (Cheng and Lapata, 2016;See et al., 2017).
In this work, we design SWAP-NET a new deep learning model for extractive summarization. Similar to previous models, we use an encoderdecoder architecture with attention mechanism to select important sentences. Our key contribution is to design an architecture that utilizes key words in the selection process. Salient sentences of a document, that are useful in summaries, often contain key words and, to our knowledge, none of the previous models have explicitly modeled this interaction. We model this interaction through a two-level encoder and decoder, one for words and the other for sentences. An attention-based mechanism, similar to that of Pointer Networks, is used to learn important words and sentences from labeled data. A switch mechanism is used to select between words and sentences during decoding and the final summary is generated using a combination of selected sentences and words. We demonstrate the efficacy of our model on the CNN/Daily Mail corpus where it outperforms state-of-the-art extractive summarizers. Our experiments also suggest that the semantic redundancy in SWAP-NET generated summaries is comparable to that of human-generated summaries.

Problem Formulation
Let D denote an input document, comprising of a sequence of N sentences: s 1 , . . . , s N . Ignoring sentence boundaries, let w 1 , . . . , w n be the sequence of n words in document D. An extractive summary aims to obtain a subset of the input sentences that forms a salient summary.
We use the interaction between words and sentences in a document to predict important words and sentences. Let the target sequence of indices of important words and sentences be V = v 1 , . . . , v m , where each index v j can point to ei-ther a sentence or a word in an input document. We design a supervised sequence-to-sequence recurrent neural network model, SWAP-NET, that uses these target sequences (of sentences and words) to learn salient sentences and key words. Our objective is to find SWAP-NET model parameters M that maximize the probability p(V |M, D) = j p(v j |v 1 , . . . , v j−1 , M, D) = j p(v j |v <j , M, D). We omit M in the following to simplify notation. SWAP-NET predicts both key words and salient sentences, that are subsequently used for extractive summary generation.

Background
We briefly describe Pointer Networks (Vinyals et al., 2015). Our approach, detailed in the following sections, uses a similar attention mechanism.
Given a sequence of n vectors X = x 1 , ....x n and a sequence of indices R = r 1 , ....r m , each between 1 and n, the Pointer Network is an encoder-decoder architecture trained to maximize p(R|X; θ) = m j=1 p θ (r j |r 1 , ....r j−1 , X; θ), where θ denotes the model parameters. Let the encoder and decoder hidden states be (e 1 , ...., e n ) and (d 1 , ...., d m ) respectively. The attention vector at each output step j is computed as follows: The softmax normalizes vector u j to be an attention mask over inputs. In a pointer network, the same attention mechanism is used to select one of the n input vectors with the highest probability, at each decoding step, thus effectively pointing to an input: Here, v, W d , and W e are learnable parameters of the model.

SWAP-NET
We use an encoder-decoder architecture with an attention mechanism similar to that of Pointer Networks. To model the interaction between words and sentences in a document we use two encoders and decoders, one at the word level and the other at the sentence level. The sentence-level decoder learns to point to important sentences while the word-level decoder learns to point to important words. A switch mechanism is trained to select either a word or a sentence at each decoding step. The final summary is created using the output words and sentences. We now describe the details of the architecture.

Encoder
We use two encoders: a bi-directional LSTM at the word level and a LSTM at the sentence level. Each word w i is represented by a K-dimensional embedding (e.g., via word2vec), denoted by x i . The word embedding x i is encoded as e i using bi-directional LSTM for i = 1, . . . , n. The vector output of BiLSTM at the end of a sentence is used to represent that entire sentence, which is further encoded by the sentence-level LSTM as E k = LSTM(e k l , E k−1 ), where k l is the index of the last word in the k th sentence in D and E k is the hidden state at the k th step of LSTM, for k = 1, . . . , N . See figure 1.

Decoder
We use two decoders -a sentence-level and a word-level decoder, that are both LSTMs, with each decoder pointing to sentences and words re-spectively (similar to a pointer network). Thus, we can consider the output of each decoder step to be an index in the input sequence to the encoder. Let m be the number of steps in each decoder. Let T 1 , . . . , T m be the sequence of indices generated by the sentence-level decoder, where each index T j ∈ {1, . . . , N }; and let t 1 , . . . , t m be the sequence of indices generated by the word-level decoder, where each index t j ∈ {1, . . . , n}.

Network Details
At the j th decoding step, we have to select a sentence or a word which is done through a binary switch Q j that has two states Q j = 0 and Q j = 1 to denote word and sentence selection respectively. So, we first determine the switch probability p(Q j |v <j , D). Let α s kj denote the probability of selecting the k th input sentence at the j th decoding step of sentence decoder: and let α w ij denote the probability of selecting the i th input word at the j th decoding step of word decoder: both conditional on the corresponding switch selection. We set v j based on the probability values: . These probabilities are obtained through the attention weight vectors at the word and sentence levels and the switch probabilities: Parameters v t , w h , w t , V T , W H and W T are trainable parameters. Parameters h j and H j are the hidden vectors at the j th step of the wordlevel and sentence-level decoder respectively defined as: , is used to connect the word-level encodings to the sentence decoder and the sentence-level encodings to the word decoder. Specifically, the word-level decoder updates its state by considering a sum of sentence encodings, weighted by the attentions from the previous state and mutatis mutandis for the sentence-level decoder.
The switch probability p(Q j |v <j , D) at the j th decoding step is given by: where w Q is a trainable parameter and σ denotes the sigmoid function and φ is the chosen nonlinear transformation (tanh). During training the loss function l j at j th step is set to l j = − log(p s kj q s j + p w ij q w j ) − log p(Q j |v <j , D). Note that at each decoding step, switch is either q w j = 1, q s j = 0 if the j th output is a word or q w j = 0, q s j = 1 if the j th output is a sentence. The switch probability is also considered in the loss function.

Summary Generation
Given a document whose summary is to be generated, its sentences and words are given as input to the trained encoder. At the j th decoding step, either a sentence or a word is chosen based on the probability values α s kj and α w ij and the switch probability p(Q j |v <j , D). We assign importance scores to the selected sentences based on their probability values during decoding as well as the probabilities of the selected words that are present in the selected sentences. Thus sentences with words selected by the decoder are given higher importance. Let the k th input sentence s k be selected at the j th decoding step and i th input word w i be selected at the l th decoding step. Then the importance of s k is defined as In our experiments we choose λ = 1. The final summary consists of three sentences with the highest importance scores.

Related Work
Traditional approaches to extractive summarization rely on human-engineered features based on, for example, part of speech (Erkan and Radev, 2004) and term frequency (Nenkova et al., 2006). Sentences in the input document are scored using these features, ranked and then selected for the final summary. Methods used for extractive summarization include graph-based approaches (Mihalcea, 2005) and Integer Linear Programming (Gillick and Favre, 2009). There are many classifier-based approaches that select sentences for the extractive summary using methods such as Conditional Random Fields (Shen et al., 2007) and Hidden Markov models (Conroy and O'leary, 2001). A review of these classical approaches can be found in Nenkova et al. (2011). End-to-end deep learning based neural models that can effectively learn from text data, without human-crafted features, have witnessed rapid development, resulting in improved performance in multiple areas such as machine translation (Chen et al., 2017) and question-answering (Hao et al., 2017), to name a few. Large labelled corpora based on news stories from CNN and Daily Mail, with human generated summaries have become available (Cheng and Lapata, 2016), that have spurred the use of deep learning models in summarization. Recurrent neural network based architectures have been designed for both extractive (Cheng and Lapata, 2016;Nallapati et al., 2017) and abstractive (See et al., 2017;Tan et al., 2017) summarization problems. Among these, the work of Cheng and Lapata (2016) and Nallapati et al. (2017) are closest to our work on extractive singledocument summarization.
An encoder-decoder architecture with an attention mechanism similar to that of a pointer network is used by Cheng and Lapata (2016). Their hierarchical encoder uses a CNN at the word level leading to sentence representations that are used in an RNN to obtain document representations. They use a hierarchical attention model where the first level decoder predicts salient sentences used for an extractive summary and based on this output, the second step predicts keywords which are used for abstractive summarization. Thus they do not use key words for extractive summarization and for abstractive summarization they generate key words based on sentences predicted independently of key words. SWAP-NET, in contrast, is simpler using only two-level RNNs for word and sentence level representations in both the encoder and decoder. In our model we predict both words and sentences in such a way that their attentions interact with each other and generate extractive summaries considering both the attentions. By modeling the interaction between these key words and important sentences in our decoder architecture, we are able to extract sentences that are closer to the gold summaries.
SummaRuNNer, the method developed by Nallapati et al. (2017) is not similar to our method in its architecture but only in the aim of extractive summary generation. It does not use an encoderdecoder architecture; instead it is an RNN based binary classifier that decides whether or not to include a sentence in the summary. The RNN is multi-layered representing inputs, words, sentences and the final sentence labels. The decision of selecting a sentence at each step of the RNN is based on the content of the sentence, salience in the document, novelty with respect to previously selected sentences and other positional features. Their approach is considerably simpler than that of Cheng and Lapata (2016) but obtains summaries closer to the gold summaries, and additionally, facilitates interpretable visualization and training from abstractive summaries. Their experiments show improved performance over both abstractive and extractive summarizers from several previous models (Nallapati et al., 2017).
We note that several elements of our architecture have been introduced and used in earlier work. Pointer networks (Vinyals et al., 2015) used the attention mechanism of (Bahdanau et al., 2015) to solve combinatorial optimization problems. They have also been used to point to sentences in extractive (Cheng and Lapata, 2016) and abstractive See et al., 2017) summarizers. The switch mechanism was introduced to incorporate rare or out-of-vocabulary words  and are used in several summarizers (e.g. ). However, we use it to select between word and sentence level decoders in our model.
The importance of all the three interactions: (i) sentence-sentence, (ii) word-word and (iii) sentence-word, for summarization, have been studied by Wan et al. (2007) using graph-based approaches. In particular, they show that methods that account for saliency using both the following considerations perform better than methods that consider either one of them alone, and SWAP-NET is based on the same principles.
• A sentence should be salient if it is heavily linked with other salient sentences, and a word should be salient if it is heavily linked with other salient words.
• A sentence should be salient if it contains many salient words, and a word should be salient if it appears in many salient sentences.
6 Data and Experiments

Experimental Settings
In our experiments the maximum number of words per document is limited to 800, and the maximum number of sentences per document to 50 (padding is used to maintain the length of word sequences). We also use the symbols <GO> and <EOS> to indicate start and end of prediction by decoders. The total vocabulary size is 150,000 words. We use word embeddings of dimension 100 pretrained using word2vec (Mikolov et al., 2013) on the training dataset. We fix the LSTM hidden state size at 200. We use a batch size of 16 and the ADAM optimizer (Kingma and Ba, 2015) with parameters: learning rate = 0.001, β 1 = 0.9, β 2 = 0.999 to train SWAP-NET. We employ gradient clipping to regularize our model and an early stopping criterion based on the validation loss.
During training we find that SWAP-NET learns to predict important sentences faster than to predict words. To speed up learning of word probabilities, we add the term − log α w ij to our loss function l j in the final iterations of training. It is possible to get the same sentence or word in multiple (usually consecutive) decoding steps. In that case, in Eq. 3 we consider the maximum value of alpha obtained across these steps and calculate maximum scores of distinct sentences and words.
We select 3 top scoring sentences for the summary, as there are 3.11 sentences on average in the gold summary of the training set (similar to settings used by others, e.g., (Narayan et al., 2017)).

Baselines
Two state-of-the-art methods for extractive summarization are SummaRuNNer (Nallapati et al., 2017) and NN, the neural summarizer of Cheng and Lapata (2016). SummaRuNNer can also provide extractive summaries while being trained abstractively (Nallapati et al., 2017); we denote this method by SummaRuNNer-abs. In addition, we compare our method with the Lead-3 summary which consists of the first three sentences from each document. We also compare our method with an abstractive summarizer that uses a similar attention-based encoder-decoder architecture , denoted by ABS.

Benchmark Datasets
For our experiments, we use the CNN/DailyMail corpus (Hermann et al., 2015). We use the anonymized version of this dataset, from Cheng and Lapata (2016), which has labels for important sentences, that are used for training. To obtain labels for words, we extract keywords from each gold summary using RAKE, an unsupervised keyword extraction method (Rose et al., 2010). These keywords are used to label words in the corresponding input document during training. We replace numerical values in the documents by zeros to limit the vocabulary size.
We have 193,986 training documents, 12,147 validation documents and 10,346 test documents from the DailyMail corpus and 83,568 training documents, 1,220 validation documents and 1,093 test documents from CNN subset with labels for sentences and words.

Evaluation Metrics
We use the ROUGE toolkit (Lin and Hovy, 2003) for evaluation of the generated summaries in comparison to the gold summaries. We use three variants of this metric: ROUGE-1 (R1), ROUGE-2 (R2) and ROUGE-L (RL) that are computed by matching unigrams, bigrams and longest common subsequences respectively between the two summaries. To compare with (Cheng and Lapata, 2016) and (Nallapati et al., 2017) we use limited length ROUGE recall at 75 and 275 bytes for the Daily-Mail test set, and full length ROUGE-F1 score, as reported by them.

Results on Benchmark Datasets
Performance on Daily Mail Data   the previous best reported F-score by SummaRuN-Ner, as seen in table 3, with a consistent improvement of over 2 ROUGE points in all three metrics.

Discussion
SWAP-NET outperforms state-of-the-art extractive summarizers SummaRuNNer (Nallapati et al., 2017) and NN (Cheng and Lapata, 2016) on benchmark datasets. Our model is similar, although simpler, than that of NN and the main difference between SWAP-NET and these baselines is its explicit modeling of the interaction between key words and salient sentences. Automatic keyword extraction has been studied extensively (Hasan and Ng, 2014). We use a popular and well tested method, RAKE (Rose et al., 2010) to obtain key words in the training documents. A disadvantage with such methods is that they do not guarantee representation, via extracted keywords, of all the topics in the text (Hasan and Ng, 2014). So, if RAKE key words are directly applied to the input test document (without using word decoder trained on RAKE words, obtained from gold summary as done in SWAP-NET), then there is a possibility of missing sentences from the missed topics. So, we train SWAP-NET to predict key words and also model their interactions with sentences.

Statistics
Lead-3 SWAP-NET KW coverage 61.6% 73.8% Sentences with KW 92.2% 98% We investigate the importance of modeling this interaction and the role of key words in the final summary. Table 4 shows statistics that reflect the importance of key words in extractive summaries. Key word coverage measures the proportion of key Title: @entity19 vet surprised reason license plate denial Gold Summary: @entity9 of @entity10 , @entity1 , wanted to get ' @entity11 -0 ' put on a license plate . that would have commemorated both @entity9 getting the @entity8 in 0 and his @entity16 . the @entity1 @entity21 denied his request , citing state regulations prohibiting the use of the number 0 because of its indecent connotations . SWAP-NET Summary: @entity9 of @entity10 wanted to get ' @entity11 ' put on a license plate , the @entity14 newspaper of @entity10 reported . that would have commemorated both @entity9 getting the @entity8 in 0 and his @entity16 , according to the newspaper . the @entity1 @entity21 denied his request , citing state regulations prohibiting the use of the number 0 because of its indecent connotations @entity9 had been an armored personnel carrier 's gunner during his time in the @entity29 .
SWAP-NET Key words: @entity1, @entity9, @entity8, citing, number, year, indecent, personalized, war, surprised, plate, @en-tity14, @entity11, @entity10, regulations, reported, wanted, connotations, license, request, according,@entity21, armored, @entity16 Lead 3 Summary: a @entity19 war veteran in @entity1 has said he 's surprised over the reason for the denial of his request for a personalized license plate commemorating the year he was wounded and awarded a @entity8 . @entity9 of @entity10 wanted to get ' @entity11 ' put on a license plate , the @entity14 newspaper of @entity10 reported . that would have commemorated both @entity9 getting the @entity8 in 0 and his @entity16 , according to the newspaper . words from those in the gold summary present in the generated summary. SWAP-NET obtains nearly 74% of the key words. In comparison Lead-3 has only about 62% of the key words from the gold summary.
Sentences with key words measures the proportion of sentences containing at least one key word. It is not surprising that in SWAP-NET summaries 98% of the sentences, on average, contain at least one key word: this is by design of SWAP-NET. However, note that Lead-3 which has poorer performance in all the benchmark datasets has much fewer sentences with key words. This highlights the importance of key words in finding salient sentences for extractive summaries.
Gold summary Lead-3 SWAP-NET 0.81 0.553 0.8 Table 6: Average pairwise cosine distance between paragraph vector representations of sentences in summaries.
We also find the SWAP-NET obtains summaries that have less semantic redundancy. Table 6 shows the average distance between pairs of sentences from the gold summary, and summaries generated from SWAP-NET and Lead-3. Distances are measured using cosine distance of paragraph vectors of each sentence (Le and Mikolov, 2014) from randomly selected 500 documents of the Daily Mail test set. Paragraph vectors have been found to be effective semantic representations of sentences (Le and Mikolov, 2014) and experiments in (Dai et al., 2015) also show that paragraph vectors can be effectively used to measure semantic similarity using cosine distance. For training we use GENSIM (Řehůřek and Sojka, 2010) with embedding size 200 and initial learning rate 0.025. The model is trained on 500 documents from Daily-Mail dataset for 10 epochs and learning rate is decreased by 0.002 at each epoch.
The average pair-wise distance of SWAP-NET is very close to that of the gold summary, both nearly 0.8. In contrast, the average pairwise distance in Lead-3 summaries is 0.553 indicating higher redundancy. This highly desirable feature of SWAP-NET is likely due to use of of key words, that is affecting the choice of sentences in the final summary. Table 5 shows a sample gold summary from the Daily Mail dataset and the generated summary from SWAP-NET and, for comparison, from Lead-3. We observe the presence of key words in all the overlapping segments of text with the gold summary indicating the importance of key words in finding salient sentences. Modeling this interaction, we believe, is the reason for the superior performance of SWAP-NET in our experiments.
An implementation of SWAP-NET and all the generated summaries from the test sets are available online in a github repository 1 .

Conclusion
We present SWAP-NET, a neural sequence-tosequence model for extractive summarization that outperforms state-of-the-art extractive summarizers SummaRuNNer (Nallapati et al., 2017) and NN (Cheng and Lapata, 2016) on large scale benchmark datasets. The architecture of SWAP-NET is simpler than that of NN but due to its effective modeling of interaction between salient sentences and key words in a document, SWAP-NET achieves superior performance. SWAP-NET models this interaction using a new two-level pointer network based architecture with a switching mechanism. Our experiments also suggest that modeling sentence-keyword interaction has the desirable property of less semantic redundancy in summaries generated by SWAP-NET.