Neural Document Summarization by Jointly Learning to Score and Select Sentences

Sentence scoring and sentence selection are two main steps in extractive document summarization systems. However, previous works treat them as two separated subtasks. In this paper, we present a novel end-to-end neural network framework for extractive document summarization by jointly learning to score and select sentences. It first reads the document sentences with a hierarchical encoder to obtain the representation of sentences. Then it builds the output summary by extracting sentences one by one. Different from previous methods, our approach integrates the selection strategy into the scoring model, which directly predicts the relative importance given previously selected sentences. Experiments on the CNN/Daily Mail dataset show that the proposed framework significantly outperforms the state-of-the-art extractive summarization models.


Introduction
Traditional approaches to automatic text summarization focus on identifying important content, usually at sentence level (Nenkova and McKeown, 2011). With the identified important sentences, a summarization system can extract them to form an output summary. In recent years, extractive methods for summarization have proven effective in many systems (Carbonell and Goldstein, 1998;Mihalcea and Tarau, 2004;McDonald, 2007;Cao et al., 2015a). In previous works that use extractive methods, text summarization is decomposed into two subtasks, i.e., sentence scoring and sentence selection. * Contribution during internship at Microsoft Research.
Sentence scoring aims to assign an importance score to each sentence, and has been broadly studied in many previous works. Feature-based methods are popular and have proven effective, such as word probability, TF*IDF weights, sentence position and sentence length features (Luhn, 1958;Hovy and Lin, 1998;Ren et al., 2017). Graph-based methods such as TextRank (Mihalcea and Tarau, 2004) and LexRank (Erkan and Radev, 2004) measure sentence importance using weighted-graphs. In recent years, neural network has also been applied to sentence modeling and scoring (Cao et al., 2015a;Ren et al., 2017).
For the second step, sentence selection adopts a particular strategy to choose content sentence by sentence. Maximal Marginal Relevance (Carbonell and Goldstein, 1998) based methods select the sentence that has the maximal score and is minimally redundant with sentences already included in the summary. Integer Linear Programming based methods (McDonald, 2007) treat sentence selection as an optimization problem under some constraints such as summary length. Submodular functions (Lin and Bilmes, 2011) have also been applied to solving the optimization problem of finding the optimal subset of sentences in a document. Ren et al. (2016) train two neural networks with handcrafted features. One is used to rank sentences, and the other one is used to model redundancy during sentence selection.
In this paper, we present a neural extractive document summarization (NEUSUM) framework which jointly learns to score and select sentences. Different from previous methods that treat sentence scoring and sentence selection as two tasks, our method integrates the two steps into one endto-end trainable model. Specifically, NEUSUM is a neural network model without any handcrafted features that learns to identify the relative importance of sentences. The relative importance is measured as the gain over previously selected sentences. Therefore, each time the proposed model selects one sentence, it scores the sentences considering both sentence saliency and previously selected sentences. Through the joint learning process, the model learns to predict the relative gain given the sentence extraction state and the partial output summary.
The proposed model consists of two parts, i.e., the document encoder and the sentence extractor. The document encoder has a hierarchical architecture, which suits the compositionality of documents. The sentence extractor is built with recurrent neural networks (RNN), which provides two main functionalities. On one hand, the RNN is used to remember the partial output summary by feeding the selected sentence into it. On the other hand, it is used to provide a sentence extraction state that can be used to score sentences with their representations. At each step during extraction, the sentence extractor reads the representation of the last extracted sentence. It then produces a new sentence extraction state and uses it to score the relative importance of the rest sentences.
We conduct experiments on the CNN/Daily Mail dataset. The experimental results demonstrate that the proposed NEUSUM by jointly scoring and selecting sentences achieves significant improvements over separated methods. Our contributions are as follows: • We propose a joint sentence scoring and selection model for extractive document summarization.
• The proposed model can be end-to-end trained without handcrafted features.
• The proposed model significantly outperforms state-of-the-art methods and achieves the best result on CNN/Daily Mail dataset.

Related Work
Extractive document summarization has been extensively studied for years. As an effective approach, extractive methods are popular and dominate the summarization research. Traditional extractive summarization systems use two key techniques to form the summary, sentence scoring and sentence selection. Sentence scoring is critical since it is used to measure the saliency of a sentence. Sentence selection is based on the scores of sentences to determine which sentence should be extracted, which is usually done heuristically. Many techniques have been proposed to model and score sentences. Unsupervised methods do not require model training or data annotation. In these methods, many surface features are useful, such as term frequency (Luhn, 1958), TF*IDF weights (Erkan and Radev, 2004), sentence length (Cao et al., 2015a) and sentence positions (Ren et al., 2017). These features can be used alone or combined with weights.
Graph-based methods (Erkan and Radev, 2004;Mihalcea and Tarau, 2004;Wan and Yang, 2006) are also applied broadly to ranking sentences. In these methods, the input document is represented as a connected graph. The vertices represent the sentences, and the edges between vertices have attached weights that show the similarity of the two sentences. The score of a sentence is the importance of its corresponding vertex, which can be computed using graph algorithms.
Machine learning techniques are also widely used for better sentence modeling and importance estimation. Kupiec et al. (1995) use a Naive Bayes classifier to learn feature combinations. Conroy and O'leary (2001) further use a Hidden Markov Model in document summarization. Gillick and Favre (2009) find that using bigram features consistently yields better performance than unigrams or trigrams for ROUGE (Lin, 2004) measures. Carbonell and Goldstein (1998) proposed the Maximal Marginal Relevance (MMR) method as a heuristic in sentence selection. Systems using MMR select the sentence which has the maximal score and is minimally redundant with previous selected sentences. McDonald (2007) treats sentence selection as an optimization problem under some constraints such as summary length. Therefore, he uses Integer Linear Programming (ILP) to solve this optimization problem. Sentence selection can also be seen as finding the optimal subset of sentences in a document. Lin and Bilmes (2011) propose using submodular functions to find the subset.
Recently, deep neural networks based approaches have become popular for extractive document summarization. Cao et al. (2015b) develop a novel summary system called PriorSum, which applies enhanced convolutional neural networks to capture the summary prior features derived from length-variable phrases. Ren et al. (2017) use a two-level attention mechanism to measure the contextual relations of sentences. Cheng and Lapata (2016) propose treating document summarization as a sequence labeling task. They first encode the sentences in the document and then classify each sentence into two classes, i.e., extraction or not. Nallapati et al. (2017) propose a system called SummaRuNNer with more features, which also treat extractive document summarization as a sequence labeling task. The two works are both in the separated paradigm, as they first assign a probability of being extracted to each sentence, and then select sentences according to the probability until reaching the length limit. Ren et al. (2016) train two neural networks with handcrafted features. One is used to rank the sentences to select the first sentence, and the other one is used to model the redundancy during sentence selection. However, their model of measuring the redundancy only considers the redundancy between the sentence that has the maximal score, which lacks the modeling of all the selection history.

Problem Formulation
Extractive document summarization aims to extract informative sentences to represent the important meanings of a document. Given a document D = (S 1 , S 2 , . . . , S L ) containing L sentences, an extractive summarization system should select a subset of D to form the output summary S = {Ŝ i |Ŝ i ∈ D}. During the training phase, the reference summary S * and the score of an output summary S under a given evaluation function r(S|S * ) are available. The goal of training is to learn a scoring function f (S) which can be used to find the best summary during testing: where l is length limit of the output summary. In this paper, l is the sentence number limit. Previous state-of-the-art summarization systems search the best solution using the learned scoring function f (·) with two methods, MMR and ILP. In this paper, we adopt the MMR method. Since MMR tries to maximize the relative gain given previous extracted sentences, we let the model to learn to score this gain. Previous works adopt ROUGE recall as the evaluation r(·) con-sidering the DUC tasks have byte length limit for summaries. In this work, we adopt the CNN/Daily Mail dataset to train the neural network model, which does not have this length limit. To prevent the tendency of choosing longer sentences, we use ROUGE F1 as the evaluation function r(·), and set the length limit l as a fixed number of sentences.
Therefore, the proposed model is trained to learn a scoring function g(·) of the ROUGE F1 gain, specifically: where S t−1 is the set of previously selected sentences, and we omit the condition S * of r(·) for simplicity. At each time t, the summarization system chooses the sentence with maximal ROUGE F1 gain until reaching the sentence number limit. Figure 1 gives the overview of NEUSUM, which consists of a hierarchical document encoder, and a sentence extractor. Considering the intrinsic hierarchy nature of documents, that words form a sentence and sentences form a document, we employ a hierarchical document encoder to reflect this hierarchy structure. The sentence extractor scores the encoded sentences and extracts one of them at each step until reaching the output sentence number limit. In this section, we will first introduce the hierarchical document encoder, and then describe how the model produces summary by joint sentence scoring and selection.

Document Encoding
We employ a hierarchical document encoder to represent the sentences in the input document. We encode the document in two levels, i.e., sentence level encoding and document level encoding. Given a document D = (S 1 , S 2 , . . . , S L ) containing L sentences. The sentence level encoder reads the j-th input sentence S j = (x and constructs the basic sentence representation s j . Here we employ a bidirectional GRU (BiGRU) (Cho et al., 2014) as the recurrent unit, where GRU is defined as: Joint Sentence Scoring and Selection arg max = 5 arg max = 1 arg max =? Figure 1: Overview of the NEUSUM model. The model extracts S 5 and S 1 at the first two steps. At the first step, we feed the model a zero vector 0 to represent empty partial output summary. At the second and third steps, the representations of previously selected sentences S 5 and S 1 , i.e., s 5 and s 1 , are fed into the extractor RNN. At the second step, the model only scores the first 4 sentences since the 5th one is already included in the partial output summary.
where W z , W r and W h are weight matrices. The BiGRU consists of a forward GRU and a backward GRU. The forward GRU reads the word embeddings in sentence S j from left to right and gets a sequence of hidden states, n j ). The backward GRU reads the input sentence embeddings reversely, from right to left, and results in another sequence of hid- where the initial states of the BiGRU are set to zero vectors, i.e., h (j) After reading the words of the sentence S j , we construct its sentence level representation s j by concatenating the last forward and backward GRU hidden vectors: We use another BiGRU as the document level encoder to read the sentences. With the sentence level encoded vectors ( s 1 , s 2 , . . . , s L ) as inputs, the document level encoder does forward and backward GRU encoding and produces two list of hidden vectors: ( s 1 , s 2 , . . . , s L ) and ( s 1 , s 2 , . . . , s L ). The document level representation s i of sentence S i is the concatenation of the forward and backward hidden vectors: We then get the final sentence vectors in the given document: D = (s 1 , s 2 , . . . , s L ). We use sentence S i and its representative vector s i interchangeably in this paper.

Joint Sentence Scoring and Selection
Since the separated sentence scoring and selection cannot utilize the information of each other, the goal of our model is to make them benefit each other. We couple these two steps together so that: a) sentence scoring can be aware of previously selected sentences; b) sentence selection can be simplified since the scoring function is learned to be the ROUGE score gain as described in section 3. Given the last extracted sentenceŜ t−1 , the sentence extractor decides the next sentenceŜ t by scoring the remaining document sentences. To score the document sentences considering both their importance and partial output summary, the model should have two key abilities: 1) remembering the information of previous selected sentences; 2) scoring the remaining document sentences based on both the previously selected sentences and the importance of remaining sentences. Therefore, we employ another GRU as the recurrent unit to remember the partial output summary, and use a Multi-Layer Perceptron (MLP) to score the document sentences. Specifically, the GRU takes the document level representation s t−1 of the last extracted sentenceŜ t−1 as input to produce its current hidden state h t . The sentence scorer, which is a two-layer MLP, takes two input vectors, namely the current hidden state h t and the sentence representation vector s i , to calculate the score δ(S i ) of sentence S i .
where W s , W q and W d are learnable parameters, and we omit the bias parameters for simplicity.
When extracting the first sentence, we initialize the GRU hidden state h 0 with a linear layer with tanh activation function: whereW m and b m are learnable parameters, and s 1 is the last backward state of the document level encoder BiGRU. Since we do not have any sentences extracted yet, we use a zero vector to represent the previous extracted sentence, i.e., s 0 = 0.
With the scores of all sentences at time t, we choose the sentence with maximal gain score:

Objective Function
Inspired by Inan et al. (2017), we optimize the Kullback-Leibler (KL) divergence of the model prediction P and the labeled training data distribution Q. We normalize the predicted sentence score δ(S i ) with softmax function to get the model prediction distribution P : During training, the model is expected to learn the relative ROUGE F1 gain at time step t with previously selected sentences S t−1 . Considering that the F1 gain value might be negative in the labeled data, we follow previous works (Ren et al., 2017) to use Min-Max Normalization to rescale the gain value to [0, 1]: We then apply a softmax operation with temperature τ (Hinton et al., 2015) 1 to produce the labeled data distribution Q as the training target. We apply the temperature τ as a smoothing factor to produce a smoothed label distribution Q: Therefore, we minimize the KL loss function J: We create an extractive summarization training set based on CNN/Daily Mail corpus. To determine the sentences to be extracted, we design a rule-based system to label the sentences in a given document similar to Nallapati et al. (2017). Specifically, we construct training data by maximizing the ROUGE-2 F1 score. Since it is computationally expensive to find the global optimal combination of sentences, we employ a greedy approach. Given a document with n sentences, we enumerate the candidates from 1-combination n 1 to n-combination n n . We stop searching if the highest ROUGE-2 F1 score in n k is less than the best one in n k−1 . Table 1 shows the data statistics of the CNN/Daily Mail dataset.
We conduct data preprocessing using the same method 2 in See et al. (2017), including sentence splitting and word tokenization. Both Nallapati et al. (2016Nallapati et al. ( , 2017 use the anonymized version of the data, where the named entities are replaced by identifiers such as entity4. Following See et al. (2017), we use the non-anonymized version so we can directly operate on the original text. Model Training We initialize the model parameters randomly using a Gaussian distribution with Xavier scheme (Glorot and Bengio, 2010). The word embedding matrix is initialized using pretrained 50-dimension GloVe vectors (Pennington et al., 2014) 3 . We found that larger size GloVe does not lead to improvement. Therefore, we use 50-dim word embeddings for fast training. The pre-trained GloVe vectors contain 400,000 words and cover 90.39% of our model vocabulary. We initialize the rest of the word embeddings randomly using a Gaussian distribution with Xavier scheme. The word embedding matrix is not updated during training. We use Adam (Kingma and Ba, 2015) as our optimizing algorithm. For the hyperparameters of Adam optimizer, we set the learning rate α = 0.001, two momentum parameters β 1 = 0.9 and β 2 = 0.999 respectively, and = 10 −8 . We also apply gradient clipping (Pascanu et al., 2013) with range [−5, 5] during training. We use dropout (Srivastava et al., 2014) as regularization with probability p = 0.3 after the sentence level encoder and p = 0.2 after the document level encoder. We truncate each article to 80 sentences and each sentence to 100 words during both training and testing. The model is implemented with PyTorch (Paszke et al., 2017). We Model Testing At test time, considering that LEAD3 is a commonly used and strong extractive baseline, we make NEUSUM and the baselines extract 3 sentences to make them all comparable.

Baseline
We compare NEUSUM model with the following state-of-the-art baselines: LEAD3 The commonly used baseline by selecting the first three sentences as the summary.
CRSUM Ren et al. (2017) propose an extractive summarization system which considers the contextual information of a sentence. We train this baseline model with the same training data as our approach. (2016) propose an extractive system which models document summarization as a sequence labeling task. We train this baseline model with the same training data as our approach. SUMMARUNNER Nallapati et al. (2017) propose to add some interpretable features such as sentence absolute and relative positions.

NN-SE Cheng and Lapata
PGN Pointer-Generator Network (PGN). A stateof-the-art abstractive document summarization system proposed by See et al. (2017), which incorporates copying and coverage mechanisms.

Evaluation Metric
We employ ROUGE (Lin, 2004) as our evaluation metric. ROUGE measures the quality of summary by computing overlapping lexical units, such as unigram, bigram, trigram, and longest common subsequence (LCS). It has become the standard evaluation metric for DUC shared tasks and popular for summarization evaluation. Following previous work, we use ROUGE-1 (unigram), ROUGE-2 (bigram) and ROUGE-L (LCS) as the evaluation metrics in the reported experimental results.

Results
We use the official ROUGE script 4 (version 1.5.5) to evaluate the summarization output. Table 2 summarizes the results on CNN/Daily Mail data set using full length ROUGE-F1 5 evaluation. It includes two unsupervised baselines, LEAD3 and TEXTRANK. The table also includes three stateof-the-art neural network based extractive models, i.e., CRSUM, NN-SE and SUMMARUNNER. In addition, we report the state-of-the-art abstractive PGN model. The result of SUMMARUNNER is on the anonymized dataset and not strictly comparable to our results on the non-anonymized version dataset. Therefore, we also include the result of LEAD3 on the anonymized dataset as a reference.  NEUSUM achieves 19.01 ROUGE-2 F1 score on the CNN/Daily Mail dataset. Compared to the unsupervised baseline methods, NEUSUM performs better by a large margin. In terms of ROUGE-2 F1, NEUSUM outperforms the strong baseline LEAD3 by 1.31 points. NEUSUM also outperforms the neural network based models. Compared to the state-of-the-art extractive model NN-SE (Cheng and Lapata, 2016), NEUSUM performs significantly better in terms of ROUGE-1, ROUGE-2 and ROUGE-L F1 scores. Shallow features, such as sentence position, have proven effective in document summarization (Ren et al., 2017;Nallapati et al., 2017). Without any hand-crafted features, NEUSUM performs better than the CRSUM and SUMMARUNNER baseline models with features. As given by the 95% confidence interval in the official ROUGE script, our model achieves statistically significant improvements over all the baseline models. To the best of our knowledge, the proposed NEUSUM model achieves the best results on the CNN/Daily Mail dataset.
We also provide human evaluation results on a sample of test set. We random sample 50 documents and ask three volunteers to evaluate the output of NEUSUM and the NN-SE baseline models. They are asked to rank the output summaries from best to worst (with ties allowed) regarding informativeness, redundancy and overall quality. Table  3 shows the human evaluation results. NEUSUM performs better than the NN-SE baseline on all three aspects, especially in redundancy. This indicates that by jointly scoring and selecting sentences, NEUSUM can produce summary with less content overlap since it re-estimates the saliency of remaining sentences considering both their contents and previously selected sentences.

Precision at Step-t
We analyze the accuracy of sentence selection at each step. Since we extract 3 sentences at test time, we show how NEUSUM performs when extracting each sentence. Given a document D in test set T, NEUSUM predicted summary S, its reference summary S * , and the extractive oracle summary O with respect to D and S * (we use the method described in section 5.1 to construct O), we define the precision at step t as p(@t): where S[t] is the sentence extracted at step t, and 1 O is the indicator function defined as:  Figure 3 shows the precision at step t of NN-SE baseline and our NEUSUM. It can be observed that NEUSUM achieves better precision than the NN-SE baseline at each step. For the first sentence, both NEUSUM and NN-SE achieves good performance. The NN-SE baseline has 39.18% precision at the first step, and NEUSUM outperforms it by 1.2 points. At the second step, NEUSUM outperforms NN-SE by a large margin. In this step, the NEUSUM model extracts 31.52% sentences correctly, which is 3.24 percent higher than 28.28% of NN-SE. We think the second step selection benefits from the first step in NEUSUM since it can remember the selection history, while the separated models lack this ability.
However, we can notice the trend that the precision drops fast after each selection. We think this is due to two main reasons. First, we think that the error propagation leads to worse selection for the third selection. As shown in Figure 2, the p(@1) and p(@2) are 40.38% and 31.52% respectively, so the history is less reliable for the third selection. Second, intuitively, we think the later selections are more difficult compared to the previous ones since the most important sentences are already selected.

Position of Selected Sentences
Early works (Ren et al., 2017;Nallapati et al., 2017) have shown that sentence position is an important feature in extractive document summarization. Figure 2 shows the position distributions of the NN-SE baseline, our NEUSUM model and oracle on the CNN/Daily Mail test set. It can be seen that the NN-SE baseline model tends to extract large amount of leading sentences, especially the leading three sentences. According to the statistics, about 80.91% sentences selected by NN-SE baseline are in leading three sentences.
In the meanwhile, our NEUSUM model selects 58.64% leading three sentences. We can notice that in the oracle, the percentage of selecting leading sentences (sentence 1 to 5) is moderate, which is around 10%. Compared to NN-SE, the position of selected sentences in NEUSUM is closer to the oracle. Although NEUSUM also extracts more leading sentences than the oracle, it selects more tailing ones. For example, our NEUSUM model extracts more than 30% of sentences in the range of sentence 4 to 6. In the range of sentence 7 to 13, NN-SE barely extracts any sentences, but our NEUSUM model still extract sentences in this range. Therefore, we think this is one of the reasons why NEUSUM performs better than NN-SE.
We analyze the sentence position distribution and offer an explanation for these observations. Intuitively, leading sentences are important for a well-organized article, especially for newswire articles. It is also well known that LEAD3 is a very strong baseline. In the training data, we found that 50.98% sentences labeled as "should be extracted" belongs to the first 5 sentences, which may cause the trained model tends to select more leading sentences. One possible situation is that one sentence in the tail of a document is more important than the leading sentences, but the margin between them is not large enough. The models which separately score and select sentences might not select sentences in the tail whose scores are not higher than the leading ones. These methods may choose the safer leading sentences as a fallback in such confusing situation because there is no direct competition between the leading and tailing candidates. In our NEUSUM model, the scoring and selection are jointly learned, and at each step the tailing candidates can compete directly with the leading ones. Therefore, NEUSUM can be more discriminating when dealing with this situation.

Conclusion
Conventional approaches to extractive document summarization contain two separated steps: sentence scoring and sentence selection. In this paper, we present a novel neural network framework for extractive document summarization by jointly learning to score and select sentences to address this issue. The most distinguishing feature of our approach from previous methods is that it combines sentence scoring and selection into one phase. Every time it selects a sentence, it scores the sentences according to the partial output summary and current extraction state. ROUGE evaluation results show that the proposed joint sentence scoring and selection approach significantly outperforms previous separated methods.