Reinforced Extractive Summarization with Question-Focused Rewards

We investigate a new training paradigm for extractive summarization. Traditionally, human abstracts are used to derive goldstandard labels for extraction units. However, the labels are often inaccurate, because human abstracts and source documents cannot be easily aligned at the word level. In this paper we convert human abstracts to a set of Cloze-style comprehension questions. System summaries are encouraged to preserve salient source content useful for answering questions and share common words with the abstracts. We use reinforcement learning to explore the space of possible extractive summaries and introduce a question-focused reward function to promote concise, fluent, and informative summaries. Our experiments show that the proposed method is effective. It surpasses state-of-the-art systems on the standard summarization dataset.


Introduction
We study extractive summarization in this work where salient word sequences are extracted from the source document and concatenated to form a summary (Nenkova and McKeown, 2011). Existing supervised approaches to extractive summarization frequently use human abstracts to create annotations for extraction units (Gillick and Favre, 2009;Li et al., 2013;Cheng and Lapata, 2016). E.g., a source word is labelled 1 if it appears in the abstract, 0 otherwise. Despite the usefulness, there are two issues with this scheme. First, a vast majority of the source words are tagged 0s, only a small portion are 1s. This is due to the fact that human abstracts are short and concise; they often contain words not present in the source. Second, Table 1: Example source document, the top sentence of the abstract, and system-generated Cloze-style questions. Source content related to the abstract is italicized. not all labels are accurate. Source words that are labelled 0 may be paraphrases, generalizations, or otherwise related to words in the abstracts. These source words are often mislabelled. Consequently, leveraging human abstracts to provide supervision for extractive summarization remains a challenge.
Neural abstractive summarization can alleviate this issue by allowing the system to either copy words from the source texts or generate new words from a vocabulary (Rush et al., 2015;Nallapati et al., 2016;See et al., 2017). While the techniques are promising, they face other challenges, such as ensuring the summaries remain faithful to the original. Failing to reproduce factual details has been revealed as one of the main obstacles for neural abstractive summarization (Cao et al., 2018;Song et al., 2018). This study thus chooses to focus on neural extractive summarization.
We explore a new training paradigm for extractive summarization. We convert human abstracts to a set of Cloze-style comprehension questions, where the question body is a sentence of the abstract with a blank, and the answer is an entity or a keyword. Table 1 shows an example. Because the questions cannot be answered by applying general world knowledge, system summaries are encouraged to preserve salient source content that is relevant to the questions (≈ human abstract) such that the summaries can work as a document surrogate to predict correct answers. We use an attention mechanism to locate segments of a summary that are relevant to a given question so that the summary can be used to answer multiple questions.
This study extends the work of (Lei et al., 2016) to use reinforcement learning to explore the space of extractive summaries. While the original work focuses on generating rationales to support supervised classification, the goal of our study is to produce fluent, generic document summaries. The question-answering (QA) task is designed to fulfill this goal and the QA performance is only secondary. Our research contributions can be summarized as follows: • we investigate an alternative training scheme for extractive summarization where the summaries are encouraged to be semantically close to human abstracts in addition to sharing common words; • we compare two methods to convert human abstracts to Cloze-style questions and investigate its impact on QA and summarization performance. Our results surpass those of previous systems on a standard summarization dataset.

Related Work
This study focuses on generic summarization. It is different from the query-based summarization (Daumé III and Marcu, 2006;Dang and Owczarzak, 2008), where systems are trained to select text pieces related to predefined queries. In this work we have no predefined queries but the system carefully generates questions from human abstracts and learns to produce generic summaries that are capable of answering all questions. Cloze questions have been used in reading comprehension (Richardson et al., 2013;Weston et al., 2016;Mostafazadeh et al., 2016;Rajpurkar et al., 2016) to test the system's ability to perform reasoning and language understanding. Hermann et al. (2015) describe an approach to extract (context, question, answer) triples from news articles. Our work draws on this approach to automatically create questions from human abstracts.
Reinforcement learning (RL) has been recently applied to a number of NLP applications, includ-ing dialog generation (Li et al., 2017), machine translation (MT) (Ranzato et al., 2016;Gu et al., 2018), question answering (Choi et al., 2017), and summarization and sentence simplification (Zhang and Lapata, 2017;Paulus et al., 2017;Chen and Bansal, 2018;Narayan et al., 2018). This study leverages RL to explore the space of possible extractive summaries. The summaries are encouraged to preserve salient source content useful for answering questions as well as sharing common words with the abstracts.

Our Approach
Given a source document X, our system generates a summary Y = (y 1 , y 2 , · · · , y |Y | ) by identifying consecutive sequences of words: y t is 1 if the t-th source word is included in the summary, 0 otherwise. In this section we investigate a questionoriented reward R(Y ) that encourages summaries to contain sufficient content useful for answering key questions about the document ( §3.1); we then use reinforcement learning to explore the space of possible extractive summaries ( §3.2).

Question-Focused Reward
We reward a summary if it can be used as a document surrogate to answer important questions. Let {(Q k , e * k )} K k=1 be a set of question-answer pairs for a source document, where e * k is the groundtruth answer corresponding to an entity or a keyword. We encode the question Q k into a vector: q k = Bi-LSTM(Q k ) ∈ R d using a bidirectional LSTM, where the last outputs of the forward and backward passes are concatenated to form a question vector. We use the same Bi-LSTM to encode the summary Y to a sequence of vectors: where |S| is the number of words in the summary; h S t ∈ R d is the concatenation of forward and backward hidden states at time step t. Figure 1 provides an illustration of the system framework.
An attention mechanism is used to locate parts of the summary that are relevant to Q k . We define α k,i ∝ exp(q k W a h S i ) to represent the importance of the i-th summary word (h S i ) to answering the k-th question (q k ), characterized by a bilinear term (Chen et al., 2016a). A context vector c k is constructed as a weighted sum of all summary words relevant to the k-th question, and it is used to predict the answer. We define the QA reward R a (Y ) as the log-likelihood of correctly predict- ing all answers. {W a , W c } are learnable model parameters.
In the following we describe approaches to obtain a set of question-answer pairs {(Q k , e * k )} K k=1 from a human abstract. In fact, this formulation has the potential to make use of multiple human abstracts (subject to availability) in a unified framework; in that case, the QA pairs will be extracted from all abstracts. According to Eq. (4), the system is optimized to generate summaries that preserve salient source content sufficient to answer all questions (≈ human abstract). We expect to harvest one question-answer pair from each sentence of the abstract. More are possible, but the QA pairs will contain duplicate content. There are a few other noteworthy issues. If we do not collect any QA pairs from a sentence of the abstract, its content will be left out of the system summary. It is thus crucial for the system to extract at least one QA pair from any sentence in an automatic manner. Further, the questions must not be answered by simply applying general world knowledge. We expect the adequacy of the summary to have a direct influence on whether or not the questions will be correctly answered. Motivated by these considerations, we perform the following steps. We split a human abstract to a set of sentences, identify an answer token from each sentence, then convert the sentence to a question by replacing the token with a placeholder, yielding a Cloze question. We explore two approaches to extract answer tokens: • Entities. We extract four types of named entities {PER, LOC, ORG, MISC} from sentences and treat them as possible answer tokens.
• Keywords. This approach identifies the ROOT word of a sentence dependency parse tree and treats it as a keyword-based answer token. Not all sentences contain entities, but every sentence has a root word; it is often the main verb of the sentence.
We obtain K question-answer pairs from each human abstract, one pair per sentence. If there are less than K sentences in the abstract, the QA pairs of the top sentences will be duplicated, with the assumption that the top sentences are more important than others. If multiple entities reside in a sentence, we randomly pick one as the answer token; otherwise if there are no entities, we use the root word instead.
To ensure that the extractive summaries are concise, fluent, and close to the original wording, we add additional components to the reward function: t=1 y t − δ| to restrict the summary size. We require the percentage of selected source words to be close to a predefined threshold δ. This constraint works well at restricting length, with the average summary size adhering to this percentage; (ii) we further introduce R f (Y ) = |Y | t=2 |y t − y t−1 | to encourage the summaries to be fluent. This component is adopted from (Lei et al., 2016), where few 0/1 switches between y t−1 and y t indicates the system is selecting consecutive word sequences; (iii) we encourage system and reference summaries to share common bigrams. This practice has shown suc-cess in earlier studies (Gillick and Favre, 2009). R b (Y ) is defined as the percentage of reference bigrams successfully covered by the system summary. These three components together ensure the well-formedness of extractive summaries. The final reward function R(Y ) is a linear interpolation of all the components; γ, α, β are coefficients and we describe their parameter tuning in §4.

Reinforcement Learning
In the following we seek to optimize a policy P (Y |X) for generating extractive summaries so that the expected reward E P (Y |X) [R(Y )] is maximized. Taking derivatives of this objective with respect to model parameters θ involves repeatedly sampling summariesŶ = (ŷ 1 ,ŷ 2 , · · · ,ŷ |Y | ) (illustrated in Eq. (6)). In this way reinforcement learning exploits the space of extractive summaries of a source document.
To calculate P (Y |X) and then sampleŶ from it, we use a bidirectional LSTM to encode a source document to a sequence of vectors: (h D 1 , h D 2 , · · · , h D |X| ) = Bi-LSTM(X). Whether to include the t-th source word in the summary (ŷ t ) thus can be decided based on h D t . However, we also want to accommodate the previous t-1 sampling decisions (ŷ 1:t−1 ) to improve the fluency of the extractive summary. Following (Lei et al., 2016), we introduce a single-direction LSTM encoder whose hidden state s t tracks the sampling decisions up to time step t (Eq. 8). It represents the semantic meaning encoded in the current summary. To sample the t-th word, we concatenate the two vectors [h D t ||s t−1 ] and use it as input to a feedforward layer with sigmoid activation to estimatê y t ∼ P (y t |ŷ 1:t−1 , X) (Eq. 7).

Experiments
All training, validation, and testing was performed using the CNN dataset (Hermann et al., 2015;Nallapati et al., 2016) containing news articles paired with human-written highlights (i.e., abstracts). We observe that a source article contains 29.8 sentences and an abstract contains 3.54 sentences on average. The train/valid/test splits contain 90,266, 1,220, 1,093 articles respectively.

Hyperparameters
The hyperparameters, tuned on the validation set, include the following: the hidden state size of the Bi-LSTM is 256; the hidden state size of the single-direction LSTM encoder is 30. Dropout rate (Srivastava, 2013), used twice in the sampling component, is set to 0.2. The minibatch size is set to 256. We apply early stopping on the validation set, where the maximum number of epochs is set to 50. Our source vocabulary contains 150K words; words not in the vocabulary are replaced by the unk token. We use 100-dimensional word embeddings, initialized by GloVe (Pennington et al., 2014) and remain trainable. We set β = 2α and select the best α ∈ {10, 20, 50} and γ ∈ {5, 6, 7, 8} using the valid set (best value underlined). The maximum length of input is set to 100 words; δ is set to be 0.4 (≈40 words). We use the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 1e-4 and halve the learning rate if the objective worsens beyond a threshold (> 10%). As mentioned we utilized a bigram based pretraining method. We found that this stabilized the training of the full model.

Results
We compare our methods with state-of-the-art published systems, including both extractive and abstractive approaches (their details are summarized below). We experiment with two variants of our approach. "EntityQ" uses QA pairs whose answers are named entities. "KeywordQ" uses pairs whose answers are sentence root words. According to the R-1, R-2, and R-L scores (Lin, 2004) presented in Table 2, both methods are superior to the baseline systems on the benchmark dataset, yielding 11.5 and 11.6 R-2 F-scores, respectively.
• LSA (Steinberger and Jezek, 2004) uses the latent semantic analysis technique to identify semantically important sentences.
• LexRank (Erkan and Radev, 2004) is a graphbased approach that computes sentence importance based on the concept of eigenvector centrality in a graph representation of source sentences.
• TextRank (Mihalcea and Tarau, 2004) is an unsupervised graph-based ranking algorithm inspired by algorithms PageRank and HITS.
• SumBasic (Vanderwende et al., 2007) is an extractive approach that assumes words occurring frequently in a document cluster have a higher chance of being included in the summary.
• KL-Sum (Haghighi and Vanderwende, 2009) describes a method that greedily adds sentences to the summary so long as it decreases the KL divergence.
• Distraction-M3 (Chen et al., 2016b) trains the summarization model to not only attend to to specific regions of input documents, but also distract the attention to traverse different content of the source document.
• Pointer-Generator (See et al., 2017) allows the system to not only copy words from the source text via pointing but also generate novel words through the generator.
• Graph-based Attention (Tan et al., 2017) introduces a graph-based attention mechanism to enhance the encoder-decoder framework.  Table 3: Train/valid accuracy and R-2 F-scores when using varying numbers of QA pairs (K=1 to 5) in the reward func.
In Table 3, we vary the number of QA pairs used per article in the reward function (K=1 to 5). The summaries are encouraged to contain comprehensive content useful for answering all questions. When more QA pairs are used (K1→K5), we observe that the number of answer tokens has increased and almost doubled: 23.7K (K1)→50.3K (K5) for entities as answers, and 7.3K→13.7K for keywords. The enlarged answer space has an impact on QA accuracies. When using entities as answers, the training accuracy is 34.8% (Q5) and validation is 15.4% (Q5), and there appears to be a considerable gap between the two. In contrast, the gap is quite small when using keywords as answers (27.5% and 21.9% for Q5), suggesting that using sentence root words as answers is a more viable strategy to create QA pairs.
Comparing to QA studies (Chen et al., 2016a), we remove the constraint that requires answer entities (or keywords) to reside in the source documents. Adding this constraint improves the QA accuracy for a standard QA system. However, because our system does not perform QA during testing (the question-answer pairs are not available for the test set) but only generate generic summaries, we do not enforce this requirement and report no testing accuracies. We observe that the R-2 scores only present minor changes from K1 to K5. We conjecture that more question-answer pairs do not make the summaries contain more comprehensive content because the input and the summary are relatively short; K=1 yields the best results.
In Table 4, we present example system and reference summaries. Our extractive summaries can be overlaid with the source documents to assist people with browsing through the documents. In this way the summaries stay true to the original and do not contain information that was not in the source documents.

Source Document
It was all set for a fairytale ending for record breaking jockey AP Mc-Coy. In the end it was a different but familiar name who won the Grand National on Saturday.
25-1 outsider Many Clouds, who had shown little form going into the race, won by a length and a half, ridden by jockey Leighton Aspell.
Aspell won last year's Grand National too, making him the first jockey since the 1950s to ride back-to-back winners on different horses.

25-1 shot Many Clouds wins Grand National
Second win a row for jockey Leighton Aspell First jockey to win two in a row on different horses since 1950s approaches that automatically group selected summary segments into clusters. Each cluster can capture a unique aspect of the document, and clusters of text segments can be color-highlighted. Inspired by the recent work of Narayan et al. (2018), we are also interested in conducting the usability study to test how well the summary highlights can help users quickly answer key questions about the documents. This will provide an alternative strategy for evaluating our proposed method against both extractive and abstractive baselines.

Conclusion
In this paper we explore a new training paradigm for extractive summarization. Our system converts human abstracts to a set of question-answer pairs. We use reinforcement learning to exploit the space of extractive summaries and promote summaries that are concise, fluent, and adequate for answering questions. Results show that our approach is effective, surpassing state-of-the-art systems.