An Entity-Driven Framework for Abstractive Summarization

Abstractive summarization systems aim to produce more coherent and concise summaries than their extractive counterparts. Popular neural models have achieved impressive results for single-document summarization, yet their outputs are often incoherent and unfaithful to the input. In this paper, we introduce SENECA, a novel System for ENtity-drivEn Coherent Abstractive summarization framework that leverages entity information to generate informative and coherent abstracts. Our framework takes a two-step approach: (1) an entity-aware content selection module first identifies salient sentences from the input, then (2) an abstract generation module conducts cross-sentence information compression and abstraction to generate the final summary, which is trained with rewards to promote coherence, conciseness, and clarity. The two components are further connected using reinforcement learning. Automatic evaluation shows that our model significantly outperforms previous state-of-the-art based on ROUGE and our proposed coherence measures on New York Times and CNN/Daily Mail datasets. Human judges further rate our system summaries as more informative and coherent than those by popular summarization models.


Introduction
Automatic abstractive summarization carries strong promise for producing concise and coherent summaries to facilitate quick information consumption (Luhn, 1958). Recent progress in neural abstractive summarization has shown endto-end trained models (Nallapati et al., 2016;Tan et al., 2017a;Celikyilmaz et al., 2018;Kryściński et al., 2018) excelling at producing fluent summaries. Though encouraging, their outputs are Input Article: . . . Prime Minister Bertie Ahern of Ireland called Sunday for a general election on May 24.
[Mr. Ahern] and his centrist party have governed in a coalition government since 1197 . . . Under Irish law, which requires legislative elections every five years, Mr. Ahern had to call elections by midsummer. On Sunday, {he} said he would base his campaign for reelection on his work to strengthen the economy and efforts to revive Northern Ireland's stalled peace process this year. Political analysts said they expected Mr. Ahern 's work in Northern Ireland to be an asset . . .

Human Summary:
1 Prime Min Bertie Ahern of Ireland calls for general election on May 24. 2 [He] is required by law to call elections by midsummer. 3 Opinion polls suggest his centrist government is in danger of losing its majority in Parliament because of public disgruntlement about overburdened public services. 4 {Ahern} says he would base his campaign for reelection on his work to strengthen economy and his efforts to revive Northern Ireland's stalled peace process. 5 Analysts expect his work in Northern Ireland to be asset. Figure 1: Sample summary of an article from the New York Times corpus (Sandhaus, 2008). Mentions of the same entity are colored. Underlined sentence in the article occurs relatively at an earlier position in the summary ( 2 ) to improve topical coherence. Mentions in brackets ("[]","{}") show different ways in which the same entity is referred to in the article and the summary. Detailed explanation is given in §1.
frequently found to be unfaithful to the input and lack inter-sentence coherence (Cao et al., 2018;See et al., 2017;Wiseman et al., 2017). These observations suggest that existing methods have difficulty in identifying salient entities and related events in the article (Fan et al., 2018), and that existing model training objectives fail to guide the generation of coherent summaries.
In this paper, we present SENECA, a System for ENtity-drivEn Coherent Abstractive summarization. 1 We argue that entity-based modeling Prime Minister Bertie Ahern of Ireland called for a general election on May 24 …..Mr. Ahern and his centrist party, have governed in a coalition government since 1997. Under Irish law, which requires legislative elections every ve years, Mr. Ahern had to call elections by midsummer. On Sunday, he said he would base his campaign for re-election on his work to strengthen e orts to revive Northern Ireland's stalled peace process this year. Political analysts said they expected Mr. Ahern s work in Northern Ireland to be an asset … Input Article prime min bertie ahern of ireland calls for general election on may 0. ahern and his centrist party, have governed in coalition government since 0. … Ahern says he would base his campaign for re-election on his work to strengthen economy and his e orts to revive Northern Ireland's stalled peace process. enables enhanced input text interpretation, salient content selection, and coherent summary generation, three major challenges that need to be addressed by single-document summarization systems (Jones et al., 1999). We use a sample summary in Fig. 1 to show entity usage in summarization. Firstly, frequently mentioned entities from the input, along with their contextual information, underscores the salient content of the article (Nenkova, 2008). Secondly, as also discussed in prior work (Barzilay and Lapata, 2008;Siddharthan et al., 2011), patterns of entity distributions and how they are referred to contribute to the coherence and conciseness of the text. For instance, a human writer places the underlined sentence in the input article next to the first sentence in the summary to improve topical coherence as they are about the same topic ("elections"). Moreover, the human often optimizes on conciseness by referring to entities with pronouns (e.g., "he") or last names (e.g., "Ahern") without losing clarity. We therefore propose a two-step neural abstractive summarization framework to emulate the way humans construct summaries with the goal of improving both informativeness and coherence of the generated abstracts. As shown in Fig. 2, an entityaware content selection component first selects important sentences from the input that includes references to salient entities. An abstract generation component then produces coherent summaries by conducting cross-sentence information ordering, compression, and revision. Our abstract generator is trained using reinforcement learning with rewards that promote informativeness and optionally boost coherence, conciseness, and clarity of the summary. To the best of our knowledge, we are the first to study coherent abstractive summarization with the inclusion of linguisticallyinformed rewards. We conduct both automatic and human evaluation on popular news summarization datasets. Experimental results show that our model yields significantly better ROUGE scores than previous state-of-the-art (Gehrmann et al., 2018;Celikyilmaz et al., 2018) as well as higher coherence scores on the New York Times and CNN/Daily Mail datasets. Human subjects also rate our system generated summaries as more informative and coherent than those of other popular summarization models.

Summarization Framework
In this section, we describe our entity-driven abstractive summarization framework which follows a two-step approach as shown in Fig. 2. It comprises of (1) an entity-aware content selection component, that leverages entity guidance to select salient sentences ( §2.1), and (2) an abstract generation component ( §2.2), that is trained with reinforcement learning to generate coherent and concise summaries ( §2.3). Finally, we describe how the two components are connected to further improve the generated summaries ( §2.4).

Entity-Aware Content Selection
We design our content selection component to capture the interaction between entity mentions and the input article. Our model learns to identify salient content by aligning entity mentions and their contexts with human summaries. Concretely, we employ two encoders: one learns entity representations by encoding their mention clusters and the other learns sentence representations. A pointer-network-based decoder (Vinyals et al., 2015b) selects a sequence of important sentences by jointly attending to the entities and the input, as depicted in Fig. 3.
Entity Encoder. We run off-the-shelf coreference resolution system from Stanford CoreNLP (Manning et al., 2014) on the input articles to extract entities, each represented as a cluster of mentions. Specifically, from each input article, we extract the coreferenced entities, and construct the mention clusters for all the mentions of each entity in that article. We also consider non coreferenced entity mentions as singleton entity mention clusters. Among all these mention clusters, for our experiments, we only consider salient entity mention clusters. We label clusters as "salient" based on two rules: (1) mention clusters with entities appearing in the first three sentences of the article, and (2) top k clusters containing most numbers of mentions. We experimented with different values of k and found that k = 6 gives us the best set of salient mention clusters having an optimal overlap with entity mentions in the ground truth summary.
For each mention cluster, we concatenate mentions of the same entity as they occur in the input into one sequence, segmented with special tokens (<MENT>). Finally, we get entity representations e i for the i-th entity by encoding each cluster via a temporal convolutional model (Kim, 2014).
Input Article Encoder. For article encoding, we first learn sentence representations r j by encoding words in the j-th sentence with another temporal convolutional model. Then, we utilize a bidirectional LSTM (biLSTM) to aggregate sentences into a sequence of hidden states h j . Both the encoders use a shared word embedding matrix to allow better alignment.
Sentence Selection Decoder. We employ a single-layer unidirectional LSTM with hidden states s t to recurrently extract salient sentences. At each time step t, we first compute an entity context vector c e t based on attention mechanism (Bahdanau et al., 2014): where a e t are attention weights, v * and W * * denote trainable parameters throughout the paper. Bias terms are omitted for simplicity. We further use a glimpse operation (Vinyals et al., 2015a) to compute a sentence context vector c t as follows: where a h t are attention weights. Finally, sentence extraction probabilities that consider both entity and input context are calculated as: where the sentence y l t with the highest probability is selected. The process stops when the model picks the end-of-selection token.
Selection Label Construction. We train our content selection component with a cross-entropy loss: − (y l ,x)∈D log p(y l | x; θ), here y l are the ground truth sentence selection labels and x is the input article. θ denotes all model parameters.
To acquire training labels for sentence selection, we collect positive sentences in the following way. First, we employ greedy search to select the best combination of sentences that maximizes ROUGE-2 F1 (Lin and Hovy, 2003) with reference to human summary, as described by Zhou et al. (2018). We further include sentences whose ROUGE-L recall is above 0.5 when each is compared with its best aligned summary sentence. In cases where no sentence is selected, we label the first two sentences from the article as positive. Our combined construction strategy selects an average of 2.96 and 3.18 sentences from New York Times and CNN/Daily Mail articles respectively.

Abstract Generation with Reinforcement Learning
Our abstract generation component takes the selected sentences as input and produces the final summary. This abstract generator is a sequence-to-sequence network with attention over input (Bahdanau et al., 2014). The copying mechanism from See et al. (2017) is adopted to allow out-of-vocabulary words to appear in the output.
The abstract generator is first trained with maximum likelihood (ML) loss followed by additional training with policy-based reinforcement learning (RL). For ML training, we use teacher forcing algorithm (Williams and Zipser, 1995), to minimize the following loss: where D is the training set, x ext are extracted sentences from our label construction.
Self-Critical Learning. Following Paulus et al. (2017), we use the self-critical training algorithm based on policy gradients to use discrete metrics as RL rewards. At each training step, we generate two summaries: a sampled summary y s , obtained by sampling words from the probability distribution p(y s | x ext ; θ) at each decoding step, and a self-critical baseline summaryŷ, yielded by greedily selecting words that maximize the output probability at each time step (Rennie et al., 2017). We then calculate rewards based on the average of ROUGE-L F1 and ROUGE-2 F1 of the two summaries against that of the ground-truth summary, and define the following loss function: where D represents set of sampled summaries paired with extracted input sentences and N represents the total number of sampled summaries.

Rewards with Coherence and Linguistic Quality
So far, we have described the two basic components of our SENECA framework. As noted in prior work (Liu et al., 2016), optimizing for an ngram-based metric like ROUGE does not guarantee improvement over readability of the generations. We thus augment our framework with additional rewards based on coherence and linguistic quality as described below.
Entity-Based Coherence Reward (R Coh ). We use a separately trained coherence model to score summaries and guide our abstract generator to produce more coherent outputs by adding a reward R Coh in the aforementioned RL training process.
The new reward takes the following form: Here we show how to calculate R Coh , to capture both entity distribution patterns and topical continuity. Since summaries are short, (e.g. 2.0 sentences on average per summary in the New York Times data), we decide to build our coherence model on top of local coherence estimation for pairwise sentences. We adopt the architecture of neural coherence model developed by Wu and Hu (2018), but train it with samples that enable coherence modeling based on entity presence and their context. Here we briefly describe the model, and refer the readers to the original paper for details.
Given a pair of sentences S A and S B , convolution layers first transform them into hidden representations, from which a multi-layer perceptron is utilized to yield a coherence score 1]. We train the model with hinge-loss by leveraging both coherent positive samples and incoherent negative samples: where S A is a target sentence, S A and S + B is a positive pair, and S A and S − B is a negative pair. Note that Wu and Hu (2018) only consider position information for training data construction, i.e., S A and S + B must be adjacent, and S A and S − B are at most 9 sentences away with S − B randomly picked. We instead introduce two notable features to construct our training data. In addition to being adjacent, we further constrain S A and S + B to have at least one coreferred entity and that S − B does not. Since our initial experiments show that coherence model trained in this manner cannot discern pure repetition of sentences, e.g., simply duplicating words leads to higher coherence, we reuse the target sentences themselves as the negative pairs.
Finally, since this model outputs pairwise coherence scores, for a summary containing more than two sentences, we use the average of all adjacent sentence pairs' scores as the final summary coherence score. Summaries containing only one sentence get 0 coherence score. We also conduct correlation study to show average aggregation works reasonably well (details in Supplementary).

Linguistic Quality Rewards
. We further consider two linguistically-informed rewards to further improve summary clarity and conciseness by penalizing (1) improper usage of referential pronouns, and (2) redundancy introduced by non-restrictive appositives and relative clauses.
Pronominal Referential Clarity. Referential pronouns occurring without the antecedents in a summary decreases its readability. For instance, a text with a pronoun "they" occurring before the required referred entity is introduced, would be less comprehensible. Therefore, at the RL step, we either penalize a summary with a reward of −1 for such improper usage, or give 0 otherwise. In our implementation, we define improper usage as the presence of a third personal pronoun or a possessive pronoun before any noun phrase occurs. The new reward is written as Apposition. Next, we consider a reward to teach the model to use apposition and relative clause minimally, which improves summary conciseness. For this, we focus on the non-restrictive appositives and relative clauses, which often carry noncritical information (Conroy et al., 2006;Wang et al., 2013) and can be automatically detected based on comma usage patterns. Specifically, a sentence contains a non-restrictive appositive if i) it contains two commas, and ii) the word after first comma is a possessive pronoun or a determinant (Geva et al., 2019). We penalize a summary with −1 for using non-restrictive appositives and relative clauses, henceforth referred to as apposition, or give 0 otherwise. Similarly, we have the total reward as R(y) = R Rouge (y) + γ App R App (y).

Connecting Selection and Abstraction
Our entity-aware content selection component extracts salient sentences whereas our abstract generation component compresses and paraphrases them. Until this point, they are trained separately without any form of parameter sharing. We add an additional step to connect these two networks by training them together via the self-critical learning algorithm based on policy gradient (the same methodology as in §2.2).
Following the Markov Decision Process formulation, at each time step t, our content selector generates a set of extracted sentences (x ext ) from an input article. Our abstract generator usesx ext to generate a summary. This summary, evaluated against the respective human summary, receives ROUGE-1 as reward (See Eq. (7)). Note that the abstract generator, that has been previously trained with average of ROUGE-L and ROUGE-2 as reward to promote fluency, is not updated during this step. In this extra stage, if our content selector accurately selects salient sentences, the abstract generator is more likely to produce a highquality summary, and such action will be encouraged. Whereas, action resulting in inferior selections will be discouraged.
For training our coherence model for CNN/DM, we used 890, 419 triples constructed from summaries and input articles sampled from the CNN/DM training set. Similarly for NYT, we sampled 884, 494 triples from NYT training set. For the validation and test set for the two models, we sampled roughly 10% from the validation and test set of the respective datasets. Our coherence model for CNN/DM achieves 86% accuracy and for NYT, 84%. Additional evaluation for this model is reported in §4.1.
Training Details and Parameters. We used a vocabulary of 50K most common words in the training set (See et al., 2017), with 128-dimensional word embeddings randomly initialized and updated during training. In the content selection component, for both entity and sentence encoders, we implemented one-layer convolutional network with 100 dimensions and used a shared embedding matrix between the two. We employed LSTM models with 256-dimensional hidden states for the input article encoder (per direction) and the content selection decoder (Chen and Bansal, 2018). We used a similar setup for the abstract generator encoder and decoder. During ML training of both components, Adam (Kingma and Ba, 2015) is applied with a learning rate 0.001 and a gradient clipping 2.0, and the batch size 32. During RL stage, we reduced learning rate to 0.0001 (Paulus et al., 2017) and set batch size to 50. For our abstract generator, to reduce variance during RL training, we sampled 5 sequences for each data point and took an average over these samples using batch size 10. We set γ Coh = 0.01, γ Ref = 0.005, γ App = 0.005 for NYT and CNN/DM with grid search on validation set.
During inference, we adopted the trigram repetition avoidance strategy (Paulus et al., 2017;Chen and Bansal, 2018), with additional length normalization to encourage the generation of longer sequences as in (Gehrmann et al., 2018). We show comparison models' results reported as in the original publications. For significant tests, we run code released by the authors of POINTGEN+COV and SENTREWRITE, and by our implementation of DEEPREINFORCE on both datasets for training and testing. Since we do not have access to model outputs from Paulus et al. (2017), we re-implement their model, and achieve comparable ROUGE scores (e.g. on NYT, our 2,L are 46.61,29.76,and 43.46). For BOTTOMUP, we obtain model outputs from the authors for both CNN/DM and NYT datasets, whereas for DCA, we acquire summaries only for CNN/DM dataset.
In addition to SENECA base model, which combines entity-aware content selection and RL-based abstract generation (with average of ROUGE-L F1 and ROUGE-2 F1 as reward), we also report results on its variants with additional rewards during abstract generator training. We further consider SENECA (i) without entities, and (ii) end-to-end trained but with sentence selection only, i.e., the abstract generator simply repeats the input.

Results
In §4.1, we first evaluate our entity-based coherence model, which produces the coherence reward (R Coh ). We then present automatic evaluation for summarization models on content, coherence, and linguistic quality ( §4.2). We further discuss human evaluation results in §4.3.

Coherence Model Evaluation
We evaluate our entity-based coherence model on two tasks constructed from NYT test set: PAIR-WISE and SHUFFLE. PAIRWISE is constructed as described in §2.2 with equal number of positive pairs and negative pairs. SHUFFLE comprises of full summaries, where half are original summaries, and the other half contain their shuffled version.
In Fig. 4, we show a comparison of our model against a version trained based on the same amount of samples constructed as done by Wu and Hu (2018). Also shown are two baselines: ECHAIN, that labels a pair of sentences as more coherent if they have one or more entity mentions coreferred, and COSSIM, that labels a pair of sentences with higher cosine similarity as more coherent. Our model yields significantly higher accuracy (greater than 80%) on both tasks than the comparisons, which is due to the difference in training data construction. Wu and Hu (2018) only consider position information, whereas we capture entity-based coherence. The improvement in performance of the coherence model indicates the effectiveness of our training data construction in capturing multiple aspects of coherence.

Automatic Summary Evaluation
Results on NYT. We first report the new state-ofthe-art results for ROUGE-2 and ROUGE-L (Lin and Hovy, 2003), where our models outperform the previous best performing model DCA. Our SENECA models also outperform all comparisons on coherence score. This indicates that our models' summaries not only contain more salient information but are also more coherent.  Table 1: Results on NYT. Best results per metric are in bold. Best of our models are in italics. SENECA yields significantly higher R-1,2 and L than all comparisons except for BOTTOMUP and DCA 2 (approximate randomization test (p < 0.0005) (Edgington, 1969)). For COH., our best model is also significantly better (Welch's t-test, p < 0.05). * : scores taken from original paper. † : significant test done on outputs by running code released by their authors or our implementation.
Amongst our models, the base SENECA model reports higher ROUGE and coherence score (0.73) compared to the version without entity as input. This demonstrates the importance of entity guidance during content selection as well as abstract generation. SENECA model trained with Apposition reward (R App ) reports the highest ROUGE-L (44.60), but shows a drop in the coherence score to 0.69. Whereas, SENECA with all rewards R Coh +R Ref +R App reports the highest coherence score of 0.76 and a drop in ROUGE-L (44.01).
Results on CNN/DM. Since CNN/DM summaries are more extractive than that of NYT (Grusky et al., 2018), all SENECA models produce higher ROUGE-1 scores with SENECA base model achieving the highest, outperforming the previous best performing models on ROUGE-1. We also report the highest coherence score (0.63), which is even higher than that reported on human summaries (0.55). Since CNN/DM gold summaries are comprised of concatenated human written highlights for input articles, they are about the same topic and are cohesive, but lack entity-based coherent structure, for instance fewer 2 Significant test was not performed with BOTTOMUP as they use different test split than Paulus et al. (2017), and with DCA, since their outputs are unavailable on NYT.  Table 2: Results on CNN/Daily Mail. Our best results are achieved by the variant with sentence selection only, yielding significantly better R-1 and R-2 than all comparisons except BOTTOM-UP and DCA (Welch's t-test, p < 0.05). For COH., our best model is also significantly better (Welch's t-test, p < 0.05). We reevaluate DCA output against full length human summaries. * : scores taken from original paper. † : significant test done on outputs by running code released by their authors or our implementation. entities get coreferred in subsequent sentences. Therefore, our coherence evaluation, which tests for entity-based coherence, gives lower coherence score to CNN/DM gold summaries. Additionally for CNN/DM, the generated summaries sometimes contain less relevant words, e.g. stopwords, at the end of the summaries. This effect however improves the ROUGE scores while adversely affecting the coherence scores of the summaries. Training with additional coherence reward alleviates the issue by appending additional sentences, thereby improving overall coherence.
Amongst our models, we observe the overall trend to be similar to that in NYT results. Our base SENECA model reports higher ROUGE and coherence score compared to SENECA without entity input, again, indicating the usefulness of entity information. SENECA model with all rewards also yields the highest coherence score of 0.63, whereas SENECA model with coherence reward performs considerably better on ROUGE-L with a drop in coherence score.
Linguistic Quality Evaluation. We further do a preliminary evaluation of system summaries on two important linguistic qualities: clarity and conciseness. To measure clarity, we focus on the percentage of summaries that improperly use referential pronouns (Ref.), defined as third person pro-  nouns or possessive pronouns being used before any noun phrase. Similarly, to measure conciseness, we report how often summaries contain at least one non-restrictive relative clause (RelCl.) or non-restrictive appositives (App.). For model summaries, we report these measures in reference to the respective gold summaries. Lower values are better.
As can be seen from Table 3, our models report the least percentage of improper usage of referential pronouns. Particularly on NYT, the model trained with R Ref reward makes much fewer mistakes in this category. Similarly, adding R App reports the least amount of usage of relative clause or apposition, making summaries easier to read.

Human Evaluation
Human evaluation is conducted to analyze the informativeness and readability of the summaries generated by our models. We randomly select 30 articles from the NYT test set and ask three proficient English speakers to rate summaries generated by POINTGEN+COV, DEEPRE-INFORCE (Paulus et al., 2017), our SENECA, and SENECA + R Coh , 3 along with HUMAN written summaries. Each rater reads the article and scores the summaries against each other on a Likert scale of 1 (worst) to 5 (best) on the following three aspects: informativeness-whether summary covers salient points from the input, grammaticality, and coherence-whether summary presents content and entity mentions in coherent order. De- tailed guidelines with sample ratings and explanation are shown in the Supplementary. Table 4 shows that our model SENECA + R Coh ranks significantly higher on informativeness as well as coherence, reaffirming our observations from automatic evaluation. Surprisingly, SENECA + R Coh ranks higher on informativeness even when compared to SENECA, which reports higher on ROUGE (see Table 1). Through manual inspection, we find that SENECA model often generates summaries whose second or third sentence misses the subject, whereas SENECA + R Coh tends to avoid this problem. Though not significant, POINTGEN+COV ranks higher on grammaticality than SENECA + R Coh . We believe this is due to the fact that SENECA + R Coh learns to merge sentences from the input article while making some grammatical errors. We further show sample summaries in Figure 5.

Related Work
Neural Abstractive Summarization. Two-step abstractive summarization approaches have become popular in recent years, where the two steps, content selection and abstraction, are conveniently separated from each other. In these approaches, salient content is first identified, usually at sentence-level (Hsu et al., 2018;Tan et al., 2017b;Chen and Bansal, 2018) or phrase-level (Gehrmann et al., 2018), followed by abstract generation. However, prior work mainly focuses on improving the informativeness of abstractive summaries, e.g. copying and coverage mechanisms (See et al., 2017), and reinforcement learning methods optimizing on ROUGE scores (Paulus et al., 2017). Coherence and other aspects of linguistic quality that capture the overall readability of summaries are largely ignored.

Human:
New Jersey Legislature recommends 0 ways to overhaul system that has produced highest property taxes in nation; plan includes 0 percent reduction in property taxes to most homeowners through direct tax credits; will place annual limit on property tax increases; will revise financing of education and end special financing of state's poor districts PointGen+cov: new jersey legislature, after more than three decades of complaints about soaring property taxes and three months of hearings about ways to reduce them, designed to overhaul system that has produced highest property taxes in nation. recommendations included 0 percent reduction in property taxes to most of state's homeowners in form of direct tax credits DeepReinforce: new jersey legislature, after more than three decades of complaints about soaring property taxes and three months of hearings about ways to reduce them. unveils 0 proposals designed to overhaul system that has highest property taxes in nation. recommendations include 0 percent reduction in property taxes to most of state's and of direct tax credits SENECA: new jersey legislature unveils 0 proposals designed to overhaul system that has produced highest property taxes in nation. recommendations included 0 percent reduction in property taxes to most of state's homeowners in form of direct tax credits SENECA + R Coh : new jersey legislature, unveiled 0 proposals designed to overhaul system that has highest property taxes in nation. recommendations included 0 percent reduction in property taxes to most of state's homeowners in form of direct tax credits. other parts of plan would place limit on annual property tax increases and revise way state pays for public education, ending special financing given to state's poor districts Figure 5: Sample summaries for an NYT article. Our model with coherence reward overlaps the most with human summary (green is ours, blue denotes human).
In this work, besides informativeness, we also aim to improve upon these aspects of summaries.
Role of Entities and Coherence in Summarization. Entities in a text carry useful contextual information (Nenkova, 2008) and therefore play an important role in multi-document summarization (Li et al., 2006) and event summarization for selecting salient sentences (Li et al., 2015). Moreover, entity mentions connecting sentences have also been used to extract non-adjacent yet coherent sentences (Siddharthan et al., 2011;Parveen et al., 2016). For abstractive summarization, Amplayo et al. (2018) find it beneficial to leverage entities that are linked to existing knowledge bases. Unfortunately, it fails to capture the entities that do not exist in these knowledge bases.
Grammar role-based entity transitions have been widely employed to model coherence in text generation tasks (Barzilay and Lee, 2004;Lapata and Barzilay, 2005;Barzilay and Lapata, 2008;Guinaudeau and Strube, 2013;Tien Nguyen and Joty, 2017), which often requires intensive feature engineering. Neural coherence models (Mesgar and Strube, 2018;Li and Hovy, 2014) have, therefore, gained popularity due to their endto-end nature. However, coherence has mainly been investigated in extractive summarization systems (Alonso i Alemany and Fuentes Fort, 2003;Christensen et al., 2013;Parveen et al., 2015;Wu and Hu, 2018). To the best of our knowledge, we are the first to leverage entity information to improve coherence for neural abstractive summarization along with other important linguistic qualities.

Conclusion
We present an entity-driven summarization framework to generate informative and coherent abstractive summaries. An entity-aware content selector chooses salient sentences from the input article and an abstract generator produces a coherent abstract. Linguistically-informed guidance further enhances conciseness and clarity, thus improving the summary quality. Our model obtains the new state-of-the-art ROUGE scores on the NYT and CNN/DM datasets. Human evaluation further indicates that our system produces more coherent summaries than other popular methods.