Learning-Based Single-Document Summarization with Compression and Anaphoricity Constraints

We present a discriminative model for single-document summarization that integrally combines compression and anaphoricity constraints. Our model selects textual units to include in the summary based on a rich set of sparse features whose weights are learned on a large corpus. We allow for the deletion of content within a sentence when that deletion is licensed by compression rules; in our framework, these are implemented as dependencies between subsentential units of text. Anaphoricity constraints then improve cross-sentence coherence by guaranteeing that, for each pronoun included in the summary, the pronoun's antecedent is included as well or the pronoun is rewritten as a full mention. When trained end-to-end, our final system outperforms prior work on both ROUGE as well as on human judgments of linguistic quality.


Introduction
While multi-document summarization is wellstudied in the NLP literature (Carbonell and Goldstein, 1998;Gillick and Favre, 2009;Lin and Bilmes, 2011;Nenkova and McKeown, 2011), single-document summarization (McKeown et al., 1995;Marcu, 1998;Mani, 2001;Hirao et al., 2013) has received less attention in recent years and is generally viewed as more difficult. Content selection is tricky without redundancy across multiple input documents as a guide and simple positional information is often hard to beat (Penn and Zhu, 2008). In this work, we tackle the single-document problem by training an expressive summarization model on a large nat-1 Available at http://nlp.cs.berkeley.edu urally occurring corpus-the New York Times Annotated Corpus (Sandhaus, 2008) which contains around 100,000 news articles with abstractive summaries-learning to select important content with lexical features. This corpus has been explored in related contexts (Dunietz and Gillick, 2014;Hong and Nenkova, 2014), but to our knowledge it has not been directly used for singledocument summarization.
To increase the expressive capacity of our model we allow more aggressive compression of individual sentences by combining two different formalisms-one syntactic and the other discursive. Additionally, we incorporate a model of anaphora resolution and give our system the ability rewrite pronominal mentions, further increasing expressivity. In order to guide the model, we incorporate (1) constraints from coreference ensuring that critical pronoun references are clear in the final summary and (2) constraints from syntactic and discourse parsers ensuring that sentence realizations are well-formed. Despite the complexity of these additional constraints, we demonstrate an efficient inference procedure using an ILPbased approach. By training our full system endto-end on a large-scale dataset, we are able to learn a high-capacity structured model of the summarization process, contrasting with past approaches to the single-document task which have typically been heuristic in nature (Daumé and Marcu, 2002;Hirao et al., 2013).
We focus our evaluation on the New York Times Annotated corpus (Sandhaus, 2008). According to ROUGE, our system outperforms a document prefix baseline, a bigram coverage baseline adapted from a strong multi-document system (Gillick and Favre, 2009), and a discourse-informed method from prior work (Yoshida et al., 2014). Imposing discursive and referential constraints improves human judgments of linguistic clarity and referential structure-outperforming the method of   x UNIT subject to a length constraint. These textual units u are scored with weights w and features f . Next, we add constraints derived from both syntactic parses and Rhetorical Structure Theory (RST) to enforce grammaticality. Finally, we add anaphora constraints derived from coreference in order to improve summary coherence. We introduce additional binary variables x REF that control whether each pronoun is replaced with its antecedent using a candidate replacement rij. These are also scored in the objective and are incorporated into the length constraint. Yoshida et al. (2014) and approaching the clarity of a sentence-extractive baseline-and still achieves substantially higher ROUGE score than either method. These results indicate that our model has the expressive capacity to extract important content, but is sufficiently constrained to ensure fluency is not sacrificed as a result. Past work has explored various kinds of structure for summarization. Some work has focused on improving content selection using discourse structure (Louis et al., 2010;Hirao et al., 2013), topical structure ( Barzilay and Lee, 2004), or related techniques (Mithun and Kosseim, 2011). Other work has used structure primarily to reorder summaries and ensure coherence (Barzilay et al., 2001;Barzilay and Lapata, 2008;Louis and Nenkova, 2012;Christensen et al., 2013) or to represent content for sentence fusion or abstraction (Thadani and McKeown, 2013;Pighin et al., 2014). Similar to these approaches, we appeal to structures from upstream NLP tasks (syntactic parsing, RST parsing, and coreference) to restrict our model's capacity to generate. However, we go further by optimizing for ROUGE subject to these constraints with end-to-end learning.

Model
Our model is shown in Figure 1. Broadly, our ILP takes a set of textual units u = (u 1 , . . . , u n ) from a document and finds the highest-scoring extractive summary by optimizing over variables , which are binary indicators of whether each unit is included. Textual units are contiguous parts of sentences that serve as the fundamental units of extraction in our model. For a sentence-extractive model, these would be entire sentences, but for our compressive models we will have more fine-grained units, as shown in Figure 2 and described in Section 2.1. Textual units are scored according to features f and model parameters w learned on training data. Finally, the extraction process is subject to a length constraint of k words. This approach is similar in spirit to ILP formulations of multi-document summarization systems, though in those systems content is typically modeled in terms of bigrams (Gillick and Favre, 2009;Berg-Kirkpatrick et al., 2011;Hong and Nenkova, 2014;Li et al., 2015). For our model, type-level n-gram scoring only arises when we compute our loss function in maxmargin training (see Section 3).
In Section 2.1, we discuss grammaticality constraints, which take the form of introducing dependencies between textual units, as shown in Figure 2. If one textual unit requires another, it cannot be included unless its prerequisite is. We will show that different sets of requirements can capture both syntactic and discourse-based compression schemes.
Furthermore, we introduce anaphora constraints (Section 2.2) via a new set of variables that capture the process of rewriting pronouns to make them  explicit mentions. That is, x REF ij = 1 if we should rewrite the jth pronoun in the ith unit with its antecedent. These pronoun rewrites are scored in the objective and introduced into the length constraint to make sure they do not cause our summary to be too long. Finally, constraints on these variables control when they are used and also require the model to include antecedents of pronouns when the model is not confident enough to rewrite them.

Grammaticality Constraints
Following work on isolated sentence compression (McDonald, 2006;Clarke and Lapata, 2008) and compressive summarization (Lin, 2003;Martins and Smith, 2009;Berg-Kirkpatrick et al., 2011;Woodsend and Lapata, 2012;Almeida and Martins, 2013), we wish to be able to compress sentences so we can pack more information into a summary. During training, our model learns how to take advantage of available compression options and select content to match human generated summaries as closely possible. 2 We explore two ways of deriving units for compression: the RST-based compressions of Hirao et al. (2013) and the syntactic compressions of Berg-Kirkpatrick et al. (2011). Figure 2a shows how to derive compressions from Rhetorical Structure Theory (Mann and Thompson, 1988;Carlson et al., 2001). We show a sentence broken into elemen-2 The features in our model are actually rich enough to learn a sophisticated compression model, but the data we have (abstractive summaries) does not directly provide examples of correct compressions; past work has gotten around this with multi-task learning (Almeida and Martins, 2013), but we simply treat grammaticality as a constraint from upstream models.

RST compressions
tary discourse units (EDUs) with RST relations between them. Units marked as SAME-UNIT must both be kept or both be deleted, but other nodes in the tree structure can be deleted as long as we do not delete the parent of an included node. For example, we can delete the ELABORATION clause, but we can delete neither the first nor last EDU. Arrows depict the constraints this gives rise to in the ILP (see Figure 1): u 2 requires u 1 , and u 1 and u 3 mutually require each other. This is a more constrained form of compression than was used in past work (Hirao et al., 2013), but we find that it improves human judgments of fluency (Section 4.3). Figure 2b shows two examples of compressions arising from syntactic patterns (Berg-Kirkpatrick et al., 2011): deletion of the second part of a coordinated NP and deletion of a PP modifier to an NP. These patterns were curated to leave sentences as grammatical after being compressed, though perhaps with damaged semantic content. Figure 2c shows the textual units and requirement relations yielded by combining these two types of compression. On this example, the two schemes capture orthogonal compressions, and more generally we find that they stack to give better results for our final system (see Section 4.3). To actually synthesize textual units and the constraints between them, we start from the set of RST textual units and introduce syntactic compressions as new children when they don't cross existing brackets; because syntactic compressions are typically narrower in scope, they are usually completely contained in EDUs. Figure 2d shows an example of this process: the possible deletion of with Aetna is grafted onto the textual unit and appropriate requirement relations are introduced. The net effect is that the textual unit is wholly included, partially included (with Aetna removed), or not at all.

Combined compressions
Formally, we define an RST tree as T rst = (S rst , π rst ) where S rst is a set of EDU spans (i, j) and π : S → 2 S is a mapping from each EDU span to EDU spans it depends on. Syntactic compressions can be expressed in a similar way with trees T syn . These compressions are typically smallerscale than EDU-based compressions, so we use the following modification scheme. Denote by T syn(kl) a nontrivial (supports some compression) subtree of T syn that is completely contained in an EDU (i, j). We build the following combined compression tree, which we refer to as the augmentation of T rst with T syn(kl) : That is, we maintain the existing tree structure except for the EDU (i, j), which is broken into three parts: the outer two depend on each other (is a claims adjuster and . from Figure 2d) and the inner one depends on the others and preserves the tree structure from T syn . We augment T rst with all maximal subtrees of T syn , i.e. all trees that are not contained in other trees that are used in the augmentation process. This is broadly similar to the combined compression scheme in Kikuchi et al. (2014) but we use a different set of constraints that more strictly enforce grammaticality. 3

Anaphora Constraints
What kind of cross-sentential coherence do we need to ensure for the kinds of summaries our system produces? Many notions of coherence are useful, including centering theory (Grosz et al., 1995) and lexical cohesion (Nishikawa et al., 2014), but one of the most pressing phenomena to deal with is pronoun anaphora (Clarke and Lapata, 2010). Cases of pronouns being "orphaned" during extraction (their antecedents are deleted) are This hasn't been Kellogg's year .
The oat-bran craze has cost it market share.
Otherwise (i.e. if no replacement is possible): Allow pronoun replacement with the predicted antecedent and add the following constraint: Add the following constraint: Kellogg it year it No replacement necessary Replace the first pronoun in the second textual unit Figure 3: Modifications to the ILP to capture pronoun coherence. It, which refers to Kellogg, has several possible antecedents from the standpoint of an automatic coreference system (Durrett and Klein, 2014). If the coreference system is confident about its selection (above a threshold α on the posterior probability), we allow for the model to explicitly replace the pronoun if its antecedent would be deleted (Section 2.2.1). Otherwise, we merely constrain one or more probable antecedents to be included (Section 2.2.2); even if the coreference system is incorrect, a human can often correctly interpret the pronoun with this additional context. relatively common: they occur in roughly 60% of examples produced by our summarizer when no anaphora constraints are enforced. This kind of error is particularly concerning for summary interpretation and impedes the ability of summaries to convey information effectively (Grice, 1975). Our solution is to explicitly impose constraints on the model based on pronoun anaphora resolution. 4 Figure 3 shows an example of a problem case. If we extract only the second textual unit shown, the pronoun it will lose its antecedent, which in this case is Kellogg. We explore two types of constraints for dealing with this: rewriting the pronoun explicitly, or constraining the summary to include the pronoun's antecedent.

Pronoun Replacement
One way of dealing with these pronoun reference issues is to explicitly replace the pronoun with what it refers to. This replacement allows us to maintain maximal extraction flexibility, since we can make an isolated textual unit meaningful even if it contains a pronoun. Figure 3 shows how this process works. We run the Berkeley Entity Resolution System (Durrett and Klein, 2014) and compute posteriors over possible links for the pronoun. If the coreference system is sufficiently confident in its prediction (i.e. max i p i > α for a specified threshold α > 1 2 ), we allow ourselves to replace the pronoun with the first mention of the entity corresponding to the pronoun's most likely antecedent. In Figure 3, if the system correctly determines that Kellogg is the correct antecedent with high probability, we enable the first replacement shown there, which is used if u 2 is included the summary without u 1 . 5 As shown in the ILP in Figure 1, we instantiate corresponding pronoun replacement variables implies that the jth pronoun in the ith sentence should be replaced in the summary. We use a candidate pronoun replacement if and only if the pronoun's corresponding (predicted) entity hasn't been mentioned previously in the summary. 6 Because we are generally replacing pronouns with longer mentions, we also need to modify the length constraint to take this into account. Finally, we incorporate features on pronoun replacements in the objective, which helps the model learn to prefer pronoun replacements that help it to more closely match the human summaries.

Pronoun Antecedent Constraints
Explicitly replacing pronouns is risky: if the coreference system makes an incorrect prediction, the intended meaning of the summary may be damaged. Fortunately, the coreference model's posterior probabilities have been shown to be wellcalibrated (Nguyen and O'Connor, 2015), meaning that cases where it is likely to make errors are signaled by flatter posterior distributions. In this case, we enable a more conservative set of constraints that include additional content in the summary to make the pronoun reference clear without explicitly replacing it. This is done by requiring the inclusion of any textual unit which contains possible pronoun references whose posteriors sum to at least a threshold parameter β. Figure 3 shows that this constraint can force the inclusion of u 1 to provide additional context. Although this could still lead to unclear pronouns if text is stitched together in an ambiguous or even misleading way, in practice we observe that the textual units we force to be added almost always occur very recently before the pronoun, giving enough additional context for a human reader to figure out the pronoun's antecedent unambiguously.

Features
The features in our model (see Figure 1) consist of a set of surface indicators capturing mostly lexical and configurational information. Their primary role is to identify important document content. The first three types of features fire over textual units, the last over pronoun replacements.
Lexical These include indicator features on nonstopwords in the textual unit that appear at least five times in the training set and analogous POS features. We also use lexical features on the first, last, preceding, and following words for each textual unit. Finally, we conjoin each of these features with an indicator of bucketed position in the document (the index of the sentence containing the textual unit).
Structural These features include various conjunctions of the position of the textual unit in the document, its length, the length of its corresponding sentence, the index of the paragraph it occurs in, and whether it starts a new paragraph (all values are bucketed).
Centrality These features capture rough information about the centrality of content: they consist of bucketed word counts conjoined with bucketed sentence index in the document. We also fire features on the number of times of each entity mentioned in the sentence is mentioned in the rest of the document (according to a coreference system), the number of entities mentioned in the sentence, and surface properties of mentions including type and length Pronoun replacement These target properties of the pronoun replacement such as its length, its sentence distance from the current mention, its type (nominal or proper), and the identity of the pronoun being replaced.

Learning
We learn weights w for our model by training on a large corpus of documents u paired with reference summaries y. We formulate our learning problem as a standard instance of structured SVM (see Smith (2011) for an introduction). Because we want to optimize explicitly for ROUGE-1, 7 we define a ROUGE-based loss function that accommodates the nature of our supervision, which is in terms of abstractive summaries y that in general cannot be produced by our model. Specifically, we take: i.e. the gap between the hypothesis's ROUGE score and the oracle ROUGE score achievable under the model (including constraints). Here x NGRAM are indicator variables that track, for each n-gram type in the reference summary, whether that n-gram is present in the system summary. These are the sufficient statistics for computing ROUGE.
We train the model via stochastic subgradient descent on the primal form of the structured SVM objective (Ratliff et al., 2007;Kummerfeld et al., 2015). In order to compute the subgradient for a given training example, we need to find the most violated constraint on the given instance through a loss-augmented decode, which for a linear model takes the form arg max x w f (x) + (x, y). To do this decode at training time in the context of our model, we use an extended version of our ILP in Figure 1 that is augmented to explicitly track typelevel n-grams: subject to all constraints from Figure 1, and x NGRAM i = 1 iff an included textual unit or replacement contains the ith reference n-gram These kinds of variables and constraints are common in multi-document summarization systems that score bigrams (Gillick and Favre, 2009 inter alia). Note that since ROUGE is only computed over non-stopword n-grams and pronoun replacements only replace pronouns, pronoun replacement can never remove an n-gram that would otherwise be included.
For all experiments, we optimize our objective using AdaGrad (Duchi et al., 2011) with 1 regularization (λ = 10 −8 , chosen by grid search), with a step size of 0.1 and a minibatch size of 1. We train for 10 iterations on the training data, at which point held-out model performance no longer improves. Finally, we set the anaphora thresholds α = 0.8 and β = 0.6 (see Section 2.2). The values of these and other hyperparameters were determined on a held-out development set from our New York Times training data. All ILPs are solved using GLPK version 4.55.

Experiments
We primarily evaluate our model on a roughly 3000-document evaluation set from the New York Times Annotated Corpus (Sandhaus, 2008). We also investigate its performance on the RST Discourse Treebank (Carlson et al., 2001), but because this dataset is only 30 documents it provides much less robust estimates of performance. 8 Throughout this section, when we decode a document, we set the word budget for our summarizer to be the same as the number of words in the corresponding reference summary, following previous work (Hirao et al., 2013;Yoshida et al., 2014).

Preprocessing
We preprocess all data using the Berkeley Parser (Petrov et al., 2006), specifically the GPUaccelerated version of the parser from Hall et al. (2014), and the Berkeley Entity Resolution System (Durrett and Klein, 2014). For RST discourse analysis, we segment text into EDUs using a semi-Markov CRF trained on the RST treebank with features on boundaries similar to those of Hernault et al. (2010), plus novel features on spans including span length and span identity for short spans.
To follow the conditions of Yoshida et al. (2014) as closely as possible, we also build a discourse parser in the style of Hirao et al. (2013), since their parser is not publicly available. Specifically, Article on Speak-Up, program begun by Westchester County Office for the Aging to bring together elderly and college students.
National Center for Education Statistics reports students in 4th, 8th and 12th grades scored modestly higher on American history test than five years earlier. Says more than half of high school seniors still show poor command of basic facts. Only 4th graders made any progress in civics test. New exam results are another ingredient in debate over renewing Pres Bush's signature No Child Left Behind Act. Federal officials reported yesterday that students in 4th, 8th and 12th grades had scored modestly higher on an American history test than five years earlier, although more than half of high school seniors still showed poor command of basic facts like the effect of the cotton gin on the slave economy or the causes of the Korean War. Federal officials said they considered the results encouraging because at each level tested, student performance had improved since the last time the exam was administered, in 2001. "In U.S. history there were higher scores in 2006 for all three grades," said Mark Schneider, commissioner of the National Center for Education Statistics, which administers the test, at a Boston news conference that the Education Department carried by Webcast. The results were less encouraging on a national civics test, on which only fourth graders made any progress. The best results in the history test were also in fourth grade, where 70 percent of students attained the basic level of achievement or better. The test results in the two subjects are likely to be closely studied, because Congress is considering the renewal of President Bush's signature education law, the No Child Left Behind Act. A number of studies have shown that because No Child Left Behind requires states… Long before President Bush's proposal to rethink Social Security became part of the national conversation, Westchester County came up with its own dialogue to bring issues of aging to the forefront. Before the White House Conference on Aging scheduled in October, the county's Office for the Aging a year ago started Speak-Up, which stands for Student Participants Embrace Aging Issues of Key Concern, to reach students in the county's 13 colleges and universities. Through a variety of events to bring together the elderly and college students, organizers said they hoped to have by this spring a series of recommendations that could be given to Washington… Figure 4: Examples of an article kept in the NYT50 dataset (top) and an article removed because the summary is too short. The top summary has a rich structure to it, corresponding to various parts of the document (bolded) and including some text that is essentially a direct extraction. Oracle sentences First k sentences ≥ Figure 5: Counts on a 1000-document sample of how frequently both a document prefix baseline and a ROUGE oracle summary contain sentences at various indices in the document. There is a long tail of useful sentences later in the document, as seen by the fact that the oracle sentence counts drop off relatively slowly. Smart selection of content therefore has room to improve over taking a prefix of the document.
we use the first-order projective parsing model of McDonald et al. (2005) and features from Soricut and Marcu (2003), Hernault et al. (2010), and Joty et al. (2013). When using the same head annotation scheme as Yoshida et al. (2014), we outperform their discourse dependency parser on unlabeled dependency accuracy, getting 56% as opposed to 53%.

New York Times Corpus
We now provide some details about the New York Times Annotated corpus. This dataset contains 110,540 articles with abstractive summaries; we split these into 100,834 training and 9706 test examples, based on date of publication (test is all articles published on January 1, 2007 or later). Examples of two documents from this dataset are shown in Figure 4. The bottom example demonstrates that some summaries are extremely short and formulaic (especially those for obituaries and editorials). To counter this, we filter the raw dataset by removing all documents with summaries that are shorter than 50 words. One benefit of filtering is that the length distribution of our resulting dataset is more in line with standard summarization evaluations like DUC; it also ensures a sufficient number of tokens in the budget to produce nontrivial summaries. The filtered test set, which we call NYT50, includes 3,452 test examples out of the original 9,706. Interestingly, this dataset is one where the classic document prefix baseline can be substantially outperformed, unlike in some other summarization settings (Penn and Zhu, 2008). We show this fact explicitly in Section 4.3, but Figure 5 provides additional analysis in this regard. We compute oracle ROUGE-1 sentence-extractive summaries on a 1000-document subset of the training set and look at where the extracted sentences lie in the document. While they certainly skew earlier in the document, they do not all fall within the doc-  (Sandhaus, 2008). We report ROUGE-1 (R-1), ROUGE-2 (R-2), clarity/grammaticality (CG), and number of unclear pronouns (UP) (lower is better). On content selection, our system substantially outperforms all baselines, our implementation of the tree knapsack system (Yoshida et al., 2014), and learned extractive systems with less compression, even an EDU-extractive system that sacrifices grammaticality. On clarity metrics, our final system performs nearly as well as sentence-extractive systems. The symbols * and † indicate statistically significant gains compared to No Anaphoricity and Tree Knapsack (respectively) with p < 0.05 according to a bootstrap resampling test. We also see that removing either syntactic or EDU-based compressions decreases ROUGE. ument prefix summary. One reason for this is that many of the articles are longer-form pieces that begin with a relatively content-free lede of several sentences, which should be identifiable with lexicosyntactic indicators as are used in our discriminative model.

New York Times Results
We evaluate our system along two axes: first, on content selection, using ROUGE 9 (Lin and Hovy, 2003), and second, on clarity of language and referential structure, using annotators from Amazon Mechanical Turk. We follow the method of Gillick and Liu (2010) for this evaluation and ask Turkers to rate a summary on how grammatical it is using a 10-point Likert scale. Furthermore, we ask how many unclear pronouns references there were in the text. The Turkers do not see the original document or the reference summary, and rate each summary in isolation. Gillick and Liu (2010) showed that for linguistic quality judgments (as opposed to content judgments), Turkers reproduced the ranking of systems according to expert judgments.
To speed up preprocessing and training time on this corpus, we further restrict our training set to only contain documents with fewer than 100 EDUs. All told, the final system takes roughly 20 hours to make 10 passes through the subsampled training data (22,000 documents) on a single core of an Amazon EC2 r3.4xlarge instance. Table 1 shows the results on the NYT50 corpus. We compare several variants of our system and baselines. For baselines, we use two variants of first k: one which must stop on a sentence boundary (which gives better linguistic quality) and one which always consumes k tokens (which gives better ROUGE). We also use a heuristic sentence-extractive baseline that maximizes the document counts (term frequency) of bigrams covered by the summary, similar in spirit to the multi-document method of Gillick and Favre (2009). 10 We also compare to our implementation of the Tree Knapsack method of Yoshida et al. (2014), which matches their results very closely on the RST Discourse Treebank when discourse trees are controlled for. Finally, we compare several variants of our system: purely extractive systems operating over sentences and EDUs respectively, our full system, and ablations removing either the anaphoricity component or parts of the compression module.
In terms of content selection, we see that all of the systems that incorporate end-to-end learning (under "This work") substantially outperform our various heuristic baselines. Our full system using the full compression scheme is substantially better on ROUGE than ablations where the syntactic or discourse compressions are removed. These improvements reflect the fact that more compression options give the system more flexibility to include key content words. Removing the anaphora resolution constraints actually causes ROUGE to increase slightly (as a result of granting the model flexibility), but has a negative impact on the linguistic quality metrics.
On our linguistic quality metrics, it is no surprise that the sentence prefix baseline performs the best. Our sentence-extractive system also does well on these metrics. Compared to the EDUextractive system with no constraints, our constrained compression method improves substantially on both linguistic quality and reduces the   (Carlson et al., 2001). Differences between our system and the Tree Knapsack system of Yoshida et al. (2014) are not statistically significant, reflecting the high variance in this small (20 document) test set.
number of unclear pronouns, and adding the pronoun anaphora constraints gives further improvement. Our final system is approaches the sentenceextractive baseline, particularly on unclear pronouns, and achieves substantially higher ROUGE score.

RST Treebank
We also evaluate on the RST Discourse Treebank, of which 30 documents have abstractive summaries. Following Hirao et al. (2013), we use the gold EDU segmentation from the RST corpus but automatic RST trees. We break this into a 10document development set and a 20-document test set. Table 2 shows the results on the RST corpus. Our system is roughly comparable to Tree Knapsack here, and we note that none of the differences in the table are statistically significant. We also observed significant variation between multiple runs on this corpus, with scores changing by 1-2 ROUGE points for slightly different system variants. 11

Conclusion
We presented a single-document summarization system trained end-to-end on a large corpus. We integrate a compression model that enforces grammaticality as well as pronoun anaphoricity constraints that enforce coherence. Our system improves substantially over baseline systems on ROUGE while still maintaining good linguistic quality.
Our system and models are publicly available at http://nlp.cs.berkeley.edu