Generating Coherent Summaries of Scientific Articles Using Coherence Patterns

,


Introduction
The growth in the scientific output of many different fields makes the task of automatic summarization imperative. Automatic summarizers assist researchers to have an informative and coherent gist of long scientific articles. An automatic summarizer produces summaries considering three properties: Importance: The summary should contain the important information of the input document. Non-redundancy: The summary should contain non-redundant information. The information should be diverse in the summary.
Coherence: Though the summary should comprise diverse and important information of the input document, its sentences should be connected to one another such that it becomes coherent and easy to read.
If we do not ensure that a summary is coherent, its sentences may not be properly connected. This results in an obscure summary. In previous work coherence has not been thoroughly considered.  use single sentence connectivity in the input document as a coherence measure. They measure coherence by calculating the outdegree of a sentence in a graph representation of an input document. This has two disadvantages: first, since it is computed only based on one sentence, it is not sufficient to generate coherent summaries; second, it is obtained based on sentence connectivity in the input document rather than in the summary.
In this work, we focus on the coherence aspect of summarization. We use discourse entities as the unit of information that relate sentences. Here, discourse entities are referred to as head nouns of noun phrases (see Section 2). The main goal is to extract sentences which refer to those entities which are important and unique, and also to entities which connect the extracted sentences in a coherent manner. Entities in connected sentences can be used to create linguistically motivated coherence patterns (Daneš, 1974). Recently, Mesgar and Strube (2015) modeled these coherence patterns by subgraphs of the graph representation (nodes represent sentences and edges represent entity connections among sentences) of documents. They show that the frequency of coherence patterns can be used as features for coherence.
The key idea of this paper is to apply coherence patterns to long scientific articles to extract (possibly) non-adjacent sentences which, however, are already coherent. Based on the assumption that ab-c b a (i) S1 Cardiometabolic diseases are a growing concern across sub-Saharan Africa (SSA).
S2 According to current estimates, the prevalence of diabetes among adults aged 20-79 y in Africa is 3.8% and will increase to 4.6% by 2030.

S3 Urban environments and associated lifestyles, including diets high in salt, sugar, and fat, and physical inactivity, have been widely implicated as leading causes of the rise in cardiometabolic diseases.
S4 If and how these changes affect the health of rural residents, however, remains poorly understood. Sentences S1, S3 and S5 constitute the pattern in an input document.

S5 Existing research on lifestyle risk factors for cardiometabolic diseases has almost exclusively focused on exposures
stracts of scientific articles are similar in style to coherent summaries, we obtain coherence patterns by analyzing a corpus of abstracts of articles from biomedicine (PubMed corpus). Then we apply the most frequent coherence patterns to input documents, i.e. long scientific articles from bio-medicine (PLOS Medicine dataset), extract corresponding sentences to generate coherent summaries, and evaluate them by comparing with summaries written by a PLOS Medicine editor. Figure 1 illustrates the extraction of sentences from an input document ( Figure 1, (ii)) which constitute a coherence pattern (Figure 1, (i)).
If we overlay the input document with coherence patterns and extract the sentences which constitute those patterns, then the extracted sentences are already coherent. We also take into account importance and non-redundancy. We capture all three factors in an objective function maximized by Mixed Integer Programming (MIP) (Section 2).
We evaluate our method on two different datasets: PLOS Medicine  and DUC 2002. We extract frequent coherence patterns from all abstracts in the PubMed corpus, and generate summaries of unseen scientific articles of the PLOS Medicine dataset (Section 3.1). For DUC 2002 we extract coherence patterns from the human summaries of DUC 2005 (Dang, 2005). We evaluate our model on DUC 2002 to compare with state-ofthe-art systems.
Our experimental results show that using coherence patterns for summarization produces more informative (but not redundant) and coherent summaries as compared to several baseline methods and state-of-the-art methods based on ROUGE scores and human judgements.

Method
We solve the task of creating coherent summaries by employing coherence patterns. We tightly integrate determining importance, non-redundancy and coherence by applying global optimization, i.e., MIP.

Document Representation
We use the entity graph (Guinaudeau and Strube, 2013) to represent scientific articles. The entity graph is a bipartite graph which consists of entities and sentences as two disjoint sets of nodes (Figure 2, ii). Entity nodes are connected only with sentence nodes and not among each other. An entity node is connected with a sentence node if and only if the entity is present in the sentence. Entities are the head nouns of noun phrases.
We perform a one-mode projection on sentence nodes to create a directed one-mode projection graph (Figure 2, iii). Two sentence nodes in the onemode projection graph are connected if they share at least one entity in the entity graph. Edge directions encode the sentence order in the input document.

Mining Coherence Patterns
We use one-mode projection graphs of abstracts in the PubMed corpus (see Section 3.1) to mine coherence patterns. The weight of a coherence pattern, weight(pat u ), is its frequency in the PubMed corpus normalized by the maximum number of its occurrence in abstracts in the PubMed corpus (Equation 1).
where q is the number of graphs associated with abstracts in the corpus, and g k represents the graph of the k th abstract in the PubMed corpus.  The weights of the coherence patterns are not on the same scale. We normalize the weights using the standard score x−µ σ , where µ is the mean and σ is the standard deviation. A sigmoid function scales weights to the interval [0, 1].

Summary Generation
We maximize importance, non-redundancy and pattern-based coherence with their respective weights λ to generate coherent summaries. The objective function is: where S is a set of binary variables for sentences in an article, E is a set of binary variables for entities and P is a set of binary variables for coherence patterns.
Importance (f I (S)): The importance function quantifies the overall importance of information in the summary, which is calculated by considering the ranks of selected sentences for the summary: In Equation 3, Rank (sent i ) represents the rank of sentence sent i and s i is the binary variable of sentence sent i . n is the number of sentences. Kleinberg (1999) develops the Hubs and Authorities algorithm (HITS) to rank web pages. He divides web pages into two sets: Hubs, pages which contain links to informative web pages, and Authorities, informative web pages. Here, Hubs are entities and Authorities are sentences. We calculate the rank of sentences using the HITS algorithm . Initial ranks for sentences and entities are computed by Equations 4 and 5 in an entity graph: In Equation 4, sim(sent i , title) is the cosine similarity between the scientific article's title and sentence sent i . In Equation 5, ent j refers to the j th entity in the entity graph. After applying the HITS algorithm on the entity graph using the above initialization, the final rank of a sentence is its importance.
Non-redundancy (f R (E)): In the objective function, f R (E) represents the non-redundancy of information in the summary. Intuitively, if the summary has unique information in every sentence then the summary is non-redundant. We measure nonredundancy as follows: where m is the number of entities and e j is a binary variable for each entity. The summary becomes nonredundant if we include only unique entities.
On the basis of f I (S) and f R (E) we define the following optimization constraints: The constraint in Equation 7 limits the length of the summary. l max is the maximal length of the summary and |Sent i | is the length of sentence sent i .
In Equation 8, the constraint ensures that if sentence sent i is selected (s i = 1), then all entities E i present in sentence sent i must also be selected. In Equation 9, S j represents the set of binary variables of sentences which contain entity ent j . This constraint prescribes that if entity ent j is selected (e j = 1), then at least one of the sentences in S j must be selected, too.
Coherence (f C (P)): We use the mined patterns to extract sentences from the input document of PLOS Medicine to create a coherent summary. We extract sentences, if the connectivity among nodes in their projection graph matches the connectivity among nodes in a coherence pattern. In Figure 3 we overlay the projection graph from Figure 2, ii with the coherence pattern from Figure 1, i. This results in three instances of this coherence pattern. However, we select only one since we simultaneously optimize for importance and non-redundancy. In the objective function, f C (P ) measures the coherence of the summary based on the weights of the coherence patterns occurring in it (Section 2.2): where p u is a boolean variable associated with coherence pattern pat u . The optimization considers pattern pat u for summarizing the input article, if pat u is a subgraph of the projection graph of the article. To find the coherence pattern in a projection graph we apply a graph matching algorithm (Lerouge et al., 2015). To model the graph matching problem between projection graph g = (V g , E g ) and patterns pat u = (V patu , E patu ), two kinds of mapping binary variables are used: x i,k for the node map, and y ij,kl for the edge map. x i,k, = 1, if vertices i ∈ V patu and k ∈ V g match. y ij,kl = 1, if for each pair of edges ij ∈ E patu and kl ∈ E g match ( Figure 4). Constraints for graph matching are as follows: • Every node of the pattern matches at most one unique node of the graph: • Every edge of the pattern matches at most one unique edge of the graph: • Every node of the graph matches at most one node of the pattern: i∈Vpat u • A node of pattern pat u matches a node of graph g if an edge originating from the node of pat u matches an edge originating from the node of g: • A node of pattern pat u matches a node of graph g if an edge targeting the node of pat u matches an edge targeting the node of g: • We need a constraint to extract induced patterns 1 : i∈Vpat u The constraints in Equations 11 − 16 are defined to find pattern pat u in projection graph g of the input article. However these constraints do not ensure that the pattern is in the summary. For this, we define constraints in Equations 17 − 19 to assure that an existing pattern in an article is selected if there are some sentences in the summary which constitute the pattern.
• The constraint in Equation 17 ensures that if sentences s k and s l are selected for the summary then the edge between them is selected (z kl = 1), too: • Pattern pat u is present in the summary (p u = 1) if and only if one of its instances in the projection graph is included in the summary, i.e., some of the selected sentence nodes must be present in an instance of pattern pat u . |V patu | is the number of nodes in pattern pat u , and |E patu | is the number of edges in pattern pat u . This constraint is shown below: • If a sentence is selected then it has to match a node of at least one of the patterns:

Experiments
In this section we discuss the datasets and the experimental setup. We evaluate our model using ROUGE scores and human judgements.

Datasets
PLOS Medicine: This dataset contains 50 scientific articles. In this dataset every scientific article is accompanied by a summary written by an editor of the month. This editor's summary has a broader perspective than the authors' abstract. We use the editor's summary as a gold summary for calculating the ROUGE scores. We use 700 different PLOS Medicine articles from the PubMed 2 corpus to mine coherence patterns from their abstracts and to calculate patterns' weights.

Experimental Setup
First, we extract the text of an article. We remove figures, tables, references and non-alphabetical characters. Then we use the Stanford parser (Klein and Manning, 2003) to determine sentence boundaries. We apply the Brown coherence toolkit (Elsner and Charniak, 2011) to convert the articles into entity grids (Barzilay and Lapata, 2008) which then are transformed into entity graphs. We use gSpan (Yan and Han, 2002) to extract all subgraphs from the projection graphs of the abstracts of the PubMed corpus. It is possible that patterns with a large number of nodes are not at all present in the projection graph. Hence, we use coherence patterns with 3 and 4 nodes, referred to as CP 3 and CP 4 , respectively. We use Gurobi (Gurobi Optimization, Inc., 2014) to solve the MIP problem. We use a pronoun resolution system (Martschat, 2013) to replace all pronouns in the summary with their antecedents.
We determine the best values for λ I , λ R , and λ c on the development sets. λ I = 0.4, λ R = 0.3, and λ c = 0.3 are the best weights for the PLOS Medicine development set. Weights for the DUC 2002 development set are λ I = 0.5, λ R = 0.2 and λ c = 0.3.

Results
We evaluate our model in two ways. First, we use ROUGE scores to compare our model with other models. Second, we explicitly evaluate the coherence of the summaries by human judgements.

ROUGE Assessment
The ROUGE score (Lin, 2004) is a standard evaluation score in automatic text summarization. It calculates the overlap between gold summary and system summary. In automatic text summarization ROUGE 1, ROUGE 2 and ROUGE SU4 are usually reported (see Graham (2015) for an assessment of evaluation metrics for summarization). We compare our system (CP 3 and CP 4 ) with four baselines: Lead, Random, Maximal Marginal Relevance (MMR) and TextRank. Lead selects adjacent sentences from the beginning of an input article. Random selects sentences randomly. MMR (Carbonell and Goldstein, 1998) uses a trade-off between relevance and redundancy. TextRank is a graph-based system using sentences as nodes and edges weighted by cosine similarity between sentences (Mihalcea and Tarau, 2004). We compare our system with three state-of-the-art systems: E Coh , T Coh  , and Mead (Radev et al., 2004). E Coh uses entity graphs which consists of entities and sentences, and T Coh uses topical graphs where entities are replaced by the topics. They both use the outdegree of sentence nodes in the unweighted and the weighted projection graph, respectively, as the coherence measure of each sentence.
Mead employs a linear combination of three features: centroid score, position score and overlap score. The linear combination is used to add sentences to the summary up to the required length. The centroid score gives the highest score to the most central sentence in the cluster of sentences, the position score gives a higher score to the sentences which are in the beginning of the document, and the overlap score computes the similarity between the sentences of a document. All three features do not take care of the coherence of a summary as they do not have any notion of the order and the structure of a summary.
To compare with the state-of-the-art systems on PLOS Medicine, E Coh  and T Coh , we limit the length of summaries to 5 sentences. Table 1 reports ROUGE scores of different systems. Our system outperforms baselines and state-of-the-art systems.
Since the word length limit of a summary is more meaningful than the sentence length limit of a summary, we limit the length of a summary to the average length of editor's summaries in the dataset (750 words). Table 2 shows the performance of different systems with 750 words limit for a summary. In Table 2 (Table 2). We compute statistical significance between E Coh and CP 3 on both scores, ROUGE SU4 is significantly different by 95%. ROUGE 2 is significantly different by 99%.
Upper Bound in Table 2 represents maximum ROUGE scores that can be achieved in extractive summarization on the PLOS Medicine dataset. It is calculated by considering the whole scientific article as a summary and the corresponding editor's summary as the gold standard. The Upper Bound scores are not very high showing that a significant improvement in ROUGE scores on the PLOS Medicine dataset is difficult. Thus, the performance achieved by our systems, CP 3 and CP 4 , is a considerable improvement on the PLOS Medicine dataset.
Furthermore, we apply CP 3 on the dataset introduced by Liakata et al. (2013). The dataset consists of 28 scientific articles from the chemistry domain. The state-of-the-art system on this dataset is CoreSC, which is developed by Liakata et al. (2013). CoreSC considers discourse information while summarizing a scientific article. The ROUGE-1 score of CP 3 (0.96) is significantly better than CoreSC (0.75) and Microsoft Office Word 2007 AutoSumarize (0.73) (García-Hernández et al., 2009), in respect of abstracts. This shows that our system per-forms well in other domains.
We further calculate the average number of sentences per summary obtained by Mead and CP 3 . On average Mead produces 17.5 sentences per summary whereas CP 3 produces 27.2 sentences per summary. The possibility of longer sentences containing more topic irrelevant entities is higher than shorter sentences (Jin et al., 2010).
We calculate the average percentage of sentences selected from the sections Introduction, Method, Results and Discussion by different systems. CP 3 extracts sentences mainly from Introduction (32.5%) and Method (38.5%), but also a considerable number of sentences from Results (17.67%) and Discussion (11.33%). The distribution is quite similar to TextRank and MMR. Lead, obviously, extracts only from Introduction (80.59%) and Method (19.41%). Mead extracts maximum sentences from the beginning of the document using its positional feature. The sentences in a summary extracted by CP 3 are evenly distributed indicating that they are not biased to any sections. This clearly represents that coherence patterns not only seeks for nearby sentences but also for any distant sentences of a scientific article. Table 3 shows the results on DUC 2002 to compare the results with state-of-the-art systems. There is no significant difference between the ROUGE scores of using CP 3 and CP 4 on DUC 2002. Thus, we only report the results of using CP 3 on DUC 2002.
In Table 3, LREG is a baseline system us-  ing logistic regression and hand-made features (Cheng and Lapata, 2016). We compare our model to previously published state-of-the-art systems. These systems show reasonable performance on the DUC 2002 summarization task. ILP phrase is a phrase-based extraction model, which selects important phrases and combines them via integer linear programming (Woodsend and Lapata, 2010). URANK utilizes a unified ranking process for single-document and multi-document summarization tasks (Wan, 2010). UniformLink (k=10), considers similar documents for document expansion in the single-document summarization task (Wan and Xiao, 2010). The more recent system, NN-SE, utilizes a neural network hierarchical document encoder and an attention-based extractor to extract sentences from a document for a summary (Cheng and Lapata, 2016). ROUGE scores of our approach on this dataset are better than baselines and state-of-theart systems. This shows that our system performs well even in a different genre (robust) and with considerably shorter input documents (scalable).

Coherence Assessment
ROUGE scores do not evaluate summary coherence, since ROUGE only calculates overlapping recall scores and does not consider the structure of the summary. Haghighi and Vanderwende (2009), Celikyilmaz and Hakkani-Tür (2010) and Christensen et al. (2013) evaluate the overall summary quality by asking human subjects to rank system generated summaries.  and  assess the coherence by asking human assessors to rank system generated summaries and compare their system with baseline systems. We perform summary coherence assessment by asking one Postdoc, two PhD students and one Masters student from the field of natural language processing. We provide them with the output summaries of four different systems for ten articles. We ask them to rank the summaries, i.e., the best summary gets rank 1, the second best gets rank 2, the third best gets rank 3, and the worst gets rank 4.
The four systems assessed are CP 3 , E Coh , Text-Rank, and Lead. We apply the Kendall concordance coefficient (W) (Siegel and Castellan, 1988) to measure whether the human assessors agree in ranking the four systems. With W = 0.6725 the correlation between the human assessors is high. Applying the χ 2 test shows that W is significant at least at the 99% level indicating that the ranks provided by the human assessors are reliable and informative. Table  4 shows the overall average rank of a system given by the four human assessors. The lower the value of average human scores the more coherent the summary. Unsurprisingly Lead gets the best overall av-  erage rank. Lead extracts adjacent sentences from the beginning of the document. Hence, these summaries are as coherent as the author intends them to be, but they are not informative. However, CP 3 is very close in coherence to Lead indicating that our strategy is successful. It also performs substantially better than TextRank and E Coh . This confirms that using coherence patterns for sentence extraction yields more coherent summaries. mation is spread all over the article unlike information in news articles (Teufel and Moens, 2002). There are various approaches for summarizing scientific articles. Citations have been used by many researchers for summarization in this domain (Elkiss et al., 2008;Mohammad et al., 2009;Qazvinian and Radev, 2008;Abu-Jbara and Radev, 2011). Nanba and Okumura (2000) develop rules for categorizing citations by analyzing citation sentences. Newman (2001) analyzes the structure using a citation network. Similarly, Siddharthan and Teufel (2007) discover scientific attributions using citations. Discourse structure (but not necessarily coherence) has been used by Teufel and Moens (2002), Liakata et al. (2013) and others for summarizing scientific articles.

PLOS Medicine
Several state-of-the-art extractive summarization systems implement summarization as maximizing an objective function using constraints. McDonald (2007) interprets text summarization as a global inference problem, where he is maximizing the importance score of a summary by considering the length constraint. Similarly, various approaches for summarization are based on optimization using ILP (Gillick et al., 2009;Nishikawa et al., 2010;Galanis et al., 2012;. Until now, only few works have considered coherence while summarizing scientific articles. Abu-Jbara and Radev (2011) work on citation based summarization. They preprocess the citation sentences to filter out irrelevant sentences or sentence fragments, then extract sentences for the summary. Eventually, they refine the summary sentences to improve readability. Jha et al. (2015) consider Minimum Independent Discourse Contexts (MIDC) to solve the problem of non-coherence in extractive summarization. However, none of them deals with the problem of coherence within the task of sentence selection. Sentence selection and ensuring the coherence of summaries are not tightly integrated in their techniques. They model coherence in summarization by only considering adjacent sentences.
There are few methods (Hirao et al., 2013;Gorinski and Lapata, 2015) which integrate coherence in optimization. These methods do not take into account the overall structure of the summary. Unlike earlier methods, we incorporate coherence patterns in optimization.

Conclusion
We introduce a novel graph-based approach to generate coherent summaries of scientific articles. Our approach takes care of coherence distinctively by coherence patterns. We have experimented with PLOS Medicine and DUC 2002. The results show that the approach is robust, works on both scientific and news documents and with input documents of different length. It considerably outperforms state-ofthe-art systems on both datasets. We collected human assessments to evaluate the coherence of summaries. Our system substantially outperforms baselines and state-of-the-art systems, i.e., incorporating coherence patterns produces more coherent summaries. The results show that our approach performs well in human summary coherence assessment and relevance evaluation (ROUGE scores).