Toward Abstractive Summarization Using Semantic Representations

We present a novel abstractive summarization framework that draws on the recent development of a treebank for the Abstract Meaning Representation (AMR). In this framework, the source text is parsed to a set of AMR graphs, the graphs are transformed into a summary graph, and then text is generated from the summary graph. We focus on the graph-to-graph transformation that reduces the source semantic graph into a summary graph, making use of an existing AMR parser and assuming the eventual availability of an AMR-to-text generator. The framework is data-driven, trainable, and not specifically designed for a particular domain. Experiments on gold-standard AMR annotations and system parses show promising results. Code is available at: https://github.com/summarization


Introduction
Abstractive summarization is an elusive technological capability in which textual summaries of content are generated de novo. Demand is on the rise for high-quality summaries not just for lengthy texts (e.g., books; Bamman and Smith, 2013) and texts known to be prohibitively difficult for people to understand (e.g., website privacy policies; Sadeh et al., 2013), but also for non-textual media (e.g., videos and image collections; Kim et al., 2014;Kuznetsova et al., 2014;Zhao and Xing, 2014), where extractive and compressive summarization techniques simply do not suffice. We believe that the challenge of abstractive summarization deserves renewed attention and propose that recent developments in semantic analysis have an important role to play.
We conduct the first study exploring the feasibility of an abstractive summarization system based on transformations of semantic representations such as the Abstract Meaning Representation (AMR; Banarescu et al., 2013). Example sentences and their AMR graphs are shown in Fig. 1. AMR has much in common with earlier formalisms (Kasper, 1989;Dorr et al., 1998); today an annotated corpus comprised of over 20,000 AMR-analyzed English sentences (Knight et al., 2014) and an automatic AMR parser (JAMR; Flanigan et al., 2014) are available.
In our framework, summarization consists of three steps illustrated in Fig. 1: (1) parsing the input sentences to individual AMR graphs, (2) combining and transforming those graphs into a single summary AMR graph, and (3) generating text from the summary graph. This paper focuses on step 2, treating it as a structured prediction problem. We assume text documents as input 1 and use JAMR for step 1. We use a simple method to read a bag of words off the summary graph, allowing evaluation with ROUGE-1, and leave full text generation from AMR (step 3) to future work.
The graph summarizer, described in §4, first merges AMR graphs for each input sentence through a concept merging step, in which coreferent nodes of the graphs are merged; a sentence conjunction step, which connects the root of each sentence's AMR graph to a dummy "ROOT" node; and an optional   Figure 1: A toy example. Sentences are parsed into individual AMR graphs in step 1; step 2 conducts graph transformation that produces a single summary AMR graph; text is generated from the summary graph in step 3.
graph expansion step, where additional edges are added to create a fully dense graph on the sentencelevel. These steps result in a single connected source graph. A subset of the nodes and arcs from the source graph are then selected for inclusion in the summary graph. Ideally this is a condensed representation of the most salient semantic content from the source. We briefly review AMR and JAMR ( §2), then present the dataset used in this paper ( §3). The main algorithm is presented in §4, and we discuss our simple generation step in §5. Our experiments ( §6) measure the intrinsic quality of the graph transformation algorithm as well as the quality of the terms selected for the summary (using ROUGE-1). We explore variations on the transformation and the learning algorithm, and show oracle upper bounds of various kinds.

Background: Abstract Meaning
Representation and JAMR AMR provides a whole-sentence semantic representation, represented as a rooted, directed, acyclic graph ( Fig. 1). Nodes of an AMR graph are labeled with concepts, and edges are labeled with relations.
Step 1 of our framework converts input document sentences into AMR graphs. We use a statistical semantic parser, JAMR (Flanigan et al., 2014), which was trained on AMR Bank. JAMR's current performance on our test dataset is 63% F -score. 3 We will analyze the effect of AMR parsing errors by comparing JAMR output with gold-standard annotations of input sentences in the experiments ( §6).
In addition to predicting AMR graphs for each sentence, JAMR provides alignments between spans of words in the source sentence and fragments of its predicted graph. For example, a graph fragment headed by "date-entity" could be aligned to the tokens "April 8, 2002." We use these alignments in our simple text generation module (step 3; §5).

Dataset
To build and evaluate our framework, we require a dataset that includes inputs and summaries, each with gold-standard AMR annotations. 4 This allows us to use a statistical model for step 2 (graph summarization) and to separate its errors from those in step 1 (AMR parsing), which is important in determining whether this approach is worth further investment.
Fortunately, the "proxy report" section of the AMR Bank (Knight et al., 2014)  proxy report is created by annotators based on a single newswire article, selected from the English Gigaword corpus. The report header contains metadata about date, country, topic, and a short summary. The report body is generated by editing or rewriting the content of the newswire article to approximate the style of an analyst report. Hence this is a single document summarization task. All sentences are paired with gold-standard AMR annotations. Table 1 provides an overview of our dataset.

Graph Summarization
Given AMR graphs for all of the sentences in the input (step 1), graph summarization transforms them into a single summary AMR graph (step 2). This is accomplished in two stages: source graph construction ( §4.1); and subgraph prediction ( §4.2).

Source Graph Construction
The "source graph" is a single graph constructed using the individual sentences' AMR graphs by merging identical concepts. In the AMR formalism, an entity or event is canonicalized and represented by a single graph fragment, regardless of how many times it is referred to in the sentence. This principle can be extended to multiple sentences, ideally resulting in a source graph with no redundancy. Because repeated mentions of a concept in the input can signal its importance, we will later encode the frequency of mentions as a feature used in subgraph prediction. Concept merging involves collapsing certain graph fragments into a single concept, then merging all concepts that have the same label. We collapse the graph fragments that are headed by either a dateentity ("date-entity") or a named entity ("name"), if the fragment is a flat structure. A collapsed named entity is further combined with its parent (e.g., "person") into one concept node if it is the only child of the parent. Two such graph fragments are illustrated in Fig. 2. We choose named and date entity concepts since they appear frequently, but most often refer to different entities (e.g., "April 8, 2002" vs. "Nov. 17"). No further collapsing is done. A collapsed graph fragment is assigned a new label by concatenating the consisting concept and edge labels. Each fragment that is collapsed into a new concept node can then only be merged with other identical fragments. This process won't recognize coreferent concepts like "Barack Obama" = "Obama" and "say-01" = "report-01," but future work may incorporate both entity coreference resolution and event coreference resolution, as concept nodes can represent either.
Due to the concept merging step, a pair of concepts may now have multiple labeled edges between them. We merge all such edges between a given pair of concepts into a single unlabeled edge. We remember the two most common labels in such a group, which are used in the edge "Label" feature (Table 3).
To ensure that the source graph is connected, we add a new "ROOT" node and connect it to every concept that was originally the root of a sentence graph (see Fig. 3). When we apply this procedure to the documents in our dataset ( §3), source graphs contain 144 nodes and 221 edges on average.
We investigated how well these automatically constructed source graphs cover the gold-standard summary graphs produced by AMR annotators. Ideally, a source graph should cover all of the goldstandard edges, so that summarization can be accomplished by selecting a subgraph of the source Sentence A: I saw Joe's dog, which was running in the garden. Sentence B: The dog was chasing a cat. Figure 3: A source graph formed from two sentence AMR graphs. Concept collapsing, merging, and graph expansion are demonstrated. Edges are unlabeled. A "ROOT" node is added to ensure connectivity. (1) and (2) are among edges added through the optional expansion step, corresponding to sentence-and document-level expansion, respectively. Concept nodes included in the summary graph are shaded.

Summary Edge Coverage (%)
Expand Labeled Unlabeled Sent. Doc.  graph ( §4.2). In Table 2, columns one and two report labeled and unlabeled edge coverage. 'Unlabeled' counts edges as matching if both the source and destination concepts have identical labels, but ignores the edge label. In order to improve edge coverage, we explore expanding the source graph by adding every possible edge between every pair of concepts within the same sentence. We also explored adding every possible edge between every pair of concepts in the entire source graph. Edges that are newly introduced during expansion receive a default label 'null'. We report unlabeled edge coverage in Table 2, columns three and four, respectively. Subgraph prediction became infeasable with the document-level expansion, so we conducted our experiments using only sentence-level expansion. Sentence-level graph ex-pansion increases the average number of edges by a factor of 15, to 3,292. Fig. 3 illustrates the motivation. Document-level expansion covers the goldstandard summary edge "chase-01" → "garden," yet the expansion is computationally prohibitive; sentence-level expansion adds an edge "dog" → "garden," which enables the prediction of a structure with similar semantic meaning: "Joe's dog was in the garden chasing a cat."

Subgraph Prediction
We pose the selection of a summary subgraph from the source graph as a structured prediction problem that trades off among including important information without altering its meaning, maintaining brevity, and producing fluent language (Nenkova and McKeown, 2011). We incorporate these concerns in the form of features and constraints in the statistical model for subgraph selection.
Let G = (V, E) denote the merged source graph, where each node v ∈ V represents a unique concept and each directed edge e ∈ E connects two concepts. G is a connected, directed, node-labeled graph. Edges in this graph are unlabeled, and edge labels are not predicted during subgraph selection. We seek to maximize a score that factorizes over graph nodes and edges that are included in the summary graph. For subgraph (V , E ): (1) where f (v) and g(e) are the feature representations of node v and edge e, respectively. We describe node and edge features in Table 3. θ and ψ are vectors of empirically estimated coefficients in a linear model. We next formulate the selection of the subgraph using integer linear programming (ILP; §4.2.1) and describe supervised learning for the parameters (coefficients) from a collection of source graphs paired with summary graphs ( §4.2.2).

Decoding
We cast decoding as an ILP whose constraints ensure that the output forms a connected subcomponent of the source graph. We index source graph concept nodes by i and j, giving the "ROOT" node Node Concept Identity feature for concept label Features Freq Concept freq in the input sentence set; one binary feature defined for each frequency threshold τ = 0/1/2/5/10 Depth Average and smallest depth of node to the root of the sentence graph; binarized using 5 depth thresholds Position Average and foremost position of sentences containing the concept; binarized using 5 position thresholds Span Average and longest word span of concept; binarized using 5 length thresholds; word spans obtained from JAMR Entity Two binary features indicating whether the concept is a named entity/date entity or not Bias Bias term, 1 for any node Edge Label First and second most frequent edge labels between concepts; relative freq of each label, binarized by 3 thresholds Features Freq Edge frequency (w/o label, non-expanded edges) in the document sentences; binarized using 5 frequency thresholds Position Average and foremost position of sentences containing the edge (without label); binarized using 5 position thresholds Nodes Node features extracted from the source and target nodes (all above node features except the bias term) IsExpanded A binary feature indicating the edge is due to graph expansion or not; edge freq (w/o label, all occurrences) Bias Bias term, 1 for any edge index 0. Let N be the number of nodes in the graph.
Let v i and e i,j be binary variables. v i is 1 iff source node i is included; e i,j is 1 iff the directed edge from node i to node j is included.
The ILP objective to be maximized is Equation 1, rewritten here in the present notation: Note that this objective is linear in {v i , e i,j } i,j and that features and coefficients can be folded into node and edge scores and treated as constants during decoding.
Constraints are required to ensure that the selected nodes and edges form a valid graph. In particular, if an edge (i, j) is selected (e i,j takes value of 1), then both its endpoints i, j must be included: Connectivity is enforced using a set of singlecommodity flow variables f i,j , each taking a nonnegative integral value, representing the flow from node i to j. The root node sends out up to N units of flow, one to reach each included node (Equation 4). Each included node consumes one unit of flow, reflected as the difference between incoming and outgoing flow (Equation 5). Flow may only be sent over an edge if the edge is included (Equation 6).
The AMR representation allows graph reentrancies (concept nodes having multiple parents), yet reentrancies are rare; about 5% of edges are reentrancies in our dataset. In this preliminary study we force the summary graph to be tree-structured, requiring that there is at most one incoming edge for each node: Interestingly, the formulation so far equates to an ILP for solving the prize-collecting Steiner tree problem (PCST; Segev, 1987), which is known to be NP-complete (Karp, 1972). Our ILP formulation is modified from that of Ljubić et al. (2006). Flow-based constraints for tree structures have also previously been used in NLP for dependency parsing  and sentence compression (Thadani and McKeown, 2013). In our experiments, we use an exact ILP solver, 5 though many approximate methods are available.
Finally, an optional constraint can be used to fix the size of the summary graph (measured by the number of edges) to L: The performance of summarization systems depends strongly on their compression rate, so systems are only directly comparable when their compression rates are similar (Napoles et al., 2011). L is supplied to the system to control summary graph size.

Parameter Estimation
Given a collection of input and output pairs (here, source graphs and summary graphs), a natural starting place for learning the coefficients θ and ψ is the structured perceptron (Collins, 2002), which is easy to implement and often performs well. Alternatively, incorporating factored cost functions through a structured hinge loss leads to a structured support vector machine (SVM; Taskar et al., 2004) which can be learned with a very similar stochastic optimization algorithm. In our scenario, however, the gold-standard summary graph may not actually be a subset of the source graph. In machine translation, ramp loss has been found to work well in situations where the gold-standard output may not even be in the hypothesis space of the model (Gimpel and Smith, 2012). The structured perceptron, hinge, and ramp losses are compared in Table 4.
We explore learning by minimizing each of the perceptron, hinge, and ramp losses, each optimized using Adagrad (Duchi et al., 2011), a stochastic optimization procedure. Let β be one model parameter (coefficient from θ or ψ). Let g (t) be the subgradient of the loss on the instance considered on the t th iteration with respect to β. Given an initial step size η, the update for β on iteration t is:

Generation
Generation from AMR-like representations has received some attention, e.g., by Langkilde and Knight (1998) who described a statistical method. Though we know of work in progress driven by the goal of machine translation using AMR, there is currently no system available. We therefore use a heuristic approach to generate a bag of words. Given a predicted subgraph, a system summary is created by finding the most frequently aligned word span for each concept node. (Recall that the JAMR parser provides these alignments; §2). The words in the resulting spans are generated in no particular order. While this is not a natural language summary, it is suitable for unigram-based summarization evaluation methods like ROUGE-1.

Experiments
In Table 5, we report the performance of subgraph prediction and end-to-end summarization on the test set, using gold-standard and automatic AMR parses for the input. Gold-standard AMR annotations are used for model training in all conditions. During testing, we apply the trained model to source graphs constructed using either gold-standard or JAMR parses. In all of these experiments, we use the number of edges in the gold-standard summary graph to fix the number of edges in the predicted subgraph, allowing direct comparison across conditions. Subgraph prediction is evaluated against the goldstandard AMR graphs on summaries. We report precision, recall, and F 1 for nodes, and F 1 for edges. 6 Oracle results for the subgraph prediction stage are obtained using the ILP decoder to minimize the cost of the output graph, given the gold-standard. We assign wrong nodes and edges a score of −1, correct nodes and edges a score of 0, then decode with the same structural constraints as in subgraph prediction. The resulting graph is the best summary graph in the hypothesis space of our model, and provides an upper bound on performance achievable within our framework. Oracle performance on node prediction is in the range of 80% when using gold-standard AMR annotations, and 70% when using JAMR output. Edge prediction has lower performance, yielding 52.2% for gold-standard and 31.1% for JAMR parses. When graph expansion was applied, the numbers increased to 64% and 46.7%, respectively. The uncovered summary edge (i.e., those not covered by source graph) is a major source for low recall values on edge prediction (see Table 2); graph expansion slightly alleviates this issue.
Summarization is evaluated by comparing system summaries against reference summaries, using ROUGE-1 scores (Lin, 2004) 7 . System summaries are generated using the heuristic approach presented in §5: given a predicted subgraph, the approach finds the most frequently aligned word span for each concept node, and then puts them together as a bag of words. ROUGE-1 is particularly usefully for eval-Structured perceptron loss: −score(G * ) + max G score(G)

Structured hinge loss:
−score(G * ) + max G (score(G) + cost(G; G * )) Structured ramp loss: − max G (score(G) − cost(G; G * )) + max G (score(G) + cost(G; G * )) Table 4: Loss functions minimized in parameter estimation. G * denotes the gold-standard summary graph. score(·) is as defined in Equation 1. cost(G; G * ) penalizes each vertex or edge in G ∪ G * \ (G ∩ G * ). Since cost factors just like the scoring function, each max operation can be accomplished using a variant of ILP decoding ( §4.2.1) in which the cost is incorporated into the linear objective while the constraints remain the same.

Subgraph Prediction Summarization Nodes
Edges  Table 5: Subgraph prediction and summarization (to bag of words) results on test set. Gold-standard AMR annotations are used for model training in all conditions. "+ Expand" means the result is obtained using source graph with expansion; edge performance is measured ignoring labels.
uating such less well-formed summaries, such as those generated from speech transcripts . Oracle summaries are produced by taking the gold-standard AMR parses of the reference summary, obtaining the most frequently aligned word span for each unique concept node using the JAMR aligner ( §2), and then generating a bag of words summary. Evaluation of oracle summaries is performed in the same manner as for system summaries. The above process does not involve graph expansion, so summarization performance is the same for the two conditions "Oracle" and "Oracle + Expand." We find that JAMR parses are a large source of degradation of edge prediction performance, and a smaller but still significant source of degradation for concept prediction. Surprisingly, using JAMR parses leads to slightly improved ROUGE-1 scores. Keep in mind, though, that under our bag-of-words generator, ROUGE-1 scores only depend on concept prediction and are unaffected by edge prediction. The oracle summarization results, 65.8% and 57.8% F 1 scores for gold-standard and JAMR parses, respectively, further suggest that improved graph summarization models (step 2) might benefit from future improvements in AMR parsing (step 1).
Across all conditions and both evaluations, we find that incorporating a cost-aware loss function (hinge vs. perceptron) has little effect, but that using ramp loss leads to substantial gains.
In Table 5, we show detailed results with and without graph expansion. "+ Expand" means the results are obtained using the expanded source graph. We find that graph expansion only marginally affects system performance. Graph expansion slightly hurts the system performance on edge prediction. For example, using ramp loss with JAMR parser as input, we obtained 50.7% and 19.0% for node and edge prediction with graph expansion; 51.5% and 20.0% without edge expansion. On the other hand, it increases the oracle performance by a large margin. This suggests that with more training data, or a more sophisticated model that is able to better discriminate among the enlarged output space, graph expansion still has promise to be helpful.

Related and Future Work
According to Dang and Owczarzak (2008), the majority of competitive summarization systems are extractive, selecting representative sentences from input documents and concatenating them to form a summary. This is often combined with sentence compression, allowing more sentences to be included within a budget. ILPs and approximations have been used to encode compression and extraction (McDonald, 2007;Gillick and Favre, 2009;Berg-Kirkpatrick et al., 2011;Almeida and Martins, 2013;Li et al., 2014). Other decoding approaches have included a greedy method exploiting submodularity (Lin and Bilmes, 2010), document reconstruction (He et al., 2012), and graph cuts (Qian and Liu, 2013), among others.
Previous work on abstractive summarization has explored user studies that compare extractive with NLG-based abstractive summarization (Carenini and Cheung, 2008). Ganesan et al. (2010) propose to construct summary sentences by repeatedly searching the highest scored graph paths. (Gerani et al., 2014) generate abstractive summaries by modifying discourse parse trees. Our work is similar in spirit to Cheung and Penn (2014), which splices and recombines dependency parse trees to produce abstractive summaries. In contrast, our work operates on semantic graphs, taking advantage of the recently developed AMR Bank.
Also related to our work are graph-based summarization methods (Vanderwende et al., 2004;Erkan and Radev, 2004;Mihalcea and Tarau, 2004). Vanderwende et al. (2004) transform input to logical forms, score nodes using PageRank, and grow the graph from high-value nodes using heuristics. In Erkan and Radev (2004) and Mihalcea and Tarau (2004), the graph connects surface terms that co-occur. In both cases, the graphs are constructed based on surface text; it is not a representation of propositional semantics like AMR. However, future work might explore similar graph-based calculations to contribute features for subgraph selection in our framework.
Our constructed source graph can easily reach ten times or more of the size of a sentence dependency graph. Thus more efficient graph decoding algorithms, e.g., based on Lagrangian relaxation or approximate algorithms, may be explored in future work. Other future directions may include jointly performing subgraph and edge label prediction; exploring a full-fledged pipeline that consists of an automatic AMR parser, a graph-to-graph summarizer, and a AMR-to-text generator; and devising an evaluation metric that is better suited to abstractive summarization.
Many domains stand to eventually benefit from summarization. These include books, audio/video segments, and legal texts.

Conclusion
We have introduced a statistical abstractive summarization framework driven by the Abstract Meaning Representation. The centerpiece of the approach is a structured prediction algorithm that transforms semantic graphs of the input into a single summary semantic graph. Experiments show the approach to be promising and suggest directions for future research.