Structure-Aware Abstractive Conversation Summarization via Discourse and Action Graphs

Abstractive conversation summarization has received much attention recently. However, these generated summaries often suffer from insufficient, redundant, or incorrect content, largely due to the unstructured and complex characteristics of human-human interactions. To this end, we propose to explicitly model the rich structures in conversations for more precise and accurate conversation summarization, by first incorporating discourse relations between utterances and action triples (“who-doing-what”) in utterances through structured graphs to better encode conversations, and then designing a multi-granularity decoder to generate summaries by combining all levels of information. Experiments show that our proposed models outperform state-of-the-art methods and generalize well in other domains in terms of both automatic evaluations and human judgments. We have publicly released our code at https://github.com/GT-SALT/Structure-Aware-BART.


Introduction
Online interaction has become an indispensable component of everyday life and people are increasingly using textual conversations to exchange ideas, make plans, and share information. However, it is time-consuming to recap and grasp all the core content within every complex conversation (Gao et al., 2020;Feng et al., 2020). As a result, how to organize massive everyday interactions into natural, concise, and informative text, i.e., abstractive conversation summarization, starts to gain importance.
Significant progress has been made on abstractive summarization for structured document via pointer generator (See et al., 2017), reinforcement methods (Paulus et al., 2018; and pre-trained models (Liu and Lapata, 2019;Lewis et al., 2020;). Despite the huge success, it is challenging to directly apply document models to summarize conversations, due to Figure 1: An example of discourse relation graph (a) and action graph (b) from one conversation in SAM-Sum (Gliwa et al., 2019). The annotated summary is Simon was on the phone before, so he didn't here Helen calling. Simon will fetch Helen some tissues. a set of inherent differences between conversations and documents (Gliwa et al., 2019). First, speaker interruptions like repetitions, false-starts, and hesitations are frequent in conversations (Sacks et al., 1978), and key information resides in different portions of a conversation. These unstructured properties pose challenges for models to focus on salient contents that are necessary for generating both abstractive and informative summaries. Second, there is more than one speaker in conversations and people interact with each other in different language styles . The complex interactions among multiple speakers make it harder for mod-els to identify and associate speakers with correct actions so as to generate factual summaries.
In order to summarize the unstructured and complex conversations, a growing body of research has been conducted, such as transferring document summarization methods to conversation settings (Shang et al., 2018;Gliwa et al., 2019), adopting hierarchical models , or incorporating conversation structures like topic segmentation (Liu et al., 2019b;Chen and Yang, 2020), dialogue acts (Goo and Chen, 2018), and conversation stages (Chen and Yang, 2020). However, current approaches still face challenges in terms of succinctness and faithfulness, as most prior studies (i) fail to explicitly model dependencies between utterances which can help identify salient portions of conversations (Bui et al., 2009), and (ii) lack structured representations  to learn the associations between speakers, actions and events. We argue that these rich linguistic structures associated with conversations are key components towards generating abstractive and factual conversation summaries.
To this end, we present a structure-aware sequence-to-sequence model, in which we equip abstractive conversation summarization models with rich conversation structures through two types of graphs: discourse relation graph and action graph. Discourse relation graphs are constructed based on dependency-based discourse relations (Kirschner et al., 2012;Stone et al., 2013;Asher et al., 2016;Qin et al., 2017) between intertwined utterances, where each Elementary Discourse Unit (EDU) is one single utterance and they are linked through 16 different types of relations (Asher et al., 2016). As shown in Figure 1(a), highly related utterances are linked based on discourse relations like Question Answer Pairs, Comment and Explanation. Explicitly modeling these utterances relations in conversations can aid models in recognizing key content for succinct and informative summarization. Action graphs are constructed as the "WHO-DOING-WHAT" triplets in conversations which express socially situated identities and activities (Gee, 2014). For instance, in Figure 1(b), the action graph provides explicit information between Simon, fetch, and tissues for the utterance it is Simon who will fetch the tissues, making models less likely to generate summaries with wrong references (e.g., Helen will fetch the tissues).
To sum up, our contributions are: (1) We pro-pose to utilize discourse relation graphs and action graphs to better encode conversations for conversation summarization. (2) We design structureaware sequence-to-sequence models to combine these structured graphs and generate summaries with the help of a novel multi-granularity decoder.
(3) We demonstrate the effectiveness of our proposed methods through experiments on a largescale conversation summarization dataset, SAM-Sum (Gliwa et al., 2019). (4) We further show that our structure-aware models can generalize well in new domains such as debate summarization.

Related Work
Document Summarization Compared to extractive document summarization (Gupta and Lehal, 2010;Narayan et al., 2018;Liu and Lapata, 2019), abstractive document summarization is generally considered more challenging and has received more attention. Various methods have been designed to tackle abstractive document summarization like sequence-to-sequence models (Rush et al., 2015), pointer generators (See et al., 2017), reinforcement learning methods (Paulus et al., 2018; and pre-trained models (Lewis et al., 2020;. To generate faithful abstractive document summaries (Maynez et al., 2020), graphbased models were introduced recently such as extracting entity types (Fernandes et al., 2018;Fan et al., 2019), leveraging knowledge graphs Zhu et al., 2020a) or designing extra fact correction modules (Dong et al., 2020). Inspired by these graph-based methods, we also construct action graphs for generating more factual conversation summaries.
Conversation Summarization Extractive dialogue summarization (Murray et al., 2005) has been studied extensively via statistical machine learning methods such as skip-chain CRFs (Galley, 2006), SVM with LDA models (Wang and Cardie, 2013), and multi-sentence compression algorithms (Shang et al., 2018). Such methods struggled with generating succinct, fluent, and natural summaries, especially when the key information needs to be aggregated from multiple first-person point-of-view utterances (Song et al., 2020). Abstractive conversation summarization overcomes these issues by designing hierarchical models , incorporating commonsense knowledge (Feng et al., 2020), or leveraging conversational structures like dialogue acts (Goo and Chen, Figure 2: Model architecture. Each utterance is encoded via transformer encoder; discourse relation graphs and action graphs are encoded through Graph Attention Networks (a). The multi-granularity decoder (b) then generates summaries based on all levels of encoded information including utterances, action graphs, and discourse graphs. 2018), key point sequences (Liu et al., 2019a), topic segments (Liu et al., 2019b; and stage developments (Chen and Yang, 2020). Some recent research has also utilized discourse relations as input features in classifiers to detect important content in conversations (Murray et al., 2006;Bui et al., 2009;Qin et al., 2017). However, current models still have not explicitly utilized the dependencies between different utterances, making models hard to leverage long-range dependencies and utilize these salient utterances. Moreover, less attention has been paid to identify the actions of different speakers and how they interact with or refer to each other, leading to unfaithful summarization with incorrect references or wrong reasoning (Gliwa et al., 2019). To fill these gaps, we propose to explicitly model actions within utterances, and relations between utterances in conversations in a structured way, by using discourse relation graphs and action graphs and further combining these through relational graph encoders and multigranularity decoders for abstractive conversation summarization.

Methods
To generate abstractive and factual summaries from unstructured conversations, we propose to model structural signals in conversations by first constructing discourse relation graphs and action graphs (Section 3.1), and then encoding the graphs together with conversations (Section 3.2) as well as incorporating these different levels of information in the decoding stage through a multi-granularity decoder (Section 3.3) to summarize given conversations. The overall architecture is shown in Figure 2.

Structured Graph Construction
This section describes how to construct the discourse relation graphs and action graphs. Formally, for a given conversation C = {u 0 , ..., u m } with m utterances, we construct discourse relation graph G D = (V D , E D ), where V D is the set of nodes representing Elementary Discourse Units (EDUs), and E D is the adjacent matrix that describes the relations between EDUs, and action graph G A = (V A , E A ), where V A is the set of nodes representing "WHO", "DOING" and "WHAT" arguments, and E A is the adjacent matrix to link "WHO-DOING-WHAT" triples.
Discourse Relation Graph Utterances from different speakers do not occur in isolation; instead, they are related within the context of discourse (Murray et al., 2006;Qin et al., 2017), which has been shown effective for dialogue understanding like identifying the decisions in multi-party dialogues (Bui et al., 2009) and detecting salient content in email conversations (McKeown et al., 2007). Although current attention-based neural models are supposed to, or might implicitly, learn certain relations between utterances, they often struggle to focus on many informative utterances (Chen and Yang, 2020; Song et al., 2020) and fail to address long-range dependencies (Xu et al., 2020), especially when there are frequent interruptions. As a result, explicitly incorporating the discourse relations will help neural summarization models better encode the unstructured conversations and concentrate on the most salient utterances to generate more informative and less redundant summaries.
To do so, we view each utterance as an EDU and use the discourse relation types defined in Asher et al. (2016). We first pre-train a discourse parsing model (Shi and Huang, 2019) on a humanannotated multiparty dialogue corpus (Asher et al., 2016), with 0.775 F1 score on link predictions and 0.557 F1 score on relation classifications, which are comparable to the state-of-the-art results (Shi and Huang, 2019). We then utilize this pre-trained parser to predict the discourse relations within conversations in our SAMSum corpus (Gliwa et al., 2019).

Action Graph
The "who-doing-what" triples from utterances can provide explicit visualizations of speakers and their actions, the key to understanding concrete details happened in conversations (Moser, 2001;Gee, 2014;Sacks et al., 1978). Simply relying on neural models to identify this information from conversations often fail to produce factual characterizations of concrete details happened (Cao et al., 2018;. To this end, we extract "WHO-DOING-WHAT" triples from utterances and construct action graphs for conversation summarization (Chen et al., 2019;Huang et al., 2020b,a). Specifically, we first transform the first-person point-of-view utterances to its thirdperson point-of-view forms based on simple rules: (i) substituting first/second-person pronouns with the names of current speaker or surrounding speakers and (ii) replacing third-person pronouns based on coreference clusters in conversations detected by the Stanford CoreNLP (Manning et al., 2014). For example, an utterance "I'll bring it to you to-morrow" from Amanda to Jerry will be transformed into "Amanda'll bring cakes to Jerry tomorrow". Then we extract "WHO-DOING-WHAT" (subjectpredicate-object) triples from transformed conversations using the open information extraction (Ope-nIE) systems 1 (Angeli et al., 2015). We then construct the Action Graph G A = (V A , E A ) from the extracted triples by taking arguments ("WHO", "DOING", or "WHAT" ) as nodes in V A , and connect them with edge E A [i][j] = 1 if they are adjacent in one "WHO-DOING-WHAT" triple.

Encoder
Given a conversation and its corresponding discourse relation graph and action graph, we utilize an utterance encoder and two graph encoders, to obtain its hidden representations shown in Figure 2(a).

Utterance Encoder
We initialize our utterance encoder F U (.) with a pre-trained encoder, i.e., BART-base (Lewis et al., 2020), and encode tokens {x i,0 , ..., x i,l } in an utterance u i into its hidden representation: Here we add a special token x i,0 =<S> at the beginning of each utterance to represent it.

Graph Encoder
Node Initialization For discourse relation graph, we employ the output embeddings of the special tokens x i,0 from the utterance encoder, i.e., h U i,0 , to initialize the i-th node v D i in G D . We use a one-hot embedding layer to encode the relations E D [i][j] = e D i,j between utterance i and j. For action graph, we first utilize F U (.) to encode each token in nodes v A i and then average their output embeddings as their initial representations.
Structured Graph Attention Network Based on Graph Attention Network (Veličković et al., 2018), we utilize these relations between nodes to Through two graph encoders F D (., .) and F A (., .), we then obtain the hidden representations of these nodes as:

Multi-Granularity Decoder
Different levels of encoded representations are then aggregated via our multi-granularity decoder to generate summaries as shown in Figure 2(b). With s − 1 previously generated tokens y 1 , ..., y s−1 , our decoder G(.) predicts the l-th token via: To better incorporate the information in constructed graphs, different from the traditional pretrained BART model (Lewis et al., 2020), we improve the BART transformer decoder with two extra cross attentions (Discourse Attention and Action Attention) added to each decoder layer, which attends to the encoded node representations in discourse relation graphs and action graphs.
In each decoder layer, after performing the original cross attentions over every token in utterances {h U i,0:l } and getting the utterance-attended representation x U , multi-granularity decoder then conducts cross attentions over nodes {h D 0:m } and {h A 0:n } that are encoded from graph encoders in parallel, to obtain the discourse-attended representation x D and action-attended representation x A . These two attended vectors are then combined into a structureaware representation x S , through a feed-forward network for further forward passing in the decoder.
To alleviate the negative impact of randomly initialized graph encoders and cross attentions over graphs on pre-trained BART decoders at early stages and accelerate the learning of newlyintroduced modules during training, we apply ReZero (Bachlechner et al., 2020) to the residual connection after attending to graphs in each decoder layer:x where α is one trainable parameter instead of a fixed value 1, which modulates updates from cross attentions over graphs.
Training During training, we seek to minimize the cross entropy and use the teacher-forcing strategy (Bengio et al., 2015): 4 Experiments

Datasets
We trained and evaluated our models on a conversation summarization dataset SAMSum (Gliwa et al., 2019) covering messenger-like conversations about daily topics, such as arranging meetings and discussing events. We also showed the generalizability of our models on the Argumentative Dialogue Summary Corpus (ADSC) (Misra et al., 2015), a debate summarization corpus. The data statistics of two datasets were shown in Table 1, with the discourse relation types distributions in the Appendix.

Baselines
We compare our methods with several baselines:    • BART (Lewis et al., 2020): We utilized BART 2 , and separated utterances by a special token.
• Multi-View Seq2Seq (Chen and Yang, 2020) utilized topic and stage views on top of BART for summarizing conversations. Here we implemented it based on BART-base models.

Implementation Details
We used the BART-base model to initialize our sequence-to-sequence model for training in all experiments. For parameters in the original BART encoder/decoder, we followed the default settings and set the learning rate 3e-5 with 120 warm-up steps. For graph encoders, we set the number of hidden dimensions as 768, the number of attention heads as 2, the number of layers as 2, and the dropout rate as 0.2. For graph cross attentions added to BART decoder layers, we set the number of attention heads as 2. The weights α in ReZero residual connections were initialized with 1. The learning rate for parameters in newly added modules was 3e-4 with 60 warm-up steps. All experiments were performed on GeForce RTX 2080Ti (11GB memory).

Results on In-Domain Corpus
Automatic Evaluation We evaluated all the models with the widely used automatic metric,  Table 4: Human evaluation on Factualness, Succinctness, Informativeness.
All model variants of S-BART received significantly higher ratings than BART (student t-test, p < 0.05). ROUGE scores (Lin and Och, 2004) 3 , and reported ROUGE-1, ROUGE-2, and ROUGE-L in Table 2. We found that, compared to simple sequence-to-sequence models (Pointer Generator and Transformer), incorporating extra information such as commonsense knowledge from ConceptNet (D-HGN) increased the ROUGE metrics. When equipped with pre-trained models and simple conversation structures such as topics and conversation stages, Multi-View Seq2Seq boosted ROUGE scores. Incorporating discourse relation graphs or action graphs helped the performances of summarization, suggesting the effectiveness of explicitly modeling relations between utterances and the associations between speakers and actions within utterances. Combining two different structured graphs produced better ROUGE scores compared to previous state-of-the-art methods and our base models, with an increase of 2.0% on ROUGE-1, 4.3% on ROUGE-2, and 1.2% on ROUGE-L compared to our base model, BART. This indicates that, our structure-aware models with discourse and action graphs could help abstractive conversation summarization, and these two graphs complemented each other in generating better summaries.

Human Evaluation
We conducted human evaluation to qualitatively evaluate the generated summaries. Specifically, we asked annotators from Amazon Mechanical Turk to score a set of randomly sampled 100 generated summaries from ground-truth, BART and our structured models, using a Likert scale from 1 (worst) to 5 (best) in terms of factualness (e.g., associates actions with the right actors) , succinctness (e.g., does not contain redundant information), and informativeness (e.g., covers the most important content) (Feng et al., 2020;. To increase annotation quality, we required turkers to have a 98% approval rate and at least 10,000 approved tasks for their previous work. Each message was rated by three workers. The scores for each summary were averaged. The Intra-Class Correlation was 0.543, showing moderate agreement (Koo and Li, 2016).
As shown in Table 4, S-BART that utilized structured information from discourse relation graphs and action graphs generated significantly better summaries with respect to factualness, succinctness, and informativeness. This might because that the incorporation of structured information such as discourse relations helped S-BART to recognize the salient parts in conversations, and thus improve the succinctness and informativeness over BART. Modeling the connections between speakers and actions greatly helped generate more factual summaries than the baselines, e.g., with an increase of 0.27 from BART to S-BART w. Action.

Results on Out-Of-Domain Corpus
To investigate the generalizability of our structureaware models, we then tested the S-BART model trained on SAMSum corpus directly on the debate summarization domain (ADSC Corpus (Misra et al., 2015)) in a zero-shot setting. Besides the differences in topics, utterances in debate conversations were generally longer and include more action triples (37.20 vs 6.81 as shown in Table 1) and fewer participants. The distribution of discourse relation types also differed a lot across different   Table 6: ROUGE-1, ROUGE-2 and ROUGE-L scores of S-BART models using different ways to combine discourse relation graphs and action graphs. Results are averaged over three random runs. domains 4 (e.g., more Contrast in debates (19.5%) than in daily conversations (1.0%)).
As shown in Table 3, our single graph models S-BART w. Discourse and S-BART w. Action boosted ROUGE scores compared to BART, suggesting that utilizing structures can also increase the generalizability of conversation summarization methods. However, contrary to in-domain results in Table 2, action graphs led to much more gains than discourse graphs. This indicated that when domain shifts, action triples were most robust in terms of zero-shot setups; differences in discourse relation distributions could limit such generalization. Consistent with in-domain scenarios, our S-BART w. Discourse&Action achieved better results, with an increase of 66.2% on ROUGE-1, 373.4% on ROUGE-2, and 82.2% on ROUGE-L over BART.

Ablation Studies
This part conducted ablation studies to show the effectiveness of structured graphs in our S-BART.

The Quality of Discourse Relation Graphs
We showed how the quality of discourse relation graphs affected the performances of conversation summarization in Table 5. Specifically, we compared the ROUGE scores of S-BART using our constructed discourse relation graphs (S-BART w. Discourse Graph) and S-BART using randomly generated discourse relation graphs S-BART w. Random Graph where both connections between nodes and relation types were randomized. The number of edges in two graphs was kept the same. We found that S-BART with our discourse graphs outperformed Figure 3: Averaged α over decoder layers in the trained S-BART models using different graphs models with random graphs, indicating the effectiveness of the constructed discourse relation graphs and the importance of their qualities.

Different Ways to Combine Graphs
We experimented with different ways to combine discourse relation graphs and action graphs in our S-BART w. Discourse & Action, and presented the results in Table 6. Here, parallel strategy performed cross attentions on different graphs separately and then combined the attended results with feed-forward networks as discussed in Section 3.3; sequential strategy performed cross attentions on two graphs in a specific order (from discourse relation graphs to actions graphs, or vice versa). We found that the parallel strategy showed better performances and the sequential ones did not introduce gains compared to S-BART with single graphs. This demonstrates that discourse relation graphs and action graphs were both important and provided different signals for abstractive conversation summarization.
Visualizing ReZero Weights We further tested our structure-aware BART with two ReZero settings: (i) initializing α from 0, (ii) initializing α from 1, and found initializing α from 1 would bring in more performance gains (see Appendix). We then visualized the average α over different decoder layers after training in Figure 3, and observed that (i) when α was initialized with 1, the final α was much larger than the setting where α was initialized with 0, which might because randomly initialized modules barely received supervisions at early stages and therefore contributes less to BART. (ii) Compared to discourse graphs, action graphs received higher α weights after training in both initializing settings, suggesting that the information from structured action graphs might be harder for the end-to-end BART models to capture. (iii) Utilizing both graphs spontaneously led to higher  ReZero weights, further validating the effectiveness of combining discourse relation graphs and action graphs and their complementary properties.

Error Analyses
To inspect when our summarization models could help the conversations summarization, we visualized the average number of discourse edges and the average number of action triples in three sets of conversations in Table 7 When the structures in conversations were simpler (fewer discourse edges and fewer action triples than the average), BART showed similar performance as S-BART. As the structures of conversations become more complex with more discourse relations and more action mentions, S-BART outperformed BART as it explicitly incorporated these structured graphs. However, both BART and S-BART struggled when there were much more interactions beyond certain thresholds, calling for better mechanisms to model structures in conversations for generating better summaries.

Conclusion
In this work, we introduced a structure-aware sequence-to-sequence model for abstractive conversation summarization by incorporating discourse relations between utterances, and the connections between speakers and actions within utterances. Experiments and ablation studies on SAMSum corpus showed the effectiveness of these structured graphs in aiding the task of conversation summarization via both quantitative and qualitative eval-uation metrics. Results in zero-shot settings on ADCS Corpus further demonstrated the generalizability of our structure-aware models. In the future, we plan to extend our current conversation summarization models for various application domains such as emails, debates, and podcasts, and in conversations that might involve longer utterances and more participants in an unsynchronized way.   (Asher et al., 2016) with default settings 5 to get the link prediction and relation classification models to label discourse relations in SAMSum and ADSC corpus. The distribution of the relation types in two datasets were shown in Table 9. The major discourse relations in daily conversations are Comment, Clarification and QA pairs, while the main discourse relations in debate are Comment, Contrast, Clarification and QA pairs.

B Impact of Different ReZero Weight Initializations
We tested our structure-aware BART (S-BART w. Discourse/Action) within two ReZero settings: (i) initializing α from 0, (ii) initializing α from 1. And the results were shown in Table 8. S-BART with 1 as the initialized ReZero weight outperformed