Multi-View Sequence-to-Sequence Models with Conversational Structure for Abstractive Dialogue Summarization

Text summarization is one of the most challenging and interesting problems in NLP. Although much attention has been paid to summarizing structured text like news reports or encyclopedia articles, summarizing conversations---an essential part of human-human/machine interaction where most important pieces of information are scattered across various utterances of different speakers---remains relatively under-investigated. This work proposes a multi-view sequence-to-sequence model by first extracting conversational structures of unstructured daily chats from different views to represent conversations and then utilizing a multi-view decoder to incorporate different views to generate dialogue summaries. Experiments on a large-scale dialogue summarization corpus demonstrated that our methods significantly outperformed previous state-of-the-art models via both automatic evaluations and human judgment. We also discussed specific challenges that current approaches faced with this task. We have publicly released our code at https://github.com/GT-SALT/Multi-View-Seq2Seq.


Introduction
We live in an information age where communications between human and human/machine are increasing exponentially in the form of textual dialogues between users and users-agents (Kester, 2004). It is challenging and time-consuming to review all the content before starting any conversations especially when the chatting history becomes very long (Gao et al., 2020). How to process and organize those interaction activities into concise and structured data, i.e. conversation summarization, becomes technically and socially important.
Most existing research efforts on text summarization have been focused on single-speaker doc-uments like news reports (Nallapati et al., 2016;See et al., 2017), scientific publications (Nikolov et al., 2018) or encyclopedia articles (Liu* et al., 2018), where structured text is usually used to elaborate a core idea in the third-person point of view, and the information flow is very clear through paragraphs or sections. Different from these structured documents, conversations are often informal, verbose and repetitive, sprinkled with false-starts, back channeling, reconfirmations, hesitations, speaker interruptions (Sacks et al., 1978) and the salient information is scattered in the whole chat, making current summarization models hard to focus on many informative utterances. Take the conversation in Table 1 as an example, turns, informal words, abbreviations, and emoticons all introduce new forms of challenges to the task of summarization. This calls for the design and development of new methods for dialogue summarization instead of directly applying current document summarization models.
There has been some recent research on conversation summarization such as directly deploying existing document summarization models (Gliwa et al., 2019) and exploring multi-sentence compression (Shang et al., 2018), however, most of them haven't utilized specific conversational structures, which refer to the way utterances are organized in order to make the conversation meaningful, enjoyable and understandable (Sacks et al., 1978), in dialogues -a key factor that differentiates dialogues from structured documents. As a way of using language socially of "doing things with words" together with other persons, the conversation has its own dynamic structures that organize utterances in certain orders to make the conversation meaningful, enjoyable, and understandable (Sacks et al., 1978). Although there are a few exceptions such as utilizing topic segmentation (Liu et al., 2019b;, dialogue acts (Goo and Chen, 2018) or key point sequence (Liu et al., 2019a), they either need extensive expert annotations of discourse acts (Goo and Chen, 2018;Liu et al., 2019a), or only encode conversations based on their topics (Liu et al., 2019b), which fails to capture rich conversation structures in dialogues.
Even one single conversation can be viewed from different perspectives, resulting in multiple conversational or discourse patterns. For instance, in Table 1, based on what topics were discussed (topic view) (Galley et al., 2003;Liu et al., 2019b;, it can be segmented into greetings, today's plan, plan for tomorrow, plan for Saturday and pick up time; from a conversation progression perspective (stage view) (Ritter et al., 2010;Paul, 2012;Althoff et al., 2016), the same dialogue can be categorized into openings, intention, discussion, and conclusion. From a coarse perspective (global view), conversations can be treated as a whole, or each utterance can serve as one segment (discrete view). Models that only utilized a fixed topic view of the conversation (Joty et al., 2010;Liu et al., 2019b) may fail to capture its comprehensive and nuanced conversational structures, and any amount of information loss introduced by the conversation encoder may lead to larger error cascade in the decoding stage. To fill these gaps, we propose to combine those multiple, diverse views of conversations in order to generate more precise summaries.
To sum up, our contributions are: (1) we propose to utilize rich conversational structures, i.e., structured views (topic view and stage view) and the generic views (global view and discrete view) for abstractive conversation summarization. (2) We de-sign a multi-view sequence-to-sequence model that consists of a conversation encoder to encode different views and a multi-view decoder with multiview attention to generate dialogue summaries. (3) We perform experiments on a large-scale conversation summarization dataset, SAMSum (Gliwa et al., 2019), and demonstrate the effectiveness of our proposed methods. (4) We conduct thorough error analyses and discuss specific challenges that current approaches faced with this task.

Related Work
Document Summarization Document summarization has received extensive research attention, especially for abstractive summarization. For instance, Rush et al. (2015) introduced to use sequence-to-sequence models for abstractive text summarization. See et al. (2017) proposed a pointer-generator network to allow copying words from the source text to handle the OOV issue and avoid generating repeated content. Paulus et al. (2018); Chen and Bansal (2018) further utilized reinforcement learning to select the correct content needed by summarization. Large-scale pre-trained language models (Liu and Lapata, 2019;Raffel et al., 2019;Lewis et al., 2019) have also been introduced to further improve the summarization performance. Other line of work explored long-document summarization by utilizing discourse structures in text (Cohan et al., 2018), introducing hierarchical models (Fabbri et al., 2019) or modifying attention mechanisms (Beltagy et al., 2020). There are also recent studies looking at the faithfulness in Figure 1: Model architecture. Different views of conversations are first extracted automatically, and then encoded through the conversation encoder (a) and combined in the multi-view decoder to generate summaries (b). In the conversation encoder, each view (consists of blocks) is encoded separately and the block's representations S i are encoded through LSTM to represent the view. In the multi-view decoder, the model decides attention weights over different views and then attend to each token in different views through the multi-view attention. document summarization (Cao et al., 2018;Zhu et al., 2020a), in order to enhance the information consistency between summaries and the input.
Dialogue Summarization When it comes to the summarization of dialogues, Shang et al. (2018) proposed a simple multi-sentence compression technique to summarize meetings. Zhao et al. (2019); Zhu et al. (2020b) introduced turn-based hierarchical models that encoded each turn of utterance first and then used the aggregated representation to generate summaries. A few studies have also paid attention to utilizing conversational analysis for generating dialogue summaries, such as leveraging dialogue acts (Goo and Chen, 2018), key point sequence (Liu et al., 2019a) or topics (Liu et al., 2019b;. However, they either needed a large amount of human annotation for dialogue acts, key points or visual focus (Goo and Chen, 2018;Liu et al., 2019a;, or only utilized topical information in conversations Liu et al., 2019b).
These prior work also largely ignored diverse conversational structures in dialogues, for instance, reply relations among participants (Mayfield et al., 2012;Zhu et al., 2019), dialogue acts (Ritter et al., 2010;Paul, 2012), and conversation stages (Althoff et al., 2016). Models that only utilized a fixed topic view of the conversation (Galley et al., 2003;Joty et al., 2010) may fail to capture its comprehensive and nuanced conversational structures, and any amount of information loss introduced by the conversation encoder may lead to larger error cascade in the decoding stage. To fill these gaps, we propose to leverage diverse conversational structures including topic segments, conversational stages, dialogue overview, and utterances to design a multiview model for dialogue summarization.

Method
Conversations can be interpreted from different views and every single view enables the model to focus a specific aspect of the conversation. To take advantages of those rich conversation views, we design a Multi-view Sequence-to-Sequence Model (see Figure 1) that firstly extracts different views of conversations (Section 3.1) and then encodes them to generate summaries (Section 3.2).

Conversation View Extraction
Conversation summarization models may easily stray among all sorts of information across various speakers and utterances especially when conversations become long. Naturally, if informative structures in the form of small blocks can be explicitly extracted from long conversations, models may be able to understand them better in a more organized way. Thus, we first extract different views of structures from conversations.
Topic View Although conversations are often less structured than documents, they are mostly organized around topics in a coarse-grained structure (Honneth et al., 1988). For instance, a telephone chat could possess a pattern of "greetings → invitation → party details → rejection" from a topical perspective. Such explicit view and topic flow could help models interpret conversations more precisely and generate summaries that cover important topics. Here we combine the classic topic segment algorithm, C99 (Choi, 2000) that segments conversations based on inter-sentence similarities, with recent advanced sentence representations Sentence-BERT (Reimers and Gurevych, 2019), to extract the topic view. Specifically, each utterance u i in a conversation C = {u 1 , u 2 , ..., u m } is first encoded into hidden vectors via Sentence-BERT. Then the conversation C is divided into blocks C topic = {b 1 , ..., b n } through C99, where b i is one block that contains several consecutive utterances, such as the topic view described in Table 1.
Stage View As a way of doing things with words socially together with other people, conversation organizes utterances in certain orders to make it meaningful, enjoyable, and understandable. (Sacks et al., 1978;Althoff et al., 2016) For example, counseling conversations are found to follow a common pattern of "introductions → problem exploration → problem solving → wrap up" (Althoff et al., 2016). Such conversation stage view provides high-level sketches about the functions or goals of different parts in conversations, which could help models focus on the stages with key information.
We follow Althoff et al. (2016) to extract stages through a Hidden Markov Model (HMM). We impose a fixed ordering on the stages and only allow transitions from the current stage to the next one. The observations in the HMM model are the encoded representations h i from Sentence-BERT. We set the number of hidden stages as 4. Similar to the topic view extraction, we segment the conversations into blocks C stage = {b 1 , ..., b n }, where s i is one block that contains several consecutive utterances. We interpret the inferred stages qualitatively and further visualize the top 6 frequent words appearing in each stage in Table 2. We found that conversations around daily chats usually start with openings, introduce the goals/focus of the con-  versation followed by discussions of the details, and finally conclude with certain endings. Table 1 shows an example of the stage view.
Global View and Discrete View In addition to the aforementioned two structured views, conversations can also be naturally viewed from a relatively coarse perspective, i.e., a global view that concatenates all utterances into one giant block (Gliwa et al., 2019), and a discrete view that separates each utterance into a distinct block Gliwa et al., 2019).

Multi-view Sequence-to-Sequence Model
We extend generic sequence-to-sequence models to encode and combine different conversation views.
To better utilize semantic information in recent pretrained models, we implement our base encoders and decoders with a transformer based pre-trained model, BART (Lewis et al., 2019). Note that our multi-view sequence-to-sequence model is agnostic to BART with which it is initialized.
Conversation Encoder Given a conversation under a specific view k with n blocks: ., x k m,j } is first encoded through the conversation encoder E , e.g., BART encoder as shown in Figure 1(a), into hidden representations: (1) Note that we add special tokens x k 0,j at the beginning of each block and use these tokens' representations to describe each block, i.e., S k j = h k 0,j . To depict different views using hidden vectors, we aggregate the information from all blocks in one conversation through LSTM layers (Hochreiter and Schmidhuber, 1997): We use the last hidden state S k n to represent the current view k, denoted as V k .
Multi-view Decoder Different views could provide different types of conversational aspects for models to learn and further determine which set of utterances should deserve more attention in order to generate better dialogue summaries. As a result, the ability to strategically combine different views is essential. To this end, we propose a transformer based multi-view decoder to integrate encoded representations from different views and generate summaries as shown in Figure 1(b).
The input to the decoder contains l−1 previously generated tokens t 1 , ..., t l−1 . Via our multi-view decoder D, the l-th token is predicted via: Here, W p is a parameter to be learned.
Different from generic transformer decoder, we introduce a multi-view attention layer in each transformer block. Multi-view attention layer first decides the importance α k of each view V k through: where v is a randomly initialized context vector; W and b are parameters. To avoid the attention weights being too similar to each other as views are actually encoded from a similar context, we utilize a sharpening function over α k with a temperature i . When T → 0, the attention weights will behave like a one-hot vector.
Then the multi-head attention is performed over conversation tokens h k i,j from different views k and form A k separately. The attended results are further combined based on the view-attention weightsα k and continue forward passing: Training We minimize the cross entropy loss during training: Specifically, we apply the teacher forcing strategy: at training time, the inputs are previous tokens from the ground truth; at test time, the inputs are previous tokens predicted by the decoder.

Dataset and Baselines
We evaluate our model on a large-scale dialogue summary dataset SAMSum (Gliwa et al., 2019) that has 14732 dialogues with human-written summaries. The data statistics are shown in Table 3. SAMSum contains messenger-like conversations about daily topics, such as chit-chats, arranging meetings, discussing events, etc. We compare our Multi-view Sequence-to-Sequence Model (Multiview BART) with several baseline models: • • BART + Generic views (Lewis et al., 2019) utilized BART, a denoising autoencoder for pretraining sequence-to-sequence models, together with generic views (global view and discrete view). We used the BART-large model with its default settings 1 .

Model Settings 2
We loaded the pre-trained "bert-base-nli-stsb-meantokens" 3 for sentence-BERT to get representations for each utterance. For extracting the topic view via C99, we set the window size 4 and std coefficient 1. For extracting the stage view, we set the number of hidden states 4 in HMM. These hyper-parameters were set with a grid search.   Multi-View BART, we experimented with different view combinations: (1) the best generic view -global view, was combined with two structured views (stage and topic view) separately; (2) the best two structured views are also combined (topic + stage). The settings for BART encoder/decoder kept identical as baselines. We used a one-layer LSTM for encoding sections. The learning rate for section encoder and multi-view attention was set 3e-3. The temperature T was 0.2. The beam search size during inference for all the models was 4.

Results
Quantitative Results We evaluated models with the standard metric ROUGE Score (with stemming) (Lin and Och, 2004), and reported ROUGE-1, ROUGE-2 and ROUGE-L 4 . Results on the test set for different models were shown in Table 4. Compared to Pointer Generator, using reinforcement learning to select important sentences first (Fast Abs RL Enhanced ) slightly increased F scores. Adding pre-trained embeddings or extra documents training data to lightweight convolution models, (DynamicConv + GPT-2/News) lead to even better ROUGE scores. When using pre-trained transformer based model BART with generic views, all ROUGE scores improved significantly, and BART 4 Here we followed BART and used https://github. com/pltrdy/rouge. Note that different tools may generate different ROUGE scores. + Global outperformed BART + Discrete especially in terms of ROUGE-L F scores. Segmenting conversations into blocks from structured views (stage view and topic view) further boosted the performance, suggesting that our extracted conversation structures help conversational encoders to capture nuanced and informative aspects of dialogs.
We did not see any performance boost when combining the generic global view with either topic or conversational stage views, partially due to that the coarse granularity of global view does not complement structured views well. In contrast, utilizing both structured views (topic view + stage view) further increased ROUGE scores consistently, indicating the effectiveness of synthesizing informative conversation blocks introduced by both views.
We visualized the attention weight distributions for the stage view and topic view in our best model (see Appendix) and found contributions of topic views are slightly more prominent compared to stage views. This also communicated that the two different structured views can complement each other well though sharing the same dialogue content. Note that the gains from Multi-view BART (Topic + Stage) are mainly from the precision scores while recall scores are kept comparable, suggesting that our proposed model produced fewer irrelevant tokens while preserving necessary information in its generated summary.

Impact of Participants and Turns
We visualized the impact of two essential components in conversations-the number of participants and turns-on rouge scores via our best-performing model Multi-view BART with topic view + stage view in Figure 3. As the number of participants/turns increases, ROUGE scores decrease, indicating that the difficulty of conversation summarization increased with more participants involved in conversations and more utterances.
Qualitative/Human Evaluation We also conducted human annotations to evaluate the extracted dialogue summaries, in addition to ROUGE scores. Similar to Gliwa et al. (2019), we asked human annotators on Amazon Mechanical Turk 5 to rate each summary (200 randomly sampled summaries in total) on the scale of [-2, 0, 2], where -2 means that a summary was poor, extracted irrelevant information or did not make sense at all, 2 means it was understandable and gave a concise overview of the text, and 0 refers to that the summary only extracted only a part of relevant information, or made some mistakes. The score for each summary was averaged among three different annotators. The Intra-class Correlation was 0.583, indicating moderate agreement (Koo and . As shown in Figure 4, consistent with ROUGE scores in Table 4, our multi-view model achieved

Model Analysis and Discussion
So far, we have achieved a reasonable summarization performance. To further study why dialog summarization is challenging and how future research could advance this direction, we take a closer look at this dialogue summarization dataset (SAMSum), model generation errors, as well as certain challenges that existing approaches are struggling with.

Challenges in Dialog Summarization
We conduct a thorough examination of the challenges in conversation summarization and organized them into 7 categories as below: 1. Informal language use Many conversations especially in online contexts such as Twitter/Reddit (Jackson and Moulinier, 2007), contain typos, word abbreviations, slang or emoticons/emojis, making it hard to be represented and summarized. Figure 3, conversations with more speakers are harder to be summarized since it may require models to accurately differentiate both language styles and content from different speakers, similar to the multiple characters issue in story summarization .

Multiple participants As shown in
3. Multiple turns Similar to long document summarization (Xiao and Carenini, 2019), conversations with many utterances contain more information to be processed, thus harder to be summarized.

(Referral and coreference
People usually refer to each other, mention others' names or use coreference in their messages, which introduces extra difficulty to dialogue summarization, also a challenge also exists in reading comprehension  and document summarization (Falke et al., 2017).

Repetition and interruption
Information is generally scattered through the whole conversation, and speakers may interrupt each other, Table 5: The breakdown of challenges in dialogue summarization based on our analyses of 100 sampled conversations, and the ROUGE scores per challenge reconfirm, back channeling or repeat themselves, a unique discourse challenge for dialogue summarization.
6. Negations and rhetorical questions As a long-standing problem in NLP field , negation related issues are even more frequent in conversations, as there are more question-answer exchanges between speakers.
7. Role and language change Conversations usually involve more than one speaker, and the role of a speaker may shift from a questioner to an answerer, requiring the summarization model to dynamically deal with speaker roles and the associated language (e.g., first personal pronouns) We randomly sampled 100 examples 6 from our test set and classified them using the above challenge taxonomy. A conversation might have more than one category labels, and if it had none of the aforementioned challenges, we labeled it as (0) Generic. Usually, the one marked as Generic were shorter or had a simple structure. Table 5 presents the percentage of each type of challenge and per-category performances from our best model (Multi-view BART with Topic view + Stage view). We observed that: (i) Referral & coreference (33%) and Role & language change (30%) were the two most frequent challenges that dialogue summarization task faced. (2) As expected, Generic conversations were relatively easier summarize.
(3) Our best model performed relatively worse when it came to Repetition & interruption, Multiple turns, and Referral & coreference, calling for more intelligent summarization methods to tackle those challenges. 6 The full analyzed set of examples are shown in Appendix.

Error Analysis 7
We examined summaries generated by our bestperforming model compared to ground-truth summaries, and observed several major error types: 1. Missing information: content mentioned in references is missing in generated summaries.
2. Redundancy: content occurred in generated summaries was not mentioned by references.
3. Wrong references: generated summaries contain information that is not faithful to the original dialogue, and associate one's actions/locations with a wrong speaker.
4. Incorrect reasoning: generated summaries reasoned relations in dialogues incorrectly, thus came to wrong conclusions.
We annotated the same set of 100 randomly sampled summaries via the above error type taxonomy. A summary might have more than one category labels and we categorized a summary as (0) Other if it did not belong to any error types. Table 6 presents the breakdown of error types and per-category ROUGE scores. We found that: (i) missing information (37%) was the most frequent error type, indicating that current summarization models struggled with identifying key information. (ii) Incorrect reasoning had a percentage of 24% with the worst ROUGE-2; despite of being a minor type 6%, improper gendered pronouns seemed to severely decrease both ROUGE-1 and ROUGE-2. (iii) The relatively low ROUGE scores associated with incorrect reasoning and wrong references urged better summarization models in dealing with faithfulness in dialogue summarization.

Relation between Challenges and Errors
To figure out relations between challenges and errors made by our models, i.e., how different types of errors correlate with different types of challenges, we visualized the co-occurrence heat map in Figure 5. We found that: (i) Our model generated good summary for generic, simple conversations. (ii) All kinds of challenges had high correlations with, or could lead to the missing information error. (iii) Wrong references were highly associated with referral & coreference; this was as expected since co-references in conversations would naturally increase the difficulty for models to associate correct speakers with correct actions. (iv) High correlations between role & language change, referral & coreference and incorrect reasoning indicated that interactions between multiple participants with frequent co-references might easily lead current summarization models to reason incorrectly.

Conclusion
In this work, we proposed a multi-view sequenceto-sequence model that leveraged multiple conversational structures (topic view and stage view) and generic views (global view and discrete view) to generate summaries for conversations. In order to strategically combine these different views for better summary generations, we propose a multiview sequence-to-sequence model. Experiments conducted demonstrated the effectiveness of our proposed models in terms of both quantitative and qualitative evaluations. Via thorough error analyses, we concluded a set of challenges that current models struggled with, which can further facilitate future research on conversation summarization. Due to the lack of annotations, we only adopted simple unsupervised segmentation methods to ex-tract different views. In the future, we plan to annotate some of the data, explore supervised segmentation models  and introduce more conversation structures like dialogue acts (Oya and Carenini, 2014;Joty and Hoque, 2016) into abstractive dialogue summarization.

A Model Settings
We load the pre-trained "bert-base-nli-stsb-meantokens" 8 for sentence-BERT to get representations for each utterance. When extracting the topic view, we set the window size 4 and std coefficient 1 in C99. When extracting the stage view, we set the number of hidden states 4 in HMM. These hyperparameters were set after a grid search with evaluating randomly sampled segmented results by human. The BART + Structured views (stage and topic views) followed the same parameters as BART + Generic views. For Multi-View BART, we selected different views to combine: (1) generic view + structured view: best generic view, global view, was combined with two structured views (stage and topic view); (2) structured view + structured view: best two single views are combined (topic + stage). The settings for BART encoder/decoder kept the same as baseline. We used a one layer LSTM for encoding sections. The learning rate for section encoder and multi-view attention was set 3e-3. The temperature T was 0.2. The beam search size during inference for all the models was 4.
Experiments were performed on two Tesla P100 (16GB memory).

B View Attention Visualization
We visualized the attention weights distribution for the stage view and topic view in our best multi-view model to explore the importance of stage verses topic in Figure 6.We found that the topic views were more prominent than the stage views, consistent with the performances of BART + topic view and BART + stage view. This indicated that having 8 https://github.com/UKPLab/ sentence-transformers   discourse structures about topics might be more important while both topic and stage could improve the conversation summarization. This also communicated that the two different structured views can complement each other well though sharing the same dialogue content.
We displayed two examples in Table 8 with the golden references, each single view's generated summaries and the combined views' generated summaries. The combined view could balance the advantages of each single view and generated more precise summaries. And the attention weights the model learned were also consistent with single view's performances.

C Supplementary Examples for Model Analysis and Discussion
For the analysis in the Model Analysis and Discussion section in our paper, we randomly sampled 100 examples from the test set of the SAMSum  dataset which can be downloaded here 9 . Table 7 provides a full index list of the samples. Table 9 shows the error analysis for BART-Discrete, BART-Global, BART-Stage, BART-Topic and BART-Multi-view models. It can be observed that, (i) without any explicit structures, discreteview and global-view models generated summaries with more redundancies compared to golden reference summaries, as models may easily lost focus on massive information; (ii) once we introduced certain conversation structures such as topic-view and stage-view, models behaved better in terms of redundancy and incorrect reasoning, which indicated that the structured views could help models to better understand the conversations; (iii) our multiview models which combined both stage-view and topic-view made the least number of errors compared to all single view models, suggesting the effectiveness of combining different views for conversation summarization.