SAMSum Corpus: A Human-annotated Dialogue Dataset for Abstractive Summarization

This paper introduces the SAMSum Corpus, a new dataset with abstractive dialogue summaries. We investigate the challenges it poses for automated summarization by testing several models and comparing their results with those obtained on a corpus of news articles. We show that model-generated summaries of dialogues achieve higher ROUGE scores than the model-generated summaries of news – in contrast with human evaluators’ judgement. This suggests that a challenging task of abstractive dialogue summarization requires dedicated models and non-standard quality measures. To our knowledge, our study is the first attempt to introduce a high-quality chat-dialogues corpus, manually annotated with abstractive summarizations, which can be used by the research community for further studies.


Introduction and related work
The goal of the summarization task is condensing a piece of text into a shorter version that covers the main points succinctly.In the abstractive approach important pieces of information are presented using words and phrases not necessarily appearing in the source text.This requires natural language generation techniques with high level of semantic understanding (Chopra et al., 2016;Rush et al., 2015;Khandelwal et al., 2019;Zhang et al., 2019;See et al., 2017;Chen and Bansal, 2018;Gehrmann et al., 2018).
Major research efforts have focused so far on summarization of single-speaker documents like news (e.g., Nallapati et al. (2016)) or scientific publications (e.g., Nikolov et al. (2018)).One of the reasons is the availability of large, high-quality news datasets with annotated summaries, e.g., CNN/Daily Mail (Hermann et al., 2015;Nallapati et al., 2016).Such a comprehensive dataset for dialogues is lacking.
The challenges posed by the abstractive dialogue summarization task have been discussed in the literature with regard to AMI meeting corpus (McCowan et al., 2005), e.g.Banerjee et al. (2015), Mehdad et al. (2014), Goo and Chen (2018).Since the corpus has a low number of summaries (for 141 dialogues), Goo and Chen (2018) proposed to use assigned topic descriptions as gold references.These are short, label-like goals of the meeting, e.g., costing evaluation of project process; components, materials and energy sources; chitchat.Such descriptions, however, are very general, lacking the messenger-like structure and any information about the speakers.
To benefit from large news corpora, Ganesh and Dingliwal (2019) built a dialogue summarization model that first converts a conversation into a structured text document and later applies an attention-based pointer network to create an abstractive summary.Their model, trained on structured text documents of CNN/Daily Mail dataset, was evaluated on the Argumentative Dialogue Summary Corpus (Misra et al., 2015), which, however, contains only 45 dialogues.
In the present paper, we further investigate the problem of abstractive dialogue summarization.With the growing popularity of online conversations via applications like Messenger, What-sApp and WeChat, summarization of chats between a few participants is a new interesting direction of summarization research.For this purpose we have created the SAMSum Corpus The paper is structured as follows: in Section 2 we present details about the new corpus and describe how it was created, validated and cleaned.Brief description of baselines used in the summarization task can be found in Section 3. In Section 4, we describe our experimental setup and parameters of models.Both evaluations of summarization models, the automatic with ROUGE metric and the linguistic one, are reported in Section 5 and Section 6, respectively.Examples of models' outputs and some errors they make are described in Section 7. Finally, discussion, conclusions and ideas for further research are presented in sections 8 and 9.

SAMSum Corpus
Initial approach.Since there was no available corpus of messenger conversations, we considered two approaches to build it: (1) using existing datasets of documents, which have a form similar to chat conversations, (2) creating such a dataset by linguists.
In the first approach, we reviewed datasets from the following categories: chatbot dialogues, SMS corpora, IRC/chat data, movie dialogues, tweets, comments data (conversations formed by replies to comments), transcription of meetings, written discussions, phone dialogues and daily communication data.Unfortunately, they all differed in some respect from the conversations that are typically written in messenger apps, e.g. they were too technical (IRC data), too long (comments data, transcription of meetings), lacked context (movie dialogues) or they were more of a spoken type, such as a dialogue between a petrol station assistant and a client buying petrol.
As a consequence, we decided to create a chat dialogue dataset by constructing such conversations that would epitomize the style of a messenger app.
Process of building the dataset.Our dialogue summarization dataset contains natural messenger-like conversations created and written down by linguists fluent in English.The style and register of conversations are diversified -di-alogues could be informal, semi-formal or formal, they may contain slang phrases, emoticons and typos.We asked linguists to create conversations similar to those they write on a daily basis, reflecting the proportion of topics of their real-life messenger conversations.It includes chit-chats, gossiping about friends, arranging meetings, discussing politics, consulting university assignments with colleagues, etc.Therefore, this dataset does not contain any sensitive data or fragments of other corpora.
Each dialogue was created by one person.After collecting all of the conversations, we asked language experts to annotate them with summaries, assuming that they should (1) be rather short, ( 2) extract important pieces of information, (3) include names of interlocutors, (4) be written in the third person.Each dialogue contains only one reference summary.
Validation.Since the SAMSum corpus contains dialogues created by linguists, the question arises whether such conversations are really similar to those typically written via messenger apps.
To find the answer, we performed a validation task.We asked two linguists to doubly annotate 50 conversations in order to verify whether the dialogues could appear in a messenger app and could be summarized (i.e. a dialogue is not too general or unintelligible) or not (e.g. a dialogue between two people in a shop).The results revealed that 94% of examined dialogues were classified by both annotators as good i.e. they do look like conversations from a messenger app and could be condensed in a reasonable way.In a similar validation task, conducted for the existing dialogue-type datasets (described in the Initial approach section), the annotators agreed that only 28% of the dialogues resembled conversations from a messenger app.
Cleaning data.After preparing the dataset, we conducted a process of cleaning it in a semiautomatic way.Beforehand, we specified a format for written dialogues with summaries: a colon should separate an author of utterance from its content, each utterance is expected to be in a separate line.Therefore, we could easily find all deviations from the agreed structure -some of them could be automatically fixed (e.g. when instead of a colon, someone used a semicolon right after the interlocutor's name at the beginning of an utterance), others were passed for verification to linguists.We also tried to correct typos in inter-locutors' names (if one person has several utterances, it happens that, before one of them, there is a typo in his/her name) -we used the Levenshtein distance to find very similar names (possibly with typos e.g.'George' and 'Goerge') in a single conversation, and those cases with very similar names were passed to linguists for verification.
Description.The created dataset is made of 16369 conversations distributed uniformly into 4 groups based on the number of utterances in con- versations: 3-6, 7-12, 13-18 and 19-30.Each utterance contains the name of the speaker.Most conversations consist of dialogues between two interlocutors (about 75% of all conversations), the rest is between three or more people.Table 1 presents the size of the dataset split used in our experiments.The example of a dialogue from this corpus is shown in Table 2.

Dialogue
Blair: Remember we are seeing the wedding planner after work Chuck: Sure, where are we meeting her? Blair: At Nonna Rita's Chuck: Can I order their seafood tagliatelle or are we just having coffee with her? I've been dreaming about it since we went there last month Blair: Haha sure why not Chuck: Well we both remmber the spaghetti pomodoro disaster from our last meeting with Diane Blair: Omg hahaha it was all over her white blouse Chuck: :D Blair: :P Summary Blair and Chuck are going to meet the wedding planner after work at Nonna Rita's.The tagliatelle served at Nonna Rita's are very good.

Dialogues baselines
The baseline commonly used in the news summarization task is Lead-3 (See et al., 2017), which takes three leading sentences of the document as the summary.The underlying assumption is that the beginning of the article contains the most • MIDDLE-n, which takes n utterances from the middle of the dialogue, • LONGEST-n, treating only n longest utterances in order of length as a summary, • LONGER-THAN-n, taking only utterances longer than n characters in order of length (if there is no such long utterance in the dialogue, takes the longest one), • MOST-ACTIVE-PERSON, which treats all utterances of the most active person in the dialogue as a summary.
Results of the evaluation of the above models are reported in Table 3.There is no obvious baseline for the task of dialogues summarization.We expected rather low results for Lead-3, as the beginnings of the conversations usually contain greetings, not the main part of the discourse.However, it seems that in our dataset greetings are frequently combined with question-asking or information passing (sometimes they are even omitted) and such a baseline works even better than the MIDDLE baseline (taking utterances from the middle of a dialogue).Nevertheless, the best dialogue baseline turns out to be the LONGEST-3 model.
This section contains a description of setting used in the experiments carried out.

Data preparation
In order to build a dialogue summarization model, we adopt the following strategies: (1) each candidate architecture is trained and evaluated on the dialogue dataset; (2) each architecture is trained on the train set of CNN/Daily Mail joined together with the train set of the dialogue data, and evaluated on the dialogue test set.
In addition, we prepare a version of dialogue data, in which utterances are separated with a special token called the separator (artificially added token e.g.'<EOU>' for models using word embeddings, '|' for models using subword embeddings).In all our experiments, news and dialogues are truncated to 400 tokens, and summaries -to 100 tokens.The maximum length of generated summaries was not limited.

Models
We carry out experiments with the following summarization models (for all architectures we set the beam size for beam search decoding to 5): • Pointer generator network (See et al., 2017).In the case of Pointer Generator, we use a default configuration 3 , changing only the minimum length of the generated summary from 35 (used in news) to 15 (used in dialogues).
• Transformer (Vaswani et al., 2017).The model is trained using OpenNMT library 4 .We use the same parameters for training both on news and on dialogues 5 , changing only the minimum length of the generated summary -35 for news and 15 for dialogues.
• Fast Abs RL (Chen and Bansal, 2018).It is trained using its default parameters 6 .For dialogues, we change the convolutional wordlevel sentence encoder (used in extractor part) to only use kernel with size equal 3 instead of 3-5 range.It is caused by the fact that some of utterances are very short and the default setting is unable to handle that.
• Fast Abs RL Enhanced.The additional variant of the Fast Abs RL model with slightly changed utterances i.e. to each utterance, at the end, after artificial separator, we add names of all other interlocutors.The reason for that is that Fast Abs RL requires text to be split into sentences (as it selects sentences and then paraphrase each of them).For dialogues, we divide text into utterances (which is a natural unit in conversations), so sometimes, a single utterance may contain more than one sentence.Taking into account how this model works, it may happen that it selects an utterance of a single person (each utterance starts with the name of the author of the utterance) and has no information about other interlocutors (if names of other interlocutors do not appear in selected utterances), so it may have no chance to use the right people's names in generated summaries.

Evaluation metrics
We evaluate models with the standard ROUGE metric (Lin, 2004), reporting the F 1 scores (with stemming) for ROUGE-1, ROUGE-2 and ROUGE-L following previous works (Chen and Bansal, 2018;See et al., 2017).We obtain scores using the py-rouge package9 .

Results
The results for the news summarization task are shown in Table 4 and for the dialogue summarization -in Table 5.In both domains, the best models' ROUGE-1 exceeds 39, ROUGE-2 -17 and ROUGE-L -36.Note that the strong baseline for news (Lead-3) is outperformed in all three metrics only by one model.In the case of dialogues, all tested models perform better than the baseline (LONGEST-3).
In general, the Transformer-based architectures benefit from training on the joint dataset: news+dialogues, even though the news and the dialogue documents have very different structures.Interestingly, this does not seem to be the case for the Pointer Generator or Fast Abs RL model.
The inclusion of a separation token between dialogue utterances is advantageous for most models -presumably because it improves the discourse structure.The improvement is most visible when training is performed on the joint dataset.
Having compared two variants of the Fast Abs RL model -with original utterances and with enhanced ones (see Section 4.2), we conclude that enhancing utterances with information about the other interlocutors helps achieve higher ROUGE values.
The largest improvement of the model performance is observed for LightConv and Dynamic-Conv models when they are complemented with pretrained embeddings from the language model GPT-2, trained on enormous corpora.
It is also worth noting that some models (Pointer Generator, Fast Abs RL), trained only on the dialogues corpus (which has 16k dialogues), reach similar level (or better) in terms of ROUGE metrics than models trained on the CNN/DM news dataset (which has more than 300k articles).Adding pretrained embeddings and training on the joined dataset helps in achieving significantly higher values of ROUGE for dialogues than the best models achieve on the CNN/DM news dataset.
According to ROUGE metrics, the best performing model is DynamicConv with GPT-2 embeddings, trained on joined news and dialogue data with an utterance separation token.

Linguistic verification of summaries
ROUGE is a standard way of evaluating the quality of machine generated summaries by comparing them with reference ones.The metric based on n-gram overlapping, however, may not be very informative for abstractive summarization, where paraphrasing is a keypoint in producing highquality sentences.To quantify this conjecture, we manually evaluated summaries generated by the models for 150 news and 100 dialogues.We asked two linguists to mark the quality of every summary on the scale of −1, 0, 1, where −1 means that a summarization is poor, extracts irrelevant information or does not make sense at all, 1 -it is understandable and gives a brief overview of the text, and 0 stands for a summarization that extracts only a part of relevant information, or makes some mistakes in the produced summary.
We noticed a few annotations (7 for news and 4 for dialogues) with opposite marks (i.e. one annotator judgement was −1, whereas the second one was 1) and decided to have them annotated once again by another annotator who had to resolve conflicts.For the rest, we calculated the linear weighted Cohen's kappa coefficient (McHugh, 2012) between annotators' scores.For news examples, we obtained agreement on the level of 0.371 and for dialogues -0.506.The annotators' agreement is higher on dialogues than on news, probably because of structures of those data -articles are often long and it is difficult to decide what the key-point of the text is; dialogues, on the contrary, are rather short and focused mainly on one topic.
For manually evaluated samples, we calculated ROUGE metrics and the mean of two human ratings; the prepared statistics is presented in Table 6.As we can see, models generating dialogue summaries can obtain high ROUGE results, but their outputs are marked as poor by human annotators.Our conclusion is that the ROUGE metric corresponds with the quality of generated summaries for news much better than for dialogues, confirmed by Pearson's correlation between human evaluation and the ROUGE metric, shown in Table 7.

Model Train data
Separator R-

Difficulties in dialogue summarization
In a structured text, such as a news article, the information flow is very clear.However, in a dialogue, which contains discussions (e.g. when people try to agree on a date of a meeting), questions (one person asks about something and the answer may appear a few utterances later) and greetings, most important pieces of information are scattered across the utterances of different speakers.What is more, articles are written in the third-person point of view, but in a chat everyone talks about themselves, using a variety of pronouns, which further complicates the structure.Additionally, people talking on messengers often are in a hurry, so they shorten words, use the slang phrases (e.g.'u r gr8' means 'you are great') and make typos.These phenomena increase the difficulty of performing dialogue summarization.• DynamicConv + GPT-2 embeddings with a separator (trained on news + dialogues), • DynamicConv + GPT-2 embeddings (trained on news + dialogues), • Fast Abs RL (trained on dialogues), • Fast Abs RL Enhanced (trained on dialogues), • Transformer (trained on news + dialogues).
One can easily notice problematic issues.Firstly, the models frequently have difficulties in associating names with actions, often repeating the same name, e.g., for Dialogue 1 in Table 8, Fast Abs RL generates the following summary: 'lilly and lilly are going to eat salmon'.To help the model deal with names, the utterances are enhanced by adding information about the other interlocutors -Fast Abs RL enhanced variant described in Section 4.2.In this case, after enhancement, the model generates a summary containing both interlocutors' names: 'lily and gabriel are going to pasta...'.Sometimes models correctly choose speakers' names when generating a summary, but make a mistake in deciding who performs the action (the subject) and who receives the action (the object), e.g. for Dialogue 4 Dynamic-Conv + GPT-2 emb.w/o sep.model generates the summary 'randolph will buy some earplugs for maya', while the correct form is 'maya will buy some earplugs for randolph'.
A closely related problem is capturing the context and extracting information about the arrangements after the discussion.For instance, for Dialogue 4, the Fast Abs RL model draws a wrong conclusion from the agreed arrangement.This issue is quite frequently visible in summaries generated by Fast Abs RL, which may be the consequence of the way it is constructed; it first chooses important utterances, and then summarizes each of them separately.This leads to the narrowing of the context and loosing important pieces of information.
One more aspect of summary generation is deciding which information in the dialogue content is important.For instance, for Dialogue 3 Dy-namicConv + GPT-2 emb. with sep.generates a correct summary, but focuses on a piece of information different than the one included in the reference summary.In contrast, some other models -like Fast Abs RL enhanced -select both of the pieces of information appearing in the discussion.On the other hand, when summarizing Dialogue 5, the models seem to focus too much on the phrase 'it's the best place', intuitively not the most important one to summarize.

Discussion
This paper is a step towards abstractive summarization of dialogues by (1) introducing a new dataset, created for this task, (2) comparison with news summarization by the means of automated (ROUGE) and human evaluation.
Most of the tools and the metrics measuring the quality of text summarization have been developed for a single-speaker document, such as news; as such, they are not necessarily the best choice for conversations with several speakers.
We test a few general-purpose summarization models.In terms of human evaluation, the results of dialogues summarization are worse than the results of news summarization.This is connected with the fact that the dialogue structure is more complex -information is spread in multiple utterances, discussions, questions, more typos and slang words appear there, posing new challenges for summarization.On the other hand, dialogues are divided into utterances, and for each utterance its author is assigned.We demonstrate in experiments that the models benefit from the introduction of separators, which mark utterances for each person.This suggests that dedicated models having some architectural changes, taking into account the assignation of a person to an utterance in a systematic manner, could improve the quality of dialogue summarization.
We show that the most popular summarization metric ROUGE does not reflect the quality of a summary.Looking at the ROUGE scores, one concludes that the dialogue summarization models perform better than the ones for news summarization.In fact, this hypothesis is not true -we performed an independent, manual analysis of summaries and we demonstrated that high ROUGE results, obtained for automatically-generated dialogue summaries, correspond with lower evaluation marks given by human annotators.An interesting example of the misleading behavior of the ROUGE metrics is presented in Table 9 for Dialogue 4, where a wrong summary -'paul and cindy don't like red roses.' -obtained all ROUGE values higher than a correct summary -'paul asks cindy what color flowers should buy.'.Despite lower ROUGE values, news summaries were scored higher by human evaluators.We conclude that when measuring the quality of modelgenerated summaries, the ROUGE metrics are more indicative for news than for dialogues, and a new metric should be designed to measure the quality of abstractive dialogue summaries.

Conclusions
In our paper we have studied the challenges of abstractive dialogue summarization.We have addressed a major factor that prevents researchers from engaging into this problem: the lack of a proper dataset.To the best of our knowledge, this is the first attempt to create a comprehensive resource of this type which can be used in future research.The next step could be creating an even more challenging dataset with longer dialogues that not only cover one topic, but span over numerous different ones.As shown, summarization of dialogues is much more challenging than of news.In order to perform well, it may require designing dedicated tools, but also new, non-standard measures to capture the quality of abstractive dialogue summaries in a relevant way.We hope to tackle these issues in future work.

Table 2 :
Example of a dialogue from the collected corpus

Table 3 :
Baselines for the dialogues summarization significant information.Inspired by the Lead-n model, we propose a few different simple models:

Table 4 :
Model evaluation on the news corpus test set

Table 5 :
Model evaluation on the dialogues corpus test set

Table 6 :
Statistics of human evaluation of summaries' quality and ROUGE evaluation of those summaries

Table 7 :
Pearson's correlations between human judgement and ROUGE metric together with summaries produced by the best tested models: