A Dataset for Document Grounded Conversations

This paper introduces a document grounded dataset for conversations. We define “Document Grounded Conversations” as conversations that are about the contents of a specified document. In this dataset the specified documents were Wikipedia articles about popular movies. The dataset contains 4112 conversations with an average of 21.43 turns per conversation. This positions this dataset to not only provide a relevant chat history while generating responses but also provide a source of information that the models could use. We describe two neural architectures that provide benchmark performance on the task of generating the next response. We also evaluate our models for engagement and fluency, and find that the information from the document helps in generating more engaging and fluent responses.


Introduction
At present, dialog systems are considered to be either task-oriented, where a specific task is the goal of the conversation (e.g. getting bus information or weather for a particular location); or non-task oriented where conversations are more for the sake of themselves, be it entertainment or passing the time. Ultimately, we want our agents to smoothly interleave between task-related information flow and casual chat for the given situation. There is a dire need of a dataset which caters to both these objectives.  provide a comprehensive list of available datasets for building end-to-end conversational agents. Datasets based on movie scripts (Lison and Tiedemann, 2016;Danescu-Niculescu-Mizil and Lee, 2011a) contain artificial conversations. The Ubuntu Dialogue Corpus  is based on technical support logs from the Ubuntu forum. The Frames dataset (Asri et al., 2017) was collected to solve the problem of frame tracking. These datasets do not provide grounding of the information presented in the conversations. Zhang et al. (2018) focuses on personas in dialogues: each worker has a set of predefined facts about the persona that they can talk about. Most of these datasets lack conversations with large number of on-topic turns.
We introduce a new dataset which addresses the concerns of grounding in conversation responses, context and coherence in responses. We present a dataset which has real human conversations with grounding in a document. Although our examples use Wikipedia articles about movies, we see the same techniques being valid for other external documents such as manuals, instruction booklets, and other informational documents. We build a generative model with and without the document information and find that the responses generated by the model with the document information is more engaging (+7.5% preference) and more fluent (+0.96 MOS). The perplexity also shows a 11.69 point improvement.

The Document Grounded Dataset
To create a dataset for document grounded conversations, we seek the following things: (1) A set of documents (2) Two humans chatting about the content of the document for more than 12 turns. We collected conversations about the documents through Amazon Mechanical Turk (AMT). We restrict the topic of the documents to be movierelated articles to facilitate the conversations. We initially experimented with different potential domains. Since movies are engaging and widely known, people actually stay on task when discussing them. In fact in order to make the task interesting, we offered a choice of movies to the participants so that they are invested in the task.

Document Set Creation
We choose Wikipedia (Wiki) 1 articles to create a set of documents D = {d 1 , . . . , d 30 } for grounding of conversations. We randomly select 30 movies, covering various genres like thriller, super-hero, animation, romantic, biopic etc. We extract the key information provided in the Wiki article and divide it into four separate sections. This was done to reduce the load of the users to read, absorb and discuss the information in the document. Hence, each movie document d i consists of four sections {s 1 , s 2 , s 3 , s 4 } corresponding to basic information and three key scenes of the movie. The basic information section s 1 contains data from the Wikipedia article in a standard form such as year, genre, director. It also includes a short introduction about the movie, ratings from major review websites, and some critical responses. Each of the key scene sections {s 2 , s 3 , s 4 } contains one short paragraph from the plot of the movie. Each paragraph contains on an average 7 sentences and 143 words. These paragraphs were extracted automatically from the original articles, and were then lightly edited by hand to make them of consistent size and detail. An example of the document is attached in Appendix.

Dataset Creation
To create a dataset of conversations which uses the information from the document, involves the participation of two workers. Hence, we explore two scenarios: (1) Only one worker has access to the document and the other worker does not and (2) Both the workers have access to the document. In both settings, they are given the common instructions of chatting for at least 12 turns.
Scenario 1: One worker has document. In this scenario, only one worker has access to the document. The other worker cannot see the document. The instruction to the worker with the document is: Tell the other user what the movie is, and try to persuade the other user to watch/not to watch the movie using the information in the document; and the instruction to the worker without the document is: After you are told the name of the movie, pretend you are interested in watching the movie, and try to gather all the information you need to make a decision whether to watch the movie in the end. An example of part of the dialogue for this user2: Hey have you seen the inception? user1: No, I have not but have heard of it.
What is it about user2: It's about extractors that perform experiments using military technology on people to retrieve info about their targets. User 2: I thought The Shape of Water was one of Del Toro's best works. What about you? User 1: Did you like the movie? User 1: Yes, his style really extended the story. User 2: I agree. He has a way with fantasy elements that really helped this story be truly beautiful.  Table 2.
Workflow: When the two workers enter the chat-room, they are initially given only the first section on basic information s 1 of the document d i . After the two workers complete 3 turns (for the first section 6 turns is needed due to initial greetings), the users will be shown the next section. The users are encouraged to discuss information in the new section, but are not constrained to do so.

Dataset Statistics
The dataset consists of total 4112 conversations with an average of 21.43 turns. The number of conversations for scenario 1 is 2128 and for scenario 2 it is 1984. We consider a turn to be an exchange between two workers (say w1 and w 2 ). Hence an exchange of w 1 , w 2 , w 1 has 2 turns (w 1 , w 2 ) and (w 2 , w 1 ). We show the comparison of our dataset as CMU Document Grounded Conversations (CMU DoG) with other datasets in Table 3.  (Asri et al., 2017) 19,986 15  One of the salient features of CMUDoG dataset is that it has mapping of the conversation turns to each section of the document, which can then be used to model conversation responses. Another useful aspect is that we report the quality of the conversations in terms of how much the conversation adheres to the information in the document.
Split Criteria: We automatically measure the quality of the conversations using BLEU (Papineni et al., 2002) score. We use BLEU because we want to measure the overlap of the turns of the conversation with the sections of the document. Hence, a good quality conversation should use more information from the document than a low quality conversation. We divide our dataset into three ratings based on this measure. The BLEU score is calculated between all the utterances {x 1 , . . . , x n } of a conversation C i and the document d i corresponding to C i . We eliminate incomplete conversations that have less than 10 turns. The percentiles for the remaining conversations are shown in Table 5. We split the dataset into three ratings based on BLEU score.
Rating 1: Conversations are given a rating of 1 if their BLEU score is less than or equal to 0.1. We consider these conversations to be of low-quality.
Rating 2: All the conversations that do not fit in rating 1 and 3 are marked with a rating of 2.
Rating 3: Conversations are labeled with a rating of 3, only if the conversation has more than 12 turns and has a BLEU score larger than 0.587. This threshold was calculated by summing the mean (0.385) and the standard deviation (0.202) of BLEU scores of the conversations that do not belong rating 1. The average BLEU score for workers who have access to the document is 0.22 whereas the average BLEU score for the workers without access to the document is 0.03. This suggests that even if the workers had external knowledge about the movie, they have not extensively used it in the conversation. It also suggests that the workers with the document have not used the information from the document verbatim in the conversation. Table 4 shows the statistics on the total number of conversations, utterances, and average number of utterances per conversation and average length of utterances for all the three ratings.

Models
In this section we discuss models which can leverage the information from the document for generating responses. We explore generative models for this purpose. Given a dataset X = {x 0 , . . . , x n } of utterances in a conversation C i , we consider two settings: (1) to generate a response x i+1 when given only the current utterance x i and (2) to generate a response x i+1 when given the corresponding section s i and the previous utterance x i .

Without section:
We use the sequence-tosequence model (Sutskever et al., 2014) to build our baseline model. Formally, let θ E represent the parameters of the encoder. Then the representation h x i of the current utterance x i is given by: Samples of x i+1 are generated as follows: where,x <t are the tokens generated beforex t . We also use global attention (Luong et al., 2015) with copy mechanism (See et al., 2017) to guide our generators to replace the unknown (UNK) tokens. We call this model SEQ.
With section: We extend the sequence-tosequence framework to include the section s i corresponding the current turn. We use the same encoder to encode both the utterance and the section. We get the representation h x i of the current utterance x i using Eq. 1. The representation of the section is given by: The input at each time step t to the generative model is given is the embedding of the word at the previous time step. We call this model SEQS.
Experimental Setup: For both SEQ and SEQS model, we use a two-layer bidirectional LSTM as the encoder and a LSTM as the decoder. The dropout rate of the LSTM output is set to be 0.3. The size of hidden units for both LSTMs is 300. We set the word embedding size to be 100, since the size of vocabulary is relatively small 2 . The models are trained with adam (Kingma and Ba, 2014) optimizer with learning rate 0.001 until they converge on the validation set for the perplexity criteria. We use beam search with size 5 for response generation. We use all the data (i.e all the conversations regardless of the rating and scenario) for training and testing. The proportion of train/validation/test split is 0.8/0.05/0.15.

Evaluation
In what follows, we first present an analysis of the dataset, then provide an automatic metric for evaluation of our models-perplexity and finally present the results of human evaluation of the generated responses for engagement and fluency.  Dataset analysis: We perform two kinds of automated evaluation to investigate the usefulness of the document in the conversation. The first one is to investigate if the workers use the information from the document in the conversation. The second analysis is to show that the document adds value to the conversation. Let the set of tokens in the current utterance x i be N , the set of tokens in the current section s i be M , the set of tokens in the previous three utterances be H, and the set of stop words be S. In scenario 1, we calculate the set operation (NW) as |((N ∩ M ) \ H) \ S|.
Let the tokens that appear in all the utterances (x i , . . . , x i+k ) corresponding to the current section s i be K and the tokens that appear in all the utterances (x i , . . . , x i+p ) corresponding to the previous section s i−1 be P . In scenario 2, we calculate the set operation (NW) as |((K ∩M )\P )\S|.
The results in Table 6 show that people use the information in the new sections and are not fixated on old sections. It also shows that they use the information to construct the responses.
Perplexity: To automatically evaluate the fluency of the models, we use perplexity measure.
We build a language model on the train set of responses using ngrams up to an order of 3 3 . The generated test responses achieve a perplexity of 21.8 for the SEQ model and 10.11 for the SEQS model. This indicates that including the sections of document helps in the generation process.

Human Evaluation
We also perform two kinds of human evaluations to evaluate the quality of predicted utterances -engagement and fluency. These experiments are performed on Amazon Mechanical Turk.

Engagement:
We set up a pairwise comparison following Bennett (2005) to evaluate the engagement of the generated responses. The test presents the chat history (1 utterance) and then, in random order, its corresponding response produced by the SEQ and SEQS models. A third option "No Preference" was given to participants to mark no preference for either of the generated responses. The instruction given to the participants is "Given the above chat history as context, you have to pick the one which can be best used as the response based on the engagingness." We randomly sample 90 responses from each model. Each response was annotated by 3 unique workers and we take majority vote as the final label. The result of the test is that SEQ generated responses were chosen only 36.4% times as opposed to SEQS generated responses which were chosen 43.9% and the "No Preference" option was chosen 19.6% of times. This result shows the information from the sections improves the engagement of the generated responses.
Fluency: The workers were asked to evaluate the fluency of the generated response on a scale of 1 to 4, where 1 is unreadable and 4 is perfectly readable. We randomly select 120 generated responses from each model and each response was annotated by 3 unique workers. The SEQ model got a low score of 2.88, contrast to the SEQS score of 3.84. This outcome demonstrates that the information in the section also helps in guiding the generator to produce fluent responses.

Conclusion
In this paper we introduce a crowd-sourced conversations dataset that is grounded in a predefined set of documents which is available for download 4 . We perform multiple automatic and human judgment based analysis to understand the value the information from the document provides to the generation of responses. The SEQS model which uses the information from the section to generate responses outperforms the SEQ model in the evaluation tasks of engagement, fluency and perplexity.