MuTual: A Dataset for Multi-Turn Dialogue Reasoning

Non-task oriented dialogue systems have achieved great success in recent years due to largely accessible conversation data and the development of deep learning techniques. Given a context, current systems are able to yield a relevant and fluent response, but sometimes make logical mistakes because of weak reasoning capabilities. To facilitate the conversation reasoning research, we introduce MuTual, a novel dataset for Multi-Turn dialogue Reasoning, consisting of 8,860 manually annotated dialogues based on Chinese student English listening comprehension exams. Compared to previous benchmarks for non-task oriented dialogue systems, MuTual is much more challenging since it requires a model that be able to handle various reasoning problems. Empirical results show that state-of-the-art methods only reach 71%, which is far behind human performance of 94%, indicating that there is ample room for improving reasoning ability.


Introduction
Building an intelligent conversational agent is one of the longest running goals in AI. Existing conversational agents can be categorized into taskoriented dialogue systems (Kannan et al., 2016) and non-task-oriented chatbot systems (Shum et al., 2018;. Owing to the rise of deep learning techniques and the large amount of conversation data for training (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018b), we are now witnessing promising results of chatbots both in academia and industry (Pan et al., 2019;Tao et al., 2019).
Neural dialogue systems are trained over a large dialogue corpus and used to predict responses given a context. There are two lines of methods. Retrievebased methods and generation based methods rely Figure 1: B is incorrect because there is no reason to apologize. C and D can be excluded because the relationship between two speakers are waiter and customer based on the context. on matching scores and perplexity scores, respectively. Due to the development of text matching and pre-training models (Devlin et al., 2019;, a machine is able to achieve highly competitive results on these datasets, even close to human performance. For instance, ESIM  achieves 88% on the Dialogue NLI (Welleck et al., 2019), and BERT achieves 85.8%, 93.1% and 98.5% in terms of R 10 @1, R 10 @2 and R 10 @5 on the Ubuntu Corpus (Whang et al., 2019).
However, there is still a huge gap between high performance on the leader-board and poor practical user experience. Chatbot engines often generate responses that are logically incorrect or violate commonsense knowledge (Shum et al., 2018). A likely reason is that current dialogue systems do not have strong reasoning skills, and most of the cases in previous benchmarks can be tackled by linguistic information matching. Previous work has demonstrated that neural encoders capture a rich hierarchy of syntactic and semantic information (Jawahar et al., 2019;Clark et al., 2019). However, reasoning capability and commonsense knowledge are not captured sufficiently (Young et al., 2018).
One important research question is how we can evaluate reasoning ability in chatbots, which can potentially allow us to bridge the gap between high performance on leader-board and unsatisfactory practical performance. To this end, we develop " Narrative "

MuTual
Next Utterances Prediction " Open " Table 1: Comparison between our dataset and other datasets. "Manually" indicates that human writing of the question or answers is involved in the data annotation process, rather than mere manual selection of data.
an open domain Multi-Turn dialogue reasoning dataset (MuTual) to facilitate conversation model reasoning capabilities. In particular, given a context, we prepare four response candidates, each of which is relevant to the context, but only one of them is logically correct. As shown in Figure 1, all responses follow the same topic, but only the first one is appropriated. It requires reasoning ability on social etiquette and relationship to make the correct choice, which is not considered by existing dialogue benchmarks. We build our dataset based on Chinese high school English listening comprehension test data, where students are excepted to select the best answer from three candidate options, given a multiturn dialogue and a question. The original data is formatted as dialogue, question, answer , which is not directly suitable for our goal since chatbots only concern about how to respond contexts instead of answering an additional question. Therefore, we ask human annotators to rewrite the question and answer candidates as response candidates. Then our dataset follows the traditional response selection setting (Lowe et al., 2015), where a model should recognize a correct response from others for a multi-turn dialogue.
The resulting dataset, MuTual, consists of 8,860 challenge questions, in terms of almost all questions involving reasoning, which are designed by linguist experts and high-quality annotators. We evaluate state-of-the-art retrieval-based models and pre-training models on MuTual. The best method gives a R@1 of 71%, which significantly underperforms human performance (94%). To the best of our knowledge, MuTual is the first human-labeled reasoning-based dataset for multi-turn dialogue. We provide detailed analysis to provide insights into developing potentially reasoning-based chitchat dialogue systems. Table 1 compares our dataset with prior dialogue and reasoning related benchmarks.

Related work
Dialogue: The Ubuntu Dialogue Corpus is a large retrieval-based dataset (Lowe et al., 2015), extracted from Ubuntu chat logs. PERSONA-CHAT (Zhang et al., 2018a) considers consistent personality in dialogue. Crowd workers are required to act the part of a given provided persona, and chat naturally. Dialogue NLI (Welleck et al., 2019) is a natural language inference dataset modified from PERSONA-CHAT. It demonstrates that NLI can be used to improve the consistency of dialogue models. CoQA (Reddy et al., 2019) is collected by pairing two annotators to chat about a passage in the form of questions and answers. Each question is dependent on the conversation history. There are also several large-scale datasets in Chinese, such as Sina Weibo (Shang et al., 2015), Douban Conversation Corpus (Wu et al., 2017) and E-commerce Dialogue Corpus (Zhang et al., 2018b).
As shown in Table 1, most of the existing conversation benchmarks do not focus on testing reasoning ability. One exception is CoQA, which considers pragmatic reasoning. The difference is that CoQA is a machine comprehension dataset, in which conversations are based on a given passage. Another related reading comprehension dataset is DREAM (Sun et al., 2019), which is designed specifically for challenging dialogue-based reading Ma'am, you forgot your phone.
Oh, thanks, I couldn't live without this little thing. I know what you mean. It is of great significance to you. So did you enjoy your dinner?
Oh yes, everything was just perfect. It's so hard to take the whole family out to eat, but your restaurant was perfect. Johnny had his own place to play in and I had time to talk with my sisters and their husbands.
I'm glad to hear it. Our kids area is always popular.
Well, you can be sure we'll be back. Ma'am, you forgot your phone.
Oh, thanks, I couldn't live without this little thing. I know what you mean. It is of great significance to you. So did you enjoy your dinner?
Oh yes, everything was just perfect. It's so hard to take the whole family out to eat, but your restaurant was perfect. Johnny had his own place to play in and I had time to talk with my sisters and their husbands. comprehension. It relies on an external question to test the model's understanding capability. In contrast to the above dataset, our dataset is a next utterance prediction task, which is the fundamental problem in retrieval-based chatbots. In addition, our dataset requires various specific reasoning abilities, such as algebraic reasoning, intention prediction and so on, which is the main characteristic of our dataset.
Reasoning: Recently, efforts have been made to develop benchmarks and tasks to address reasoning for language understanding. Winograd Schema Challenge (Levesque et al., 2012) is a reasoningbased coreference resolution task. Each pair of sentences differs by only one phrase. SWAG (Zellers et al., 2018) is derived from pairs of consecutive video captions, including 113k short context each with four candidates endings. CommonsenseQA (Talmor et al., 2019) is a question answering dataset extracted from CONCEPTNET (Speer et al., 2016). Utilizing CONCEPTNET to construct the dataset ensures that questions directly target commonsense reasoning. RACE is a machine reading comprehension dataset collected from English exams for Chinese students. AI2 Reasoning Challenge (Clark et al., 2018) contains 7,787 genuine grade-school level science questions with a corpus of 14M science reference sentences. DROP (Dua et al., 2019) and COSMOS (Huang et al., 2019) focus on factual understanding and commonsense comprehension, respectively.
Despite their success, these datasets can hardly help chatbots directly. Following the traditional dia-logue response selection setting, we deeply modify English listening comprehension conversation to form an utterance prediction task.

Collection
The original listening comprehension materials and question-answer pairs are designed by linguist experts. Students are required to choose the best answer from three options for a question based on a piece of audio. To ensure students fully understand the audio, most of the questions need to be answered with reasoning capability.
We crawled the listening exams from public websites 1 . Since the audio is either a conversation between two people or a simple passage, we only crawled data in the conversation format. The raw data is formatted as triples Conversation (audio), Question and Choices (text), Answer (image) . The following data pre-processing methods are applied to convert raw data to data in Figure 2.
Step 1 Pre-processing: If question and candidate choices in two problems are the same, we consider them as duplicates and delete one of them. If there are more than three candidate options in one problem, we randomly drop incorrect options until three candidates are left.
The answers are stored as images. We apply a commercial OCR system to convert images to text. It is easy to recognize the printed alphabet answer for the OCR system. We manually correct all OCR outputs to ensure quality. In the original listening comprehension test, the conversation is stored as audio. We adopt a commercial ASR system to convert speech to text, and further recruit experienced annotators to correct the transcription errors. To further ensure the quality of the transcripts, they are double-checked by annotators in the next step.
Step 2 Candidate Response Creation: Figure 2 illustrates the process of modifying the listening comprehension problem. At first, an annotator is required to segment the original conversation, after clues to answer the question have appeared. Then, they construct positive response (Response A in Figure 2) and negative responses (Response C and Response D) by consulting correct choice (Choice A) and incorrect choices (Choice B and Choice C), respectively. To make MuTual more challenging, we further ask the annotator to construct one more negative response (Response B) based on the correct choice. Through these steps, MuTual not only keeps the reasoning test designed by experts, but also introduces one more another type of reasoning for each instance. As shown in Figure 2, Response C and D can be excluded based on the relationship between two speakers. But B is incorrect due to the attitude reasoning.
It is worth noting that all negative responses are logically correct if the context is not considered, but they are not appropriated responses if the context is taken into account. Therefore, our dataset focuses on multi-turn conversation reasoning rather than the logic of a sentence. When framing a negative response, we encourage annotators to copy some phrases in the context to discourage a model that can solve the problem by text matching. We further calculate the lexical overlap between response and context. There are 9.98% (10.63%) words in the positive (negative) response that occur in the corresponding context, suggesting that MuTual is hard to solve by plain text matching.
Annotators in Step 2 are all English-major graduate students in Chinese, who are familiar with English language exams in China and fluent in English (pass the TEM-8 2 ). Annotators are required to draft annotate 170 instances repeatedly, until their labeling is sufficiently accurate to provide useful annotation. Because not all conversations are adapted to construct a reasoning-based response problem, the annotator has the right to skip the con-2 The highest level test for English majors as a foreign language in China.  versation. We employ five annotators to construct the response, and two quality inspectors to check it. We discard the instance when inspectors doubt the uniqueness or correctness of the answer.

Analysis
The detailed statistics of MuTual are summarized in Table 2. MuTual has an average of 4.73 turns. The vocabulary size is 11,343, which is smaller than other dialogue datasets (Lowe et al., 2015;Wu et al., 2017). Because MuTual is modified from listening tests of English as a foreign language, the complexity of morphology and grammar is much simpler than other datasets. For human-annotated datasets, there is always a trade-off between the number of instances being annotated and the quality of annotations (Kryciski et al., 2019). Our dataset is smaller than the previous crawling-based dialogue dataset (Lowe et al., 2015;Wu et al., 2017) due to the collection method. But it is comparable with high-quality reasoning based dataset (Clark et al., 2018;Khashabi et al., 2018;Talmor et al., 2019) and human-designed dialogue dataset (Zhang et al., 2018a). Moreover, around 10k is sufficient to train a discriminative model (Nivre et al., 2019) or fine-tuning the pretraining model .
To assess the distribution of different reasoning types, we annotate the specific types of reasoning that are involved for instance, sampled from the test set and categorize them into six groups. The definition and ratio of each group are shown as follows.
Attitude Reasoning: This type of instance tests if a model knows the speaker's attitude towards an object.
Algebraic Reasoning: This type of instances tests whether a model is equipped with algebraic abilities when it chooses a response.
Intention Prediction: This type tests whether a model can predict what the speaker is going to do next.  Situational Reasoning: Situation information (e.g., Location, Relationship between two speakers) is considered in this type of instance. A model should mine the implicit information from the previous context.
Multi-fact Reasoning: In this type of instance, the correct response is related to multiple facts in context, which requires the model to deeply understand the context rather than simply text matching.
Others:. There are 9% of instances that require other commonsense knowledge. For example, at the bottom of Figure 3, the model should know that a fully reserved restaurant is usually very popular.
The six types of reasoning are considered the most relevant to real chatbots. For example, it enables chatbots to make personal recommendations if a machine knows the user's attitude. The ability of intention prediction allows chatbots to respond more intelligently in a long conversation session.

MuTual plus
To further increase the difficulty, we use safe response to replace one of the candidate responses for each instance in MuTual. To guarantee diversity, the safe response is sampled from a list including "I'm afraid I didn't quite catch what you were saying.", "Could you repeat that?", "I'm really sorry, I didn't catch that.", etc. In particular, once the instance is chosen, we randomly select a response to replace. If the positive response is replaced, the correct one is the safe response. If the negative response is replaced, the original positive response is still the best one.
The motivation to build MuTual plus is to evaluate whether a model is able to select a safe response when the other candidates are inappropriate. When we replace the positive response with a safe response, it simulates a scenario in which all the other candidates are incorrect. The phenomenon is common in retrieval-based chatbots, because limited candidate responses cannot handle all cases in practice. Similarly, we can evaluate if the model can choose the correct response instead of a safe response when a correct response exists.

Experiments
We split the data into training, development and test sets, with an 80%, 10% and 10% ratio. We pack instances constructed from the same conversation during splitting to avoid data leakage. Following the standard dialogue setting (Lowe et al., 2015;Wu et al., 2017), we consider our task as a response selection task and employ traditional information retrieval evaluation methods, including recall at position 1 in 4 candidates (R@1), recall at position 2 in 4 candidates (R@2) and Mean Reciprocal Rank (MRR) (Voorhees, 2000). We compare the performance of several response selection models as well as pre-training models. We simply introduce these works as follows:

Baselines
We evaluate individual scoring methods, multichoice methods and human performance in our experiment. Given a context c and four candidates (r 1 , r 2 , r 3 , r 4 ), the individual scoring method computes a score for each choice independently with a score g(c, r i ), and selects the individual with the highest score among four candidates. On the contrary, the multi-choice method selects the best one by classification over all choices, formulated as h(c, r 1 , r 2 , r 3 , r 4 ).
TF-IDF: The correct response tends to share more words with the context than the incorrect ones. Following Lowe et al. (2015), we calculate the TF-IDF vectors for the context and each of the candidate responses, respectively, and then select the highest cosine similarity between the context and the candidate response as the model output. The "IDF" is calculated only on the training set.
Dual LSTM (Lowe et al., 2015): Two LSTMs are used to encode context and response, respectively. The relevance between context and response is calculated by the similarity of the final hidden state from both LSTMs.
Sequential Matching Network (Wu et al., 2017): To avoid losing information in the context, SMN constructs a word-word and a sequencesequence similarity matrix, instead of utilizing the last hidden state only, and then aggregates similarity matrix as a matching score.
Deep Attention Matching Network:  adopt self attention module (Vaswani et al., 2017) to encode response and each utterance, respectively. To match utterance and response, DAM further applies cross-attention module and 3D matching to obtain final score.
BERT (Devlin et al., 2019): Pre-training models have shown promising results on various multichoice and reasoning tasks (Whang et al., 2019;. Following Devlin et al. (2019), we concatenate the context (sentence A), and a candidate response (sentence B) as BERT input. On the top of BERT, a fully-connected layer is used for transforming the [CLS] token representation to the matching score.
RoBERTa:  re-establish BERT's masked language model training objective by using more data and different hyper-parameters. We fine-tune RoBERTa in the same way as BERT.
GPT-2 (Radford et al., 2019): Given a context, the positive response has a higher probability compared with negative responses. Motivated by this, we concatenate context and response as a sequence, and calculate the joint probability of an entire sequence. The response in the lowest perplexity sequence is considered as the positive response. Moreover, we fine-tune the GPT-2 on [Context, Positive Response] pairs in MuTual training set, denoted as GPT-2-FT.
Multi-choice Method: Inspired by BERT for multiple choice (Devlin et al., 2019), the task is considered as picking the most suitable response by comparing four candidates responses. In particular, we concatenate each candidate response with the corresponding context. Each input sequence is subsequently encoded to produce a [CLS] representation. The positive response is predicted based on the concatenation of all [CLS] representations, on which a fully connected layer with softmax is used. The method is denoted as BERT-MC. Similarly, we implement RoBERTa-MC as another multi-choice method.
Human Performance: To obtain the human performance, we employ 3 NLP experts to measure the ceiling performance on the test set.

Experiment Results
We report the performance of approaches introduced in 4.1, and human performance. Implementation details are shown in Appendix B.

Results on MuTual
All models perform significantly worse than on other popular conversation datasets, such as the Ubuntu Corpus (Lowe et al., 2015) and the Dialogue NLI dataset (Welleck et al., 2019), while human can address the reasoning problems easily. For example, BERT gives 85.8 % R 10 @1 on the Ubuntu Corpus, but RoBERTa only gives 71.3% R 4 @1 on MuTual.
TF-IDF only slightly better than randomly guessing, which indicates that there is no obvious statistic clue between context and positive response. In contrast, TF-IDF achieves 54.98% R@1 score on the Ubuntu Corpus, showing our dataset is more difficult to get the correct answer by text overlap. We evaluate typical retrieved-based dialogue models' performance on MuTual. From Table 3, we  Dev  Test  Baseline category  Baseline method  R@1  R@2 MRR R@1  R@2 MRR   Baseline  Human  --- (Wu et al., 2017) 0.264 0.524 0.578 0.265 0.516 0.627 DAM  0  can see that well-designed matching models do not give better performance compared with simple dual LSTM, moreover, they drop by more than 50 absolute R@1 points compared to their performance on the Ubuntu Corpus, indicating that text matching models cannot handle reasoning problem well.
Both BERT and RoBERTa outperform other models in MuTual, which is consistent with results in other literatures (Talmor et al., 2019). This is mainly because models learn reasoning capability during the pre-training on a large corpus. Although RoBERTa only gets 71.3% on R@1, it achieves a surprising number, 89.2 %, on R@2, indicating that the model is able to rank the correct response to the top-2 position. BERT-MC and RoBERTa-MC obtain similar results with BERT and RoBERTa, respectively. However, even RoBERTa is far behind human performance 23 points on R@1, indicating that MuTual is indeed a challenging dataset, which opens the door for tackling new and complex reasoning problems in multi-turn conversations.
GPT-2 and GPT-2-FT also perform undesirably on MuTual, even if the averaged perplexity on MuTual testset is 10.40. This phenomenon illustrates that 1) sentences in MuTual are fluent; and 2) current generative models still have plenty of room to improve their reasoning ability.

Results on MuTual plus
As shown in Table 4, all models perform worse on MuTual plus , indicating the dataset is more difficult than MuTual, which is consistent with our assumption. We find that the performance of multichoice method is significantly better than individual scoring method. One possible explanation is that multi-choice methods consider candidates together, so they can distinguish whether or not the safe response is the best one. In contrast, individual scoring methods are not robust, and safe responses are easy to confuse methods in the training stage. Moreover, RoBERTa-MC outperforms others by a large margin, showing its outstanding performance on reasoning problems.
Furthermore, we conduct a transfer experiment, in which models are trained on MuTual but tested on MuTual plus without fine-tuning. The experiment investigates whether the model handles safe responses well if they have never seen them in training corpus. As shown in Table 4, RoBERTa-MC and RoBERTa drops 24.1% and 6.8%, respectively, in the transfer setting, demonstrating the benefits of seeing safe responses during the training process. Moreover, the individual scoring RoBERTa outperforms RoBERTa-MC, showing that the individual scoring method is more robust, when the safe response is not fed during training.

Discussion
Performance across different reasoning types: To analyze model performance across different reasoning types, we calculate BERT-MC and RoBERTa-MC performance on various question types as we introduce in Section 3.2. As shown in Figure 4, we find that the trends of BERT-MC and RoBERTa-MC are similar across different categories. RoBERTa-MC significantly outperforms BERT-MC in attitude reasoning and multi-fact reasoning. One potential reason is that there are some normal patterns between action and attitude captured by RoBERTa-MC, such as "play football" and "excited". However, instances that involve algebraic and situation show poor performance. These two reasoning types heavily depend on commonsense reasoning. Taking Figure 5 as examples, it takes a simple subtraction step to derive the time difference (5:00 pm -6h = 11:00 am), but this turns out a significant challenge for RoBERTa-MC. In the second case, RoBERTa-MC fails to infer the dialogue situation, where the goal is to find a flat to rent.
Performance across different context lengths: It is interesting that the performance of RoBERTa does not decrease significantly with the number of turns increasing, which is different from the phenomenon observed on other datasets. As shown in Table 5, the performance drops by only 1.9 points R@1 from 2 turns to long turns (>6), and the performance of 5 turns is higher than those with 4 F: Good morning. What can I do for you? M: I am looking for a flat for 2 people near the university. F: Well. There are several places available and the rent ranges from 80 to $150 a month. What are your requirements? M: I think of flat for no more than $100 a month is good. I prefer to live in a quiet street and I need at least 2 bedrooms.
✘ F: If you have any questions about enrollment, do not hesitate to ask me. ✓ F: How about this flat? If you are satisfied, we can sign the contract tomorrow.
F: We have 2 floors in our supermarket. F: You want only 1 bedroom, so we have three flats that meet your requirement.
F: Do you know what time it is right now in New York? M: Let me see. It's 5:00 pm now, in New York is 6 hours behind. F: Let me see, 7 hours behind. It is 11:00 am now in New York. F: 5 hours ahead. It is 11:00 pm now in New York. ✘ F: Is it 5:00 pm as well? ✓ F: It is 11:00 am now in New York.   turns, indicating the reasoning problems do not become much harder when the context becomes longer. The results also show that the difficulty of MuTual is attributed to reasoning instead of complex conversation history. Context ablation study: We further verify whether our dataset requires multi-turn understanding rather than degenerating to a single turn reasoning problem. We evaluate Roberta and Roberta-MC performance when some utterances are manually removed. Figure 6 shows the performance when the earliest n utterances are removed in testing. As the ablation utterance increases, the performance of RoBERTa and RoBERTa-MC significantly decreases, which conforms to intuition. RoBERTa and RoBERTa-MC achieve only 43.7% and 47.7% after ablating all utterances in the context, respectively, indicating the importance of each utterance and the quality of the dataset. Moreover, if we shuffle the sequence of utterance, the performance of RoBERTa-MC drops by 3.8% only, showing that it is insensitive to the utterance sequence information.

Conclusion
We introduced MuTual, a high-quality manually annotated multi-turn dialogue reasoning dataset, which contains 8,860 dialogues and aims to test reasoning ability of dialogue models. We describe the process for generating MuTual, and perform a detailed analysis. We find that various state-ofthe-art models show poor performance in MuTual. The best model RoBERTa only obtains 71.3% R@1. There is a large gap between the model performance and human performance. We hope that this dataset facilitates future research on multi-turn conversation reasoning problem.