MovieChats: Chat like Humans in a Closed Domain

Being able to perform in-depth chat with humans in a closed domain is a precondition before an open-domain chatbot can ever be claimed. In this work, we take a close look at the movie domain and present a large-scale high-quality corpus with ﬁne-grained annotations in hope of pushing the limit of movie-domain chatbots. We propose a uniﬁed, readily scalable neural approach which reconciles all subtasks like intent prediction and knowledge retrieval. The model is ﬁrst pretrained on the huge general-domain data, then ﬁnetuned on our corpus. We show this simple neural approach trained on high-quality data is able to outperform commercial systems replying on complex rules. On both the static and interactive tests, we ﬁnd responses generated by our system exhibits remarkably good engagement and sensibleness close to human-written ones. We further analyze the limits of our work and point out potential directions for future work 1 .


Introduction
Being able to converse like humans in a closed domain is a precondition before an intelligent opendomain chatbot, which further requires transiting among various domains, can be designed Su et al., 2020). Nonetheless, even if constrained in a specific domain, current chatbots are still far from satisfactory. Unlike task-oriented systems that can be relatively well-resolved with handcrafted templates, human conversations feature a complex mixture of QA, chitchat, recommendation, etc. without pre-specified goals or conversational patterns (Dodge et al., 2016;Akasaki and Kaji, 2017;. Selecting proper domain knowledge to support response generation at all the different situations is challenging (Milward and Beveridge, 2003;Shen et al., 2019). In this work, we direct our focus to the movie domain and present a large-scale, crowdsourced Chinese dataset with fine-grained annotations in hope of boosting the study towards a human-like closed-domain chatbot.
A variety of dialogue datasets with grounded domain knowledge have already been proposed. However, they are collected either through (1) online forum crawling (Dodge et al., 2016;Ghazvininejad et al., 2018;Zhou et al., 2018a;, which are noisy, multi-party, mostly contain only single-exchange QA, or (2) crowdsourced (Zhu et al., 2017;Zhou et al., 2018b;Moon et al., 2019;, which are small-scale and often created in an overconstrained setting like teacher-student (Moghe et al., 2018). Even for datasets crowd-sourced in unconstrained scenarios, suggestive domain knowledge is provided for humans before an utterance is provided. This would inevitably prompt humans to utilize these knowledge deliberately, yielding unnatural conversations simply connecting the knowledge (Dinan et al., 2019;Zhou et al., 2020). We show examples from other datasets in Appendix Table 10. In comparison, our dataset has the following advantages: 1. Natural: Crowdworkers chat in a free environment without further constraint or prompt in order to mimic the human daily conversations to the largest extent.
2. Large-scale: It covers 270k human dialogues with over 3M utterances, which is at least one order of magnitude larger than all the other crowd-sourced datasets.
3. Annotated: Utterances are labeled with entity information and dialogue acts classified into 15 fine-grained aspects, based on which linked into different types of knowledge.
Different from previous crowd-sourced works, our annotation process is conducted posteriori so that it will not interfere with human conversations, e.g., prompt them to overuse suggested knowledge.
Built upon our dataset, we propose a simple unified language model approach to push the limits of movie-domain chatbots. The model is first pretrained on 2.2B words collected from various general-domain conversational resources, then finetuned on the movie dataset with additional knowledge and dialogue acts incorporated. We pool all components like intent prediction and knowledge retrieval into a sequence prediction task and solve them with a unified language model architecture. It avoids designing complex systems for individual components separately and all subtasks can be easily trained simultaneously (Hosseini-Asl et al., 2020;Peng et al., 2020). We show our simple unified approach outperforms strong baselines for each separate subtask. Knowledge retrieval, dialogue acts prediction and general-domain pretrain benefit from each other and altogether bring improvement to the generation quality. In the online interactive test, our best model succeeds at chatting with humans for 11.4 turns without being detected to be a machine, outperforming even commercial chatbots Mitsuku 2 and Microsoft XiaoIce 3 which further rely on complex rules. By analyzing the limitations of our model, we find it especially has difficulty at dealing with in-depth discussions over long turns. Future research can consider employing larger knowledge base or explicit state tracking.
In summary, our main contributions are (1) presenting a high-quality, large-scale Chinese conversational corpus with fine-grained annotations in the movie domain to benefit future study, (2) showing that a simple unified neural model trained on the high-quality dataset can approach human performance and even outperform commercial systems replying on complex rules, and (3) studying the shortcomings of current techniques, providing suggestive directions for future research.

Dataset Construction
The dataset construction consist of (1) crowdsourcing the dialogues, (2) annotating dialog acts and entities and (3) linking utterances into grounded knowledge. We explain these three steps in order and present the dataset statistics in the end.
Dialogue Crowd-sourcing We obtain the dialogue dataset through a two-phase Wizard-of-Ozstyle collection (Kelley, 1984;Dahlbäck et al., 1993). In the first phase, we run small-scale pilot studies and examine the quality of collected conversations. Based on the examination, we created tutorials and qualification tests. They are used to train and qualify crowd-workers for the second phase. During this second phase, we consistently monitor the collected dialogue datasets and perform periodic quality check on samples from every individual work pairs. If more than 5% from one pair are considered invalid, their collections will be removed. Before a conversation started, two workers are paired and a movie is chosen agreed by both 4 . We constrain at least one of them to have watched the movie to make sure the conversation is contentful 5 . The annotators are especially instructed to (1) behave naturally as in daily life, (2) avoid dirty words and (3) talk differently in each conversation. Duplicate conversations will be removed if more than 70% of their contents are overlapped. To encourage diverse movies, we further set an upper limit to forbid one movie from being talked about for more than 100 times.
The whole collecting process lasts two months. In the end, 245 participants are involved with 66,424 movies being talked about in total.
Dialogue Act and Entity Annotation Following prior work, we base our annotation schema on the ISO 24617-2 standard (Bunt et al., 2010(Bunt et al., , 2012. Table 1 shows our annotation schema, counts, descriptions, and brief examples. The dialogue acts (DAs) are organized in a hierarchical structure. The first layer makes distinctions on three concepts: objective facts, recommendations and subjective feelings. Each concept can either be either requested or informed during the conversation. We further define an "Other" class to include actions that do not belong to any of the three concepts, like some general non-contentful greetings or echos. The second layer includes 15 finer-grained aspects covering most popular topics being discussed. Every first-layer DA (except Other) will be further group it into one of these 15 aspects, e.g., the de- tailed DA of the first example in Table 1 will be request fact director. If one utterance contains multiple dialogue acts, we order the dialogue acts based on their turn of appearance in the utterance. As for the named entity recognition, we labeled 5 kinds of entities: movie names, director, actor, type and role (first 5 aspects).
To speed up the annotation process, we first define a set of handcrafted regular expressions, which covers most frequent patterns at each class, to train a DA and NER classifier . The annotators are instructed to post-correct the auto-labeled dialogues instead of doing everything from scratch. The classifiers are trained with online learning (Sahoo et al., 2018) to keep improving the accuracy and lower down the frequency of post-correction in consequence. As we observe, this semi-automated way significantly speeds up the labeling process. All the dataset is finished labeling within three weeks with 188 annotators involved.
Knowledge Linkage We extract fact knowledge from the structured table in Douban Movie 6 , a popular Chinese platform for movies. The knowledge is organized in the form of key-value pairs, where the key corresponds to the 15 aspects defined by us. Some aspects, like lines or music, are not directly available from the structured table. We extract these missing information from other sources and combine it into our knowledge base. For utterances labeld as inform/request fact, we link them to the key-value pairs from the same aspect. Apart from the objective knowledge, we also crawl movie comments from Douban Movie to support the generation of responses expressing subjective feelings. These comments can be a good supplementary to provide knowledge that can be hardly organized in the structured form (Moghe et al., 2018). For utterances labeled as inform/request feeling, we compare them with Douban comments from the same movie and compute the similarity score based on weighted average of edit distance, Jaccard distance, tf-idf, sentence vector cosine similarity, common words and entities. Each utterance is linked to the most similar comment with a threshold cutoff. In the end, 51.7% of the utterances about feelings have grounded comments. For utterances about recommendations, we simply ground them to the men- tioned movie entities 7 , and no grounded knowledge is linked for utterances labeled as Other. An example of our annotation is presented in Table 1.

Model Architecture
Language models have demonstrated impressive performance as a universal learner across NLP tasks (Shen et al., 2017;Peters et al., 2018;Radford et al., 2019;Brown et al., 2020). Inspired by this, our dialogue generation model is implemented as a Transformer-based language model like GPT2 (Radford et al., 2019;. It contains a pipeline process of movie tracker, intent prediction, knowledge retrieval and text gener- 7 We only consider recommending movies as for the DA about recommendation. Recommending other aspects require assembling recommendation systems of different domains, which is beyond the scope of this paper.

Context
[context] dialogue context  ation. Unlike in traditional task-oriented systems where subtasks are decomposed separately, we opt for a simple and unified approach by casting all subtasks into sequence prediction. A special token is injected in the beginning to indicate which subtask to perform (Hosseini-Asl et al., 2020;Peng et al., 2020). Table 3 shows the schema representation for different components. The condition and the target are concatenated into a single sequence and then fed into the language model to train. For example, the task of predicting the intent given the dialogue context will be transformed into "[context] dialogue context [intent] DA sequence", where the DA sequence will be predicted conditioned on

General-domain Pretrain
The model is first pretrained on a mixed general-domain conversational corpus crawled from various sources like douban 8 , tieba 9 , zhihu 10 and weibo 11 . The pretrained corpus covers 468M conversations with 2.2B words. Each training instance is processed in the form of "[context] dialogue context [response] response" where response will be predicted given "[context] dialogue context [response]". The objective is a mixture of maximum likelihood and unlikelihood training (He and Glass, 2019;, which we find help reduce repeated and incoherent generations as observed in Adiwardana et al. (2020). The unlikelihood training minimizes the likelihood of 1) randomly sampled responses from the corpus and 2) repeated bigrams from the previous generated tokens.

Movie Tracker
The movie tracker is like the belief state tracker in task-oriented systems (Henderson et al., 2013). It is used to track which movie will be talked about in the next utterance.

Intent Prediction
The intent prediction is also cast as a sequence prediction task. Compared with the traditional way of multi-label classification, casting it as sequence prediction is better at addressing the coexistence of multiple DAs and capturing the sequential dependencies among the hierarchy (Raffel et al., 2019;Vedula et al., 2020). For example, to predict the DAs of the 4th utterance in Figure 1, the sequence fed to the language model will be "[context] dialogue context [intent] inform, feeling, plot, request, fact, plot". By this means, before predicting a DA, the model can condition on both the dialogue context and its previous DAs to improve the accuracy.
Knowledge Retrieval The knowledge retrieval component is similar to the classical DSSM model (Huang et al., 2013). We replace the MLP with our language model encoder to get the embedding for knowledge. Note that we only select knowledge from the current movie, which can be obtained from the movie tracker, so it is possible to  " will be fed to the language model to generate the response. To make it consistent with the pretrained general-domain dialogue, the position embedding of the decoded response will skip the concatenated intent and knowledge and directly follow the dialogue context. We find this beneficial when combined with pretrained models. The objective also follows the pretrained model mixing maximum lilkelihood and unlikelihood training.

Experiment Setting
We tokenize the Text in the unit of Chinese characters and keep all unique non-Chinese unique tokens appearing for more than 5 times. The whole vocabulary contains 13,317 words. We train our model on 24 Nvidia V100 GPUs (32GB) with three different model sizes as shown in

Results and Analysis
Automatic Evaluation In Table 5, we report the perplexity, BLEU scores and distinct uni/bigrams for three model sizes. To investigate the effects of incorporating annotations and pretraining, we start from a basic model which trains from scratch on our movie corpus. At each time, we add one more condition to see its influence. The results show a clear tendency of gradual improvement as more conditions are added to the training. Adding knowledge especially boosts the performance, which is understandable considering movie-domain chats usually contain many movie-specific rare names. Without knowledge grounding, it can hardly predict the correct tokens. Pretraining on general-domain conversations can improve both the overlap with ground truth. The distinct uni/bigrams also consistently increase, implying the model can learn useful patterns in the pretrained corpus to enrich its generations in the movie domain. In unseen testset, the performance generally drops for all, especially for models without knowledge grounding as they have to make up facts and comments for totally unseen movies in the training set.    Table 7 measures the performance of retrieving fact knowledge, movie comments and recommen-dation respectively. We report the hit@1 and hit@5 scores for them (Zhang et al., 2018). We compare our model with a random baseline, bag-ofword (BOW) and the Bert (Devlin et al., 2019) model (we pass sentences through Bert and derive a fixed-sized vector by averaging the outputs from the second-to-last layer (May et al., 2019)). The BOW and Bert model are finetuned with our knowledge linkage annotations. We find that our unified model again outperforms all baseline approaches. Adding the DA as a condition further helps. Fact retrieval has the highest hit rate as it is well structured and easy to match. Recommendation, on the other hand, is very hard to predict. As an accurate recommendation system is clearly beyond the scope of this paper, it is understandable that our simple way fails to provide satisfying recommendations.  Human Evaluation Automatically evaluating dialogue systems are known to be extremely hard (Liu et al., 2016;. We further conduct a set of static and interactive human evaluations. We focus on evaluate the machine-generated response from four perspectives. Apart from the oft-used metrics (1) Sensibleness (Sens) and (2) Engagement (Enga) for open-domain chatbots, we further evaluate on (3) Factuality (Fact) and (4) Informativeness (Info) to see if models can actively provide informative responses based on movie facts. Details are in Appendix B. As evaluating factuality requires specific movie knowledge, this metric is only evaluated by the same person who produced the dialogue. The other metrics are evaluated by 3 workers each. Table 8 shows the agreement scores. The agreement is reasonable considering the evaluations are subjective. The results are the majority votes of the binary scores.
In the static evaluation, we sample 300 responses for each model from the test set (mixing seen and unseen). The responses can come from any turn in a conversation. We show the results in Figure 2. Our largest model with 762M is clearly preferred by human evaluators on almost all metrics and approaches human performance. By training a larger model and increasing the training size, the gap might be further closed.
In the interactive evaluation, humans can chat with any topic but restricted in the movie domain. We conduct an online Turing test where one side is always a human participant not aware whom he is talking with. The other side could be either Mitsuku, XiaoIce 13 , our model (762M with pretraining) or a real human. Mitsuku interacts in English, so we hire only English native speakers for the experiment. We collect 100 conversations for all models. Humans can stop interacting once they (1) find the other side is a machine or (2) reaches the maximum turn of 20. Responses from all models are later passed to the third party to judge the scores. The results are shown on the right of Figure 2. Our model outperforms Mitsuku and XiaoIce by a large margin. As Mitsuku and XiaoIce are designed to be open-domain chatbots, restricting to be on the movie domain will give our model some natural advantage. We can also notice that Mitsuku and XiaoIce almost never produce fake facts. The cost is the extremely low ratio of informative responses since they tend to behave over-safely and will only answer it when they are 100% sure. Our model is closer to humans in that sense. It will converse actively at some risk of containing fact errors.  13 We use its chat service through Weibo. It will sometimes generate responses containing keywords like "XiaoIce". We manually replace it to prevent disclosing its identity.  Distance from Human Performance In the interactive evaluation, compared with human performance, our model loses a bit on sensibleness and factuality but wins on the other two. To investigate where our model fails, figure 3 visualizes the change of SSA (Sensibleness-Engagement average) and FIA (Factuality-Informativeness average) when the conversational turn proceeds. A good chatbot should balance well these skills (Adiwardana et al., 2020). SEA can reflect how it behaves as a general chatbot while FIA can better test its capability at incorporating domain knowledge. We can see a clear trend of decrease for all models. As for human performance, however, the score is quite consistent across turn rounds, implying a large improvement space for current models to deal with multi-turn context.
In figure 4, we further show the "dying distribution" of our model, namely, in which DA our model fails to pass the Turing test and thereby "dies". Unsurprisingly, we can see the system fails mostly when informing facts or feelings. Only a small portion are from non-grounded chitchats (other). This suggests the most crucial bottleneck lies in the interaction with movie-specific knowledge and seamlessly incorporating it into the response generation. We show some snippets of interactions with our model in Table 9. The first two are failing cases labeled by humans as not factual and sensible. We can see the model struggles at replying to too specific facts. This is understandable since our knowledge base only provide short introductions and cannot cover all what happened in the movie. The second case shows its shortcoming at handing long-range consistency. It still recommends the current movie when the user asks about "which other movie". Employing larger knowledge bases and explicitly tracking the states by a checklist (Kiddon et al., 2016) might potentially alleviate both issue. We also provide examples for controllable generations where the DA and aspect are manually assigned. As observed, the model shows decent performance at fitting both the dialogue con-text and specified conditions. This can be helpful when finer-grained control is needed.

Conclusion
We present MovieChats: a movie-domain chatbot built upon a large-scale, high-quality conversational corpus with fine-grained annotations. The model can be trained end-to-end with a simple unified language model architecture. We show that our model, powered by well-defined knowledge grounding, is able to approach human performance in some perspective, though still lagged behind when it comes to dealing with detailed knowledge or long-turn consistency. A Dataset Collection Table 10 shows examples comparing our dataset and the others. As observed, forum conversations are mostly single-turn QA or comments. Current crowd-sourced datasets are either collected on constrained scenarios (the scenario in  fixed the roles in a conversation as one introducer and one listener), or unconstrained but prompting people to deliberately connect knowledge. Our dataset simulates real-life conversations to the largest extent.
We classify the utterances into one of 15 aspects. The definitions, counts, and examples of them are shown in Table 11. When annotating the corpus, tutorials and examples are provided to the annotators, we show some examples of in the following tables. All the examples are provided only in Chinese as that is the native language among annotators.

B Human Evaluation
As for the four human evaluation metrics. The first two will focus only on the conversational backbones without considering domain knowledge. The second two will check if the responses can provide informative and correct responses powered by domain knowledge. The detailed definitions of them are: