Towards Conversational Recommendation over Multi-Type Dialogs

We focus on the study of conversational recommendation in the context of multi-type dialogs, where the bots can proactively and naturally lead a conversation from a non-recommendation dialog (e.g., QA) to a recommendation dialog, taking into account user’s interests and feedback. To facilitate the study of this task, we create a human-to-human Chinese dialog dataset DuRecDial (about 10k dialogs, 156k utterances), where there are multiple sequential dialogs for a pair of a recommendation seeker (user) and a recommender (bot). In each dialog, the recommender proactively leads a multi-type dialog to approach recommendation targets and then makes multiple recommendations with rich interaction behavior. This dataset allows us to systematically investigate different parts of the overall problem, e.g., how to naturally lead a dialog, how to interact with users for recommendation. Finally we establish baseline results on DuRecDial for future studies.


Introduction
In recent years, there has been a significant increase in the work of conversational recommendation due to the rise of voice-based bots (Christakopoulou et al., 2016;Reschke et al., 2013;Warnestal, 2005). They focus on how to provide high-quality recommendations through dialog-based interactions with users. These work fall into two categories: (1) task-oriented dialogmodeling approaches (Christakopoulou et al., 2016;Sun and Zhang, 2018;Warnestal, 2005); (2) nontask dialog-modeling approaches with more freeform interactions (Kang et al., 2019;. Almost all these work focus on a single type of dialogs, either task oriented dialogs for recommendation, or recommendation oriented open-domain conversation. Moreover, they assume that both sides in the dialog (especially the user) are aware of the conversational goal from the beginning.
In many real-world applications, there are multiple dialog types in human-bot conversations (called multi-type dialogs), such as chit-chat, task oriented dialogs, recommendation dialogs, and even question answering (Ram et al., 2018;Wang et al., 2014;Zhou et al., 2018b). Therefore it is crucial to study how to proactively and naturally make conversational recommendation by the bots in the context of multi-type human-bot communication. For example, the bots could proactively make recommendations after question answering or a task dialog to improve user experience, or it could lead a dialog from chitchat to approach a given product as commercial advertisement. However, to our knowledge, there is less previous work on this problem.
To address this challenge, we present a novel task, conversational recommendation over multitype dialogs, where we want the bot to proactively and naturally lead a conversation from a non-recommendation dialog to a recommendation dialog. For example, in Figure 1, given a starting dialog such as question answering, the bot can take into account user's interests to determine a recommendation target (the movie <The message>) as a long-term goal, and then drives the conversation in a natural way by following short-term goals, and completes each goal in the end. Here each goal specifies a dialog type and a dialog topic. Our task setting is different from previous work (Christakopoulou et al., 2016;. First, the overall dialog in our task contains multiple dialog types, instead of a single dialog type as done in previous work. Second, we emphasize the initiative of the recommender, i.e. the bot proactively plans   Figure 1: A sample of conversational recommendation over multi-type dialogs. The whole dialog is grounded on knowledge graph and a goal sequence, while the goal sequence is planned by the bot with consideration of user's interests and topic transition naturalness. Each goal specifies a dialog type and a dialog topic (an entity). We use different colors to indicate different goals and use underline to indicate knowledge texts. a goal sequence to lead the dialog, and the goals are unknown to the users. When we address this task, we will encounter two difficulties: (1) how to proactively and naturally lead a conversation to approach the recommendation target, (2) how to iterate upon initial recommendation with the user.
To facilitate the study of this task, we create a human-to-human recommendation oriented multitype Chinese dialog dataset at Baidu (DuRecDial). In DuRecDial, every dialog contains multi-type dialogs with natural topic transitions, which corresponds to the first difficulty. Moreover, there are rich interaction variability for recommendation, corresponding to the second difficulty. In addition, each seeker has an explicit profile for the modeling of personalized recommendation, and multiple dialogs with the recommender to mimic real-world application scenarios.
To address this task, inspired by the work of Xu et al. (2020), we present a multi-goal driven conversation generation framework (MGCG) to handle multi-type dialogs simultaneously, such as QA/chitchat/recommendation/task etc.. It consists of a goal planning module and a goal-guided responding module. The goal-planning module can conduct dialog management to control the dialog flow, which determines a recommendation target as the final goal with consideration of user's interests and online feedback, and plans appropriate shortterm goals for natural topic transitions. To our knowledge, this goal-driven dialog policy mechanism for multi-type dialog modeling is not studied in previous work. The responding module produces responses for completion of each goal, e.g., chatting about a topic or making a recommendation to the user. We conduct an empirical study of this framework on DuRecDial.
This work makes the following contributions: • We identify the task of conversational recommendation over multi-type dialogs.
• To facilitate the study of this task, we create a novel dialog dataset DuRecDial, with rich variability of dialog types and domains as shown in Table 1.
• We propose a conversation generation framework with a novel mixed-goal driven dialog policy mechanism. where each dialog contains in-depth discussions on multiple topics. In comparison with them, our dataset contains multiple dialog types, clear goals to achieve during each conversation, and user profiles for personalized conversation.

Models for Conversational Recommendation
Previous work on conversational recommender systems fall into two categories: (1) task-oriented dialog-modeling approaches in which the systems ask questions about user preference over predefined slots to select items for recommendation (Christakopoulou et al., 2018(Christakopoulou et al., , 2016Reschke et al., 2013;Sun and Zhang, 2018;Warnestal, 2005;Zhang et al., 2018b); (2) non-task dialog-modeling approaches in which the models learn dialog strategies from the dataset without predefined task slots and then make recommendations without slot filling (Chen et al., 2019;Kang et al., 2019;Moon et al., 2019;Zhou et al., 2018a). Our work is more close to the second category, and differs from them in that we conduct multi-goal planning to make proactive conversational recommendation over multi-type dialogs.

Goal Driven Open-domain Conversation
Generation Recently, imposing goals on opendomain conversation generation models having attracted lots of research interests (Moon et al., 2019;Tang et al., 2019b;Wu et al., 2019) since it provides more controllability to conversation generation, and enables many practical applications, e.g., recommendation of engaging entities. However, these models can just produce a dialog towards a single goal, instead of a goal sequence as done in this work. We notice that the model by Xu et al. (2020) can conduct multi-goal planning for conversation generation. But their goals are limited   Figure 2: We collect multiple sequential dialogs {d s k i } for each seeker s k . For annotation of every dialog, the recommender makes personalized recommendations according to task templates, knowledge graph and the seeker profile built so far. The seeker must accept/reject the recommendations.
to in-depth chitchat about related topics, while our goals are not limited to in-depth chitchat.

Task Design
We define one person in the dialog as the recommendation seeker (the role of users) and the other as the recommender (the role of bots). We ask the recommender to proactively lead the dialog and then make recommendations with consideration of the seeker's interests, instead of the seeker to ask for recommendation from the recommender. Figure 2 shows our task design. The data collection consists of three steps: (1) collection of seeker profiles and knowledge graph; (2) collection of task templates; (3) annotation of dialog data. 3 Next we will provide details of each step.
Explicit seeker profiles Each seeker is equipped with an explicit unique profile (a groundtruth profile), which contains the information of name, gender, age, residence city, occupation, and his/her preference on domains and entities. We automatically generate the ground-truth profile for each seeker, which is known to the seeker, and unknown to the recommender. We ask that the utterances of each seeker should be consistent with his/her profile. We expect that this setting could encourage the seeker to clearly and self-consistently explain what he/she likes/dislikes. In addition, the recommender can acquire seeker profile information only through dialogs with the seekers.
Knowledge graph Inspired by the work of document grounded conversation (Ghazvininejad et al., 2018;Moghe et al., 2018), we provide a knowledge graph to support the annotation of more informative dialogs. We build them by crawling data from Baidu Wiki and Douban websites. Table 3 presents the statistics of this knowledge graph.
Multi-type dialogs for multiple domains We expect that the dialog between the two task-workers starts from a non-recommendation scenario, e.g., question answering or social chitchat, and the recommender should proactively and naturally guide the dialog to a recommendation target (an entity). The targets usually fall into the seeker's interests, e.g., the movies of the star that the seeker likes.
Moreover, to be close to the setting in practical applications, we ask each seeker to conduct multiple sequential dialogs with the recommender. In the first dialog, the recommender asks questions about seeker profile. Then in each of the remaining dialogs, the recommender makes recommendations based on the seeker's preferences collected so far, and then the seeker profile is automatically updated at the end of each dialog. We ask that the change of seeker profile should be reflected in later dialogs. The difference between these dialogs lies in subdialog types and recommended entities.
Rich variability of interaction How to iterate upon initial recommendation plays a key role in the interaction procedure for recommendation. To provide better supervision for this capability, we expect that the task workers can introduce diverse interaction behaviors in dialogs to better mimic the decision-making process of the seeker. For example, the seeker may reject the initial recommendation, or mention a new topic, or ask a question about an entity, or simply accept the recommendation. The recommender is required to respond appropriately and follow the seeker's new topic.
Task templates as annotation guidance Due to the complexity of our task design, it is very hard to conduct data annotation with only high-level instructions mentioned above. Inspired by the work of MultiWOZ (Budzianowski et al., 2018), we provide a task template for each dialog to be annotated, which guides the workers to annotate in the way we expect them to be. As shown in Table 2, each template contains the following information: (1) a goal sequence, where each goal consists of two The seeker takes the initiative, and asks for the information about the movie <Stolen life>; the recommender replies according to the given knowledge graph; finally the seeker provides feedback. Goal2: chitchat about the movie star Xun Zhou The recommender proactively changes the topic to movie star Xun Zhou as a short-term goal, and conducts an indepth conversation; Goal3: Recommendation of the movie <The message> The recommender proactively changes the topic from movie star to related movie <The message>, and recommend it with movie comments, and the seeker changes the topic to Rene Liu's movies; Goal4: Recommendation of the movie <Don't cry, Nanking!> The recommender proactively recommends Rene Liu's movie <Don't cry, Nanking!> with movie comments. The seeker tries to ask questions about this movie, and the recommender should reply with related knowledge. Finally the user accepts the recommended movie.
Table 2: One of our task templates that is used to guide the workers to annotate the dialog in Figure 1. We require that the recommendation target (the long-term goal) is consistent with the user's interests and the topics mentioned by the user, and short-term goals provide natural topic transitions to approach the long-term goal. elements, a dialog type and a dialog topic, corresponding to a sub-dialog. (2) a detailed description about each goal. We create these templates by (1) first automatically enumerating appropriate goal sequences that are consistent with the seeker's interests and have natural topic transitions and (2) then generating goal descriptions with the use of some rules and human annotation.

Data Collection
To obtain this data, we develop an interface and a pairing mechanism. We pair up task workers and give each of them a role of seeker or recommender. Then the two workers conduct data annotation with the help of task templates, seeker profiles and knowledge graph. In addition, we ask that the goals in templates must be tagged in every dialog.
Data structure We organize the dataset of DuRecDial according to seeker IDs. In DuRecDial, there are multiple seekers (each with a different profile) and only one recommender. Each seeker s k has multiple dialogs {d s k i } i with the recommender. For each dialog d s k i , we provide a knowledge graph and a goal sequence for data annotation, and a seeker profile updated with this dialog.
Data statistics Table 3   variability of dialog types and domains.
Data quality We conduct human evaluations for data quality. A dialog will be rated "1" if it follows the instruction in task templates and the utterances are fluent and grammatical, otherwise "0". Then we ask three persons to judge the quality of 200 randomly sampled dialogs. Finally we obtain an average score of 0.89 on this evaluation set.

Problem Definition and Framework
Overview i=0 denote a set of dialogs by the seeker s k (0 ≤ k < N s ), where N D s k is the number of dialogs by the seeker s k , and N s is the number of seekers. Recall that we attach each dialog (say d s k i ) with an updated seeker profile (denoted as P s k i ), a knowledge graph K, and a goal sequence G = {g t } Tg−1 t=0 . Given a context X with utterances {u j } m−1 j=0 from the dialog d s k i , a goal history G = {g 0 , ..., g t−1 } (g t−1 as the goal for u m−1 ), P s k i−1 and K, the aim is to provide an appropriate goal g c to determine where the dialog goes and then produce a proper response Y = {y 0 , y 1 , ..., y n } for completion of the goal g c .
Framework overview The overview of our framework MGCG is shown in Figure 3. The goalplanning module outputs goals to proactively and naturally lead the conversation. It first takes as input X, G , K and P s k i−1 , then outputs g c . The responding module is responsible for completion of each goal by producing responses conditioned on X, g c , and K. For implementation of the responding module, we adopt a retrieval model and a generation model proposed by Wu et al. (2019), and modify them to suit our task.
For model training, each [context, response] in d s k i is paired with its ground-truth goal, P s k i and K. These goals will be used as answers for training of the goal-planning module, while the tuples of [context, a ground-truth goal, K, response] will be used for training of the responding module.

Goal-planning Model
As shown in Figure 3(a), we divide the task of goal planning into two sub-tasks, goal completion estimation, and current goal prediction.
Goal completion estimation For this subtask, we use Convolutional neural network (CNN) (Kim, 2014) to estimate the probability of goal completion by: PGC (l = 1|X, gt−1).
(1) Current goal prediction If g t−1 is not completed (P GC < 0.5), then g c = g t−1 , where g c is the goal for Y . Otherwise, we use CNN based multi-task classification to predict the current goal by maximizing the following probability: where g ty is a candidate dialog type and g tp is a candidate dialog topic.

Retrieval-based Response Model
In this work, conversational goal is an important guidance signal for response ranking. Therefore, we modify the original retrieval model to suit our task by emphasizing the use of goals. As shown in Figure 3(b), our response ranker consists of five components: a context-response representation module (C-R Encoder), a knowledge representation module (Knowledge Encoder), a goal representation module (Goal Encoder), a knowledge selection module (Knowledge Selector), and a matching module (Matcher).
The C-R Encoder has the same architecture as BERT (Devlin et al., 2018), and it takes a context X and a candidate response Y as segment a and segment b in BERT, and leverages a stacked selfattention to produce the joint representation of X and Y , denoted as xy.
Each related knowledge knowledge i is also encoded as a vector by the Knowledge Encoder using a bi-directional GRU (Chung et al., 2014), which can be formulated as represent the last and initial hidden states of the two directional GRU respectively.
The Goal Encoder uses bi-directional GRUs to encode a dialog type and a dialog topic for goal representation (denoted as g c ).
For knowledge selection, we make the contextresponse representation xy attended to all knowledge vectors k i and get the attention distribution: We fuse all related knowledge information into a single vector k c = i p(k i |x, y, g c ) * k i . We view k c , g c and xy as the information from knowledge source, goal source and dialogue source respectively, and fuse the three information sources into a single vector via concatenation. Finally we calculate a matching probability for each Y by: p(l = 1|X, Y, K, gc) = sof tmax(MLP([xy; kc; gc])) (5)

Generation-based Response Model
To highlight the importance of conversational goals, we also modify the original generation model by introducing an independent encoder for goal representation. As shown in Figure 3(c), our generator consists of five components: a Context Encoder, a Knowledge Encoder, a Goal Encoder, a Knowledge Selector, and a Decoder. Given a context X, conversational goal g c and knowledge graph K, our generator first encodes them as vectors with the use of above encoders (based on bi-directional GRUs).
We assume that using the correct response will be conducive to knowledge selection. Then minimizing KLDivloss will make the effect of knowledge selection in the prediction stage (not use the correct response) close to that of knowledge selection with correct response. For knowledge selection, the model learns knowledge-selection strategy through minimizing the KLDivLoss between two distributions, a prior distribution p(k i |x, g c ) and a posterior distribution p(k i |x, y, g c ). It is formulated as: p(ki|x, y, gc)log p(ki|x, y, gc) p(ki|x, gc) In training procedure, we fuse all related knowledge information into a vector k c = i p(k i |x, y, g c ) * k i , same as the retrieval-based method, and feed it to the decoder for response generation. In testing procedure, the fused knowledge is estimated by k c = i p(k i |x, g c ) * k i without ground-truth responses. The decoder is implemented with the Hierarchical Gated Fusion Unit described in (Yao et al., 2017), which is a standard GRU based decoder enhanced with external knowledge gates. In addition to the loss L KL (θ), the generator uses the following losses: NLL Loss: It computes the negative loglikelihood of the ground-truth response (L N LL (θ)).
BOW Loss: We use the BOW loss proposed by , to ensure the accuracy of the fused knowledge k c by enforcing the relevancy between the knowledge and the true response. 4 Specifically, let w = MLP(k c ) ∈ R |V | , where |V | is vocabulary size. We define: Then, the BOW loss is defined to minimize: Finally, we minimize the following loss function: where α is a trainable parameter.

Experimental Setting
We split DuRecDial into train/dev/test data by randomly sampling 65%/10%/25% data at the level of seekers, instead of individual dialogs. To evaluate the contribution of goals, we conduct an ablation study by replacing input goals with "UNK" for responding model. For knowledge usage, we conduct another ablation study, where we remove input knowledge by replacing them with "UNK".

Methods 5
S2S We implement a vanilla sequence-to-sequence model (Sutskever et al., 2014), which is widely used for open-domain conversation generation. MGCG R: Our system with automatic goal planning and a retrieval based responding model.
MGCG G: Our system with automatic goal planning and a generation based responding model.  Table 4: Automatic evaluation results. +(-)gl. represents "with(without) conversational goals". +(-)kg. represents "with(without) knowledge". For "S2S +gl.+kg.", we simply concatenate the goal predicted by our model, all the related knowledge and the dialog context as its input.

Automatic Evaluations
Metrics For automatic evaluation, we use several common metrics such as BLEU (Papineni et al., 2002), F1, perplexity (PPL), and DISTINCT (DIST-2) (Li et al., 2016) to measure the relevance, fluency, and diversity of generated responses. Following the setting in previous work (Wu et al., 2019;Zhang et al., 2018a), we also measure the performance of all models using Hits@1 and Hits@3. 6 Here we let each model to select the best response from 10 candidates. Those 10 candidate responses consist of the ground-truth response generated by humans and nine randomly sampled ones from the training set. Moreover, we also evaluate the knowledge-selection capability of each model by calculating knowledge precision/recall/F1 scores as done in Wu et al. (2019). 7 In addition, we also report the performance of our goal planning module, including the accuracy of goal completion estimation, dialog type prediction, and dialog topic prediction.
Results Our goal planning model can achieve accuracy scores of 94.13%, 91.22%, and 42.31% for goal completion estimation, dialog type prediction, and dialog topic prediction. The accuracy of dialog topic prediction is relatively low since the num-6 Candidates (including golden response) are scored by PPL using the generation-based model, then candidates are sorted based on the scores, and Hits@1 and Hits@3 are calculated. 7 When calculating the knowledge precision/recall/F1, we compare the generated results with the correct knowledge. ber of topic candidates is very large (around 1000), leading to the difficulty of topic prediction. As shown in Table 4, for response generation, both MGCG R and MGCG G outperform S2S by a large margin in terms of all the metrics under the same model setting (without gl.+kg., with gl., or with gl.+kg.). Moreover, MGCG R performs better in terms of Hits@k and DIST-2, but worse in terms of knowledge F1 when compared to MGCG G. 8 It might be explained by that they are optimized on different metrics. We also found that the methods using goals and knowledge outperform those without goals and knowledge, confirming the benefits of goals and knowledge as guidance information.

Human Evaluations
Metrics: The human evaluation is conducted at the level of both turns and dialogs.
For turn-level human evaluation, we ask each model to produce a response conditioned on a given context, the predicted goal and related knowledge. 9 The generated responses are evaluated by three annotators in terms of fluency, appropriateness, informativeness, and proactivity. The appropriateness measures if the response can complete current goal and it is also relevant to the context. The informativeness measures if the model makes full use of knowledge in the response. The proactivity mea-sures if the model can successfully introduce new topics with good fluency and coherence.
For dialogue-level human evaluation, we let each model converse with a human and proactively make recommendations when given the predicted goals and related knowledge. 10 For each model, we collect 100 dialogs. These dialogs are then evaluated by three persons in terms of two metrics: (1) goal success rate that measures how well the conversation goal is achieved, and (2) coherence that measures relevance and fluency of a dialog as a whole.
All the metrics has three grades: good(2), fair(1), bad(0). For proactivity, "2" indicates that the model introduces new topics relevant to the context, "1" means that no new topics are introduced, but knowledge is used, "0" means that the model introduces new but irrelevant topics. For goal success rate, "2" means that the system can complete more than half of the goals from goal planning module, "0" means the system can complete no more than one goal, otherwise "1". For coherence, "2"/"1"/"0" means that two-thirds/one-third/very few utterance pairs are coherent and fluent.
Results All human evaluations are conducted by three persons. As shown in Table 5, our two systems outperform S2S by a large margin, especially in terms of appropriateness, informativeness, goal success rate and coherence. In particular, S2S tends to generate safe and uninformative responses, failing to complete goals in most of dialogs. Our two systems can produce more appropriate and informative responses to achieve higher goal success rate with the full use of goal information and knowledge. Moreover, the retrieval-based model performs better in terms of fluency since its response is selected from the original human utterances, not automatically generated. But it performs worse on all the other metrics when compared to the generationbased model. It might be caused by the limited number of retrieval candidates. Finally, it can be seen that there is still much room for performance improvement in terms of appropriateness and goal success rate, which will be left as the future work.

Result Analysis
In order to further analyze the relationship between knowledge usage and goal completion, we provide the number of failed goals, completed goals, and used knowledge for each method over different dialog types in Table 6. We see that the number of 10 Please see Appendix 4. for more details.  used knowledge is proportional to goal success rate across different dialog types or different methods, indicating that the knowledge selection capability is crucial to goal completion through dialogs. Moreover, the goal of chitchat dialog is easier to complete in comparison with others, and QA and recommendation dialogs are more challenging to complete. How to strengthen knowledge selection capability in the context of multi-type dialogs, especially for QA and recommendation, is very important, which will be left as the future work.

Conclusion
We identify the task of conversational recommendation over multi-type dialogs, and create a dataset DuRecDial with multiple dialog types and multidomain use cases. We demonstrate usability of this dataset and provide results of state of the art models for future studies. The complexity in DuRecDial makes it a great testbed for more tasks such as knowledge grounded conversation (Ghazvininejad et al., 2018), domain transfer for dialog modeling, target-guided conversation (Tang et al., 2019a) and multi-type dialog modeling (Yu et al., 2017). The study of these tasks will be left as the future work.
• Weather: historical weather of 55 cities from July 2017 to August 2019 Collection of task templates First we manually annotate a list of around 20 high-level goal sequences as candidates. Most of these goal sequences include 3 to 5 high-level goals. Here each high-level goal contains a dialog type and a domain (not entity or chatting topic). Then for each seeker, we select appropriate high-level goal sequences from the above list, which contains the domains that fall into the seeker's preferred domain list.
To collect goal sequences at entity level, we first use the seed entities of the seeker to enrich the information of above high-level goal sequences. If the seed entities are not enough, or there is no seeds for some domains in the high-level goal sequences, we select some entities from KG for each goal domain based on embedding based similarity scores of the seed entities (of current seeker) and the candidate entity. Then we obtain goal sequences at entity level. Finally we use some rules to generate a description for each goal (e.g., which side, the seeker or the recommender, to start the dialog, how to complete the goal). Thus we have task templates for guidance of data annotation.
To introduce diverse interaction behavior for recommendation, we design some fine-grained interaction operations, e.g., the seeker may reject the initial recommendation, or mention a new topic, or ask a question about an entity, or simply accept the recommendation. Each interaction operation corresponds to a goal. We randomly sample one of above operations and insert it into the entity-level goal sequences to diversify recommendation dialogs. The entities associated with the above interaction operations are selected from the KG based on their similarity scores with current seeker's seed entites. If the entity will be accepted by the seeker as described in the task templates (including entity-level goal sequence and its description), then its similarity score with the seeker's seed entites should be relatively high. If the entity will be rejected by the seeker as described in the task templates, then its similarity score with the seeker's seed entites should be relatively low.

Dataset annotation process
We first release a small amount of data for training, and then carry out video training for annotation problems. After that, a small amount of data is released again to select the final task workers. To ensure that at least two workers enter the task at the same time, we arrange multiple workers to log in the annotation platform. During annotation, each conversation is randomly assigned to two workers, one of whom plays the role of BOT and the other plays the role of User. Two workers conduct annotation based on the seeker profile, knowledge graph and task templates.
Due to the complexity of our task design, the quality of data annotation may have a high variance. To address this problem, we provide a strict data annotation standard to guide the workers to annotate in the way we expect them to be. After data annotation, multiple data specialists will review it. As long as one specialist thinks it does not meet the requirements, it will be marked back and re-annotation until all specialists think that it fully meets requirements.

Model Parameter Settings
All models are implemented using PaddlePaddle. 11 The parameters of all the modules are shown in Table 7. 12

Turn-level Human Evaluation Guideline
Fluency measures if the produced response itself is fluent: • score 0 (bad): unfluent and difficult to understand.
• score 1 (fair): there are some errors in the response text but still can be understood.
• score 2 (good): fluent and easy to understand.
Appropriatenss measures if the response can respond to the context: • score 0 (bad): Sub-dialogs for Rec and chitchat: not semantically relevant to the context or logically contradictory to the context. Sub-dialogs for task-oriented: No necessary slot value is involved in the conversation. Subdialogs for QA: Incorrect answer.
• score 1 (fair): relevant to the context as a whole, but using some irrelevant knowledge, or not answering questions asked by the users.
• score 2 (good): otherwise.  Informativeness measures if the model makes full use of knowledge in the response: • score 0 (bad): no knowledge is mentioned at all.
• score 1 (fair): only one knowledge triple is mentioned in the response.
• score 2 (good): more than one knowledge triple is mentioned in the response.
Proactivity measures if the model can introduce new knowledge/topics in conversation: • score -1 (bad): some new topics are introduced but irrelevant to the context.
• score 0 (fair): no new topics/knowledge are used.
• score 1(good): some new topics relevant to the context are introduced.

Dialogue-level Human Evaluation Guideline
Goal Completion measures how good the given conversation goal is finished: • score 0 (bad): less than half goals are achieved..
• score 1 (fair): less than half goals are achieved with minor use of knowledge or goal information.
• score 2 (good): more than half goals are achieved with full use of knowledge and goal information.
Coherence measures the overall fluency of the whole dialogue: • score 0 (bad): two-thirds responses irrelevant or logically contradictory to the previous context.
• score 1 (fair): less than one-third responses irrelevant or logically contradictory to the previous context.
• score 2 (good): very few response irrelevant or logically contradictory to the previous context. Figure 4 shows the conversations generated by the models via conversing with humans, given the conversation goal and the related knowledge. It can be seen that our knowledge-aware generator can use more correct knowledge for diverse conversation generation. Even though the retrieval-based method can also produce knowledge-grounded responses, the used knowledge is relatively few and inappropriate. The seq2seq model can't successfully complete the given goal, as the knowledge is not fully used as our proposed knowledge-aware generator, making the generated conversation less diverse and sometimes dull.