DyKgChat: Benchmarking Dialogue Generation Grounding on Dynamic Knowledge Graphs

Data-driven, knowledge-grounded neural conversation models are capable of generating more informative responses. However, these models have not yet demonstrated that they can zero-shot adapt to updated, unseen knowledge graphs. This paper proposes a new task about how to apply dynamic knowledge graphs in neural conversation model and presents a novel TV series conversation corpus (DyKgChat) for the task. Our new task and corpus aids in understanding the influence of dynamic knowledge graphs on responses generation. Also, we propose a preliminary model that selects an output from two networks at each time step: a sequence-to-sequence model (Seq2Seq) and a multi-hop reasoning model, in order to support dynamic knowledge graphs. To benchmark this new task and evaluate the capability of adaptation, we introduce several evaluation metrics and the experiments show that our proposed approach outperforms previous knowledge-grounded conversation models. The proposed corpus and model can motivate the future research directions.


Introduction
In the chit-chat dialogue generation, neural conversation models (Sutskever et al., 2014;Sordoni et al., 2015;Vinyals and Le, 2015) have emerged for its capability to be fully data-driven and endto-end trained. While the generated responses are often reasonable but general (without useful information), recent work proposed knowledgegrounded models (Eric et al., 2017;Ghazvininejad et al., 2018;Zhou et al., 2018b;Qian et al., 2018) to incorporate external facts in an end-toend fashion without hand-crafted slot filling. Effectively combining text and external knowledge graphs have also been a crucial topic in question answering (Yin et al., 2016;Hao et al., 2017;Levy et al., 2017;Sun et al., 2018;Das et al., 2019). Nonetheless, prior work rarely analyzed the model capability of zero-shot adaptation to dynamic knowledge graphs, where the states/entities and their relations are temporal and evolve as a single time scale process. For example, as shown in Figure 1, the entity Jin-Xi was originally related to the entity Feng, Ruozhao with the type Ene-myOf, but then evolved to be related to the entity Nian, Shilan.
The goal of this paper is to facilitate knowledgegrounded neural conversation models to learn and zero-shot adapt with dynamic knowledge graphs. To our observation, however, there is no existing conversational data paired with dynamic knowledge graphs. Therefore, we collect a TV series corpus-DyKgChat, with facts of the fictitious life of characters. DyKgChat includes a Chinese palace drama Hou Gong Zhen Huan Zhuan (HGZHZ), and an English sitcom Friends,which contain dialogues, speakers, scenes (e.g., the places and listeners), and the corresponded knowl-HGZHZ Zhen-Huan: She must be frightened. It should blame me. I should not ask her to play chess. 甄 甄 甄嬛 嬛 嬛: 姐姐定是嚇壞了。都怪臣妾不好，好端端的來叫梅姐姐下棋做什麼。 Doctor-Wen: Relax, Concubine-Huan madame. Lady Shen is just injured but is fine. 溫 溫 溫太 太 太醫 醫 醫: 莞 莞 莞嬪 嬪 嬪娘 娘 娘娘 娘 娘請放心，惠 惠 惠貴 貴 貴人 人 人的精神倒是沒有大礙，只是傷口燒得有些厲害。

Friends
Joey: C' mon , you're going out with the guy! There's gotta be something wrong with him! Chandler: Alright Joey, be nice. So does he have a hump? A hump and a hairpiece? edge graphs including explicit information such as the relations FriendOf, EnemyOf, and Resi-denceOf as well as the linked entities. Table 1 shows some examples from DyKgChat.
Prior graph embedding based knowledgegrounded conversation models (Sutskever et al., 2014;Ghazvininejad et al., 2018;Zhu et al., 2017) did not directly use the graph structure, so it is unknown how a changed graph will influence the generated responses. In addition, key-value retrieved-based models (Yin et al., 2016;Eric et al., 2017;Levy et al., 2017;Qian et al., 2018) retrieve only one-hop relation paths. As fictitious life in drama, realistic responses often use knowledge entities existing in multi-hop relational paths, e.g., the residence of a friend of mine. Therefore, we propose a model that incorporates multi-hop reasoning (Lao et al., 2011;Neelakantan et al., 2015; on the graph structure into a neural conversation generation model. Our proposed model, called quick adaptive dynamic knowledge-grounded neural conversation model (Qadpt), is based on a Seq2Seq model (Sutskever et al., 2014) with a widely-used copy mechanism (Gu et al., 2016;Merity et al., 2017;Xing et al., 2017;Zhu et al., 2017;Eric et al., 2017;Ke et al., 2018). To enable multi-hop reasoning, the model factorizes a transition matrix for a Markov chain.
In order to focus on the capability of producing reasonable knowledge entities and adapting with dynamic knowledge graphs, we propose two types of automatic metrics. First, given the provided knowledge graphs, we examine if the models can generate responses with proper usage of multi-hop reasoning over knowledge graphs. Second, after randomly replacing some crucial entities in knowledge graphs, we test if the models can accordingly generate correspondent responses. The empirical results show that our proposed model has the desired advantage of zero-shot adaptation with dynamic knowledge graphs, and can serve as a preliminary baseline for this new task. To sum up, the contributions of this paper are three-fold: • A new task, dynamic knowledge-grounded conversation generation, is proposed. • A newly-collected TV series conversation corpus DyKgChat is presented for the target task. • We benchmark the task by comparing many prior models and the proposed quick adaptive dynamic knowledge-grounded neural conversation model (Qadpt), providing the potential of benefiting the future research direction.

Task Description
For each single-turn conversation, the input message and response are respectively denoted as x = {x t } m t=1 and y = {y t } n t=1 , where m and n are their lengths. Each turn (x, y) is paired with a knowledge graph K, which is composed of a collection of triplets (h, r, t), where h, t ∈ V (the set of entities) and r ∈ L (the set of relationships). Each word y t in a response belongs to either generic words W or knowledge graph entities V. The task is two-fold: 1. Given an input message x and a knowledge graph K, the goal is to generate a sequence {ŷ t } n t=1 that is not only as similar as possible to the ground-truth {y t } n t=1 , but contains correct knowledge graph entities to reflect the information. 2. After a knowledge graph is updated to K , where some triplets are revised to (h, r, t ) or (h, r , t), the generated sequence should contain correspondent knowledge graph entities in K to reflect the updated information.

Evaluation Metrics
To evaluate dynamic knowledge-grounded conversation models, we propose two types of evaluation metrics for validating two aspects described above.

Knowledge Entity Modeling
There are three metrics focusing on the knowledge-related capability.
Knowledge word accuracy (KW-Acc). Given the ground-truth sentence as the decoder inputs, at each time step, it evaluates how many knowledge graph entities are correctly predicted.
For example, after perceiving the partial groundtruth response "If Jin-Xi not in" and knowing the next word should be a knowledge graph entity, KW-Acc measures if the model can predict the correct word "Yongshou Palace".
Knowledge and generic word classification (KW/Generic). Given the ground-truth sentence as the decoder inputs, at each time step, it measures the capability of predicting the correct class (a knowledge graph entity or a generic word) and adopts micro-averaging. The true positive, false negative and false positive are formulated as: y t ∼ P (· | y 1 y 2 . . . y t−1 ).
For example, after input a sentence "Where's JinXi?", if a model generates "Hi, Zhen-Huan, JinXi is in Yangxin-Palace." when reference is "JinXi is in Yongshou-Palace.", where bolded words are knowledge entities. Recall is 1 2 and precision is 1 3 .

Adaptation of Changed Knowledge Graphs
Each knowledge graph is randomly changed by (1) shuffling a batch (All), (2) replacing the predicted entities (Last1), or (3) replacing the last two steps of paths predicting the generated entities (Last2). We have two metrics focusing on the capability of adaptation.   Change rate. It measures if the responses are different from the original ones (with original knowledge graphs). The higher rate indicates that the model is more sensitive to a changed knowledge graph. Therefore, the higher rate may not be better, because some changes are worse. The following metric is proposed to deal with the issue, but change rate is still reported.
Accurate change rate. This measures if the original predicted entities are replaced with the hypothesis set, where this ensures that the updated responses generate knowledge graph entities according to the updated knowledge graphs.
(1) In All, the hypothesis set is the collection of all entities in the new knowledge graph.
(2) In Last1 and Last2, the hypothesis set is the randomly-selected substitutes.

DyKgChat Corpus
This section introduces the collected DyKgChat corpus for the target knowledge-grounded conversation generation task.

Data Collection
To build a corpus where the knowledge graphs would naturally evolves, we collect TV series conversations, considering that TV series often contain complex relationship evolution, such as  friends, jobs, and residences. We choose TV series with different languages and longer episodes. We download the scripts of a Chinese palace drama "Hou Gong Zhen Huan Zhuang" (HGZHZ; with 76 episodes and hundreds of characters) from Baidu Tieba, and the scripts of an English sitcom "Friends" (with 236 episodes and six main characters) 2 . Their paired knowledge graphs are manually constructed according to their wikis written by fans 34 . Noted that the knowledge graph of HGZHZ is mainly built upon the top twenty-five appeared characters.
The datasets are split 5% as validation data and 10% as testing data, where the split is based on multi-turn dialogues and balanced on speakers. The boundaries of dialogues are annotated in the original scripts. The tokenization of HGZHZ considers Chinese characters and knowledge entities; the tokenization of Friends considers spaceseparated tokens and knowledge entities. The data statistics after preprocessing is detailed in Table 2. The relation types r ∈ L of each knowledge graph and their percentages are listed in Table 3, and the knowledge graph entities are plotted as word clouds in Figure 2.

Subgraph Sampling
Due to the excessive labor of building dynamic knowledge graphs aligned with all episodes, we currently collect a fixed knowledge graph G containing all information that once exists for each TV series. To build the aligned dynamic knowledge 2 https://github.com/npow/ friends-chatbot 3 https://zh.wikipedia.org/wiki/後 宮 甄 嬛 傳 (電視劇) 4 https://friends.fandom.com/wiki/ Friends_Wiki graphs, we sample the top-five shortest paths on knowledge graphs from each source to each target, where the sources are knowledge entities in the input message and the scene {x t ∈ V}, and the targets are knowledge entities in the ground-truth response {y t ∈ V}. We manually check whether the selected number of shortest paths are able to cover most of the used relational paths. The dynamic knowledge graphs are built based on an ensemble of the following possible subgraphs: • The sample for each single-turn dialogue.
• The sample for each multi-turn dialogue.
• The manually-annotated subgraph for each period. While the first rule is adopted for simplicity, the preliminary models should at least work on this type of subgraphs. The subgraphs are defined as the dynamic knowledge graphs {K}, which are updated every single-turn dialogue.

Data Analysis
Data imbalance. As shown in Table 2, the turns with knowledge graph entities are about 58.9% and 15.93% of HGZHZ and Friends respectively. Apparently in Friends, the training data with knowledge graph entities are too small, so fine- tuning on this subset with knowledge graph entities might be required.
Shortest paths. The lengths of shortest paths from sources to targets are shown in Figure 3. Most probabilities lie on two and three hops rather than zero and one hop, so key-value extraction based text generative models (Eric et al., 2017;Levy et al., 2017;Qian et al., 2018) are not suitable for this task.On the other hand, multi-hop reasoning might be useful for better retrieving correct knowledge graph entities.
Dynamics. The distribution of graph edit distances among dynamic knowledge graphs are 57.24 ± 24.34 and 38.16 ± 15.99 for HGZHZ and Friends respectively, revealing that the graph dynamics are spread out: some are slightly changed while some are largely changed, which matches our intuition.

Qadpt: Quick Adaptative Dynamic
Knowledge-Grounded Neural Conversation Model To our best knowledge, no prior work focused on dynamic knowledge-grounded conversation; thus we propose Qadpt as the preliminary model. As illustrated in Figure 4, the model is composed of (1) a Seq2Seq model with a controller, which decides to predict knowledge graph entities k ∈ V or generic words w ∈ W, and (2) a reasoning model, which retrieves the relational paths in the knowledge graph.

Sequence-to-Sequence Model
Qadpt is based on a Seq2Seq model (Sutskever et al., 2014;Vinyals and Le, 2015), where the encoder encodes an input message x into a vector e(x) as the initial state of the decoder. At each time step t, the decoder produces a vector d t conditioned on the ground-truth or predicted y 1 y 2 . . . y t−1 . Note that we use gated recurrent unit (GRU)  in our experiments.
Each predicted d t is used for three parts: output projection, controller, and reasoning. For output projection, the predicted d t is transformed into a distribution w t over generic words W by a projection layer.

Controller
To decide which vocabulary set (knowledge graph entities V or generic words W) to use, the vector d t is transformed to a controller c t , which is a widely-used component (Eric et al., 2017;Zhu et al., 2017;Ke et al., 2018;Zhou et al., 2018b;Xing et al., 2017) similar to copy mechanism (Gu et al., 2016;Merity et al., 2017;. The controller c t is the probability of choosing from knowledge graph entities V, while 1 − c t is the probability of choosing from generic words W. Note that we take the controller as a special symbol KB in generic words, so the term 1 − c t is already multiplied to w t . The controller here can be flexibly replaced with any other model. w t = P (W | y 1 y 2 . . . y t−1 , e(x)), (4) c t = P (KB | y 1 y 2 . . . y t−1 , e(x)), where φ is the output projection layer, and k t is the predicted distribution over knowledge graph entities V (detailed in subsection 4.3), and o t is the produced distribution over all vocabularies.

Reasoning Model
To ensure that Qadpt can zero-shot adapt to dynamic knowledge graphs, instead of using attention mechanism on graph embeddings (Ghazvininejad et al., 2018;Zhou et al., 2018b), we leverage the concept of multi-hop reasoning (Lao et al., 2011). The reasoning procedure can be divided into two stages: (1) forming a transition matrix and (2) reasoning multiple hops by a Markov chain. In the first stage, a transition matrix T t is viewed as multiplication of a path matrix R t and the adjacency matrix A of a knowledge graph K. The adjacency matrix is a binary matrix indicating if the relations between two entities exist. The path matrix is a linear transformation θ of d t , and represents the probability distribution of each head h ∈ V choosing each relation type r ∈ L. Note that a relation type self-loop is added.
where R t ∈ IR |V|×1×|L| , A ∈ IR |V|×|L|×|V| , and T t ∈ IR |V|×|V| . In the second stage, a binary vector s ∈ IR |V| is computed to indicate whether each knowledge entity exists in the input message x. First, the vector s is multiplied by the transition matrix. A new vector s T t is then produced to denote the new probability distribution over knowledge entities after one-hop reasoning. After N times reasoning 5 , the final probability distribution over knowledge entities is taken as the generated knowledge entity distribution k t : The loss function is the cross-entropy of the predicted word o t and the ground-truth distribution: where ψ is the parameters of GRU layers. Compared to prior work, the proposed reasoning approach explicitly models the knowledge reasoning path, so an updated knowledge graphs will definitely change the results without retraining. 5 We choose N = 6 because of the maximum length of shortest paths in Figure 3

Inferring Reasoning Paths
Because this reasoning method is stochastic, we compute the probabilities of the possible reasoning paths by the reasoning model, and infer the one with the largest probability as the retrieved reasoning path.

Related Work
The proposed task is motivated by prior knowledge-grounded conversation tasks (Ghazvininejad et al., 2018;Zhou et al., 2018b), but further requires the capability to adapt to dynamic knowledge graphs.
For knowledge from unstructured texts, Ghazvininejad et al. (2018) used bag-of-word representations and Long et al. (2017) applied a convolutional neural network to encode the whole texts. With structured knowledge graphs, Zhu et al. (2017) and Zhou et al. (2018b) utilized graph embedding methods (e.g., TransE (Bordes et al., 2013)) to encode each triplet.
The above methods generated responses without explicit relationship to each external knowledge triplet. Therefore, when a triplet is added or deleted, it is unknown whether their generated responses can change accordingly. Moon et al. (2019) recently presented a similar concept, walking on the knowledge graph, for response generation. Nonetheless, their purpose is to find explainable path on a large-scaled knowledge graph instead of adaptation with the changed knowledge graphs. Hence, the proposed attention-based graph walker may suffer from the same issue as previous embedding-based methods.

Multi-Hop Reasoning
We leverage multi-hop reasoning (Lao et al., 2011) to allow our model to quickly adapt to dynamic knowledge graphs. Recently, prior work used convolutional neural network (Toutanova et al., 2015), recurrent neural network (Neelakantan et al., 2015;Das et al., 2017), and reinforcement learning Das et al., 2018;Shen et al., 2018) to model multi-hop reasoning on knowledge graphs, and has proved this concept useful in link prediction. These reasoning models, however, have not yet explored on dialogue generation. The proposed model is the first attempt at adapting conversations via a reasoning procedure.

Experiments
For all models, we use gated recurrent unit (GRU) based Seq2Seq models Chung et al., 2014;Vinyals and Le, 2015). Both encoder and decoder for HGZHZ are 256 dimension with 1 layer; ones for Friends are 128 dimension with 1 layer. We benchmark the task, dynamic knowledgegrounded dialogue generation, and corpus DyKgChat by providing a detailed comparison between the prior conversational models and our proposed model as the preliminary experiments. We evaluate their capability of quick adaptation by randomized whole, last 1, last 2 reasoning paths as described in Section 2.1.2. We evaluate the produced responses by sentence-level BLEU-2 (Papineni et al., 2002;Liu et al., 2016), perplexity, distinct-n , and our proposed metrics for predicting knowledge entities descrin section 2.1.1.
Because of the significant data imbalance of Friends, we first train on whole training data, and then fine-tune the models using the subset containing knowledge entities. Early stopping is adopted in all experiments.

Baselines
We compare our model with prior knowledgegrounded conversation models: the memory network (Ghazvininejad et al., 2018) and knowledgeaware model (KAware) (Zhu et al., 2017;Zhou et al., 2018b). We also leverage the topic-aware model (TAware) (Xing et al., 2017;Wu et al., 2018;Zhou et al., 2018a) by attending on knowledge graphs and using two separate output projection layers (generic words and all knowledge graph entities). In our experiments, MemNet is modified for fair comparison, where the memory pool of MemNet stores TransE embeddings of knowledge triples (Zhou et al., 2018b). The maximum number of the stored triplets are set to the maximum size of all knowledge graphs for    (Weston et al., 2015b) is also implemented (MemNet+multi) 6 . To empirically achieve better performance, we also utilize the attention of MemNet for TAware and KAware. Moreover, we empirically find that multi-hop MemNet deteriorate the performance of KAware (compared to one-hop), while it could enhance the performance of TAware. A standard Seq2Seq model (Vinyals and Le, 2015) is also shown as a baseline without using external knowledge. We also leverage multi-hop MemNet and the attention of TAware to strength Qadpt (+multi and +TAware).

Results
As shown in Table 4, MemNet, TAware and KAware significantly change when the knowledge graphs are largely updated (All) and can also achieve good accurate change rate. For them, the more parts updated (All >> Last2 > Last1), the more changes and accurate changes. However, when the knowledge graphs are slightly updated (Last1 and Last2), the portion of accurate changes over total changes (e.g., the Last1 score 1.17/31.78 for HGZHZ with MemNet model) is significantly low. Among the baselines, KAware has better performance on Last1. On the other hand, Qadpt outperforms all baselines when the knowledge graphs slightly change (Last1 and Last2) in terms of accurate change rate. The proportion of accurate changes over total changes also show significantly better performance than the prior models. Figure 5 shows the distribution of lengths of the inferred relation paths for Qadpt models. After combining TAware or MemNet, the distribution becomes more similar to the test data. Table 5 shows the results of the proposed metrics for correctly predicting knowledge graph entities. On both HGZHZ and Friends, TAware+multi and Qadpt significantly outperform MemNet for KW-Acc and KW/Generic, and MemNet outperforms all other models by KW/Generic precision (100%). This demonstrates that these models can better predict knowledge graph entities, but are slightly worse at making good choices of when to predict generic words (KW/Generic). Table 6 presents the BLEU-2 scores (as recommended in the prior work (Liu et al., 2016)), perplexity (PPL), and distinct scores. The results show that all models have similar levels of BLEU-2 and PPL, while Qadpt+multi has slightly better distinct scores. The results suggest the same claim as Liu et al. (2016)

Human Evaluation
To perform human evaluation, we randomly select examples from the knowledge-related outputs of all models, because it is difficult for human to distinguish which generic response is better. We recruit fifteen annotators to judge the results. Each annotator was randomly assigned with 20 examples, and was guided to rank the results of five models: Seq2Seq, MemNet, TAware, KAware, and Qadpt. They were asked to rank all results according to two criteria: (1) fluency and (2) information. Fluency measures which output is more proper as a response to a given input message. Information measures which output contains more correct information (in terms of knowledge words here) according to a given input message and a referred response. The evaluation results are classified into "win", "tie", and "lose" for comparison. The human evaluation results and the annotator agreement in the form of Cohen's kappa (Cohen, 1960) are reported in Table 7. According to a magnitude guideline (Landis and Koch, 1977), most agreements are substantial (0.6-0.8), while some agreements of Friends are moderate (0.4-0.6). In most cases of Table 7, Qadpt outperforms other four models. However, in Friends, Qadpt, Mem-Net, and TAware tie closely. The reason might be the lower agreements of Friends, or only the similar trend with automatic evaluation metrics. There are two extra spots. First, Qadpt wins Mem-Net and TAware less times than winning Seq2Seq and KAware, which aligns with Table 5 and Table 6. Second, Qadpt wins baselines more often by fluency than by information, and much more ties happen in the infomation fields than the fluency fields. This is probably due to the selection of knowledge-contained examples. Hence there is no much difference when seeing the information amount of models. Overall, the human evaluation results can be considered as reference because of the substantial agreement among annotators and the similar trend with automatic evaluation.

Discussion
The results demonstrate that MemNet, TAware and Qadpt generally perform better than than the other two baselines, and they excel at different aspects. MemNet can successfully incorporate knowledge graphs and generate sentences with both appropriate knowledge entities and generic words. In contrast, TAware and Qadpt predict more correct knowledge entities but tend to diminish generic words.
For the scenario of zero-shot adaptation, Mem-Net and TAware show their ability to update responses when the knowledge graphs are largely changed. On the other hand, Qadpt is better to capture minor dynamic changes ( Last1 and Last2) and updates the responses according to the new knowledge graphs. Some examples are given in Appendix. This demonstrates that MemNet and TAware attend on the whole graph instead of focusing on the most influential part.

Conclusion
This paper presents a new task, dynamic knowledge-grounded conversation generation, and a new dataset DyKgChat for evaluation. The dataset is currently provided with a Chinese and an English TV series as well as their correspondent knowledge graphs. This paper also benchmarks the task and dataset by proposing automatic evaluation metrics and baseline models, which can motivate the future research directions.