Knowledge Diffusion for Neural Dialogue Generation

End-to-end neural dialogue generation has shown promising results recently, but it does not employ knowledge to guide the generation and hence tends to generate short, general, and meaningless responses. In this paper, we propose a neural knowledge diffusion (NKD) model to introduce knowledge into dialogue generation. This method can not only match the relevant facts for the input utterance but diffuse them to similar entities. With the help of facts matching and entity diffusion, the neural dialogue generation is augmented with the ability of convergent and divergent thinking over the knowledge base. Our empirical study on a real-world dataset prove that our model is capable of generating meaningful, diverse and natural responses for both factoid-questions and knowledge grounded chi-chats. The experiment results also show that our model outperforms competitive baseline models significantly.


Introduction
Dialogue systems are receiving more and more attention in recent years. Given previous utterances, a dialogue system aims to generate a proper response in a natural way. Compared with the traditional pipeline based dialogue system, the new method based on sequence-to-sequence model (Shang et al., 2015;Vinyals and Le, 2015;Cho et al., 2014) impressed the research communities with its elegant simplicity. Such methods are usually in an end-to-end manner: utterances are encoded by a recurrent neural network * Work done when the first author was an intern at Data Science Lab, JD.com. while responses are generated sequentially by another (sometimes identical) recurrent neural network. However, due to lack of universal background knowledge and common senses, the endto-end data-driven structure inherently tends to generate meaningless and short responses, such as "haha" or "I don't know." To bridge the gap of the common knowledge between human and computers, different kinds of knowledge bases ( e.g., the freebase (Google, 2013) and DBpedia (Lehmann et al., 2017) ) are leveraged. A related application of knowledge bases is question answering, where the given questions are first analyzed, followed by retrieving related facts from knowledge bases (KBs), and finally the answers are generated.The facts are usually presented in the form of "subject-relationobject" triplets, where the subject and object are entities. With the aid of knowledge triplets, neural generative question answering systems are capable of answering facts related inquiries (Yin et al., 2016;Zhu et al., 2017;He et al., 2017a), WH questions in particular, like "who is Yao Ming's wife ?".
Although answering enquiries is essential for dialogue systems, especially for task-oriented dialogue systems (Eric et al., 2017), it is still far behind a natural knowledge grounded dialogue system, which should be able to understand the facts involved in current dialogue session (socalled facts matching), as well as diffuse them to other similar entities for knowledge-based chitchats (i.e. entity diffusion): 1) facts matching: in dialogue systems, matching utterances to exact facts is much harder than explicit factoid inquiries answering. Though some utterances are facts related inquiries, whose subjects and relations can be easily recognized, for some utterances, the subjects and relations are elusive, which leads the trouble in exact facts matching.  Table 1: Examples of knowledge grounded conversations. Knowledge entities are underlined. Table 1 shows an example: Item 1 and 2 are talking about the film "Titanic", Unlike item 1, which is a typical question answering conversation,item 2 is a knowledge related chit-chat without any explicit relation. It is difficult to define the exact fact match for item 2. 2) entity diffusion: another noticeable phenomenon is that the conversation usually drifts from one entity to another. In Table 1, utterances in item 3 and 4 are about entity "Titanic", however, the entity of responses are other similar films. Such entity diffusion relations are rarely captured by the current knowledge triplets. The response in item 3 shows that the two entities "Titanic" and "Waterloo Bridge" are relevant through "love stories". Item 4 suggests another similar shipwreck film of "Titanic".
To deal with the aforementioned challenges, in this paper, we propose a neural knowledge diffusion (NKD) dialogue system to benefit the neural dialogue generation with the ability of both convergent and divergent thinking over the knowledge base, and handle factoid QA and knowledge grounded chit-chats simultaneously. NKD learns to match utterances to relevant facts; the matched facts are then diffused to similar entities; and finally, the model generates the responses with respect to all the retrieved knowledge items.
In general, our contributions are as follows: • We identify the problem of incorporating knowledge bases and dialogue systems as facts matching and entity diffusion.
• We manage both facts matching and entity diffusion by introducing a novel knowledge diffusion mechanism and generate the responses with the retrieved knowledge items, which enable the convergent and divergent thinking over the knowledge base.
• The experimental results show that the proposed model effectively generate more diverse and meaningful responses involving more accurate relevant entities compared with the state-of-the-art baselines.
The corpus will be released upon publication. Given the input utterance X = (x 1 , x 2 , ..., x N X ), NKD produces a response Y = (y 1 , y 2 , ..., y N Y ) containing the entities from the knowledge base K. N X and N Y are the number of tokens in the utterance and response respectively. The knowledge base K is a collection of knowledge facts in the form of triplets (subject, relation, object). In particular, both subjects and objects are entities in this work. As illustrated in Figure 1, the model mainly consists of four components:

Model
1. An encoder encodes the input utterance X into a vector representation. Our work is built on hierarchical recurrent encoder-decoder architecture (Sordoni et al., 2015a), and a knowledge retriever network integrates the structured knowledge base into the dialogue system.

Encoder
The encoder transforms discrete tokens into vector representations. To capture information at different aspects, we learn utterance representations with two independent RNNs resulting with two hidden state sequences respectively. One final hidden state h C N X is used as the input of context RNN to track the dialogue state. The other final hidden state h K N X is utilized in knowledge retriever and is designed to encode the knowledge entities and relations within the input utterances. For instance, in Figure 1, "director" and "Titanic" in X 1 are knowledge elements.

Knowledge Retriever
Knowledge retriever extracts a certain number of facts from knowledge base and specifies their importance. It enables the knowledge grounded neural dialogue system with convergent and divergent thinking ability through facts matching and entity diffusion. Figure 2 illustrates the process.

Facts Matching
Given the input utterance X and h K N X , relevant facts are extracted from both the knowledge base and the dialogue history. A predefined number of relevant facts F = {f 1 , f 2 , ..., f N f } are obtained through string matching, entity linking or named entity recognition. As shown in Figure  2, in the first sentence, "Titanic" is recognized as an entity, all the relevant knowledge triplets are extracted. Then, these entities and knowledge triplets are transformed into fact representations ..h f N f } by averaging the entity embedding and relation embedding. The relevance coefficient r f between a fact and the input utterances, ranging from 0 to 1, is calculated by a nonlinear function or a sub neural network. Here, we apply a multi-layer perceptron (MLP): For the multi-turn conversation, entities in previous utterances are also inherited and reserved as depicted in Figure 2 the dotted lines. For instance, in the second sentence of Figure 2 (right one), no new fact is extracted from the input utterance. Thus it is necessary to record the history entities "Titanic" and "James Cameron". We summarize the facts as relevant fact representation C f through a weighted average of fact representations h f :

Entity Diffusion
To retrieve other relevant entities, which are typically not mentioned in the dialogue utterance, we diffuse the matched facts. We calculate the similarity between the entities (except the entities that have occurred in previous utterances) in the knowledge base and the relevant fact representation through a multi-layer perceptron, resulting with a similarity coefficient r e , ranging from 0 to 1: , where e k is the entity embedding. The top N e number of entities E = {e 1 , e 2 , ..., e Ne } are selected as similar entities. Then, the similar entity representation C s is formalized as: Back to the example in Figure 2, in the first turn, the matched fact of the input utterance (T itanic, direct by, JamesCameron) is of a high relevance coefficient in "facts matching" as expected. When a fact getting matched, intuitively it is not necessary for entity diffusion. In such case, from the Figure 2, we observe that the entities in "entity diffusing" are of low similarities. In the second turn, there is no triplets matched to the utterance, while the entity "Titanic" achieves a much higher relevance score. Then in "entity diffusion", the similar entities "Waterloo Bridge" and "Poseidon" get relatively higher similarity weights than in the first turn.

Context RNN
Context RNN records the utterance level dialogue state. It takes in the utterance representation and the knowledge representations. The hidden state of the context RNN is updated as: h T t is then conveyed to the decoder to guide the response generation.

Decoder
The decoder generates the response sequentially through a word generator conditioned on h T t , C f and C s . Let C denotes the concatenation of h T t , C f and C s . Knowledge items coefficient R is the concatenation of relevance coefficient r f and similarity coefficient r e . We introduce two variants of word generator: Vanilla decoder simply generates the response Y = (y 1 , y 2 , ..., y Ny ) according to C, R. The where θ denotes the model parameters. The conditional probability of y t is specified by p(y t |y 1 , ..., y t−1 , C, R; θ) = p(y t |y t−1 , s t , C, R; θ), where y t is the embedding of the vocabulary or object entities of retrieved knowledge items, s t is the decoder RNN hidden state .
Probabilistic gated decoder utilizes a gating variable z t (Yin et al., 2016) to indicate whether the t th word is generated from common vocabulary or knowledge entities. The probability of generating the t th word is given by: where p(z t |s t ; θ) is computed by a logistic regression, p(y t |R, z t = 1; θ) is approximated with the knowledge items coefficient R, and θ is the model parameter.
During response generation, if an entity is overused, the response diversity will be reduced. Therefore, once a knowledge item occurred in the response, the corresponding coefficient should be reduced in case that an item occurs multiple times. To keep tracking the coverage of knowledge items, we update the knowledge items coefficient R at each time step. We also explore two coverage tracking mechanisms: 1) Mask coefficient tracker directly reduces the coefficient of the chosen knowledge item to 0 to ensure it can never be selected as the response word again. 2) Coefficient attenuation tracker calculates an attenuation score i t based on s t , R 0 , R t−1 and y t−1 : and then update the coefficient as: where i t ranges from 0 to 1 to gradually decrease the coefficient.

Training
The model parameters include the embedding of vocabulary, entities, relations, and all the model components. The model is differential and can be optimized in an end-to-end manner using backpropagation. Given the training data where N d is the max turns of a dialogue, F denotes the set of relevant knowledge and E denotes the set of similar knowledge in response, the objective function is to minimize the negative loglikelihood: 3 Experiment

Dataset
Most existing knowledge related datasets are mainly focused on single-turn factoid question answering (Yin et al., 2016;He et al., 2017b). We here collect a multi-turn conversation corpus grounded on the knowledge base, which includes not only facts related inquiries but also knowledge-based chit-chats. The data is publicly available online 1 .
We first obtain the element information of each movie, including the movie's title, publication time, directors, actors and other attributes from https://movie.douban.com/, a popular Chinese social network for movies. Then, entities and relations are extracted as triplets to build the knowledge base K.
To collect the question-answering dialogues, we crawled the corpus from a question-answering forum https://zhidao.baidu.com/.
To gather the knowledge related chit-chat corpus, we mined the dataset from the social forum https://www.douban.com/group/. Users post their comments, feedbacks, and impressions of films and televisions on it.
The conversations are grounded on the knowledge using NER, string match, and artificial scoring and filtering rules. The statistical information of the dataset is shown in Table 2. We observed that the conversations follow the long tail distribution, where famous films and televisions are discussed repeatedly and the low rating ones are rarely mentioned.

Experiment Detail
The total 32977 conversations consisting of 104567 utterances are divided into training (32177) and testing set (800). Bi-directional LSTM (Schuster and Paliwal, 1997) is used for encoder, and the dimension of the LSTM hidden  layer is set to 512. For the context RNN, the dimension of the LSTM unit is set to 1024. The dimension of word embedding shared by the vocabulary, entities and relations is also set to 512 empirically. We use Adam learning (Kingma and Ba, 2014) to update the gradient and clip the gradient in 5.0. It takes 140 to 150 epochs to train the model with a batch size of 80.

Baselines
We compare our neural knowledge diffusion model with three state-of-the-art baselines: • Seq2Seq: a sequence to sequence model with vanilla RNN encoder-decoder (Shang et al., 2015;Vinyals and Le, 2015).
• GenDS: a neural generative dialogue system that is capable of generating responses based on input message and related knowledge base (KB) (Zhu et al., 2017) .
Three variants of the neural diffusion dialogue generation model are implemented to verify different configurations of decoders.
• NKD-ori is the original model with a vanilla decoder and a mask coefficient tracker.
• NKD-gated is augmented with a probabilistic gated decoder and a mask coefficient tracker.
• NKD-atte utilizes a vanilla decoder and the coefficient attenuation tracker.

Evaluation Metric
Both automatic and human evaluation metrics are used to analyze the model's performance. To validate the effectiveness of facts matching and diffusion, we calculate entity accuracy and recall on factoid QA data set as well as the whole data set. Human evaluation rates the model in three aspects: fluency, knowledge relevance and correctness of the response. All these metrics range from 0 to 3, where 0 represents complete error, 1   for partially correct, 2 for almost correct, 3 for absolutely correct. Table 3 displays the accuracy and recall of entities on factoid question answering dialogues. The performance of NKD is slightly better than the specific QA solution GenDS, while LSTM and HRED which are designed for chi-chat almost fail in this task. All the variants of NKD models are capable of generating entities with an accuracy of 60% to 70%, and NKD-gated achieves the best performance with an accuracy of 77.6% and a recall of 77.3%. Table 4 lists the accuracy and recall of entities on the entire dataset including both the factoid QA and knowledge grounded chit-chats. Not surprisingly, both NKD-ori and NKD-gated outperform GenDS on the entire dataset, and the relative improvement over GenDS is even higher than the improvement in QA dialogues. It confirms that although NKD and GenDS are comparable in answering factoid questions, NKD is better at introducing the knowledge entities for knowledge grounded chit-chats.

Experiment Result
All the NKD variants in Table 4 generate more entities than GenDS. LSTM and HRED also produce a certain amount of entities, but are of low  accuracies and recalls. We also noticed that NKDgated achieves the highest accuracy and recall, but generates fewer entities compared with NKDori and NKD-gated, whereas NKD-atte generates more entities but also with relatively low accuracies and recalls.This demonstrates that NKDgated not only learns to generate more entities but also maintains the quality ( with a relatively high accuracy and recall ).
The results of human evaluation in Table 5 also validate the superiority of the proposed model, especially on appropriateness. Responses generated by LSTM and HRED are of high fluency, but are simply repetitions, or even dull responses as "I don't know.", "Good.". NKD-gated is more adept at incorporating the knowledge base with respect to appropriateness and correctness, while NKDatte generates more fluent responses. NKD-ori is a compromise, and obtains the best correctness in completing an entire dialogue. Four evaluators rated the scores independently. The pairwise Cohen's Kappa agreement scores are 0.67 on fluency, 0.54 on appropriateness, and 0.60 on entire correctness, which indicate a strong annotator agreement.
To our surprise, one of the variant model of NKD, which utilized both probabilistic gated decoder and coefficient attenuation tracker does not perform well on entire dataset. The accuracy of the model is quite high, but the recall is very low compared to others. We speculate that this is due to the method of minimizing negative log-likelihood during the training process, which makes the model tend to generate completely correct answers, and therefore reduces the number of generated entities. Table 6 shows typical examples of the generated responses. Both Item 1 and 2 are based on facts relevant utterances. NKD handles these questions by facts matching. Item 3 asks for a recommen-dation. NKD obtains similar entities by diffusing the entities. For item 4, 5 and 6, no explicit entity appears in the utterances. NKD is able to output appropriate recommendations through entity diffusion. The entities are recorded during the whole dialogue session, so NKD keeps recommending for several turns. Item 7 fails to generate an appropriate response because the entity in the golden response does not appear in the training set, which suggests the future work for out-ofvocabulary cases.

Related Work
The successes of sequence-to-sequence architecture (Cho et al., 2014;Sutskever et al., 2014) motivated investigation in dialogue systems that can effectively learn to generate a response sequence given the previous utterance sequence (Shang et al., 2015;Sordoni et al., 2015b;Vinyals and Le, 2015). The model is trained to minimize the negative log-likelihood of the training data. Despite the current progress, the lack of response diversity is a notorious problem, where the model inherently tends to generate short, general responses in spite of different inputs. Li et al. (2016a); Serban et al. (2017); Cao and Clark (2017) suggested that theses boring responses are common in training data and shorter responses are more likely to be given a higher likelihood. To tackle the problem, Li et al. (2016a) introduced a maximum mutual information training objective. Serban et al. (2017), Cao and Clark (2017) and Chen et al. (2018) used latent variables to introduce stochasticity to enhance the response diversity. Vijayakumar et al. (2016), Shao et al. (2017) and Li et al. (2016b) recognized that the greedy search decoding process, especially beam-search with a wide beam size, leads the short responses possess higher likelihoods. They reserved more diverse candidates during beam-search decoding. In this paper, we present that the absence of background knowledge and common sense is another source of lacking diversity. We augment the knowledge base to endto-end dialogue generation.
Another research line comes from the utilizing of knowledge bases. A typical application is question-answering (QA) systems. The end-toend QA also resort to the encoder-decoder framework (Yin et al., 2016;He et al., 2017a). Yin et al. (2016) enquired the knowledge-base to achieve one fact and answer the simple factoid questions  by referring to the fact. He et al. (2017a) extended this approach by augmenting the copying mechanism and enabled the output words to copy from the original input sequence. Eric et al. (2017) noticed that neural task-oriented dialogue systems often struggle to smoothly interface with a knowledge base and they addressed the problem by augmenting the end-to-end structure with a key-value retrieval mechanism where a separate attention is performed over the key of each entry in the KB. Ghazvininejad et al. (2017) represented the unstructured text as bag of words representation and also performed soft attention over the facts to retrieve a facts vector. Zhu et al. (2017) generated responses with any number of answer entities in the structured KB, even when these entities never appear in the training set. Dhingra et al. (2017) proposed a multi-turn dialogue agent which helps users search knowledge base by soft KB lookup. In our model, we perform not only facts matching to answer factoid inquiries, but also entity diffusion to infer similar entities. Given previous utterances, we retrieve the relevant facts, diffuse them, and generate responses based on diversified rele-vant knowledge items.

Conclusion
In this paper, we identify the knowledge diffusion in conversations and propose an end-to-end neural knowledge diffusion model to deal with the problem. The model integrates the dialogue system with the knowledge base through both facts matching and entity diffusion, which enable the convergent and divergent thinking over the knowledge base. Under such mechanism, the factoid question answering and knowledge grounded chitchats can be tackled together. Empirical results show the proposed model is able to generate more meaningful and diverse responses, compared with the state-of-the-art baselines. In future work, we plan to introduce reinforcement learning and knowledge base reasoning mechanisms to improve the performance.