Diverse and Informative Dialogue Generation with Context-Specific Commonsense Knowledge Awareness

Generative dialogue systems tend to produce generic responses, which often leads to boring conversations. For alleviating this issue, Recent studies proposed to retrieve and introduce knowledge facts from knowledge graphs. While this paradigm works to a certain extent, it usually retrieves knowledge facts only based on the entity word itself, without considering the specific dialogue context. Thus, the introduction of the context-irrelevant knowledge facts can impact the quality of generations. To this end, this paper proposes a novel commonsense knowledge-aware dialogue generation model, ConKADI. We design a Felicitous Fact mechanism to help the model focus on the knowledge facts that are highly relevant to the context; furthermore, two techniques, Context-Knowledge Fusion and Flexible Mode Fusion are proposed to facilitate the integration of the knowledge in the ConKADI. We collect and build a large-scale Chinese dataset aligned with the commonsense knowledge for dialogue generation. Extensive evaluations over both an open-released English dataset and our Chinese dataset demonstrate that our approach ConKADI outperforms the state-of-the-art approach CCM, in most experiments.


Introduction
Nowadays, open-domain dialogue response generation systems have shown impressive potential, to endow a machine with the ability to converse with a human, using natural language (Chen et al., 2017). Although such models have achieved promising performance, they still suffer from generating generic and boring responses, such as "I don't know." Such low-quality responses always reduce the attractiveness of generative dialogue systems to end-users. Researchers have tried to * Corresponding author: Ying Li, li.ying@pku.edu.cn tackle it from multiple aspects; for example, using the enhanced objective function (Li et al., 2016a); introducing additional contents (Xu et al., 2019). However, these methods haven't solved the issue thoroughly. Different from a human being, who is capable of associating the dialogue with the background knowledge in his/her mind, a machine can merely capture limited information from the surface text of the query message (Ghazvininejad et al., 2018). Consequently, it is difficult for a machine to understand the query fully, and then to generate diverse and informative responses .
To bridge the gap of the knowledge between the human and the machine, researchers have begun to introduce large-scale knowledge graphs for enhancing the dialogue generation (Zhu et al., 2017;Liu et al., 2018), and they have obtained lots of impressive results. Generally, the retrieval of knowledge facts is based on the entity name; in detail, the first step is to recognize entity words in the given query message, and then facts that contain the mentioned entities can be retrieved as candidates 1 . Subsequently, a knowledge-aware response can be generated based on the query message and previously retrieved facts. Although such a straightforward paradigm works to a certain extent, some challenges in knowledge-aware dialogue generation still keep unsolved. 1) An entity word usually can refer to different concepts, i.e., an entity has multiple meanings, but only one specific concept is involved in a particular context. Without considering this, some pre-fetched knowledge fact candidates can be irrelevant to the context. 2) Even if we only consider a particular entity meaning, the related knowledge facts may cover various target topics. However, some of those topics do not con- Figure 1: An illustrative example. #1 shows the response generated with a highly relevant fact, #2 shows the response generated with irrelevant facts. tribute to the dialogue generation. Figure 1 presents an illustrative example to demonstrate such two issues. Here, a subgraph is retrieved based on the entity word "Apple" in the query. In general, "Apple" can be interpreted as either a type of fruit or a brand name. In this context, it is evident that "Apple" refers to a brand name. However, some knowledge facts concerning a type of fruit are retrieved too. If a model makes an inappropriate choice of irrelevant facts, the generated response will make no sense to the query message. In our example, even for the entities in blue circle related to the brand name "Apple", only some of them have a positive effect in the dialogue generation, e.g., "Jobs" should not make any contribution to the "#1".
3) The integration of the knowledge and the dialogue generation in previous approaches is insufficient, including the way of integration, as well as the types of knowledge.
To tackle such challenges, this paper proposes a Context Knowledge-Aware Diverse and Informative conversation generation model, ConKADI. First, we design a Felicitous Fact mechanism to help the model highlight the knowledge facts that are highly relevant to the context, that is, "Felicitous Facts". Felicitous Fact mechanism generates a felicitous fact probability distribution over the retrieved facts. For improving the selection of felicitous facts, human-generated answers (i.e., the ground-truth responses) are used as the posterior context knowledge to supervise the training of the prior felicitous fact probability distribution. Next, Context-Knowledge Fusion is proposed to lift the role of knowledge facts in the dialogue generation, by fusing the context and the felicitous knowledge before the decoding. Last, ConKADI can generate three types of words owing to the Flexible Mode Fusion module, which aims at simultaneously fusing multiple types of knowledge. To summarize, Felicitous Fact mechanism can alleviate the first two issues, and the next two techniques solve the last issue. Consequently, our approach can improve the utilization rate of knowledge graphs, as well as can promote the diversity and informativeness of the generated responses.
In the experiments, a large-scale Chinese Weibo dataset is collected and aligned with the commonsense knowledge for dialogue generation. We perform extensive evaluations on two large-scale datasets: an open-released English Reddit dataset and our proposed Chinese Weibo dataset. The experimental results demonstrate that our proposed ConKADI model significantly outperforms representative methods in knowledge utilization, diversity, and informativeness. Especially, ConKADI exceeds the latest knowledge-aware dialogue generation model, CCM , in most experiments.

Related Work
Seq2Seq (Sutskever et al., 2014;Vinyals and Le, 2015) has been widely used in the open-domain dialogue generation. However, models tend to generate generic responses . To tackle this issue, researchers have proposed new objectives (Li et al., 2016a), enhanced decoding algorithms (Li et al., 2016b), latent-variable based methods Gao et al., 2019). Introducing additional contents into the dialogue generation is also helpful. (Xu et al., 2019) uses meta-words; (Zhu et al., 2019) uses the retrieved existing dialogues. However, the leading cause of generating generic responses is that the model can not obtain enough background knowledge from the query message (Ghazvininejad et al., 2018;. Recently, to alleviate the lack of background knowledge, researchers have begun to introduce the knowledge into the generation. The knowledge can be the unstructured knowledge texts (Ghazvininejad et al., 2018), the structured knowledge graphs , or the hybrid of them . The structured knowledge has the best quality, because it is generally extracted and summarized by the human. The structured knowledge graph can be either domain-specific (Zhu et al., 2017;Liu et al., 2018) or open-domain . ConceptNet (Speer et al., 2017) is a multilingual open-domain commonsense knowledge graph, which is designed to represent the general knowledge and to improve understanding of the meanings behind the words people use. Two previous studies  have proved the feasibility of introducing commonsense knowledge into dialogue systems. The first work  is designed for retrieval-based systems; therefore, only the current state-of-the-art CCM  is our direct competitor. In comparison with CCM, 1) ConKADI is aware of the context when using the knowledge. 2) ConKADI uses human's responses as posterior knowledge in training.
In addition, our Felicitous Fact mechanism is different from the word/knowledge selection mechanisms previously proposed in related tasks; for example, selecting a cue word (Mou et al., 2016;Yao et al., 2017) or selecting a knowledge . First, ConKADI can access more contextual information because our model is fully end-to-end, while previous works use independent and external modules. Second, our Felicitous Fact outputs a probabilistic distribution instead of a hard singleton value, as did the previous works.

Task Formulation and Model Overview
Formally, given a training data D of triplets, where each triplet includes a query message X = (x 1 , . . . , x n ), a response Y = (y 1 , . . . , y m ), and a set of commonsense knowledge facts F = {f 1 , . . . , f l }. The training goal of knowledgeaware dialogue generation is to maximize the prob- The overview of ConKADI has been shown in Figure 2. Knowledge fact set F is retrieved by the Knowledge Retriever given the query message X. The Context Encoder summarizes an utterance into contextual representations. The Felicitous Fact Recognizer calculates the felicitous fact probability distribution z over the F , which is used to initialize the Decoder and guide the generation. The Triple Knowledge Decoder can generate three types of words: vocabulary words, entity words, and copied words, with the Flexible Mode Fusion.

Felicitous Fact mechanism
Knowledge Retriever: Given a query message X, if a word x i ∈ X is recognized as an entity word and can be matched to a vertex e src in the knowledge graph G, then, each neighbour e tgt ∈ N eighbour(e src ) and the corresponding directional relation r is retrieved as a candidate fact f . e src /e tgt is called as source/target entity. If a word can't match any vertex, a special fact f N AF will be used.
Context Encoder: The Context Encoder is a bidirectional GRU network (Cho et al., 2014), which reads X or Y and outputs a contextual state sequence. For simplicity, we take X as an example. At the time step t, the Encoder outputs a forward state and a backward state, the concatenation of such two states h where x t is the word embedding of x t . To enhance the semantic information, the matched entity embedding e x t of x t is also involved. Finally, the contextual state sequence of X/Y is denoted as H x/y = (h x/y 1 , . . . , h x/y n/m ). Specifically, H x is the prior context; H y is the posterior context that is only available in the training stage.
Felicitous Fact Recognizer: Recall the example illustrated in Figure 1 , some preliminary retrieved knowledge facts may be inappropriate in the dialogue context. The Felicitous Fact Recognizer is designed to detect the facts that highly coincide with the dialogue context, i.e., Felicitous Facts. The Felicitous Fact Recognizer reads the contextual information, then outputs a probability distribution z ∈ R l×1 over the F ; therefore, the i-th dimension value z[i] indicates the weight of f i . In the training stage, the high-quality human-generated response Y is served as the posterior knowledge; hence, the posterior z post is adopted in training, the prior z prior is adopted in inference: (2) where F ∈ R l×(de+dr+de) is the embedding matrix of the retrieved facts F , W ft , W post and W prior are trainable parameters, η is sof tmax activation , ϕ is tanh activation. Kullback-Leibler Divergence (Kullback and Leibler, 1951) (KLD) is used to force two distributions to become as close as possible. L Context-Knowledge Fusion: To enhance the Decoder's understanding of the background knowledge, the Decoder is initialized based on the fused knowledge f z = z · F and the query context: where W init is a trainable parameter. Following the previous work , we adopt the Bag-of-Words Loss to ensure the accuracy of the input of the Context-Knowledge Fusion, namely, h x n and f z . Meanwhile, we construct a 0-1 indicator vector I f ∈ R l×1 to supervise the training of z post , where I f [i] is set to 1 if the target entity of the i-th fact f i appears in the Y , otherwise 0. Thus, the objective is to minimize the L f given by: activated by sof tmax, which outputs the probability distribution over the vocabulary V .

Triple Knowledge Decoder
The Decoder is another GRU network. At each time step, the Decoder can generate one of three types of words: vocabulary words, knowledgeable entity words, and copied words. ConKADI first updates the internal state: , and y t−1 , e y t−1 , h x y t−1 are the word embedding, the entity embedding and the pointed-then-copied source state of the last predicted token y t−1 , respectively; and c t−1 is the Attention 2 .
Vocabulary Words: The probability distribution p w,t ∈ R |V |×1 over the V is given by: where W v1/2 are trainable parameters, and the non-linear activation elu is proposed by (Clevert et al., 2016).
Knowledgeable Entity Words: An entity word can be generated by extracting the target entity of the best-matched fact f at each time step. The corresponding probability distribution p k,t ∈ R l×1 over the F is calculated as: where the previous z here serves as a static global distribution (denoted as GlFact), z d,t is the dynamic distribution, and γ t is a gate to control the contribution of each distribution.
Copied Words: The Decoder can further point out a word x from X, and then copies the x . The corresponding probability distribution p c,t ∈ R n×1 over the query message X is calculated as: Flexible Mode Fusion: Previous three distributions can be fused by the M F (h y t , u t−1 , c t ), a 2-layer MLP activated by sof tmax. M F can outputs a probability distribution (γ w,t , γ k,t , γ c,t ) over three modes at each time step: p out,t = γ w,t ×p w,t +γ k,t ×p k,t +γ c,t ×p c,t (10) The proposed M F can be regarded as a multiclass classifier; therefore, the advantage of M F is the flexibility, we can additionally integrate more modes or remove existing modes by simply changing the number of classes. For a more reasonable fusion, the Cross-Entropy between the ground-truth mode and the predicted distribution by M F is used to supervise the training; the corresponding Cross-Entropy loss is denoted as L m . Next, we optimize the fused output distribution p out (Y |X, F ) by minimizing the L n , which is given by: − t λ t log p out,t (y t |y t−1:1 , X, F ) + L m 2 where λ t is a normalization term to penalize the out-of-vocabulary words, λ t = 1 #(unk∈Y ) 3 if y t is an unk, otherwise λ t = 1.
Training Objective: Finally, the ConKADI can be trained by minimizing the following objective: 4 Experiments

Dataset
To verify the generalization among different languages, we evaluate models not only on a public English Reddit dataset , but we also collect and construct a Chinese Weibo dataset. Both datasets are aligned with the commonsense knowledge graph ConcetNet (conceptnet.io), the statistics have been reported in Table 1. 3 #(·) is the count of ·  In comparison with the English Reddit, our dataset has more facts, but the relation types are quite limited; hence, we set the limit that a message can be associated with at most 150 fact triplets. For two datasets, the embedding of entities and relations are learned by using TransE (Bordes et al., 2013); then, they are kept fixed in training. Our experimental resources are available at the web 5 .

Settings
Baselines: The widely used S2S (Sutskever et al., 2014), and its Attentive version ATS2S (Luong et al., 2015). We further add the bidi-MMI (Li et al., 2016a) or the diverse decoding (Li et al., 2016b) to improve the diversity of ATS2S, which are denoted as ATS2S M M I and ATS2S DD 6 . Copy mechanism (Gu et al., 2016; allows Decoder to point then copy a source word. GenDS is a knowledge-aware model, which can generate responses with the utilizing of entity words. (Zhu et al., 2017). CCM is the current state-of-the-art approach in the task of response generation with  commonsense knowledge .

Implementation:
We implemented all models except CCM, CCM was tested based on its official code 7 . Most hyper-parameters are kept the same as CCM, and hyper-parameters among models are kept the same as possible. In detail, the word embedding dimension is 300, Encoder is a 2-layer bidirectional GRU with 512 units, and Decoder is a 2-layer GRU with 512 units. Adam is used to optimizing model with an initial learning rate lr = 0.0001; if perplexity begins to increase, the lr will be halved, if perplexity increases in two continuous epochs, the training will be stopped. Following the CCM, the maximum epoch number is 20.
Objective Metrics: We evaluate the generated responses from four aspects: Knowledge Utilization (A 1 ): E match is the averaged number of the matched target entities per generation. . E use further counts the source entities. E recall is the ratio of recalled entities. Embeddingbased Relevance (A 2a ) : Following (Liu et al., 2016), we use the Emb avg that considers the averaged word embedding, and the Emb ex that considers each dimension's extreme value. Overlappingbased Relevance (A 2b ) : BLEU-2/3 (Tian et al., 2017;Wu et al., 2017). Diversity (A 3 ): We report the ratio of distinct uni/bi-grams, i.e., Distinct-1/2, in all generated texts (Li et al., 2016a;Wu et al., 2018). Informativeness (A 4 ): We report the wordlevel Entropy (Mou et al., 2016).
Relative Score: To illustrate the comprehensive performance of models, we first compute the average score of 7 baselines metric by metric (AVG), then, we report the arithmetic mean score: and the geometric mean score:

Experimental Results
The objective evaluation results on the two datasets have been reported in Table 2. By reviewing the Relative Score, it can be seen that the overall performance of ConKADI outperforms baseline models. More specifically, our ConKADI outperforms baseline models in terms of all metrics except BLEU-3 on the Chinese Weibo, and our ConKADI outperforms baseline models in terms of almost all metrics on the English Reddit. In comparison with the state-of-the-art method CCM, our ConKADI increases the overall performance by 153%/95% (arithmetic/geometric mean) on the Chinese dataset, as well as increases the overall performance by 48%/25% on the English dataset.
Knowledge Utilization: By accessing the knowledge, three knowledge-aware models, i.e., GenDS, CCM, and ConKADI, can significantly outperform other models. In comparison with GenDS and CCM, the advantages of ConKADI can be summarized as 1) ConKADI has a higher utilization of the knowledge, which can be proved by E match . 2) By using the point-then-copy mechanism (ConKADI vs. ConKADI −cp ), ConKADI further expands the total generated entity number (E use ). After adding the point-then-copy mechanism, while the E match drops by 7.5%, the overall E use increases by 10%. It means ConKADI can reasonably decide whether to use a knowledge fact or copy a source word. 3) ConKADI is more potential to find out the accurate knowledge; hence, our E recall is much higher than the E recall of GenDS and CCM. Such results can demonstrate that the proposed Felicitous Fact mechanism can help the model better focus on the facts that are relevant to the dialogue context, and increase the utilization rate of the knowledge graph and the accuracy of the knowledge selection.
Diversity and Informativeness: Generative models have been suffering from generating responses without enough diversity and informativeness.
Although previous GenDS and CCM can utilize the knowledge, they fail to solve this challenge; they even can be beaten by other baselines. By contrast, our ConKADI has significantly alleviated this issue. According to our ablation experiments, such notable promotion can be attributed to the proposed Context-Knowledge Fusion. The more detail will be discussed in the ablation study.
Relevance: On the Chinese dataset, ConKADI has the best overall performance, but ConKADI's performance is not ideal on the English dataset. First, we think the reason is the inherent difference of datasets; two datasets are collected from different sources and have varying densities of entityrelations (see Table 1). Next, we must emphasize these metrics can only evaluate the relevance to the given reference. Instead of the 1-to-1 mapping, the dialogue is undoubtedly a 1-to-n mapping; therefore, these results cannot show the generation is not consistent with the query. ConKADI is a very diverse model; only use one reference to judge is unfair. Similarly, this limitation has been found and explained in a recent work (Gao et al., 2019).  Following , we randomly sample 200 query messages from the test set, and then we conduct the pair-wise comparison. For the variations of S2S, We remain two most representative models, ATS2S and ATS2S M M I . Thus, we have 1,000 pairs in total. For each pair, we invite three well-educated volunteers to judge which response is better, in terms of the following two metrics: 1) Appropriateness, which mainly considers the fluency and the logical relevance. 2) Informativeness, which considers whether the model provides new information/knowledge or not. The tie is allowed, but volunteers are required to avoid it as possible. The model names are masked, and the A-B order is random.

Human Annotation
For the appropriateness, 2/3 agreement (i.e., the percentage of cases that at least 2 volunteers give the same label) is 95%, and the 3/3 agreement is 67.1%. For the informativeness, 2/3 agreement is 97%, and the 3/3 agreement is 79.1%.
The results have been reported in Table 3. ATS2S M M I is the strongest baseline owing to the beam search and the MMI re-ranking, especially in terms of appropriateness. While the generation of ATS2S M M I is more generic, it's friendly for human reading; hence, it tends to receive higher scores. GenDS and CCM are far behind our model. We find their generation is usually not fluent, while a lot of entities are generated. Comparing two metrics, ConKADI has more notable advantages in terms of informativeness.

Ablation Study
We focus on the ablation of the Felicitous Fact mechanism. There are 3 factors, GlFact (using the distribution z to guide the entity word generation), CKF (Context-Knowledge Fusion), and CKF's loss L f . Copy has fully removed the Felicitous Fact mechanism (i.e., above 3 factors); Base further   The results have been reported in Table 5. 1) The performance drops significantly without using the context-knowledge fused result to initialize the Decoder (#5 − → #7), indicating that CKF is very important for the Decoder. 2) If GlFact is adopted solely, it can affect performance in turn. 3) L f is essential to the Copy in comparison with Base.
Analysis of KL Divergence: The training stage introduces posterior knowledge, which is absent during the inference. Therefore, reducing the difference between such two distribution is very necessary. We here check the curve of the KLD between the z prior and z post , i.e., L k . A lower L k means the two distribution are closer. As shown in Figure 3: 1) KLD is strongly related to the overall performance. 2) The importance that using the fused knowledge to initialize the Decoder (CKF) has been proved once again (#5 vs. #6).

Case Study
Three cases are sampled in Table 4. In case 1, except ATS2S M M I and our ConKADI, the remaining models have generated weird responses. ATS2S M M I generated a fluent response, but this re- sponse is not very logically relevant to the query. In case 2, although GenDS and CCM have generated entity words, they also generate some redundant generic patterns, namely, "I'm not sure ...". It is perhaps because their understanding of background knowledge is still not enough. Our ConKADI generates a fluent and informative response. The last challenging case is sampled from the Chinese dataset. "Taylor Swift" is a female singer, but it is an unknown word for models. All generated responses are not absolutely perfect. Only the generations of ATS2S M M I and ConKADI are fluent. In comparison with ATS2S M M I , the generation of ConKADI provides more information; the only small flaw is ConKADI wrongly thinks "Taylor Swift" is a male singer.

Conclusion and Future Work
To bridge the gap of the knowledge between machines and human beings in the dialogue genera-tion, this paper proposes a novel knowledge-aware model ConKADI. The proposed Felicitous Fact mechanism can help the ConKADI focus on the facts that are highly relevant to the dialogue context, by generating a felicitous fact probability distribution over the retrieved facts. Besides, the proposed Context-Knowledge Fusion and Flexible Mode Fusion can facilitate the integration of the knowledge in the ConKADI. Extensive evaluations over both an open-released English dataset and our constructed Chinese dataset demonstrate our ConKADI can significantly outperform the stateof-the-art model CCM and other baselines in most experiments.
Although ConKADI has achieved a notable performance, there is still much room to improve. 1) While ATS2S M M I is behind our ConKADI, we find MMI can effectively enhance the ATS2S; hence, in the future, we plan to verify the feasibility of the re-ranking technique for knowledge-aware models. 2) We will continue to promote the integration of high-quality knowledge, including more types of knowledge and a more natural integration method.