Improving Knowledge-Aware Dialogue Response Generation by Using Human-Written Prototype Dialogues

Incorporating commonsense knowledge can alleviate the issue of generating generic responses in open-domain generative dialogue systems. However, selecting knowledge facts for the dialogue context is still a challenge. The widely used approach Entity Name Matching always retrieves irrelevant facts from the view of local entity words. This paper proposes a novel knowledge selection approach, Prototype-KR, and a knowledge-aware generative model, Prototype-KRG. Given a query, our approach first retrieves a set of prototype dialogues that are relevant to the query. We find knowledge facts used in prototype dialogues usually are highly relevant to the current query; thus, Prototype-KR ranks such knowledge facts based on the semantic similarity and then selects the most appropriate facts. Subsequently, Prototype-KRG can generate an informative response using the selected knowledge facts. Experiments demonstrate that our approach has achieved notable improvements on the most metrics, compared to generative baselines. Meanwhile, compared to IR(Retrieval)-based baselines, responses generated by our approach are more relevant to the context and have comparable informativeness.


Introduction
Unlike human beings, generative dialogue systems tend to generate generic responses, such as 'I don't know.' . One possible reason is the gap in utilizing background knowledge. Human beings can naturally frame their dialogue understanding and responding with various learned background knowledge during the conversation. However, traditional dialogue systems can merely access the surface knowledge in the given query (Ghazvininejad et al., 2018). To tackle this issue, a feasible scheme is incorporating external * Corresponding author. knowledge into the dialogue generation (Qin et al., 2019;Wu et al., 2020b). This paper focuses on introducing the structured open-domain commonsense knowledge graph into the single-turn dialogue response generation. Commonsense knowledge refers to the widely-used everyday knowledge, for example, 'lemon tastes sour'.
In general, a knowledge graph can be regarded as a set of (e head , r, e tail ) fact triplets. For the knowledge-aware dialogue generation, the first step is knowledge selection, aiming at selecting appropriate knowledge facts for the current dialogue context. Traditional works (Zhou et al., 2018) always adopt the Entity Name Matching (ENM), i.e., knowledge facts are retrieved based on the entity words that appear in the given query. For example, the fact triplet (apple, IsATypeOf, fruit) can be selected for the query 'What's your favourite fruit?'. Although such a widely-used method works to some extent, it has several flaws. First, only 1hop knowledge can be retrieved. Second, instead of using the utterance-level (global) features, it uses local words to retrieve; thus, irrelevant knowledge facts may be selected. Third, vertex (entity) degrees in a graph are always unequal; hence, once an entity in the query corresponds to a hot vertex, the number of selected facts can be tremendous. For the time efficiency, in the practical dialogue generation, we always have an upper bound to restrict the number of involved facts. Consequently, a fact may be randomly discarded, no matter it is a highly relevant fact or an irrelevant fact; because ENM can't judge the relevance of a retrieved fact.
As shown in Table 1, to address such issues, this paper proposes a novel knowledge selection approach, Prototype-KR, which retrieves high-quality knowledge facts from prototype dialogues. Prototype dialogues are a set of diverse, informative, and knowledgeable human-written dialogues, which can be retrieved from a large-scale dialogue reposi-  tory. Previous studies Cai et al., 2019) have shown that prototype dialogues always are highly relevant to the current dialogue context; thus, Prototype-KR assumes knowledge facts that are used in the prototype dialogues would be similarly relevant to the current dialogue context. The methodology can be summarized as 1) Prototype-KR first retrieves prototype dialogues that are semantically relevant to the given query using an IR (Information Retrieval) system; 2) Prototype-KR extracts all used facts from prototype dialogues; 3) Prototype-KR selects the most appropriate knowledge facts by ranking; 4) Finally, Prototype-KRG generates a response using the knowledge facts retrieved from both the Entity Name Matching and the Prototype-KR. Our experiments are conducted on a large-scale Chinese conversation dataset (Li and Yan, 2018) and a widely used commonsense knowledge graph ConceptNet. The experimental results demonstrate our approach outperforms both generative baselines and IR-based baselines. We also conduct a series of extensive experiments to analyze the Prototype-KR. We find our Prototype-KR can retrieve higher-quality knowledge facts compared to the traditional Entity Name Matching.
Our contributions can be summarized as 1)We propose a new knowledge selection approach, Prototype-KR, which uses prototype dialogues to effectively alleviate the flaws of the traditional approach Entity Name Matching; 2)We propose a knowledge-aware dialogue model, Prototype-KRG, for improving the knowledge-aware dialogue generation; 3) Extensive experiments empirically verify the effectiveness of our approaches.

Related Work
Dialogue Systems: Roughly, dialogue systems can be classified as either retrieval-based systems or generative systems (Chen et al., 2017). For generative systems, dialogue generation is always modeled as a Seq2Seq problem (Sutskever et al., 2014;Vinyals and Le, 2015). Generally, an Encoder summarizes the given query into intermediate representations, and a Decoder uses them to generate a response. Traditional methods suffer from generating generic responses, decreasing the interest of end-users. To make the dialogue more diverse and informative, previous studies have tried a lot from multiple aspects. For example, using new training objective , using latent variables Gao et al., 2019), introducing content words (Yao et al., 2017;. Knowledge-Aware Methods: One crucial factor that causes generating boring responses is the insufficiency of background knowledge. Traditional models can merely access the surface knowledge from the plain text of the query (Ghazvininejad et al., 2018). Researchers have shown the generated dialogue responses can be more diverse and informative, by introducing the external knowledge, such as the unstructured background documents (Meng et al., 2019), structured knowledge graphs (Zhou et al., 2018;Wu et al., 2020a) and knowledge tables (Qin et al., 2019), or the hybrid of them .
Knowledge Selection: For the knowledge-aware dialogue generation, selecting appropriate knowledge facts from the knowledge graph for a specific dialogue context is still a challenge. As mentioned, the traditional Entity Name Matching has many flaws, and thus many efforts have been devoted to enhancing this knowledge selection process.  adopts a neural knowledge reasoning network to select an appreciate fact.  transfers question representation and knowledge matching abilities from KBQA systems. Although such works have achieved promising results, they always are not wise choices in the practical scenario. First, such approaches adopt complicated external networks to select knowledge, which would significantly increase parameters and make the training/inference more time-consuming. Next, the external networks require a large amount of additional labeled data, which may not be an easy thing in practice. Our work differs from them in that: 1) Our approach does not require any additional data. Prototype dialogues can be retrieved from the training corpus. 2) Our IR-based knowledge selection is fast and requires no pre-training.
Prototypes: Recently, the research of prototype dialogues has received much attention in the context of dialogue generation owing to its highquality. (Weston et al., 2018) encodes a retrieved prototype dialogue into vectors, and then regards them as additional features to help the dialogue generation.  generates a response by editing a prototype response. (Cai et al., 2019) proposes a two-step skeleton-based dialogue generation. A notable shortage of such methods is they often use only one prototype dialogue (Tian et al., 2019); thus, if the given prototype dialogue is irrelevant to the context, the generation quality will sharply decrease. Besides, such methods sometimes may degenerate to directly copy the prototype response, rather than selectively extract useful information. In contrast to these works: 1) Prototype-KR can utilize multiple prototypes at the same time; 2) Prototype-KRG is a fully generative approach, the dialogue generation process does not rely on copying or editing prototype dialogues.

Problem Formulation and Overview
Let D = {(X i , Y i )} |D| be a dialogue corpus, K = {f j } |K| be a knowledge graph, where X is a query, Y is a response, and f = (e head , r, e tail ) is a knowledge fact triplet. The prototype dialogue repository D can be either the D or a new corpus. Prototype-KR retrieves a set of prototype dialogues S = {(X i , Y i )} from D , and extracts all used facts from S (denoted as F p raw ). Next, Prototype-KR ranks facts ∈ F p raw , and selects top-k facts (denoted as F p ). Meanwhile, Entity Name Matching is also used to retrieve a set of facts (denoted as F n ) . Finally, Prototype-KRG uses F p , F n , and X to generate the target response: p(Y |X, F p , F n ).

Prototype-KR
Prototype-KR is a 3-stage method to retrieve top-k relevant knowledge facts from prototype dialogues.
Prototype Retrieval: Prototype dialogues S are firstly retrieved from the repository D . We adopt Lucene 1 to construct an index and use its built-in engine to retrieve 5k prototype dialogues. Following , we have different strategies in the training and inference. In the training, we retrieve prototype dialogues based on the response similarity; in the test, we retrieve prototype dialogues based the query similarity. For each pro- we extract all its knowledge facts to the subset F i . Afterwards, all subsets are merged together: Coarse-Grained Ranking: For each knowledge fact f j ∈ F ALL , the corresponding coarse-grained ranking score s c j is computed as: measures the Jaccard similarity between A and B from a bag-of-word view, and the inverse document frequency IDF (·) is used to penalize the high-frequency generic knowledge facts. We keep top-2k ranked knowledge facts (denoted as F CGR ). It is worth noting that, for each unique target entity e tail , we only keep one fact with the highest score.
Fine-Grained Ranking: For each f j ∈ F CGR , we use embedding-based metric to measure the semantic relevance to the current query/response P X/Y (denoted as P in the below), and then we remain top-k ranked facts (i.e., F p ). The corresponding fine-grained score s f j is computed as: (2) where θ is the cosine similarity, E x is the sentencelevel extrema embedding. For each dimension of the word embedding vectors, take the most extreme value among all vectors in the sentence (Liu et al., 2016). E w (e head/tail ) is the pre-trained word embedding of the head/tail tail entity of f j .

Context Encoder
Context Encoder is a bi-directional GRU network , aiming at encoding the query X into intermediate representations. The forward GRU reads X from the beginning to the end; the backward GRU reads X from the end to the beginning. At the time step t, two outputs are given by: Figure 1: The architecture of our approach. K denotes a knowledge graph, D is a dialogue repository used to retrieve prototype dialogues, X is a query, and Y is a generated response.
is a learn-able word embedding of x, and the corresponding fixed entity embedding

Knowledge Bridge Fusion
Context Encoder only summarizes the surface text of X. For accessing the knowledge before the generation, we propose the Knowledge Bridge Fusion, which uses both the intermediate representation H and the knowledge facts to initialize the Decoder. Given the last context state h n , We first obtain the attention a p of F p , and the attention a n of F n : where F p/n is the corresponding embedding of F p/n , and KA is an Attention function. Learn-able parameters W k , W q , W o are not shared between KA(F p , h n ) and KA(F n , h n ).
Subsequently, a p , a n and h n are fused by a MLP, and the result is regarded as the initial state of the Decoder: z 0 = γ c h n + γ p a p + γ n a n γ c,n,p = sof tmax(W bridge [a n ; a p ; h n ; ita]) where [·; ·] is the concatenation operation; the vector ita is the concatenation of interactions between the a p/n and h n , which includes h n + a p/n , h n − a p/n , abs(h n − a p/n ).

Response Generation
Decoder is another GRU network, at each decoding time step t, the hidden state z t is updated as: where y t−1 is the last predicted token, c t is the attention of H (see (Luong et al., 2015) for the detail)), and c p/n t is the attention of F p/n (see  for the detail).
The tokens to be generated can be one of the following four types: words from the fixed vocabulary V , words copied from X, entity words from F p , and entity words from F n .
Vocabulary Words: The probability distribution p v,t over the fixed vocabulary V is given by: Copied Words: Decoder can copy a word from X, the corresponding probability distribution over the query message X is given by: Entity Words: Decoder can select the bestmatched knowledge fact from F p and F n , and then copy the corresponding entity word. For F p and F n , we apply the same method but with different parameters to compute the probability distribution: Mode Fusion: Following (Wu et al., 2020a), such four distributions are fused using multiple mode gates: p t = π v,t p v,t + π c,t p c,t + π p,t p p,t + π n,t p n,t (π v,t , π c,t , π p,t , π n,t ) = sof tmax(W mode [z t ; u t ])

Training
Finally, the training objective is given by: where L Gen = − t log p t (y t |y t−1:1 , X, F p , F n ) is the negative log-likelihood. L BOW is the bag-ofwords loss to ensure the fluency , our BOW prediction takes z 0 as the input. L M ode is the teach-force loss, i.e., the Cross-Entropy between the π v/c/p/n and the ground-truth 0-1 mode indicator, to help Prototype-KRG more accurate when selecting a target word from four types of words (Zhou et al., 2018) .

Models:
We select generative baselines and IRbased baselines. S2S: The RNN-based Seq2Seq with Attention (Sutskever et al., 2014;Luong et al., 2015); Copy: Copy mechanism additionally allows Seq2Seq to copy a word from the query directly (See et al., 2017); Transformer : Rather than RNNs, both the Encoder and the Decoder are two 6-layer Transformers (Vaswani et al., 2017), respectively; GenDS: A strong knowledge-aware dialogue generation baseline (Zhu et al., 2017); CCM: One SOTA commonsense knowledge-aware dialogue generation model, which proposes a static and a dynamic graph attention mechanism (Zhou et al., 2018); ProtoEdit: One SOTA IR-augmented dialogue generation model by editing the prototype response ; IR: We use a predefined index to retrieve a response from the dialogue repository; meanwhile, IR-Rerank further adds a Jaccard-based rerank process. Especially, our Prototype-KRG (denoted as Ours) and Pro-toEdit have variants. As mentioned, in the training, the original Ours R and ProtoEdit R retrieve prototype dialogues based on the response similarity. Differently, the variants Ours Q and ProtoEdit Q retrieve prototype dialogues based on the query similarity in the training.
Implementations: For S2S, Copy, Transformer, and our approach, we implement them based on a PyTorch Seq2Seq framework, OpenNMT (Klein et al., 2017). For GenDS, we use a Tensorflow implementation. For CCM and ProtoEdit, we use their official codes. In experiments, the vocabulary is set to 30,000, the word embedding dimension is 300, the entity/relation embedding is initialized from a 100-dimensional pre-trained embedding learned by TransE (Bordes et al., 2013), RNNs are implemented as 1024-dimensional GRUs, Adam is used to optimizing the model with an initial learning rate 0.0001 and the batch size 64; learning rate will be halved if the perplexity on the validation set starts to increase, the training will be stopped if the perplexity on the validation set increases in two successive epochs. In the inference, the beam width is set to 10. For a fair comparison, such settings are similarly applied to other implementations as possible. Under such settings, our approach has 193M parameters (including embeddings), and the training takes about 1.5 days on an Nvidia 2080Ti.

Metrics:
We have multiple criteria. The first metric EntN measures knowledge utilization, i.e., the number of generated knowledgeable entities per generated response (Zhou et al., 2018). For the relevance, we employ two embedding based metrics, Embedding-Greedy (EmG), Embedding-Extrema (EmX), and two word-overlap-based metrics, ROUGE, BLEU-4 (Liu et al., 2016). Next, we measure the diversity by reporting the ratio of distinct uni/bi-grams (DIST1/2) in all generated words . Lastly, Entropy is used to measure the informativeness (Mou et al., 2016). Meanwhile, to illustrate the overall performance, we design two auxiliary metrics Overall +DI and Overall. For a model, we take S2S as the stan-  Table 2: Automatic evaluation results. Considering the difference between IR-based and generative systems, we compare different types of the model separately: scores in bold stand for the leadership among generative models; scores with an underline stand for the leadership among our models and IR-based models.
dard; then, we calculate out the relative scores to the S2S metric by metric, and the averaged relative score is Overall +DI . IR-based methods can access human-written dialogues, which brings them additional advantages in diversity and informativeness. It would be better to exclude such metrics into the overall score when comparing across generative methods and IR-based methods. Hence, the calculation of Overall excludes such metrics.

Experimental Results
Experimental results have been reported in Table 2.
vs. Generative Baselines: Prototype-KRG outperforms generative baselines in terms of most metrics and the overall scores. Prototype-KRG only slightly loses the leadership in terms of Entropy compared to the Transformer. The advantages of the previous SOTA CCM and our Prototype-KRG show that knowledge can indeed help the dialogue generation. Compared with two knowledge-aware baselines, GenDS and GenDS, Prototype-KRG is notably better in terms of the knowledge utilization, diversity, and the informativeness. It can be attributed to 1) Prototype-KR can select higherquality knowledge facts; 2) the effectiveness of Prototype-KRG.
vs. IR-based Baselines: Generative approaches are not directly compared with IR-based approaches, because of their different characteristics. The later type naturally has higher diversity and informativeness, for they can directly access the human-written dialogues. Thus, IR and IR-Rerank significantly outperform other models in terms of the DIST-1/2 and the Entropy. However, every coin has two sides; they suffer from low relevance; they are notably beaten by generative approaches in the relevance metrics. This is because they mechan-ically return existing unmodified dialogues even when the retrieved responses are irrelevant to the query. ProtoEdit tries to address this flaw by editing the retrieved dialogue. It can be seen that diversity and informativeness have significantly decreased, but the improvement of the relevance and the overall performance (see Overall) is still limited. Compared to ProtoEdit, Prototype-KRG has comparable performance in terms of diversity, and notably better performance in the remaining aspects and the overall performance.
How to Select Prototypes: As mentioned, there are two strategies to retrieve prototype dialogues in training. We have noticed that the authors of ProtoEdit said that ProtoEdit Q always generates non-sense responses . As reported in Table 2 and our manually reviewing, compared to ProtoEdit R , responses generated by ProtoEdit Q are indeed boring and non-sense, while more relevant to the query. Unlike ProtoEdit, although Ours R similarly outperforms Ours Q in terms of DIST1/2 and Entropy, such two implementations are comparable in the aspect of the relevance (see Overall). It means our approach is much more robust to different prototype dialogues.
Human Annotation: We employed three annotators and sampled 200 queries from the test. Six baselines (1200 pairs) are involved in our pairwise comparisons. There are two criteria: (1) Appropriateness (i.e., fluency and relevance); (2) Informativeness (how much relevant knowledge is provided). The agreement among annotators are: for the appropriateness, 2/3 agreement is 97.3%, 3/3 agreement is 54.7%; for the informativeness, 2/3 agreement is 97.6%, 3/3 agreement is 55.0%. As shown in Table 3, our approach outperforms all baselines, indicating the advantage of our ap-proaches. In terms of appropriateness, S2S and Copy are the two best baselines because they always generate generic responses, which are fluent and sometimes are easy to be accepted by humans. CCM performs poorly because it sometimes generates long but not fluent responses. Two IR-based methods are unsatisfactory. Responses given by ProtoEdit and IR-Rank are fluent, but sometimes irrelevant to the query. Moving to the informativeness, CCM is the best generative baseline, which indicates the importance of using knowledge. Benefit from accessing the human-written dialogues, IR-based ProtoEdit and IR-Rank outperform generative baselines. If we ignore the dialogue context and only check the informativeness of responses, IR-Rerank can outperform ProtoEdit and ProtoEdit has comparable performance with our approach. However, the context should be considered, and thus we penalized the irrelevant information; as a result, ProtoEdit is comparable with IR-Rerank, and our approach is better than ProtoEdit.  Table 3: Human annotation results. A/I is Appropriateness/Informativeness. +/0/− means Ours R wins/ties/loses the comparison. Our approach is better than baselines (sign test, p-value < 0.005).

Analysis of Prototype-KR
Ablation Study: Following (Zhou et al., 2018; and many other works, in the above experiments, at least one golden fact used by the reference response 2 is given in the test set. To clearly illustrate the difference among variants, we construct a new test set in line with the practice. Instead of manually assuring such a golden knowledge fact is existing, we do not add any additional fact. As shown in Table 4, compared to the 'Full', although the variant '-Dual' similarly uses the knowledge facts retrieved by PKR and ENM, and it has similar performance in the aspect of the relevance, we find the metric EntN and DIST2 have significantly decreased, indicating the necessity to distinguish them in a model. Next, we turn to com-  pare PKR and ENM; we find '-PKR' impacts the performance more than '-ENM', which illustrates the knowledge quality of PKR is better than ENM's. '-PKR-ENM' removes the use of knowledge, and then, most metrics drop dramatically, which indicates the importance of introducing knowledge. In summary 1) Knowledge is indispensable in the dialogue generation; 2) Our Prototype-KR can select more appropriate knowledge facts than the traditional Entity Name Matching.
Knowledge Selection: Figure 2 reports the statistical evaluations for the knowledge facts retrieved by PKR and NEM. We have several observations: 1) Both two metrics indicate that PKR notably outperforms NEM in knowledge selection; 2) PKR has more notably advantages when k is small, which means the ranking of PKR is accurate, highly relevant facts always have prior ranks.
Case Study: Table 5 reports three examples. In the first example, although all approaches have generated fluent responses, they are different in both appropriateness and informativeness. S2S and Copy generated generic responses. For the knowledge-aware GenDS and CCM, they detected a specific topic (self-confidence), but the generated responses are a little irrelevant to the query. Similarly, ProtoEdit and IR answered two generic responses. In the second example, S2S and Copy repeated the words, GenDS and CCM used a wrong knowledge ('flower' is not 'leaf'). ProtoEdit and IR gave two weird responses. The last example first shows top-3 knowledge facts that were retrieved and ranked by our Prototype-KR, and then shows the response generated by Prototype-KRG. It can be seen that such three facts are highly relevant to the query, and the generated response uses the second fact. In short, compared to the generative approaches, our approach can generate more informative and relevant responses; compared to the IR-based approaches, our approach can generate more relevant responses.
Error Analysis: We have further labeled the error type for 200 responses sampled in the above human annotation. For a response, it can be labeled as a perfect (beyond the expectation), a good (acceptable), or a bad response. For a bad case, we give it one or more fine-grained error types. There are five error types: being irrelevant to the dialogue context, including illusory errors or grammar errors, generating some repeated words, and nonsense. About 64.5% generated responses are la- Figure 3: The statistics of error type. Red bars are exclusive labels; each response can only be labeled as one type. Blue bars are fine-grained error type labels; each bad case is given at least one label. beled as perfect or good; the remaining 35.5% more or less have some mistakes. The most notable error type is 'non-sense', which means the generated response is meaningless while it is always fluent and relevant to the context; for example, wrongly rephrases the query. Responding with an irrelevant topic, making grammar errors, and repeating words are three common error types among generative models, but their error rates in our approach are well-controlled. Knowledge-aware models are more potential to generate 'illusory' responses that violate the commonsense knowledge, for example, 'What disease do you drink?'. We are glad to find this rarely happens in our approach.

Conclusion and Future Work
We propose a novel knowledge selection method, Prototype-KR, and a knowledge-aware model, Prototype-KRG, for the open-domain knowledgeaware dialogue generation. Prototype-KR retrieves knowledge facts from the human-written prototype dialogues, which is fast, accurate and requires no additional labeled data. Extensive experiments on a large-scale Chinese dataset show that our approach outperforms generative baselines on most metrics. Compared to IR-based approaches, our approach can generate responses with higher relevance and comparable informativeness.
In the future, we will continue to strengthen the use of prototype dialogues without making the dialogue generation process complicated. Meanwhile, we are going to research the possibility to use different knowledge in a dialogue system.