Augmenting Knowledge-grounded Conversations with Sequential Knowledge Transition

Knowledge data are massive and widespread in the real-world, which can serve as good external sources to enrich conversations. However, in knowledge-grounded conversations, current models still lack the fine-grained control over knowledge selection and integration with dialogues, which finally leads to the knowledge-irrelevant response generation problems: 1) knowledge selection merely relies on the dialogue context, ignoring the inherent knowledge transitions along with conversation flows; 2) the models often over-fit during training, resulting with incoherent response by referring to unrelated tokens from specific knowledge content in the testing phase; 3) although response is generated upon the dialogue history and knowledge, the models often tend to overlook the selected knowledge, and hence generates knowledge-irrelevant response. To address these problems, we proposed to explicitly model the knowledge transition in sequential multi-turn conversations by abstracting knowledge into topic tags. Besides, to fully utilizing the selected knowledge in generative process, we propose pre-training a knowledge-aware response generator to pay more attention on the selected knowledge. In particular, a sequential knowledge transition model equipped with a pre-trained knowledge-aware response generator (SKT-KG) formulates the high-level knowledge transition and fully utilizes the limited knowledge data. Experimental results on both structured and unstructured knowledge-grounded dialogue benchmarks indicate that our model achieves better performance over baseline models.


Introduction
Knowledge-grounded conversations (Long et al., 2017;Liu et al., 2018;Niu et al., 2019;Xu et al., 2020), aiming at improving the informativeness Really? I just know one of his movies Lunar Eclipse.
Great, I will see it. and specificity of dialogue generation by exploiting external knowledge sources, has attracted much attention as a potential solution to relieve the common response problem (Li et al., 2015;Zhang et al., 2018a;Ren et al., 2020) in dialogue generation, i.e.,' I don't know.' and ' What do you mean?'. Typically, knowledge-grounded conversation is decomposed into two sub-processes (Dinan et al., 2018;: knowledge selection (KS) based on dialogue context, and response generation with reference to the selected knowledge. Therefore, to select relevant knowledge and then incorporate it efficiently, is of great significance for multi-turn knowledge-grounded dialogue generation task. Although external knowledge sources are widespread in the real-world, in fact, current knowledge-grounded conversations still lack the fine-grained control over knowledge selection and integration with dialogues. Most existing works (Liu et al., 2018;Niu et al., 2019) select knowledge according to the given dialogue context Kim et al., 2020). However, the sequential transition characteristic of knowledge (also known as knowledge shift) along multiple sequential conversation turns is neglected. As shown in Figure 1, two people are talking about an actor from the knowledge "astrological sign" to another knowledge "blood typology", which is a natural transition in human personality chat (Mayo et al., 1978;Miller, 2014). By nature, taking the knowledge sequential transition characteristic into account is of tremendous benefits to the knowledge grounded conversations.
What's more, knowledge-irrelevant response generation problem also hampers the performance of existing models. This is caused by two reasons. The first reason is that current models often over-fit during training, resulting with incoherent response by referring to unrelated tokens from specific knowledge content in testing phase. To resolve this problem, we propose to calculate the knowledge transition probability among different turns on a high-level representation, i.e., knowledge topic tag. With such concise high-level knowledge representation, our model is not limited to conventional structured knowledge-grounded conversation but can be easily adapted to unstructured knowledgebased conversations. For example, in structured triple data, i.e.,{obj, relation, content}, we can utilize the "relation" as the high-level topic tag to model the sequential knowledge transition process in conversations. As shown in Figure 1, the topic migrates from the "astrological sign'" tag to the "blood typology" tag, and then moves to the "masterpiece". In the unstructured dataset like 'Wizard of Wikipedia' (Dinan et al., 2018), we can utilize topic models, such as LDA (Blei et al., 2003), to obtain the knowledge tag for each turn, and then calculate the sequential transition probability among these tags. Since the number of tag categories is limited, it can be well employed to model the knowledge transition.
Moreover, the second reason is that the models often tends to overlook the selected knowledge, and hence generates knowledge-irrelevant response. To address this problem, we propose pre-training a knowledge-aware response generator, aiming at generating a natural sentence based on a given knowledge, in order to make full use of the limited knowledge data. For example in Figure 1, given the triple ' {Chao Wu, astrological sign, Aries}' , the knowledge-aware generator is optimized to generate a sentence ' Chao Wu's astrological sign is Aries.'. Obviously, the generator should also has the ability to generate ' Zhiling Lin's astrological sign is Virgo.' while given '{Zhiling Lin, astrologi-cal sign, Virgo}'. Actually, the knowledge-aware response generator learns how to generate a natural sentence based on a relation tag rather than the knowledge content. It is like that one student learns grammar rules rather than specific examples while learning a foreign language. Therefore, even with the limited data, the generator can also generate relevant sentences about given knowledge.
In this paper, we propose a sequential knowledge transition model equipped with a pre-trained knowledge-aware response generator (SKT-KG), which can conduct the high-level knowledge transition in conversation and fully use of the limited knowledge data. Specifically, at first, we pre-train a transformer-based response generator based on the knowledge. And then, we utilize a BiLSTM-CRF (Huang et al., 2015) network to model the knowledge transition process, and select the knowledge tag with maximum score and its corresponding knowledge content. Finally, we feed the dialogue utterances and the selected knowledge content together into the pre-trained knowledge-aware response generator to generate final response.
In our experiments, we use two public knowledge-grounded dialogue datasets to evaluate our proposed models, i.e. structured DuConv corpus and unstructured Wizard of Wikipedia (WoW) corpus. The results show that our SKT-KG model has the ability to produce more diverse and suitable responses than traditional knowledge-grounded models. Besides, we conduct an analysis on knowledge selection, and the results show that the SKT-KG model obtains higher ranking measure than baselines, which indicates that the knowledge selected by our model is reasonable.

Related Work
Recently, dialogue systems have gained more attention in both research community (Vougiouklis et al., 2016;Liu et al., 2018;Shen et al., 2019;Shen and Feng, 2020) and industry , because of its practicality in the real application, such as chatbot and customer services Shen et al., 2021;Zhang et al., 2020). With external knowledge sources, dialogue systems can generate more specific and informative response, which has great potential to resolve the common response problem (Zhang et al., 2018b;Ren et al., 2020). The majority of previous works decomposed the  knowledge-grounded dialogue generation task into two sub-problems: knowledge selection and response selection.
In knowledge selection, previous works proposed to use the keyword matching (Ghazvininejad et al., 2018;Liu et al., 2018), information retrieval  and entity diffusion (Liu et al., 2018) methods to detect the relevant knowledge based on dialogue context, and finally feed both dialogue utterances and the selected knowledge into generative models. Specifically,  proposed to employ the graph attention mechanism to encode the retrieved relevant knowledge graph, which can augment the semantic understanding of dialogue context.  proposed to use the prior and posterior distributions over knowledge to facilitate knowledge selection. Although these work are capable to model the relationship between context and knowledge, they still ignored the knowledge transition characteristic, which is important for knowledge selection.
Human dialogue depends on both local information and global information.  also pointed out that natural language understanding requires a coherent understanding of a series of events or actions, not only what events have appeared, but also what is likely to happen next. Therefore, it is critical to obtain the natural and relevant knowledge for the knowledge-grounded dialogue generation. Sun et al. (2020) proposed to recurrently update the knowledge based on conversation history and progressively incorporate it into the history step-by-step. But they only consider the relationship of history to knowledge. However, these models may also suffer from a knowledge sparse problem, due to the low-resource limitation in reality .
In reality, sufficient knowledge-grounded dialogues data are difficult to obtain. To tackle this practical challenge, Su et al. (2020) proposed to augment the dialogue generation with external non-conversational text, which may also introduce much noise. Li et al. (2020) proposed to pre-train the knowledge encoder with unstructured knowledge and fine-tune the model using the limited knowledge-grounded training examples. In our work, we propose to make full use of our training data and model the high-level knowledge transition process, which can resolve the sparse problem in knowledge-grounded dialogue data.

Approach
In this section, we propose a novel sequential knowledge transition model with pre-trained knowledge-aware response generator (SKT-KG), as shown in Figure 2. This model contains three major parts: pre-trained knowledge-aware response generator, sequential knowledge transition, and transformer decoder. Specifically, we firstly pretrain a transformer-based knowledge-aware response generator based on the knowledge and its corresponding natural sentence. And then, we utilize a BiLSTM-CRF (Huang et al., 2015) network to model the knowledge transition process, and select knowledge tag with maximum score and its corresponding knowledge content. Finally, we feed the context utterances and this selected knowledge content into the knowledge-aware response generator to fine-tune it. After fine-tuning, response can be generated by given the selected knowledge tag and corresponding content, and history dialogue utterances.
Token Embedding Figure 3: An example for the input representation. In the pre-training phase, we mask the context utterances part. And for the fine-tune response generation phase, we concatenate the selected knowledge tag, the selected knowledge content and history utterances as input.

Input Representation
Firstly, we introduce the data formulation in our model. Given the history knowledge content K = {k 1 , · · · , k n }, the history context C = {c 1 , · · · , c n } and the candidate knowledge set for response CK = {ck 1 , · · · , ck m }, the goal of our model is to select the most relevant and natural knowledge ck t ∈ CK based on the sequential K and C, and then generate the response Y = {y 1 , · · · , y |L| } based on the selected knowledge ck t and context C. It is worth noting that each history utterance c i is related to a history knowledge k i and each knowledge k i has a knowledge tag t i ∈ T , which is explicit in the structured knowledge, such as ' relation' in triple knowledge as shown in Figure 1, and implicit in the unstructured knowledge, which is abstracted by topic model, i.e., LDA (Blei et al., 2003). Knowledge tag category T = {t 1 , ..., t N } has N different knowledge tags. We utilize the classical transformer blocks as the backbone framework. To generate response Y , the original input is the concatenation of the selected knowledge tag s t , the selected knowledge content ck t and the history context utterances{c 1 , · · · , c n }. We use three different embedding methods for the original input: Token embedding, Role embedding and Position embedding, as shown in Figure 3. For knowledge content and dialogue utterances, we utilize the word embedding of each token as the token embedding. For knowledge tag, we map each tag to different categories as the token embedding. A special end-of-knowledge [EOK] token is inserted between knowledge and utterance context to mark the border. Another token end-of-utterance [EOU] is added at the end of each history dialog utterance. Role embeddings are employed to differentiate knowledge content and dialogue utterances. The role embedding E K is added for the knowledge content, as well as dialogue utterances are represented by role embedding E C . Position embeddings are added according to the token position in each utterance. Note that for the special token of knowledge tag, its corresponding role and position embeddings are both set to zero.

Pre-trained Knowledge-aware Response Generator
In our pre-trained knowledge-aware response generator, there are two essential phases we should consider: pre-training phase and fine-tuning response generation phase. In the pre-training phase, given the knowledge tag and knowledge content, our generator focuses on generating the relevant sentence, as shown in the left of Figure 2. And in the fine-tuning response generation phase, given the context utterances, the knowledge tag and the selected knowledge content, our generator focuses on generating the natural and relevant response, as shown in the top-right of Figure 2. To unify the pretraining phase and fine-tuning phase, we propose to utilize the flexible self-attention mask mechanism to distinguish the input representation in this two phases, as shown in Figure 3.
In the pre-training phase, we employ a selfattention mask mechanism to the history dialogue utterances, in order to train the knowledge-aware response generator independently. Given the knowledge content k i ∈ K, its knowledge tag t i ∈ T and its corresponding sentence c i = {x i 1 , · · · , x i N }, we choose the negative log-likelihood loss as our training optimization.
where θ denotes the model parameters and x i <t denotes the previously generated words.

Sequential Knowledge Transition
In this section, we will introduce the knowledge selection process, including the utterance encoding and transition modules. To obtain the next knowledge tag, we should consider both the sequential knowledge tags and the sequential context utterances, as shown in Figure 4. Utterance Encoding. To conduct the context sequential representation, we use the standard base BERT model with average pooling (Cer et al., 2018) and the BiLSTM to obtain the context sequential representation. Given the context utterances C = {c 1 , · · · , c n } where c i is composed of a group of words {x i 1 , ..., x i N }, we utilize a standard BERT model to encode each utterance c i as a sentence embedding u i c . And then, we apply a BiLSTM on these sentence embedding to obtain the context sequential representation: Knowledge Transition. We model the knowledge tag transition process with the assistance of Conditional Random Field (CRF) mechanism (Lafferty et al., 2001). We combine a BiLSTM network and a CRF network to form a BiLSTM-CRF model, as shown in Figure 4. This network can efficiently use past input features via a BiLSTM layer and sentence level tag information via a CRF layer. For each BiLSTM cell, it will output the score of each tag. Given a context representation h i c , the corresponding tag scores is: where W 1 and b 1 are the training parameters. score i+1 [t i+1 ] means the output score of knowledge tag t i+1 at the (i + 1)-th step. CRF layer is capable to model the sequential tag relationship by maximizing a global score C(t 1 , t 2 , ...t n , θ). This global score is the concatenation of a transition score T [i, j] and a matrix of score. T [i, j] is to model the transition probability from i-th tag to j-th for a pair of consecutive steps. The matrix of score is used to record tag transition path along with the context sentences.

Fine-tuning Generation Phase
Figure 4: Sequential knowledge transition phase.
Once we get the knowledge tag s t , we are able to pick out the corresponding knowledge content ck t from the candidate knowledge set CK. If there are multiple knowledge contents with the same tag s t , we will apply a coarse-to-fine knowledge matching module to select out the knowledge content with maximum score as ck t .
Coarse-to-fine Knowledge Matching. To select out the final knowledge content from multiple candidates with the same knowledge tag, we adopt BM25 (Robertson and Zaragoza, 2009), as the supporting coarse-to-fine matching model. Given a knowledge content and dialogue context pair (ck i , c), the matching model will output a matching score. We will choose the knowledge content with the highest score as the final knowledge content. Knowledge Transition Loss. In the training phase, we adopt two level knowledge loss to optimize the sequential selection process. Knowledge tag loss L kg tag (θ) is a log-likelihood loss to minimized the difference between true tag label and prediction tag label. Knowledge content loss L kg cont (θ) is a cross-entropy loss to minimize the divergence between true knowledge sentence and prediction one. Therefore, the total knowledge transition loss is defined to be:

Fine-tuning and Response Generation
The flexible self-attention mask mechanism enables our pre-trained generator to consider the dialogue history in the response generation phase. Given the generated knowledge tag s t and its corresponding knowledge content ck t , and the dialogue contexts{c 1 , · · · , c n }, the fine-tuning procedure can be carried out by the following training optimization to generate response y = {y 1 , · · · , y N }, defined as: logp(y t |y <t , ck t , s t , c 1 , · · · , c n ; θ),

5626
The process is shown in the right of the Figure 2. After fine-tuning phase, response can be generated by given selected knowledge tag, corresponding knowledge content, and history dialogue context. Baselines. We compare our SKT-KG model with several state-of-the-art models, including (i) Transformer: a fully self-attention mechanism model (Vaswani et al., 2017), (ii) MemNet: The E2E Transformer with memory mechanism (Dinan et al., 2018), which uses a Transformer memory network for knowledge selection and a Transformer decoder for utterance prediction. (iii) PostKS: Posterior Knowledge Selection , which uses the posterior knowledge distribution as a pseudo-label for knowledge selection. (iv) SLKS: sequential latent knowledge selection model (Kim et al., 2020), which keeps track of prior and posterior distribution over knowledge and sequentially updated considering contexts in previous turns. we also employ some degraded SKT-KG models to investigate the effect of our proposed pre-trained knowledge-aware response generator mechanisms: SKT is the model without pre-trained knowledgeaware response generator, only using the knowledge transition to select the knowledge and then generate the response with transformer decoder. Parameters Setting. For WoW, we set the vocabulary size to 30,522, as the default setting in BERT 1 . For DuConv, we set the vocabulary size to 21,128 2 .
To fairly compare our model with all baselines, the number of hidden nodes is all set to 512 and the batch size set to 128. The max length of sentence is set to 30 and the max number of dialogue turns is set to 8. The topic size of LDA for WoW dataset is set as 50. We use Adam (Kingma and Ba, 2014) for gradient optimization in our experiments. The learning rate is set to 0.001. We run all models on the Tesla P40 GPU. Evaluation Measures. We use both quantitative evaluation and human judgements in our experiments. Specifically, we use the indicators including BLEU-1/2 and distinct-1/2, Embedding metrics (average, extrema and greedy) 3 . We also measure the knowledge selection precision and F1 score between the prediction and ground-truth knowledge. For human evaluation, we randomly sampled 300 generated response and invited six annotators (all CS majored students) to give their rating score based on the relevant, informative and natural of the generated response with respect to the contexts. The rating ranges from 0 to 3 for relevance, informativeness and natural, respectively.

Metric-based Evaluation
The metric-based evaluation results are shown in Table 1 and Table 2. From the results, we can see that the sequential knowledge models, i.e., SLKS and our SKT models, perform better than the traditional knowledge-grounded dialogue models, i.e., MemNet and PostKS models, in terms of BLEU and Distinct measures. That's because the sequential characteristic in knowledge is significant and beneficial for the knowledge selection process. Our proposed SKT-KG model obtains good results. Tak   ing the BLEU-2 value on the DuConv dataset as an example, the BLEU-2 value of SKT-KG is 26.31, which is better than that of baseline models. The distinct-2 value of our model is also higher than other baseline models, indicating that our model can generate more diverse responses. For the unigram F1 score of the knowledge selection in Table 3, the F1 score of SKT-KG is 19.26, which is better than other models, showing that our model can extract more relevant and natural knowledge than baseline models. Compared with the ablation model SKT, we find that the pre-trained knowledgeaware response generator in our model can improve distinct measure and unigram F1 score, indicating that the model with pre-trained generator has ability to generate more diverse response. We also conducted a significant test. The experimental results show that the improvement of our model is significant in both datasets, i.e., p-value < 0.01. In summary, our SKT-KG model is able to generate higher relevant and more diverse responses than the baselines.

Human Evaluation
The results of human evaluation are shown in Ta  erated responses. From the experimental results, the relevance (Rel), information (Info) and natural (Nat) score for our model is greater than that of MemNet, PostKS and SLKS, indicating that our SKT-KG model is better than the baseline methods. Taking DuConv as an example, the score of relevance and informativeness in SKT-KG are 2.3 and 2.6, respectively, while the SLKS are 2.2 and 2.1, indicating that our model can generate more informative response than SLKS. In addition, for the natural comparison, the score of SKT-KG is 2.3, which is larger than SLKS i.e.,2.1, showing that the high-level knowledge transition is effective for the knowledge-grounded dialogue generation task and our SKT-KG model can generate more natural response with more information. The Kappa (Fleiss, 1971) value demonstrates the consistency of different annotators. We also conducted a significant test, and the improvement of our model is significant on both datasets, i.e., p-value < 0.01.

Case study
To facilitate a better understanding of our model, we present some examples in Figure 5. From the multi-turn dialogues, we can see that the knowledge topic is from ' reviews of Mengyao Xi', to the 'master work of her', and then to the ' master work of Sui He'. The knowledge tag of ground-truth   is the ' reviews of Sui He '. From the generation results, we can see that the sequential-based model performs better than the selection model, i.e., Mem-Net and PostKS. Taking an example in Figure 5, an un-natural response is generated by MemNet and PostKS, such as 'Area of Sui He ' and ' Height of Sui He'. However, the sequential model can generate more natural and relevant responses, such as ' Yes, she is the angel of China' and 'He Sui was the girl named as the angel of China '. This is mainly because the sequential model is able to locate the ' reviews' knowledge which is more natural for the contexts. Moreover, our high-level transition model with pre-trained knowledge-aware response generator can generate more informative response than SLKS, as shown in Figure 5.

Analysis on Knowledge Selection
To verify whether the performance improvements are owing to the knowledge transition module, we conduct a further data analysis. Specifically, we randomly sample 300 examples from the DuConv dataset and WoW dataset, to evaluate the performance of the knowledge selection process in base-lines and our model. As knowledge-grounded dialogue models will select the relevant knowledge from the candidate knowledge set based on the dialogue contexts, we can treat it as a ranking task. Ranking evaluation measures, such as the precision, recall and F1 score, are used for quantitative evaluations. Then we calculate the precision, recall and F1 score of the top 1,2,5 for PostKS, SLKS and our SKT-KG model. The results are shown in Table 5. We can see that the the sequential knowledge selection models, such as SLKS and SKT-KG, perform better than traditional selection model, i.e., PostKS, validating the effectiveness of sequential knowledge model. These results indicate that our proposed knowledge sequential transition module is capable to select out more relevant knowledge content than baseline models.

Conclusion and Future Work
In this paper, we propose a sequential knowledge transition model with knowledge-aware response generator to model the high-level knowledge transition and fully utilize the low-resource knowledge data. SKT-KG models can abstract knowledge into tags which leads our model easily to apply into both the structured and unstructured knowledgegrounded conversations. Besides, we propose a pre-trained knowledge-aware response generator, aiming at generating a natural sentence based on a given knowledge, to make full use of the limited data. Experimental results on both structured and unstructured knowledge-grounded dialogue datasets show that our SKT-KG model outperforms baseline models. As for future work, we intend to apply variational autoencoder to unstructured dataset, in order to empower models to learn the knowledge topic by themselves.