Difference-aware Knowledge Selection for Knowledge-grounded Conversation Generation

In a multi-turn knowledge-grounded dialog, the difference between the knowledge selected at different turns usually provides potential clues to knowledge selection, which has been largely neglected in previous research. In this paper, we propose a difference-aware knowledge selection method. It first computes the difference between the candidate knowledge sentences provided at the current turn and those chosen in the previous turns. Then, the differential information is fused with or disentangled from the contextual information to facilitate final knowledge selection. Automatic, human observational, and interactive evaluation shows that our method is able to select knowledge more accurately and generate more informative responses, significantly outperforming the state-of-the-art baselines.


Introduction
Knowledge-grounded conversation generation aims at generating informative responses based on both discourse context and external knowledge (Ghazvininejad et al., 2018;Zhou et al., 2018a), where selecting appropriate knowledge is critical to the success of the task. Existing knowledge selection models generally fall into two types. One type is solely based on the context Meng et al., 2020;, which we call non-sequential selection because knowledge selection at different turns is independent. The other type sequentially selects knowledge additionally conditioned on previously selected knowledge (Kim et al., 2020), which we denotes that the corresponding knowledge has little difference from or is identical to the previously selected one, and selecting it may lead to repetitive responses. The red × denotes that the difference is too large, and selecting it could make the response incoherent with the context. call sequential selection. As shown in Kim et al. (2020), such a sequential way can better simulate a multi-turn dialog and facilitate knowledge selection in later turns. However, the difference between selected knowledge at different turns has been largely neglected in prior studies, while it usually provides potential clues to knowledge selection. Figure 1 illustrates an example, where the dialog system selects one from candidate knowledge sentences (all relevant to the context) at the 2 nd turn. Selecting the knowledge that has little difference from or even is identical to the previously selected one (like the 1 st knowledge) may lead to generating repetitive responses, while too large difference (like the 3 rd knowledge) would make the response incoherent with the context. As a result, the dialog system should avoid the knowledge which differs from the previously selected ones either too little or too largely, and instead select an appropriate knowledge sentence (the 2 nd one) which can make the conversation flow smoothly and naturally.
We thus propose DiffKS, a novel Differenceaware Knowledge Selection method for knowledgegrounded conversation generation. It first computes the difference between the candidate knowledge sentences provided at the current turn and the previously selected knowledge. Then, in the two models we devise, the differential information is fused with or disentangled from the contextual information to facilitate final knowledge selection. Automatic and human evaluation on two widely-used benchmarks shows that our method is significantly superior over the state-of-the-art baselines and it can select knowledge more accurately and generate more informative responses.
Our contributions are summarized as follows: • We propose to explicitly model and utilize the differential information between selected knowledge in multi-turn knowledge-grounded conversation for knowledge selection. We further devise two variants where the differential information is fused with or disentangled from the context information during knowledge selection.
• Automatic, human observational, and human interactive evaluations show that our method significantly outperforms strong baselines in terms of knowledge selection and can generate more informative responses.
2 Related Work

Knowledge-grounded Dialog Generation
Recently, a variety of neural models have been proposed to facilitate knowledge-grounded conversation generation (Zhu et al., 2017;Young et al., 2018;Zhou et al., 2018a;Liu et al., 2018). The research topic is also greatly advanced by many corpora (Zhou et al., 2018b;Moghe et al., 2018;Dinan et al., 2019;Gopalakrishnan et al., 2019;Moon et al., 2019;Tuan et al., 2019;Zhou et al., 2020). As surveyed in , existing studies have been mainly devoted to addressing two research problems: (1) knowledge selection: selecting appropriate knowledge given the dialog context and previously selected knowledge Meng et al., 2020;Kim et al., 2020); and (2) knowledge-aware generation: injecting the required knowledge to generate meaningful and informative responses (Ghazvininejad et al., 2018;Zhou et al., 2018a;Li et al., 2019;Qin et al., 2019;Yavuz et al., 2019;Zhao et al., 2020). Since selecting the appropriate knowledge is a precursor to the success of knowledge grounded dialog systems, we focus on the knowledge selection problem in this paper.

Non-sequential Knowledge Selection
The non-sequential selection models capture the relationship between the current context and background knowledge Meng et al., 2020;. For instance, PostKS ) estimates a posterior distribution over candidate knowledge sentences, which is based on both the context and the golden response, and only uses the context to estimate a prior distribution as an approximation of the posterior during inference. Besides, ; Meng et al. (2020);  also belong to non-sequential selection models. Different from our work and ; Kim et al. (2020) that select knowledge from candidate knowledge sentences, their methods are devised for selecting important text spans or fragments from the background knowledge document that will be used in generation. Therefore these works have a different task setting from ours.

Sequential Knowledge Selection
The sequential selection models additionally make use of previously selected knowledge to facilitate knowledge selection (Kim et al., 2020). For instance, Kim et al. (2020) propose a Sequential Latent Knowledge Selection (SLKS) model. It keeps track of the hidden states of dialog history and previously selected knowledge sentences. Our method is parallel to SLKS because we also utilize the previously selected knowledge. However, we explicitly compute the difference between knowledge selected at different turns, while SLKS only encodes the already selected knowledge in an implicit way.
In addition, recently there emerge a number of works that propose RL-based models to select a path in structured knowledge graph (KG) (Xu et al., 2020a,b), which also select knowledge in a sequential way. While our method is designed to ground the conversation to unstructured knowledge text, we will leave as future work the application of our method to such KG-grounded dialog generation tasks Moon et al., 2019;Zhou et al., 2020).

Task Formulation
In a multi-turn dialogue, given a post and a sequence of knowledge sentences at each turn, our goal is to select appropriate knowledge and generate a proper response to the current context.
Formally, the post at the τ -th turn is a sequence of tokens x τ = x τ 1 , . . . , x τ |x τ | , and the response to be generated is y τ = y τ 1 , . . . , y τ |y τ | . The background knowledge k τ = k τ 1 , . . . , k τ |k τ | contains a sequence of knowledge sentences provided at the is a sequence of tokens in the i-th sentence.
Note that under the setting of multi-turn dialogue, we use c τ x τ −1 ; y τ −1 ; x τ as the given context at the τ -th turn, where [·; ·] denotes concatenation. In Section 3.2 and 3.4, we will omit the superscript τ for simplicity.

Encoders
The context is encoded with a bidirectional GRU (Cho et al., 2014): where ← − h c,1 as the context representation. Similarly, the knowledge sentences are encoded with another BiGRU: as the representation of k i . Specifically, we add an empty sentence k 0 that indicates no knowledge being used.  Figure 2: An overview of model structure.

Difference-aware Knowledge Selection
In order to select proper knowledge, our model gets aware of the difference between the current candidate knowledge sentences and the previously selected knowledge.
To make full use of the contextual dependency and relevance between the knowledge sentences 1 , our model first compares candidate knowledge sentences to explore their correlations, where the comparison is conducted using BiGRU: Then, the model computes the difference of each knowledge sentence r τ i from the knowledge selected in the previous M turns h τ −m k M m=1 : Inspired by Wang et al. (2018), we define the difference as follow: where F is a fully connected layer activated with tanh. Note that at the first turn, we set o 1 i to a zero vector because there is no differential information to be obtained.
For that intuitively the knowledge selected in the previous turn has the largest impact and most clues for the current selection, we studied the simplest case where M = 1, saying o τ i = Dif f h τ −1 k , r τ i , in the main experiments for simplicity.
Next, we introduce two variants where the differential information {o τ i } |k τ | i=0 is fused with or disentangled from the contextual information during knowledge selection.

Fused Selection
where v, W que and W key are trainable parameters. However, it is difficult to distinguish the respective contributions of contextual and differential information to knowledge selection in the above fused variant. We thus devise the disentangled variant as following, where the roles of two types of information are separated, which makes it feasible to conduct ablation study. Figure 4: Disentangled Selection module. The contextual information and the differential information are disentangled to calculate two separate knowledge selection distributions in two independent selectors. Figure 4 gives an overview of the Disentangled Selection module. It has two independent selectors. The Contextual Selector simply looks for the knowledge sentence that has high relevance to the context, just like most existing knowledge selection models do. It only takes advantage of the context h τ c to match each knowledge sentence itself h τ k,i , obtaining a context-aware selection distribution:

Disentangled Selection
In contrast, the Differential Selector focuses on predicting the next knowledge to be selected conditioned on the previously selected knowledge and differential information, which reveals the process of knowledge transition. Without the access to the contextual information, the Differential Selector views the previously selected knowledge h τ −1 k as query, and the knowledge sentence r τ i with its differential information o τ i as key, to estimate a difference-aware selection distribution: where v, W que and W key are trainable parameters. The final selection distribution is the summation of the distributions of two selectors: Note that the Differential Selector relies on the previously selected knowledge, thus at the first turn, we set β τ Diff,i to 0 for each i.

Selecting Knowledge
Finally, either adopting the Fused or Disentangled Selection module, the model selects the knowledge sentence with the highest attention score, and uses its representation for further generation 2 :

Decoder
The decoding state is updated by a GRU: where W D and b D are trainable parameters, and e (y t−1 ) denotes the embedding of the word y t−1 generated in the last time step. Then, the decoder outputs the generation probability over the vocabulary (without normalization): where W G and b G are trainable parameters, and w is the one-hot vector of the word w. Meanwhile, a copy mechanism (Gu et al., 2016) is adopted to output additional copy probability of the words in the selected knowledge sentence k i (without normalization): where H is a fully connected layer activated with tanh. The final probability distribution is computed as follows: (17) where Z is the normalization term. Then we select the word from vocabulary with the highest probability, saying: y t = arg max w P(y t = w).

Loss
The negative log likelihood loss is adopted: where y τ t * denotes the t-th word in the golden response at the τ -th turn and T is the length of turns in the whole dialogue. We also add supervision on the final knowledge selection distribution: where i τ * denotes the index of the golden selected knowledge sentence at the τ -th turn. The total loss is their summation: where we set λ = 1 in our experiments.
WoW (Dinan et al., 2019) contains multi-turn knowledge-grounded conversations, collected by wizard-apprentice mode. Each utterance of the wizard is grounded to a selected knowledge sentence, or indicated by that no knowledge is used. The dialogues are split into 18,430/1,948/965/968 for Train/Dev/Test Seen/Test Unseen respectively, with 4 turns per dialogue and 61 provided knowledge sentences per turn on average. Note that the test data is split into Test Seen (in-domain) and Test Unseen (out-of-domain), where Test Unseen contains topics that are never seen in Train or Dev.
Holl-E (Moghe et al., 2018) contains conversations in which one speaker is strictly instructed to give utterances by copying or modifying sentences from the given background document. Similarly, each utterance is annotated regarding the selected knowledge. Following Kim et al. (2020), we tokenized the background document into sentences, and meanwhile ensured that the annotated span is included in a whole sentence. The dialogues are split into 7,211/930/913 for Train/Dev/Test respectively, with 5 turns per dialogue and 60 provided knowledge sentences per turn on average.

Implementation Details
All the models were implemented with PyTorch (Paszke et al., 2017). The sentences were tokenized with NLTK (Bird and Loper, 2004). We set the vocabulary size to 20K for WoW and 16K for Holl-E and used the 300-dimensional word embeddings initialized by GloVe (Pennington et al., 2014) or from a standard normal distribution N (0, 1). We applied a dropout rate of 0.5 on word embeddings. The hidden sizes were set to 200 for the encoders (totally 400 for two directions) and to 400 for the decoder. We adopted the ADAM (Kingma and Ba, 2015) optimizer with the initial learning rate set to 0.0005. The batch size was set to 8 dialogues. All the models share the same hyperparameter setting and were trained for 20 epochs on one NVIDIA Titan Xp GPU. The checkpoints of our reported results were selected according to BLEU-4 on the Dev sets.

Automatic Evaluation
We used several automatic metrics: ACC, the accuracy of knowledge selection on the whole test set, corpus-level BLEU-2/4 (Papineni et al., 2002), and ROUGE-2 (Lin, 2004). As shown in Table 1 3 , our method outperforms significantly all the baselines in all the metrics on three test sets (except BLEU and ROUGE on WoW Seen compared with SLKS), which indicates its superiority in selecting proper knowledge and generating informative responses. Compared to the baseline models, our models also demonstrate a stronger ability of generalization from in-domain (WoW Seen) to out-of-domain data (WoW Unseen). It is worth noting that on WoW Unseen, our DiffKS Fus obtains a higher accuracy (19.7) of knowledge selection even than the BERT-enhanced SLKS in their original paper (18.3). We also observed that DiffKS Fus performs a bit better on WoW while DiffKS Dis on Holl-E. We conjecture that it is because in Holl-E, the golden selected knowledge among different turns usually has high contextual dependency (for example, they may be continuous sentences in the document), which makes it feasible to predict the next selected knowledge simply conditioned on the differential information.

Human Observational Evaluation
We conducted human observational evaluation with pair-wise comparison, where our two models were compared with PostKS++ and SLKS. 100 dialogues were respectively sampled from WoW Seen/Unseen. For each pair of dialogues generated from two models (suppose with T turns), annotators from Amazon Mechanical Turk were hired to give preferences (win, lose, or tie) for each response pair of all the T turns in terms of different metrics. Each pair-wise comparison of dialogues was judged by 3 curators. We adopted the following two metrics: Naturalness evaluates the fluency and readability of a response. Appropriateness evaluates the relevance to the context and whether    Table 2, where the Fleiss' Kappa (Fleiss, 1971) values show almost moderate agreements (0.4 < κ < 0.6). Our models significantly outperform PostKS++ in both metrics, and also generally outperform SLKS in terms of Appropriateness. Again, the advantage of our models on WoW Unseen is more evident than on WoW Seen.

Human Interactive Evaluation
We further conducted human interactive evaluation where real humans converse with one model about a specific topic. We compared PostKS++ and SLKS with our two models. The workers from Amazon Mechanical Turk were asked to first select one topic from 2-3 provided candidate topics, and then converse with one of the models for 3-5 dialogue turns. After conversation, they were required to rate the dialog model with a 5-star scale in terms of the fluency and informativeness of the utterances and the coherence of the whole dialog. Following Dinan et al. (2019); Kim et al. (2020), the interactive evaluation was implemented with  ParlAI (Miller et al., 2017). For each model, we averaged the scores from 150 collected conversations on each test set of WoW. We also reported the results of human-human dialog from Dinan et al. (2019); Kim et al. (2020), where each worker converses with another human and the latter has access to knowledge sentences just like the models do. Results are shown in Table 3 4 , where DiffKS Fus gains the highest scores and our models both outperform the other two state-of-the-art baselines, indicating that our models are favorably preferred by human annotators.

Ablation Test
In order to verify the effectiveness of the differential information in knowledge selection, we conducted ablation tests, which were specifically based on the disentangled variant DiffKS Dis . In DiffKS Dis , we removed either the Differential Selector (DiffSel) or the Contextual Selector (CtxSel), and trained the model with only one of the two selectors.
Results are shown in Table 4. Without the differential selector, the model performance is remarkably impaired in all the metrics on three test sets, indicating the importance of utilizing differential information. In comparison, removing the contextual selector is less influential (with less performance drop). We conjecture that this may result from the characteristics of datasets. For instance,  in WoW, the apprentice (without access to knowledge) usually reacts passively to the wizard (having access to knowledge). Thus the apprentice posts (contextual information) have limited influence in driving the conversation, which is instead affected or controlled by the wizard. In this case, our differential information that can predict the process of knowledge transition has more influence than the contextual information. In addition, same as Kim et al. (2020), the knowledge sentences in Holl-E are obtained by segmenting a long document into single sentences, which implies that there exists the relevance or contextual dependency between knowledge sentences. Consequently, the differential information is still able to provide considerable clues for knowledge selection even without access to the new user post (the context).
Furthermore, after removing DiffSel, DiffKS Dis reduces to a vanilla knowledge selection model where the supervision L KS was directly applied on the 'prior' selection distribution. Nevertheless, the performance of the ablated model is sometimes competitive to the baselines (for instance, in terms of ACC, DiffKS Dis w/o DiffSel obtains 22.3/15.5/29.1 vs. 21.9/14.9/28.0 of PostKS++). It may result from the gap between training and inference caused by the prior-posterior framework  adopted in PostKS and SLKS, which may be not superior over directly training the prior selection distribution 5 .

Difference From More Turns
To investigate the impact of increasing the turns of differential information (the M in Equ.4), we additionally experimented with M = 2, 3, and took the arithmetic average for simplicity in Equ.4, saying ∀i, λ i = 1/M .
Results are shown in Table 5. We can find that M = 2 generally achieves the best performance compared with M = 1, 3 for both DiffKS Fus and DiffKS Dis (while M = 3 is still better than M = 1). It further turns out the effectiveness of explicitly modeling differential information. We also conjecture that the model performance would be further improved by assigning the nearest/farthest difference with the largest/smallest weight in Equ.4, saying λ 1 > λ 2 > · · · > λ M , which is more reasonable than the simplified arithmetic average.

Accuracy Over Turns
To verify whether the sequential knowledge selection facilitates knowledge selection in later turns, we evaluated the accuracy of knowledge selection at different turns. The statistics are shown in Table 6. Our two models have the highest accuracy from the 2 nd to 5 th turns and outperform SLKS and PostKS++ (and SLKS also generally outperforms PostKS++). The results show that our models can select more accurate knowledge consistently over different turns.

Case Study
alloween activities include trick-or-treating (or the related guising), attending Halloween costume parties, carving pumpkins ack-o'-lanterns, lighting bonfires, apple bobbing, divination games, playing pranks, visiting haunted attractions, telling scary s, and watching horror films.
ichael Myers is a fictional character from the "Halloween" series of slasher films. Figure 5: Case study. We marked the selected knowledge sentence in parentheses before each response. The knowledge k1-k5 are about the topic Georgia (U.S. state), while k6 is about History of Australia. The blue denotes duplicate responses resulting from repetitive knowledge selection. The red × denotes incoherent responses resulting from selecting a far different knowledge from previous turns.
We show a case from WoW Seen in Figure  5, which compares the responses generated by PostKS++, SLKS and our two models.
At the 2 nd turn, PostKS++ generates almost the same responses as at the 1 st turn due to the repetitive knowledge selection. Similar cases occur for SLKS at the 2 nd and the 3 rd turns. Moreover, PostKS++ selects a quite different knowledge sen-tence at the 3 rd turn from those at previous turns, which is about the topic History of Australia but not Georgia (U.S. state). As a result, PostKS++ generates a response which is not coherent to the previous context at the 3 rd turn. In contrast, our two models select both diverse and appropriate knowledge sentences at all the turns, thereby generating informative responses and making the dialog coherent and natural.

Conclusion
We present a novel difference-aware knowledge selection method for multi-turn knowledge-grounded conversation generation. Our method first compares the candidate knowledge provided at the current turn with the previously selected knowledge, and then selects appropriate knowledge to be used in generation. Experimental results show that our method is able to select knowledge more accurately and to generate more informative responses, outperforming significantly the state-of-the-art baselines.