Filtering before Iteratively Referring for Knowledge-Grounded Response Selection in Retrieval-Based Chatbots

The challenges of building knowledge-grounded retrieval-based chatbots lie in how to ground a conversation on its background knowledge and how to match response candidates with both context and knowledge simultaneously. This paper proposes a method named Filtering before Iteratively REferring (FIRE) for this task. In this method, a context filter and a knowledge filter are first built, which derive knowledge-aware context representations and context-aware knowledge representations respectively by global and bidirectional attention. Besides, the entries irrelevant to the conversation are discarded by the knowledge filter. After that, iteratively referring is performed between context and response representations as well as between knowledge and response representations, in order to collect deep matching features for scoring response candidates. Experimental results show that FIRE outperforms previous methods by margins larger than 2.8% and 4.1% on the PERSONA-CHAT dataset with original and revised personas respectively, and margins larger than 3.1% on the CMU_DoG dataset in terms of top-1 accuracy. We also show that FIRE is more interpretable by visualizing the knowledge grounding process.


Introduction
Building a conversational agent with intelligence has received significant attention with the emergence of personal assistants such as Apple Siri, Google Now and Microsoft Cortana. One approach is to building retrieval-based chatbots, which aims to select a potential response from a set of candidates given the conversation context (Lowe et al., 2015;Wu et al., 2017;Zhou et al., 2018b;Gu et al., 2019a;Gu et al., 2020a). * Corresponding author.

The inception 2009
Christopher Nolan Scientific Leonardo DiCaprio as Dom Cobb, a professional thief who specializes in conning secrets from his victims by infiltrating their dreams. Tom Hardy as Eames, a sharp-tongued associate of Cobb. ... Response DiCaprio, who has never been better as the tortured hero, draws you in with a love story that will appeal even to non-scifi fans. The movie is a metaphor for the power of delusional hype for itself. ... Dominick Cobb and Arthur are extractors, who perform corporate espionage using an experimental military technology to infiltrate the subconscious of their targets and extract valuable information through a shared dream world. Their latest target, Japanese businessman Saito, reveals that he arranged the mission himself to test Cobb for a seemingly impossible job: planting an idea in a person's subconscious, or inception. It's about extractors that perform experiments using military technology on people to retrieve info about their targets. Sounds interesting. Do you know which actors are in it? I haven't watched it either or seen a preview. But it's scifi so it might be good. Ugh Leonardo DiCaprio is the main character. He plays as Don Cobb. I'm not a big scifi fan but there are a few movies I still enjoy in that genre. Is it a long movie? Doesn't say how long it is. The Rotten Tomatoes score is 86%.
However, real human conversations are often grounded on external knowledge. People may associate relevant background knowledge according to current conversation, and then make their replies based on both context and knowledge. Recently, the tasks of knowledge-grounded response selection (Zhang et al., 2018a;Zhou et al., 2018a) have been set up to simulate this scenario. In these tasks, agents should respond according to not only the given context but also the relevant knowledge, and the knowledge is usually represented as unstructured entries which are common in practice. An example is shown in Figure 1.
Some methods have been proposed for solving these tasks (Mazaré et al., 2018;Gu et al., 2019b). In these methods, the semantic representations of context, knowledge and responses candidates are usually derived by encoding models at first. Then, the matching degree between a response candidate and a {context, knowledge} pair is calculated by neural networks. Although these methods are capable of utilizing external knowledge when selecting responses, they still have several deficiencies. First, most of them encode context and knowledge separately, and neglect to ground the conversation on the knowledge and to comprehend the knowledge based on the conversation.  proposed to alleviate this issue by fusing the local matching information between each {context utterance, knowledge entry} pair into their representations. However, each utterance or entry plays different functions in conversations. As shown by the example in Figure 1, some utterances are closely related with background knowledge while some others are irrelevant to knowledge but play the role of connection, such as the greetings. Besides, some entries are redundant and are not mentioned in the conversation at all, such as Year, Director and Critical Response. Such global functions of utterances and entries were ignored in all existing methods. Second, the model structures used by previous methods to calculate the matching degree between a response candidate and a {context, knowledge} pair were usually shallow ones, which constrained the model from learning deep matching relationship between them. Therefore, this paper proposes a method named Filtering before Iteratively REferring (FIRE) to address these issues. First, this method designs a context filter and a knowledge filter at the encoding stage. Different from , these filters collect the global matching information between all context utterances and all knowledge entries bidirectionally. Specifically, the context filter makes the context refer to the knowledge and derives knowledge-aware context representations. On the other hand, the knowledge filter derives context-aware knowledge representations utilizing the same global attention mechanism. Considering that the knowledge entries are independent of each other and redundant entries may increase the difficulty of response matching, the knowledge filter discards irrelevant entries, which are determined by calculating the similarity between each entry and the whole context. Second, this method designs an iteratively referring network for calculating the matching degree between a response candidate and a {context, knowledge} pair. This network follows the dual matching framework (Gu et al., 2019b) in which the response refers to the context and the knowledge simultaneously. Motivated by previous studies on attention-over-attention (AoA) (Cui et al., 2017) and interaction-over-interaction (IoI)  models, this network performs the referring operation iteratively in order to derive deep matching information. Specifically, the outputs of each iteration are utilized as the inputs of the next iteration. Then, the outputs of all iterations are aggregated into a set of matching feature vectors for scoring.
We evaluate our proposed method on the PERSONA-CHAT (Zhang et al., 2018a) and CMU DoG (Zhou et al., 2018a) datasets. Experimental results show that FIRE outperforms previous methods by margins larger than 2.8% and 4.1% on the PERSONA-CHAT dataset with original and revised personas respectively, and margins larger than 3.1% on the CMU DoG dataset in terms of top-1 accuracy, achieving a new state-of-the-art performance on both tasks.
In summary, the contributions of this paper are three-fold. (1) A Filtering before Iteratively REferring (FIRE) method is proposed, which employs two filtering structures based on global and cross attentions for representing contexts and knowledge, together with an iteratively referring network for scoring response candidates. (2) Experimental results on two datasets demonstrate that our proposed model outperforms state-of-the-art models on the accuracy of response selection. (3) Empirical analysis further verifies the effectiveness of our proposed method.

Response Selection
Response selection is an important problem in building retrieval-based chatbots. Existing work on response selection can be categorized according to processing single-turn dialogues (Wang et al., 2013) or multi-turn ones (Lowe et al., 2015;Wu et al., 2017;Zhang et al., 2018b;Zhou et al., 2018b;Gu et al., 2019a;Gu et al., 2020a,b). Recent studies focused on multiturn conversations, a more practical setup for real applications. Wu et al. (2017) proposed the sequential matching network (SMN) which accumulated the utterance-response matching information by a recurrent neural network. Zhou et al. (2018b) proposed the deep attention matching network (DAM) to construct representations at different granularities with stacked self-attention. Gu et al. (2019a) proposed the interactive matching network (IMN) to perform the bidirectional and global interactions between the context and the response.  proposed the interaction over interaction (IoI) model which performed matching by stacking multiple interaction blocks. Gu et al. (2020a) proposed the speaker-aware BERT (SA-BERT) to model the speaker change information in pre-trained language models.

Knowledge-Grounded Chatbots
Chit-chat models suffer from the lack of explicit long-term memory as they are typically trained to produce an utterance given only a very recent dialogue history. Recently, some studies show that chit-chat models can be more diverse and engaging by conditioning them on the background knowledge. Zhang et al. (2018a) released the PERSONA-CHAT dataset which employs the speakers' profile information as the background knowledge. Zhou et al. (2018a) built the C-MU DoG dataset which adopts the Wikipedia articles about popular movies as the background knowledge. Mazaré et al. (2018) proposed to pretrain a model using a large-scale corpus based on Reddit.  proposed the document-grounded matching network (DGMN) which fused each context utterance with each knowledge entry for representing them. Gu et al. (2019b) proposed a dually interactive matching network (DIM) which performed the interactive matching between responses and contexts and between responses and knowledge respectively.
The FIRE model proposed in this paper makes two major improvements to the state-of-the-art DIM model (Gu et al., 2019b). First, a context filter and a knowledge filter are built to make the representations of context and knowledge aware of each other. Second, an iteratively referring network is designed to collect deep and comprehensive matching information for scoring responses.

Task Definition
Given a dataset D, an example is represented as (c, k, r, y). Specifically, c = {u 1 , u 2 , ..., u nc } represents a context with {u m } nc m=1 as its utterances and n c as the utterance number. k = {e 1 , e 2 , ..., e n k } represents a knowledge description with {e n } n k n=1 as its entries and n k as the entry number. r represents a response candidate. y ∈ {0, 1} denotes a label. y = 1 indicates that r is a proper response for (c, k); otherwise, y = 0. Our goal is to learn a matching model g(c, k, r) from D. For any context-knowledge-response triple (c, k, r), g(c, k, r) measures the matching degree between (c, k) and r. Figure 2 shows the overview architecture of our proposed model. The context utterances, knowledge entries and responses are first encoded by a sentence encoder. Then the context and the knowledge are co-filtered by referring to each other. Next, the response refers to the filtered context and knowledge representations iteratively. The outputs of each iteration are aggregated into a matching feature vector, and are utilized as the inputs of next iteration at the same time. Finally, the matching features of all iterations are accumulated for scoring response candidates. Details are provided in following subsections.

Word Representation
We follow the settings used in DIM (Gu et al., 2019b), which constructs word representations by combining general pre-trained word embeddings, those estimated on the task-specific training set, as well as character-level embeddings, in order to deal with the out-of-vocabulary issue.
Formally, embeddings of the m-th utterance in a context, the n-th entry in a knowledge description and a response candidate are denoted as respectively, where l um , l en and l r are the numbers of words in U m , E n and R respectively. Each u m,i , e n,j or r k is an embedding vector.

Sentence Encoder
Note that the encoder can be any existing encoding model. In this paper, the context utterances, knowledge entries and response candidate are encoded by bidirectional long short-term memories (BiLSTMs) (Hochreiter and Schmidhuber, 1997). Detailed calculations are omitted due to limited space. After that, we can obtain the encoded representations for utterances, entries and response, denoted

Context and Knowledge Filters
As illustrated in Figure 1, not every context utterance refers to the knowledge, and not every knowledge entry is mentioned in the conversation. In order to ground the conversation on the knowledge and to comprehend the knowledge based on the conversation, we build a context filter and a knowledge filter in the FIRE model. These two filters obtain knowledge-aware context representationC 0 and context-aware knowledge representationK 0 , which are further utilized to match with the response.
Context Filter This filter first determines the knowledge that each context token refers to by a global attention between the whole context and all knowledge entries. Then, it enhances the representation of each context token with the representations of its relevant knowledge. Given the set of utterance representations {Ū m } nc m=1 encoded by the sentence encoder, we concatenate them to form the context Then, a soft alignment is performed by computing the attention weight between each tuple {c i ,k j } as e ij =c i ·k j . (1) After that, the global relevance between the context and the knowledge can be obtained using these attention weights. For a word in the context, its relevant representation carried by the knowledge is identified and composed using e ij as where the contents in {k j } l k j=1 that are relevant toc i are selected to formc i , and we defineC = {c i } lc i=1 . To enhance the context representationC with the relevance representationC, the element-wise difference and multiplication between {C,C} are computed, and are then concatenated with the original vectors. This enhancement operation can be written as where C = {ĉ i } lc i=1 andĉ i ∈ R 4d . Finally, we compress C and obtain the knowledge-aware context representationC 0 as Here, we define a referring function to summarize above operations in the context filter as whereC acts as the query, andK acts as the key and value of the referring function respectively.
Knowledge Filter Similarly, this filter enhances the representation of each knowledge token with the representations of its relevant context. Different from the context filter, an additional selection operation is conducted to directly filter out the knowledge entries with low relevance with the context since the entries are independent of each other. First, the referring function introduced above is also performed as follows, whereK 0 is the context-aware knowledge repre- . Furthermore, the relevance between each entry and the whole conversation is computed in order to determine whether to filter out this entry. We first perform the last-hidden-state pooling over the representations of utterances and entries given by the sentence encoder in Section 4.2. Then, the utterance embedding {ū m } nc m=1 and the entry embedding {ē n } n k n=1 are obtained. Next, we compute the relevance score for each utteranceentry pair as follows, where M ∈ R d×d is a matrix that needs to be estimated.
In order to obtain the overall relevance score between each entry and the whole conversation, an aggregation operation is required. Here, we make an assumption that one entry is mentioned only once in the conversation. Thus, for a given entry, its relevance score with the conversation is defined as the maximum relevance score between it and all utterances. Mathematically, we have Those entries whose scores are below a threshold γ are considered as uninformative ones for the conversation and are directly filtered out before matching with responses. Mathematically, we havē where σ is the sigmoid function and sgn is the sign function. The final filtered knowledge representation is defined asK 0 = {Ē 0 n } n k n=1 .

Iteratively Referring
Zhao et al. (2019) and Gu et al. (2019b) showed that the referring operation between contexts and responses and that between knowledge and responses can both provide useful matching information for response selection. However, the matching information collected by these methods were very shallow and limited, as each response candidate referred to the context or the knowledge only once in their models. In this paper, we design an iteratively referring network which makes the response refer to the filtered context and knowledge iteratively. Each iteration is capable of capturing additional matching information based on previous ones. Accumulating these iterations can help to derive the deep and comprehensive matching features for response selection. Take the context-response matching as an example. The matching strategy adopted here considers the global and bidirectional matching between two sequences. LetC be the outputs of the l-th iteration, i.e., the inputs of the (l+1)-th iteration, where l ∈ {0, 1, ..., L − 1} and L is the number of iterations. For response representations, we haveR 0 =R.
First, the context refers to the response by performing the referring function and the responseaware context representationC l+1 is obtained as Bidirectionally, the response refers to the context and the context-aware response representation R l+1 is obtained as C l+1 andR l+1 are utilized as the input of next iteration. Finally, {C l } L l=1 and {R l } L l=1 are obtained after L iterations.
On the other hand, the knowledge-response matching is conducted identically to the contextresponse matching process introduced above. The response-aware knowledge representationK l and knowledge-aware response representationR l * are iteratively updated as whereR 0 * =R. Similarly, we obtain {K l } L l=1 and {R l * } L l=1 after L iterations.

Aggregation
These sets of matching matrices {C l } L l=1 , {R l } L l=1 , {K l } L l=1 , and {R l * } L l=1 are aggregated into a set of matching feature vectors finally. As shown in Figure 1, we perform the same aggregation operation after each referring iteration. The aggregation strategy in DIM (Gu et al., 2019b) is adopted here.
Let us take the l-th aggregation as an example. pooling and mean pooling operations to derive their embedding vectorsū l m ,r l ,ē l n andr l * respectively. Next, the sequences of {ū l m } nc m=1 and {ē l n } n k n=1 are further aggregated to get the embedding vectors for the context and the knowledge respectively.
As the utterances in a context are chronologically ordered, the utterance embeddings {ū l m } nc m=1 are sent into another BiLSTM following the chronological order of utterances in the context. Combined max pooling and last-hidden-state pooling operations are then performed to derive the context embeddingsc l . On the other hand, as the knowledge entries are independent of each other, an attention-based aggregation is designed to derive the knowledge embeddingsk l . Readers can refer to Gu et al. (2019b) for more details.
The matching feature vector of the l-th iteration is the concatenation of context, knowledge and response embeddings as which combines the outputs of both contextresponse matching and knowledge-response matching. Last, we obtain a set of matching feature vectors {m l } L l=1 for all iterations.

Prediction
Each matching feature vector m l is sent into a multi-layer perceptron (MLP) classifier. Here, the MLP is designed to predict the matching degree g l (c, k, r) between r and (c, k) at l-th iteration. A softmax output layer is adopted in the MLP to return a probability distribution over all response candidates. The probability distributions calculated from all L matching feature vectors are averaged to derive the final distribution for ranking.

Model Learning
Inspired by , the model parameters of FIRE are learnt by minimizing the summation of cross-entropy losses of MLP at all iterations. By this means, each matching feature vector can be directly supervised by labels in the training set. Furthermore, inspired by Szegedy et al. (2016), we employ the strategy of label smoothing by assigning a small additional confidence to all candidates, in order to prevent the model from being overconfident. Let Θ denote the parameters of FIRE. The learning objective L(D, Θ) is formulated as (c,k,r,y)∈D (y+ )log(g l (c, k, r)).

Datasets
We tested our proposed method on the PERSONA-CHAT (Zhang et al., 2018a) and CMU DoG (Zhou et al., 2018a) datasets which both contain dialogues grounded on background knowledge. The PERSONA-CHAT dataset consists of 8939 complete dialogues for training, 1000 for validation, and 968 for testing. Response selection is performed at every turn of a complete dialogue, which results in 65719 dialogues for training, 7801 for validation, and 7512 for testing in total. Positive responses are true responses from humans and negative ones are randomly sampled by the dataset publishers. The ratio between positive and negative responses is 1:19 in the training, validation, and testing sets. There are 955 personas for training, 100 for validation, and 100 for testing, each consisting of 3 to 5 profile sentences. To make this task more challenging, a version of revised persona descriptions are provided by rephrasing, generalizing, or specializing the original ones.
The CMU DoG dataset consists of 2881 complete dialogues for training, 196 for validation, and 537 for testing. Response selection is also performed at every turn of a complete dialogue, which results in 36159 dialogues for training, 2425 for validation, and 6637 for testing in total. Since this dataset did not contain negative examples, we adopted the version shared by , in which 19 negative candidates were randomly sampled for each utterance from the same set. The meanings of "Original", and "Revised" can be found in Section 5.1.

Evaluation Metrics
We used the same evaluation metrics as the ones in previous work (Zhang et al., 2018a;. Each model aimed to select k best-matched response from available candidates for the given context and knowledge. Then, the recall of true positive replies, denoted as R@k, are calculated as the measurement.

Training Details
For training FIRE on both PERSONA-CHAT and CMU DoG datasets, some common configurations were set as follows. The Adam method (Kingma and Ba, 2015) was employed for optimization. The learning rate was initialized as 0.00025 and was exponentially decayed by 0.96 every 5000 steps. Dropout (Srivastava et al., 2014) with a rate of 0.2 was applied to the word embeddings and all hidden layers. The word representation was the concatenation of a 300-dimensional GloVe embedding (Pennington et al., 2014), a 100-dimensional embedding estimated on the training set using the Word2Vec algorithm (Mikolov et al., 2013), and a 150-dimensional character-level embedding estimated by a CNN network that consists of 50 filters and window sizes were set to {3, 4, 5} respectively. The word embeddings were not updated during training. All hidden states of LSTMs had 200 dimensions. The MLP at the prediction layer had 256 hidden units with ReLU (Nair and Hinton, 2010) activation. used in label smoothing was set to 0.05. The validation set was used to select the best model for testing. Some configurations were different according to the characteristics of these two datasets. For the PERSONA-CHAT dataset, the maximum number of characters in a word, that of words in a context utterance, of utterances in a context, of words in a response, of words in a knowledge entry, and of entries in a knowledge description were set as 18, 20, 15, 20, 15, and 5 respectively. For the CMU DoG dataset, these parameters were set as 18,40,15,40,40 and 20 respectively. Zeropadding was adopted if the number of utterances in a context and the number of knowledge entries in a knowledge description were less than the maximum. Otherwise, we kept the last context utterances or the last knowledge entries. Batch size was set to 16 for PERSONA-CHAT and 4 for CMU DoG. The hyper-parameter γ was set to 0.3 for original personas and 0.2 for revised personas on the PERSONA-CHAT dataset, as well as 0.2 on the CMU DoG dataset, which were tuned on the validation sets as shown in Figure 4. The number of iterations L was set to 3 for original and revised personas on the PERSONA-CHAT dataset, as well as 3 on the CMU DoG dataset, which were tuned on the validation sets as shown in Figure 5.
All code was implemented in the TensorFlow framework (Abadi et al., 2016) and is published to help replicate our results. 1

Experimental Results
Table 1 presents the evaluation results of FIRE and previous methods on the PERSONA-CHAT using original or revised personas and on the CMU DoG dataset. Because the paper proposing DIM (Gu et al., 2019b) only studied the PERSONA-CHAT dataset, we ran its released code to get the performance of DIM on the CMU DoG dataset.
From  of-the-art performance. On the PERSONA-CHAT dataset, the margins were larger than 2.8% and 4.1% when original and revised personas were used respectively. On the CMU DoG dataset, the margin was larger than 3.1%.

Analysis
Ablation tests We conducted ablation tests as follows. First, we removed iteratively referring by setting the number of iterations L to one. Then, we removed the two filters. The results on the validation sets are shown in Table 2. We can see the drop of R@1 after each step, which demonstrated the effectiveness of both components in FIRE.
To further verify the effectiveness of the context filter, we built three models as follows: (1) a model that only performed the context-response matching without using any knowledge, i.e., the IMN model in Gu et al. (2019b) where readers can refer to for more details; (2) a model that performed the context-response matching first and then fuse the knowledge, i.e., the IMN utr model in Gu et al. (2019b); and (3) a model that filtered the context first and then performed the contextresponse matching, i.e., our FIRE model with only the upper branch in Figure 2. The evaluation results of these three models on the validation set are shown in Table 2. Since these three models adopted similar context-response matching strategy, we can see that fusion after matching and filtering before matching can both improve the performance of response selection after introducing knowledge. Furthermore, filtering before matching  outperformed fusion after matching by a large margin, which demonstrated the effectiveness of the context filter. On the other hand, we also built similar models to further verify the effectiveness of the knowledge filter. The same comparison results were observed from the last three rows of Table 2, which demonstrated its effectiveness.
Case Study A case study was conducted to visualize the attention weights in both context and knowledge filters of FIRE model. A sample was used as shown in Table 3. The similarity scores s mn in Eq. (7) for each utterance-entry pair are visualized in Figure 3 (a). The final scores s n in Eq. (8) for each entry are visualized in Figure 3 (b). We can see that U2 and U4 obtained large attention weights with E2 and E4 respectively. Meanwhile, some irrelevant entries E1 and E3 obtained small similarity scores with the conversation, which can be filtered out with appropriate threshold. These experimental results verified the effectiveness of the filtering process and the interpretability of the knowledge grounding process. Figure 4 illustrates the validation set performance of FIRE with different threshold γ in the knowledge filter. Here, the number of iterations L was set to 1 for saving computation. When γ = 0, no knowledge entries were filtered out. From this figure, we can observe a consistent trend that the model performance was improved when increasing γ at the beginning, which indicates that filtering out irrelevant entries indeed helped response selection. Then, the performance started to drop when γ was too large since some indeed relevant entries may be filtered out by mistake.

Knowledge Selection
Iteratively Referring Figure 5 illustrates how the validation set performance of FIRE changed with respect to the number of iterations in iteratively referring. From it, we can see three iterations led to the best performance on both datasets.
Complexity We analysed the time complexity difference between FIRE and DIM. We recorded their inference time over the validation set of PERSONA-CHAT under the configuration of original personas using a GeForce GTX 1080 Ti GPU. It takes FIRE 109.5s and DIM 160.4s to finish the inference, which shows that FIRE is more time-efficient. The reason is that we design a lighter aggregation method in FIRE by replacing recurrent neural network in the aggregation part of DIM with a single-layer non-linear transformation.

Conclusion
In this paper, we propose a method named Filtering before Iteratively REferring (FIRE) for utilizing the background knowledge of dialogue agents  in retrieval-based chatbots. In this method, a context filter and a knowledge filter are first designed to make the representations of context and knowledge aware of each other. Second, an iteratively referring network is built to collect deep and comprehensive matching information for scoring response candidates. Experimental results show that FIRE achieves a new state-of-the-art performance on two datasets. In the future, we will explore better ways of integrating pre-trained language models into our proposed methods for knowledge-grounded response selection.