Improving Response Selection in Multi-Turn Dialogue Systems by Incorporating Domain Knowledge

Building systems that can communicate with humans is a core problem in Artificial Intelligence. This work proposes a novel neural network architecture for response selection in an end-to-end multi-turn conversational dialogue setting. The architecture applies context level attention and incorporates additional external knowledge provided by descriptions of domain-specific words. It uses a bi-directional Gated Recurrent Unit (GRU) for encoding context and responses and learns to attend over the context words given the latent response representation and vice versa. In addition, it incorporates external domain specific information using another GRU for encoding the domain keyword descriptions. This allows better representation of domain-specific keywords in responses and hence improves the overall performance. Experimental results show that our model outperforms all other state-of-the-art methods for response selection in multi-turn conversations.


Introduction
In a conversation scenario, a dialogue system can be applied to the task of freely generating a new response or to the task of selecting a response from a set of candidate responses based on the previous utterances, i.e. the context of the dialogue. The former is known as generative dialogue system while the latter is called retrieval-based (or response selection) dialogue system.
Both approaches can be realized using a modular architecture, where each module is responsible for a certain task such as natural language understanding, dialogue state-tracking, natural language Context Utterance 1: My networking card is not working on my Ubuntu, can somebody help me? Utterance 2: What's your kernel version? Run uname -r or sudo dpkg -l |grep linux-headers |grep ii |awk '{print $3}' and paste the output here. Utterance 3: It's 2.8.0-30-generic. Utterance 4: Your card is not supported in that kernel. You need to upgrade, that's like decade old kernel! Utterance 5: Ok how do I install the new kernel??

Response
Just do sudo apt-get upgrade, that's it. generation, etc., or can be trained in an end-to-end manner optimized on a single objective function.
Previous work, belonging to the latter category, by Lowe et al. (2015a) applied neural networks to multi-turn response selection in conversations by encoding the utterances in the context as well as the possible responses with a Long Short-term Memory (LSTM) (Hochreiter and Schmidhuber, 1997). Based on the context and response encodings, the neural network then estimates the probability for each response to be the correct one given the context. More recently, a lot of enhanced architectures have been proposed that build on the general idea of encoding response and context first and performing some embedding-based matching after Dong and Huang, 2018).
Although such approaches result in efficient text-pair matching capabilities, they fail to attend over logical consistencies for longer utterances in the context, given the response. Moreover, in domain specific scenarios, a system's ability to incorporate additional domain knowledge can be very beneficial, e.g. for the example shown in Table 1.
In this paper, we propose a novel neural network architecture for multi-turn response-selection that extends the model proposed by Lowe et al. (2015a). Our major contributions are: (1) a neural network paradigm that is able to attend over important words in a context utterance given the response encoding (and vice versa), (2) an approach to incorporate additional domain knowledge into the neural network by encoding the description of domain specific words with a GRU and using a bilinear operation to merge the resulting domain specific representations with the vanilla word embeddings, and (3) an empirical evaluation on a publicly available multi-turn dialogue corpus showing that our system outperforms all other state-of-the-art methods for response selection in a multi-turn setting.

Related work
Recently, human-computer conversations have attracted increasing attention in the research community and dialogue systems have become a field of research on its own. The conversation models proposed in early studies (Walker et al., 2001;Oliver and White, 2004;Stent et al., 2002) were designed for catering to specific domains only, e.g. for performing restaurant bookings, and required substantial rule-based strategy building and human efforts in the building process. With the advancements in machine learning, there have been more and more studies on conversational agents which are based on data-driven approaches. Datadriven dialogue systems can chiefly be realized by two types of architectures: (1) pipeline architectures, which follow a modular pattern for modelling the dialogues, where each component is trained/created separately to perform a specific sub-task, and (2) end-to-end architectures, which consist of a single trainable module for modelling the conversations.
Task-oriented dialogue systems, which are designed to assist users in achieving specific goals, were mainly realized by pipeline architectures. Recently however, there have been more and more works on end-to-end dialogue systems because of the limitations of the former modular architectures, namely, the credit assignment problem and inter-component dependency, as for example described by Zhao and Eskenazi (2016). Wen et al. (2017) and Bordes et al. (2017) proposed encoder-decoder-based neural networks for modeling task oriented dialogues. Moreover, Zhao and Eskenazi (2016) proposed an end-to-end reinforcement learning-based system for jointly learning to perform dialogue state-tracking (Williams et al., 2013) and policy learning (Baird, 1995).
Since task oriented systems primarily focus on completing a specific task, they usually do not allow free flowing, articulate conversations with the user. Therefore, there has been considerable effort to develop non-goal driven dialogue systems, which are able to converse with humans on an open domain (Ritter et al., 2011). Such systems can be modeled using either generative architectures, which are able to freely generate responses to user queries, or retrieval-based systems, which pick a response suitable to a context utterance out of a provided set of responses. Retrieval-based systems are therefore more limited in their output while having the advantage of producing more informative, constrained, and grammatically correct responses (Ji et al., 2014). Ritter et al. (2011) were the first to formulate the task of automatic response generation as phrase-based statistical machine translation, which they tackled with n-gram-based language models. Later approaches (Shang et al., 2015;Luong et al., 2015) applied Recurrent Neural Network (RNN)-based encoderdecoder architectures. However, dialogue generation is considerably more difficult than language translation because of the wide possibility of responses in interactions. Also, for dialogues, in order to generate a suitable response at a certain time-step, knowing only the previous utterance is often not enough and the ability to leverage the context from the sequence of previous utterances is required. To overcome such challenges, a hierarchical RNN encoder-decoder-based system has been proposed by Serban et al. (2016) for leveraging contextual information in conversations.

Retrieval-based models
Earlier works on retrieval-based systems focused on modeling short-text, single-turn dialogues. Hao et al. (2013) introduced a data set for this task and proposed a response selection system which is based on information retrieval techniques like the vector space model and semantic matching. Ji et al. (2014) suggested to apply a deep neural network for matching contexts and responses, while  proposed a topic aware convolutional neural tensor network for answer retrieval in short-text scenarios.
More recently, there has been a lot of focus on developing retrieval-based models for multi-turn dialogues which is more challenging as the models need to take into account long-term dependencies in the context. Lowe et al. (2015a), introduced the Ubuntu Dialogue Corpus (UDC), which is the largest freely available multi-turn dialogue data set. Moreover, the authors proposed to leverage RNNs, e.g. LSTMs, to encode both the context and the response, before computing the score of the pair based on the similarity of the encodings (w.r.t. a certain measure). This class of methods is referred to as dual encoder architectures. Shortly after, Kadlec et al. (2015) investigated the performance of dual encoders with different kind of encoder networks, such as convolutional neural networks (CNNs) and bi-directional LSTMs.  followed a different approach and trained a single CNN to map a context-response pair to the corresponding matching score.
Later on, various extensions of the dual encoder architecture have been proposed.  employed two encoders in parallel, one working on word-the other on utterance-level. Wu et al. (2017) proposed the Sequential Matching Network (SMN), where the candidate response is matched with every utterance in the context separately, based on which a final score is computed. The Cross Convolution Network (CNN)  extends the dual encoder with a cross convolution operation. The latter is a dot product between the embeddings of the context and response followed by a max-pooling operation. Both of the outputs are concatenated and fed into a fullyconnected layer for similarity matching. Moreover,  improve the representation of rare words by learning different embeddings for them from the data. Handling rare words has also been studied by Dong and Huang (2018), who proposed to handle Out-of-Vocabulary (OOV) words by using both pre-trained word embeddings and embeddings from task-specific data.
Furthermore, many models targeting response selection along with other sentence pair scoring tasks such as paraphrasing, semantic text scoring, and recognizing textual entailment have been proposed. Baudiš et al. (2016) investigated a stacked RNN-CNN architecture and attention-based models for sentence-pair scoring. Match-SRNN (Wan et al., 2016) employs a spatial RNN to capture local interactions between sentence pairs. Match-LSTM (Wang and Jiang, 2016) improves its matching performance by using LSTM-based, attention-weighted sentence representations. QA-LSTM (Tan et al., 2016) uses a simple attention mechanism and combines the LSTM encoder with a CNN.
Incorporating unstructured domain knowledge into dialogue system has initially been studied by Lowe et al. (2015b) and followed by , who incorporated a loosely-structured knowledge base into a neural network using a special gating mechanism. They created the knowledge base from domain-specific data, however their model is not able to leverage any external domain knowledge.

Background
In this section, we will explain the task at hand and give a brief introduction to the neural network architectures our proposed model is based on.

Problem definition
where L is the maximum context length. We define an utterance as a sequence of words {w t } T t=1 . Thus, c i can also be viewed as a sequence of words by concatenating all utterances in c i . Each response r i is an utterance and y i ∈ {0, 1} is the corresponding label of the given triple which takes a value of 1 if r i is the correct response for c i and 0 otherwise. The goal of retrieval-based dialogue systems is then to learn a predictive distribution p(y|c, r, θ) parameterized by θ. That is, given a context c and re-sponse r, we would like to infer the probability of r being a response to context c.

RNNs, BiRNNs and GRUs
Recurrent neural networks are one of the most popular classes of models for processing sequences of words W = {w t } T t=1 with arbitrary length T ∈ N, e.g. utterances or sentences. Each word w t is first mapped onto its vector representation w t (also referred to as word embedding), which serves as input to the RNN at time step t. The central element of RNNs is the recurrence relation of its hidden units, described by where φ are the parameters of the RNN and f is some nonlinear function. Accordingly, the state − → h t of the hidden units at time step t depends on the state − → h t−1 in the previous time step and the t-th word in the sequence. This way, the hidden state − → h T obtained after T updates contains information about the whole sequence W , and can thus be regarded as an embedding of the sequence.
The RNN architecture can also be altered to take into account dependencies coming from both the past and the future by adding an additional sub-RNN that moves backward in time, giving rise to the name bi-directional RNN (biRNN). To achieve this, the network architecture is extended by an additional set of hidden units. The states ← − h t of those hidden units are updated based on the current input word and the hidden state from the next time step. That is for t = 1, . . . , T − 1: Here, the words are processed in reverse order, i.e. w T , . . . , w 1 , such that ← − h T (analogous to − → h T in the forward directed RNN) contains information about the whole sequence. At the t-th time step, the model's hidden representation of the sequence is then usually obtained by the concatenation of the hidden states from the forward and the backward RNN, i.e. by and the embedding of the whole sequence W is given by Modeling very long sequences with RNNs is hard: Bengio et al. (1994) showed that RNNs suffer from vanishing and exploding gradients, which makes training over long-term dependency difficult. Such problems can be addressed by augmenting the RNN with additional gating mechanisms, as it is done in LSTMs and the Gated Recurrent Unit (GRU) (Cho et al., 2014). These mechanisms allow the RNN to learn how much to update the hidden state flexibly in each step and help the RNN to deal with the vanishing gradient problem in long sequences better than vanilla RNNs. The gating mechanism of GRUs is motivated by that of LSTMs, but is much simpler to compute and implement. It contains two gates, namely the reset and update gate, whose states at time t are denoted by z t and r t , respectively. Formally, a GRU is defined by the following update equations where x t is the input (corresponding to w t in our setting) and the set of weight matrices φ = {W z , U z ,W r , U r , W h , U h } constitute the learnable model parameters.

Dual Encoder
Recurrent neural networks and their variants have been used in many applications in the field of natural language processing, including retrievalbased dialogue systems. In this area the dual encoder (DE) (Lowe et al., 2015a) became a popular model. It uses a single RNN encoder to transform both context and response into low dimensional vectors and computes their similarity. More formally, let h c and h r be the encoded context and response, respectively. The probability of r being the correct response for c is then computed by the DE as where θ = {φ, M, b} (recall, that φ is the set of parameters of the encoder RNN that outputs h c and h r ) is the set of parameters of the full model and σ is the sigmoid function. Note, that the same RNN is used to encode both context and response. In summary, this approach can be described as first creating latent representations of context and response in the same vector space and then using the similarity between these latent embeddings (as induced by matrix M and bias b) for estimating the probability of the the response being the correct one for the given context.

Model description
Our model extends the DE described in Section 3.3 by two attention mechanisms which make the context encoding response-aware and vice versa. Furthermore, we augment the model with a mechanism for incorporating external knowledge to improve the handling of rare words. Both extensions are described in detail in the following subsections.

Attention augmented encoding
As described above, in the DE context and response are encoded independently from each other based on the same RNN. Instead of simply taking the final hidden state h c (and h r ) of the RNN as context (and response) encoding, we propose to use a response-aware attention mechanism to calculate the context embedding and vice versa.
Subsequently, we will describe this mechanism formally. Recall that a context c can be seen as sequence of words {w c t } T t=1 where all utterances are concatenated and T is the total number of words in the context. Given this sequence, the RNN (in our experiments a bi-directional GRU) produces a sequence of hidden states h c 1 , . . . , h c T and an encoding of the whole context sequence h c as described in Section 3.2. Analogously, we get h r 1 , . . . , h r T and h r for a response consisting of a sequence of words {w r t } T t=1 , where T is the total number of words in the response.
For calculating the response-aware context encoding, we first estimate attention weights α c t for the hidden state h c t in each time step, depending on the response encoding h r : where W c is a learnable parameter matrix. The response-aware context embedding then is given byĥ Intuitively this means, that depending on the response we focus on different parts of the context sequence, for judging on how well the response matches the context. This may resemble human focus. Similarly, we calculate the context-aware re- Figure 1: Our proposed way to incorporate domain knowledge into the model. β t and 1 − β t represent the (multiplicative) weights for the description embedding and the word embedding respectively. The resulting combination,ŵ r t acts as an input of the encoder.

Incorporating domain keyword descriptions
Bahdanau et al. (2018) proposed a method for learning embeddings for OOV words based on external dictionary definitions. They learn these description embeddings of words using an LSTM for encoding the corresponding definition. If a particular word included in the dictionary also appears in the corpus' vocabulary (for which vanilla word embeddings are given), they add the word embedding and the description embedding together. Otherwise, in the case of OOV words, they use solely the description embedding in place of the missing word embedding. Inspired by this approach, we use a similar technique to incorporate domain keyword descriptions into word embeddings. If a word w r t in the response utterance is in the set of domain keywords K, we firstly extract its description. The description of w r t is a sequence of words {w d tk } K k=1 , which is projected onto sequence of embeddings {w d tk } K k=1 . This sequence is encoded using another bi-directional GRU to obtain a vector representation h d t of the same dimension as the vanilla word embeddings. If w r t is not in K, we simply set h d t to zero. We call h d t the description embedding.
Some domain specific words might also happen to be common words. For instance, in the case of the UDC's vocabulary, there exist tokens such as shutdown 1 or who 2 , which are ambiguous, i.e., although they are valid UNIX commands, they are also common words in natural language. The description embeddings of domain specific words can be simply added to the vanilla word embeddings as suggested by Bahdanau et al. (2018). However, it might be advantageous if the model can determine itself whether to treat the current word as a domain specific word, a common word, or something in between, depending on the context. For instance, if the context is mainly talking about system users, then who is most likely a UNIX keyword. Therefore, we propose a more flexible way to combine the description embedding h d t and the word embedding w r t , that is, we define the final word embedding to be a convex combination of both, and let the combination coefficients be given by a function of h d t and the context embeddingĥ c . Intuitively, this allows the model to flexibly focus on the description or the vanilla embedding, in dependence on the context and the description. Formally, the combination coefficients β t of t-th word in the response is given by where U and V are learnable parameter matrices. Note that β t is a vector of the same dimension as the embeddings. The final embedding of w r t (which serves as input to the response encoder) is then the weighted sum where denotes the element wise multiplication.

Ubuntu multi-turn dialogue corpus
Extending the work of Uthus and Aha (2013), Lowe et al. (2015a) introduced a version of the Ubuntu chat log conversations which is the largest 1 UNIX command for system shutdown. 2 UNIX command to get a list of currently logged-in users.
publicly available multi-turn, dyadic, and domainspecific dialogue data set. The chats are extracted from Ubuntu related topic specific chat rooms in the Freenode Internet Relay Chat (IRC) network. Usually, experienced users address a problem of someone by suggesting a potential solution and a name mention of the addressed user. A conversation between a pair of users often stops when the problem has been solved. However, they might continue having a discussion which is not related to the topic. A preprocessed version of the above corpus and the needed vocabulary are provided by Wu et al. (2017). The preprocessing consisted of replacing numbers, URLs, and system paths with special placeholders as suggested by . No additional preprocessing is performed by us. The data set consists of 1 million training triples, 500k validation triples, and 500k test triples. One half of the 1 million training triples are positive (triples with y = 1, i.e. the provided response fits the context) the other half negative (triples with y = 0). In contrast, in the validation and test set, for every context c i , there exists one positive triple providing the ground-truth response to c i and nine negative triples with unbefitting responses. Thus, in these sets, the ratio between positive and negative triples per context is 1:9 which makes evaluating the model with information retrieval metrics such as Recall@k possible (see Section 6).

Model hyperparameters
We chose a word embedding dimension of 200 as done by Wu et al. (2017). We use fastText (Bojanowski et al., 2016) to pre-train the word embeddings using the training set instead of using offthe-shelf word embeddings, following Wu et al. (2017). We set the hidden dimension of our GRU to be 300, as in the work of Lowe et al. (2015a). We restricted the sequence length of a context by a maximum of 320 words, and that of the response by 160. Because of the resulting size of the model and limited GPU memory, we had to use a smaller batch size of 32. We optimize the binary cross entropy loss of our model with respect to the training data using Adam (Kingma and Ba, 2015) with an initial learning rate of 0.0001. We train our model for a maximum of 20 epochs as according to our experience, this is more than enough to achieve convergence. The training is stopped when the validation recall does not increase after three sub-

Results
Following Lowe et al. (2015a) and Kadlec et al. (2015), we use the Recall@k evaluation metric, where R n @k corresponds to the fraction of of examples for which the correct response is under the k best out of a set of n candidate responses, which were ranked according to there their probabilities under the model.

Comparison against baselines
We compare our model, which we refer to as Attention and external Knowledge augmented DE with bi-directional GRU (AK-DE-biGRU), against models previously tested on the same data set: the basic DE models analyzed by Lowe et al. (2015a) and Kadlec et al. (2015) (Wang and Jiang, 2016), and QA-LSTM (Tan et al., 2016); architectures processing the context utterances individually, namely SMN dyn (Wu et al., 2017) and CCN; and we also use recently proposed ESIM (Dong and Huang, 2018) as a baseline.
The results are reported in Table 2. Our model outperforms all other models used as baselines. The largest improvement of our model compared to the best of the baselines (i.e. ESIM in general and SMN dyn for R 2 @1 metric) are with respect to the R 10 @1 and R 10 @3 metric, where we observed absolute improvements of 0.013 and 0.014 corresponding to 1.8% and 1.6% relative improvement , respectively. For R 2 @1 and R 10 @5 we observed more modest improvements of 0.007 (0.8%) and 0.005 (0.5%), respectively. Our results are significantly better with p < 10 −6 for a one-sample onetailed t-test compared to the best baseline (ESIM), on R 10 @1, R 10 @3, R 10 @5 metrics, using the outcome of 15 independent experiments. The variance between different trials is smaller than 0.001 for all evaluation metrics.

Ablation study
Our model differs in various ways from the vanilla DE: it uses a GRU instead of an LSTM for the encoding, introduces an attention mechanism for the encoding of the context and another for the encoding of the response, and incorporates additional knowledge in the response encoding process.
To analyze the effect of these components on the over all performance, we analyzed different model variants: a DE using a GRU or a bidirectional GRU as encoder (DE-GRU and DE-biGRU, respectively) and both of these models with attention augmented encoding for embedding both context and response (A-DE-GRU and A-DE-biGRU, respectively). We also tested the effects of using a simple addition instead of the weighted summation given in equation (10) for merging the word embedding with the desciption embedding (AK + -DE-biGRU). Finally, we investigated a version of our model (AK-DE-biGRU w2v ) where we used pre-trained word2vec embeddings, as done by Wu et al. (2017), instead of learning our own word embeddings from the data set.
The results of the study are presented in Table 3. With the basic models, i.e. DE-GRU and DE-biGRU, as baselines, we observed around 4% and 9% improvement on R 10 @1 when incorporating the attention mechanism (A-DE-GRU and A-DE-biGRU, respectively).
When domain knowledge is incorporated by simple addition (as in the work of Bahdanau et al. (2018)), i.e. in AK + -DE-biGRU, we noticed 0.5% further improvement. Note however, that the results are not as good as when using the proposed weighted addition. Finally, using our method of incorporating domain knowledge in combination with embeddings trained from scratch with fastText (Bojanowski et al., 2016), the performance gets 0.3% better than when using pre-Example Response Utterances gui for shutdown try typing sudo shutdown -h now sudo apt-get install qt4-designer there could be some qt dev packages too but i think the above will install them as dependencies certainly won n't make a difference i m sure but maybe try sudo shutdown -r  trained word2vec embeddings. In total, compared to the DE-biGRU baseline, our model (AK-DE-biGRU) achieves 10% of improvement in terms of the R 10 @1 metric. Thus, the results clearly suggest that both the attention mechanism and the incorporation of domain knowledge, are effective approaches for improving the dual encoder architecture. Curiously, we noticed that for the baseline models, using a GRU as the encoder is better than using a biGRU. This finding is in line with the results from Kadlec et al. (2015) reported in Table 2. However, the table is turned when augmenting the models with an attention mechanism where the biGRU-based model outperforms the one with the GRU. This observation motivates us to consider a biGRU instead of a GRU in our final model.

Visualizing response attentions
To further investigate the results given by our model, we qualitatively inspected several samples of response utterances and their attention weights, as shown in Table 4. We noticed that our model learned to focus on technical terms, such as lspci, Context utterances Utterance 1: Ubuntu <version> Utterance 2: hi all sony vaio fx120 will not turn off when shutting down, any ideas? btw acpi =o ff in boot parameters anything else i should be trying? Utterance 3: how are you shutting down i.e. terminal or gui? Table 5: Sample context utterances from UDC's test set whose correct response is the first utterance in Table 4.
shutdown, and traceroute. We also observed that the model is able to capture contextual importance, i.e. it is able to focus on context relevant words. For example, given the context in Table 5 and the correct response in the first row of Table 4, one can see the attention on the word shutdown, where it gets a lower weight when used as a common word in the first occurance than as a UNIX command in the second. 5

Error analysis
We qualitatively analyzed the errors our method made. We observed that our model's predictions are biased toward high information utterances. That is, we observed for some examples that the correct response is generic (i.e. has low information), our model chooses a non-generic response, as shown in Table 6. Furthermore, we computed the average utterance information content (the entropy) for both the correct and predicted responses, based on Xu and Reitter (2018), where we obtained 9.25 bits and 9.34 bits, respectively. This quantitatively indicates that our model is slightly biased toward high information responses.

Conclusion and future work
We presented a novel model which extends the dual encoder architecture for multi-turn response selection by incorporating external domain knowledge and attention augmented encoding. Our experimental results demonstrate that our model outperformed other state-of-the-art methods for response selection in a multi-turn dialogue setting, and that the attention mechanism and incorporating additional domain knowledge are indeed effective approaches for improving the response se-Examples of model error: Correct: ok will do :), nope. Predicted: url if you go down to the bottom of that tutorial i also have a post there that is a bit more detailed about my problem poster name is trent Correct: hmm! ok Predicted: as did i w/ fbsd ... just check out the livecd for a bit Correct: okay thank you a thread i hope :) Predicted: hmm ok because im not sure about iwconfig and wpa but we can give it a try do gksudo gedit path then add a record like this url Correct: right .. it is, it exists i verified Predicted: i want to connect to your computer remotely if you allow me to so i can fix the problem for you just follow the following procedure.
Correct: roger .. lemme check, got it ... thanks dude :) Predicted: just click the partition and then click the blue text next to mount point or you can simply navigate to that path Table 6: Examples on the error our model made. We observed that our model's predictions are biased towards non-generic responses. lection performance of the dual encoder architecture. Further improvement might be made by also considering domain knowledge in the context and by improving the handling of OOV words, e.g. by widening our domain specific word vocabulary and handling generic OOV words such as typos.