Exemplar Encoder-Decoder for Neural Conversation Generation

In this paper we present the Exemplar Encoder-Decoder network (EED), a novel conversation model that learns to utilize similar examples from training data to generate responses. Similar conversation examples (context-response pairs) from training data are retrieved using a traditional TF-IDF based retrieval model and the corresponding responses are used by our decoder to generate the ground truth response. The contribution of each retrieved response is weighed by the similarity of corresponding context with the input context. As a result, our model learns to assign higher similarity scores to those retrieved contexts whose responses are crucial for generating the final response. We present detailed experiments on two large data sets and we find that our method out-performs state of the art sequence to sequence generative models on several recently proposed evaluation metrics.


Introduction
With the availability of large datasets and the recent progress made by neural methods, variants of sequence to sequence learning (seq2seq) (Sutskever et al., 2014) architectures have been successfully applied for building conversational systems (Serban et al., , 2017b. However, despite these methods being the stateof-the art frameworks for conversation generation, they suffer from problems such as lack of diversity in responses and generation of short, repetitive and uninteresting responses (Liu et al., 2016;Serban et al., , 2017b. A large body of recent literature has focused on overcoming such challenges (Li et al., 2016a;Lowe et al., 2017).
In part, such problems arise as all information required to generate responses needs to be captured as part of the model parameters learnt from the training data. These model parameters alone may not be sufficient for generating natural conversations. Therefore, despite providing enormous amount of data, neural generative systems have been found to be ineffective for use in real world applications (Liu et al., 2016).
In this paper, we focus our attention on closed domain conversations. A characteristic feature of such conversations is that over a period of time, some conversation contexts 1 are likely to have occurred previously (Lu et al., 2017b). For instance, Table 1 shows some contexts from the Ubuntu dialog corpus. Each row presents an input dialog context with its corresponding gold response followed by a similar context and response seen in training data -as can be seen, contexts for "installing dms", "sharing files", "blocking ufw ports" have all occurred in training data. We hypothesize that being able to refer to training responses for previously seen similar contexts could be a helpful signal to use while generating responses.
In order to exploit this aspect of closed domain conversations we build our neural encoderdecoder architecture called the Exemplar Encoder Decoder (EED), that learns to generate a response for a given context by exploiting similar contexts from training conversations. Thus, instead of having the seq2seq model learn patterns of language only from aligned parallel corpora, we assist the model by providing it closely related (similar) samples from the training data that it can refer to while generating text.
Specifically, given a context c, we retrieve a set  of context-response pairs (c (k) , r (k) ), 1 ≤ k ≤ K using an inverted index of training data. We create an exemplar vector e (k) by encoding the response r (k) (also referred to as exemplar response) along with an encoded representation of the current context c. We then learn the importance of each exemplar vector e (k) based on the likelihood of it being able to generate the ground truth response. We believe that e (k) may contain information that is helpful in generating the response. Table 1 highlights the words in exemplar responses that appear in the ground truth response as well.

Contributions:
We present a novel Exemplar Encoder-Decoder (EED) architecture that makes use of similar conversations, fetched from an index of training data. The retrieved contextresponse pairs are used to create exemplar vectors which are used by the decoder in the EED model, to learn the importance of training context-response pairs, while generating responses. We present detailed experiments on the publicly benchmarked Ubuntu dialog corpus data set (Lowe et al., 2015) as well a large collection of more than 127,000 technical support conversations. We compare the performance of the EED model with the existing state of the art generative models such as HRED  and VHRED (Serban et al., 2017b). We find that our model out-performs these models on a wide variety of metrics such as the recently proposed Activity Entity metrics (Serban et al., 2017a) as well as Embedding-based metrics (Lowe et al., 2015). In addition, we present qualitative insights into our results and we find that exemplar based responses are more informative and diverse. The rest of the paper is organized as follows. Section 2 briefly describes the recent works in neural dialogue generation The details of the proposed EED model for dialogue generation are described in detail in Section 3. In Section 4, we describe the datasets as well as the details of the models used during training. We present quantitative and qualitative results of EED model in Section 5.

Related Work
In this section, we compare our work against other data-driven end-to-end conversation models. Endto-end conversation models can be further classified into two broad categories -generation based models and retrieval based models.
Generation based models cast the problem of dialogue generation as a sequence to sequence learning problem. Initial works treat the entire context as a single long sentence and learn an encoder-decoder framework to generate response word by word (Shang et al., 2015;Vinyals and Le, 2015). This was followed by work that models context better by breaking it into conversation history and last utterance (Sordoni et al., 2015b). Context was further modeled effectively by using a hierarchical encoder decoder (HRED) model which first learns a vector representation of each utterance and then combines these representations to learn vector representation of context . Later, an alternative hierarchical model called VHRED (Serban et al., 2017b) was proposed, where generated responses were conditioned on latent variables. This leads to more in-formative responses and adds diversity to response generation. Models that explicitly incorporate diversity in response generation have also been studied in literature (Li et al., 2016b;Vijayakumar et al., 2016;Cao and Clark, 2017;.
Our work differs from the above as none of these above approaches utilize similar conversation contexts observed in the training data explicitly.
Retrieval based models on the other hand treat the conversation context as a query and obtain a set of responses using information retrieval (IR) techniques from the conversation logs (Ji et al., 2014). There has been further work where the responses are further ranked using a deep learning based model (Yan et al., 2016a,b;Qiu et al., 2017). On the other hand of the spectrum, endto-end deep learning based rankers have also been employed to generate responses (Wu et al., 2017;Henderson et al., 2017). Recently a framework has also been proposed that uses a discriminative dialog network that ranks the candidate responses received from a response generator network and trains both the networks in an end to end manner (Lu et al., 2017a).
In contrast to the above models, we use the input contexts as well as the retrieved responses for generating the final responses. Contemporaneous to our work, a generative model for machine translation that employs retrieved translation pairs has also been proposed (Gu et al., 2017). We note that while the underlying premise of both the papers remains the same, the difference lies in the mechanism of incorporating the retrieved data.

Overview
A conversation consists of a sequence of utterances. At a given point in the conversation, the utterances expressed prior to it are jointly referred to as the context. The utterance that immediately follows the context is referred to as the response. As discussed in Section 1, given a conversational context, we wish to to generate a response by utilizing similar context-response pairs from the training data. We retrieve a set of K exemplar contextresponse pairs from an inverted index created using the training data in an off-line manner. The input and the retrieved context-response pairs are then fed to the Exemplar Encoder Decoder (EED) network. A schematic illustration of the EED network is presented in Figure 1. The EED encoder combines the input context and the retrieved responses to create a set of exemplar vectors. The EED decoder then uses the exemplar vectors based on the similarity between the input context and retrieved contexts to generate a response. We now provide details of each of these modules.

Retrieval of Similar Context-Response Pairs
Given a large collection of conversations as (context, response) pairs, we index each response and its corresponding context in tf − idf vector space. We further extract the last turn of a conversation and index it as an additional attribute of the context-response document pairs so as to allow directed queries based on it.
Given an input context c, we construct a query that weighs the last utterance in the context twice as much as the rest of the context and use it to retrieve the top-k similar context-response pairs from the index based on a BM25 (Robertson et al., 2009) retrieval model. These retrieved pairs form our exemplar context-response pairs (c (k) , r (k) ), 1 ≤ k ≤ K.

Exemplar Encoder Network
Given the exemplar pairs (c (k) , r (k) ), 1 ≤ k ≤ K and an input context-response pair (c, r), we feed the input context c and the exemplar contexts c (1) , . . . , c (K) through an encoder to generate the embeddings as given below: Note that we do not constrain our choice of encoder and that any parametrized differentiable architecture can be used as the encoder to generate the above embeddings. Similarly, we feed the exemplar responses r (1) , . . . , r (K) through a response encoder to generate response embeddings r Next, we concatenate the exemplar response encoding r formation about similar responses along with the encoded input context representation.
The exemplar vectors e (k) , 1 ≤ k ≤ K are further used by the decoder for generating the ground truth response as described in the next section.

Exemplar Decoder Network
Recall that we want the exemplar responses to help generate the responses based on how similar the corresponding contexts are with the input context. More similar an exemplar context is to the input context, higher should be its effect in generating the response. To this end, we compute the similarity scores s (k) , 1 ≤ k ≤ K using the encodings computed in Section 3.3 as shown below.
Next, each exemplar vector e (k) computed in Section 3.3, is fed to a decoder, where the decoder is responsible for predicting the ground truth response from the exemplar vector. Let p dec (r|e (k) ) be the distribution of generating the ground truth response given the exemplar embedding. The objective function to be maximized, is expressed as a function of the scores s (k) , the decoding distribution p dec and the exemplar vectors e (k) as shown below: Note that we weigh the contribution of each exemplar vector to the final objective based on how similar the corresponding context is to the input context. Moreover, the similarities are differentiable function of the input and hence, trainable by back propagation. The model should learn to assign higher similarities to the exemplar contexts, whose responses are helpful for generating the correct response. The model description uses encoder and decoder networks that can be implemented using any differentiable parametrized architecture. We discuss our choices for the encoders and decoder in the next section.

The Encoders and Decoder
In this section, we discuss the various encoders and the decoder used by our model. The conversation context consists of an ordered sequence of utterances and each utterance can be further viewed as a sequence of words. Thus, context can be viewed as having multiple levels of hierarchies-at the word level and then at the utterance (sentence) level. We use a hierarchical recurrent encoder-popularly employed as part of the HRED framework for generating responses and query suggestions (Sordoni et al., 2015a;Serban et al., , 2017b. The word-level encoder encodes the vector representations of words of an utterance to an utterance vector. Finally, the utterance-level encoder encodes the utterance vectors to a context vector. Let (u 1 , . . . , u N ) be the utterances present in the context. Furthermore, let (w n1 , . . . , w nMn ) be the words present in the n th utterance for 1 ≤ n ≤ N . For each word in the utterance, we retrieve its corresponding embedding from an embedding matrix. The word embedding for w nm will be denoted as w enm . The encoding of the n th utterance can be computed iteratively as follows: We use an LSTM (Hochreiter and Schmidhuber, 1997) to model the above equation. The last hidden state h nMn is referred to as the utterance encoding and will be denoted as h n .
The utterance-level encoder takes the utterance encodings h 1 , . . . , h N as input and generates the encoding for the context as follows: Again, we use an LSTM to model the above equation. The last hidden state c eN is referred to as the context embedding and is denoted as c e .
A single level LSTM is used for embedding the response. In particular, let (w 1 , . . . , w M ) be the sequence of words present in the response. For each word w, we retrieve the corresponding word embedding w e from a word embedding matrix. The response embedding is computed from the word embeddings iteratively as follows: Again, we use an LSTM to model the above equation. The last hidden state r em is referred to as the response embedding and is denoted as r e .   (Lowe et al., 2015), where |V | represents the size of vocabulary.

Tech Support Dataset
We also conduct our experiments on a large technical support dataset with more than 127K conversations. We will refer to this dataset as Tech Support dataset in the rest of the paper. Tech Support dataset contains conversations pertaining to an employee seeking assistance from an agent (technical support) -to resolve problems such as password reset, software installation/licensing, and wireless access. In contrast to Ubuntu dataset, this dataset has clearly two distinct users -employee and agent. In our experiments we model the agent responses only. For each conversation in the tech support data, we sample context and response pairs to create a dataset similar to the Ubuntu dataset format. Note that multiple context-response pairs can be generated from a single conversation. For each conversation, we sample 25% of the possible contextresponse pairs. We create validation pairs by selecting 5000 conversations randomly and sampling context response pairs). Similarly, we create test pairs from a different subset of 5000 conversations. The remaining conversations are used to create training context-response pairs.

Model and Training Details
The EED and HRED models were implemented using the PyTorch framework (Paszke et al., 2017). We initialize the word embedding matrix as well as the weights of context and response encoders from the standard normal distribution with mean 0 and variance 0.01. The biases of the encoders and decoder are initialized with 0. The word embedding matrix is shared by the context and response encoders. For Ubuntu dataset, we use a word embedding size of 600, whereas the size of the hidden layers of the LSTMs in context and response encoders and the decoder is fixed at 1200. For Tech support dataset, we use a word embedding size of 128. Furthermore, the size of the hidden layers of the multiple LSTMs in context and response encoders and the decoder is fixed at 256. A smaller embedding size was chosen for the Tech Support dataset since we observed much less diversity in the responses of the Tech Support dataset as compared to Ubuntu dataset. Two different encoders are used for encoding the input context (not shown in Figure 1 for simplicity). The output of the first context encoder is concatenated with the exemplar response vectors to generate exemplar vectors as detailed in Section 3.3. The output of the second context encoder is used to compute the scoring function as detailed in Section 3.4. For each input context, we retrieve 5 similar context-response pairs for Ubuntu dataset and 3 context-response pairs for Tech support dataset using the tf-idf mechanism discussed in Section 3.2.
We use the Adam optimizer (Kingma and Ba, 2014) with a learning rate of 1e − 4 for training the model. A batch size of 20 samples was used during training. In order to prevent overfitting, we use early stopping with log-likelihood on validation set as the stopping criteria. In order to generate the samples using the proposed EED model, we identify the exemplar context that is most similar to the input context based on the learnt scoring function discussed in Section 3.4. The corresponding exemplar vector is fed to the decoder to generate the response. The samples are generated using a beam search with width 5. The average per-word log-likelihood is used to score the beams.

Activity and Entity Metrics
A traditional and popular metric used for comparing a generated sentence with a ground truth sentence is BLEU (Papineni et al., 2002) and is frequently used to evaluate machine translation. The metric has also been applied to compute scores for predicted responses in conversations, but it has been found to be less indicative of actual performance (Liu et al., 2016;Sordoni et al., 2015a;Serban et al., 2017a), as it is extremely sensitive to the exact words in the ground truth response, and gives equal importance to stop words/phrases and informative words. Serban et al. (2017a) recently proposed a new set of metrics for evaluating dialogue responses for the Ubuntu corpus. It is important to highlight that these metrics have been specifically designed for the Ubuntu corpus and evaluate a generated response with the ground truth response by comparing the coarse level representation of an utterance (such as entities, activities, Ubuntu OS commands). Here is a brief description of each metric: • Activity: Activity metric compares the activities present in a predicted response with the ground truth response. Activity can be thought of as a verb. Thus, all the verbs in a response are mapped to a set of manually identified list of 192 verbs.
• Entity: This compares the technical entities that overlap with the ground truth response. A total of 3115 technical entities is identified using public resources such as Debian package manager APT.  • Tense: This measure compares the time tense of ground truth with predicted response.
• Cmd: This metric computes accuracy by comparing commands identified in ground truth utterance with a predicted response. Table 4 compares our model with other recent generative models (Serban et al., 2017a) -LSTM (Shang et al., 2015), HRED  & VHRED (Serban et al., 2017b).We do not compare our model with Multi-Resolution RNN (MRNN) (Serban et al., 2017a), as MRNN explicitly utilizes the activities and entities during the generation process. In contrast, the proposed EED model and the other models used for comparison are agnostic to the activity and entity information. We use the standard script 3 to compute the metrics.
The EED model scores better than generative models on almost all of the metrics, indicating that we generate more informative responses than other state-of-the-art generative based approaches for Ubuntu corpus. The results show that responses associated with similar contexts may contain the activities and entities present in the ground truth response, and thus help in response generation. This is discussed further in Section 5.2. Additionally, we compared our proposed EED with a retrieval only baseline. The retrieval baseline achieves an activity F1 score of 4.23 and entity F1 score of 2.72 compared to 4.87 and 2.99 respectively achieved by our method on the Ubuntu corpus.
The Tech Support dataset is not evaluated using the above metrics, since activity and entity information is not available for this dataset.

Embedding Metrics
Embedding metrics (Lowe et al., 2017) were proposed as an alternative to word by word comparison metrics such as BLEU. We use pre-trained Google news word embeddings 4 similar to Serban et al. (2017b), for easy reproducibility as these metrics are sensitive to the word embeddings used. The three metrics of interest utilize the word vectors in ground truth response and a predicted response and are discussed below: • Average: Average word embedding vectors are computed for the candidate response and ground truth. The cosine similarity is computed between these averaged embeddings. High similarity gives as indication that ground truth and predicted response have similar words.
• Greedy: Greedy matching score finds the most similar word in predicted response to ground truth response using cosine similarity.
• Extrema: Vector extrema score computes the maximum or minimum value of each dimension of word vectors in candidate response and ground truth.
Of these, the embedding average metric is the most reflective of performance for our setup. The extrema representation, for instance, is very sensitive to text length and becomes ineffective beyond single length sentences (Forgues et al., 2014). We use the publicly available script 5 for all our computations. As the test outputs for HRED are not available for Technical Support dataset, we use our    Table 7: Contexts, exemplar responses and responses generated by HRED, VHRED and the proposed EED model. We use the published responses for HRED and VHRED. GT indicates the ground truth response. The change of turn is indicated by →. The highlighted words in bold are common between the exemplar response and the response predicted by EED.
own implementation of HRED. Table 5 compares our model with HRED, and depicts that our model scores better on all metrics for Technical Support dataset, and on majority of the metrics for Ubuntu dataset.
We note that the improvement achieved by the EED model on activity and entity metrics are much more significant than those on embedding metrics. This suggests that the EED model is better able to capture the specific information (objects and actions) present in the conversations. Finally, we evaluate the diversity of the generated responses for EED against HRED by counting the number of unique tokens, token-pairs and token-triplets present in the generated responses on Ubuntu and Tech Support dataset. The results are shown in Table 6. As can be observed, the responses in EED have a larger number of distinct tokens, token-pairs and token-triplets than HRED, and hence, are arguably more diverse. Table 7 presents the responses generated by HRED, VHRED and the proposed EED for a few selected contexts along with the corresponding similar exemplar responses. As can be observed from the table, the responses generated by EED tend to be more specific to the input context as compared to the responses of HRED and VHRED. For example, in conversations 1 and 2 we find that both HRED and VHRED generate simple generic responses whereas EED generates responses with additional information such as the type of disk partition used or a command not working. This is also confirmed by the quantitative results obtained using activity and entity metrics in the previous section. We further observe that the exemplar responses contain informative words that are utilized by the EED model for generating the responses as highlighted in Table 7.

Conclusions
In this work, we propose a deep learning method, Exemplar Encoder Decoder (EED), that given a conversation context uses similar contexts and corresponding responses from training data for generating a response. We show that by utilizing this information the system is able to outperform state of the art generative models on publicly available Ubuntu dataset. We further show improvements achieved by the proposed method on a large collection of technical support conversations.
While in this work, we apply the exemplar encoder decoder network on conversational task, the method is generic and could be used with other tasks such as question answering and machine translation. In our future work we plan to extend the proposed method to these other applications.