Retrieval-Enhanced Adversarial Training for Neural Response Generation

Dialogue systems are usually built on either generation-based or retrieval-based approaches, yet they do not benefit from the advantages of different models. In this paper, we propose a Retrieval-Enhanced Adversarial Training (REAT) method for neural response generation. Distinct from existing approaches, the REAT method leverages an encoder-decoder framework in terms of an adversarial training paradigm, while taking advantage of N-best response candidates from a retrieval-based system to construct the discriminator. An empirical study on a large scale public available benchmark dataset shows that the REAT method significantly outperforms the vanilla Seq2Seq model as well as the conventional adversarial training approach.


Introduction
Dialogue systems intend to converse with humans with a coherent structure.They have been widely used in real-world applications, including customer service systems, personal assistants, and chatbots.Early dialogue systems are often built using the rule-based method (Weizenbaum, 1966) or learning-based method (Litman et al., 2000;Schatzmann et al., 2006;Williams and Young, 2007).These systems require extensive human efforts for making rules, which is labor intensive and difficult to scale up.Recently, with the development of social networking, conversational data have accumulated to a considerable scale, which is helpful for data-driven methods to build systems.These methods can be divided into two categories: generation-based methods (Shang, Lu, and Li, 2015;Sordoni et al., 2015;Vinyals and Le, 2015;Wen et al., 2017) and retrieval-based methods (Leuski et al., 2009;Ji, Lu, and Li, 2014;Yan, Song, and Wu, 2016) .
Generation-based methods produce a response by computing a posterior probability distribution over possible responses (Serban et al., 2015).The most common method of this category in recent years is the sequence to sequence (Seq2Seq) based model (Sutskever, Vinyals, and Le, 2014;Shang, Lu, and Li, 2015;Vinyals and Le, 2015).In practice, it usually suffers from the generic response problem (Li et al., 2016a;Serban et al., 2016).This issue appears to lie in the skewed distribution of words in the conversational dataset, where nouns and verbs are more sparse than pronouns and punctuation tokens (Serban et al., 2015).In this way, a generic response has a higher probability to be generated than a diverse response under the maximumlikelihood estimation (MLE) objective function (Li et al., 2016a).Retrieval-based methods reply to users by searching and ranking N-best response candidates from a response set.Collected in advance and written mainly by humans, the Nbest candidates are always diverse and informative, which are consistent with the responses in real-world.But some candidates may not well adapt to the message due to their being prepared in advance and thus incapable of being customized (Shang, Lu, and Li, 2015).Therefore, it is indispensible to combine two models and take advantage of different response generation methods.
In this paper, we propose a Retrieval-Enhanced Adversarial Training (REAT) approach to make use of both methods.The main idea is to utilize the diversity and informative contents of N-best response candidates to enlighten the generation-based method.To achieve this goal, we cast the response generation as a reinforcement learning problem, and the N-best candidates are used as evidence to compute the reward.Rather than manually defined reward functions, a discriminator is used to calculate the reward and trained synchronously with the response generator using the adversarial training method (Goodfellow et al., 2014).To pass back the loss of the discriminator, the generator is optimized using the policy gradient algorithm (Li et al., 2017;Yu et al., 2017).
We conduct extensive experiments on a public available Twitter corpus to verify the effectiveness of the proposed method, comparing it with both retrieval-based and generation-based methods.The results show that the REAT approach significantly outperforms the baselines in both automatic and human evaluations.
The contributions of this paper are summarized as follows: • We propose a new training method REAT which combines retrieval-based and generation-based method by enhancing generation with response candidates.
• Extensive experiments show that the proposed approach outperforms state-of-the-art baselines in both the automatic and human evaluations.
• The code of the retrieval-enhanced adversarial training will be available for public use.

Discriminator
Hypothesis It looks so delicious.

Index
Evidence LSTM

Generator
Training Data

Retrieval-based Method
Message: I made strawberry shortcake.
Ground-truth Could you tell me how this thing is cooked?
Figure 1: An overview of our proposed approach.The discriminator is enhanced by the N-best response candidates (evidence list) returned by a retrieval-based method.It reads a response which can either be a machine-generated hypothesis or a groundtruth and outputs the probability that the response is human-generated.The output is then regarded as a reward to guide the generator.

Method
In this section, we define our notations and introduce the proposed Retrieval-Enhanced Adversarial Training approach.As Figure 1 shows, The proposed approach consists of a retrieval-based method and a generation-based method.
The retrieval-based method extends the training set with retrieved response candidates (denoted as evidence list).The generation-based method includes two main components: a discriminator D and a generator G, training under the adversarial learning framework.

Notations
We denote an input and a target in a training sample as the message m and the ground-truth t, respectively.The evidence list is denoted as {e}, whose i-th element e i is the i-th response candidate.The generated response of the model is denoted as the hypothesis p.We use the variable y to represent a response which can either be a human-generated ground-truth t or a machine-generated hypothesis p.

Retrieval-Enhanced Adversarial Training
We cast the retrieval-enhanced response generation as a reinforcement learning problem.We define the policy π as parameters of the generator θ, the action a as the next word to be generated y k .The state s k−1 is defined as a tuple consisting of a message, an evidence list, and a partial response: (m, {e}, y 1:k−1 ), where y 1:k−1 = {y 1 , y 2 , ..., y k−1 } can either be a prefix of a ground-truth or a partially generated hypothesis.In this way, the retrieval-enhanced response generation converts into a reinforcement learning problem of selecting an action y k under the state (m, {e}, y 1:k−1 ) following the policy θ.
Under the reinforcement learning framework, the goal of the generator is to maximize the expected reward of gener-ated responses: (1) where V is the vocabulary table.G(y k |y 1:k−1 ) represents the probability of predicting y k given y is the action-value function, representing the reward of taking the action y k under the state s k−1 and predicting subsequent actions following the policy θ.Here, we define the reward as the probability that y 1:k is human-generated and introduce a discriminator to estimate it.A common implementation of the estimation, for example, is the Monte Carlo search.It repeatedly rolls out y 1:k into K complete sequences.The discriminator evaluates the K sequences in turn and returns their average score as the reward.In our proposed approach, we employ a more time-effective method introduced by Li et al. (2017).In this method, the discriminator is trained to directly evaluate a partial response y 1:k rather than a complete response y.In this way, the reward represented by the value-action function can be computed as follows: where φ are parameters of the discriminator (the detailed structure of the discriminator will be introduced in the next section).The reward is then passed back to the generator using the policy gradient method.With the likelihood ratio trick (Williams, 1992), the gradient of θ can be derived as: (3) Both the generator and the discriminator are pre-trained before adversarial training.The generator is pre-trained on (message, ground-truth) pairs using MLE loss and subsequently used to generate a hypothesis for each message in Algorithm 1 Retrieval-Enhanced Adversarial Training Require: The training set {m, y}; Ensure: The generator parameters θ; The discriminator parameters φ; 1: Get the evidence list by retrieving N-best response candidates with the retrieval-based method; 2: Randomly initialize θ and φ; 3: Pre-train G with MLE loss; 4: Generate hypotheses using the pre-trained G; 5: Pre-train D using hypotheses as negative samples and ground-truths as positive samples; 6: for epoch in number of epochs do 7: for g in g-steps do end for 15: end for 16: return θ, φ; the training set.The discriminator is then pre-trained using a ground-truth as a positive sample and a hypothesis as a negative sample.Its objective function is defined as: (4) where t 1:k is a prefix of the ground-truth and p 1:k is a partially generated hypothesis.
Given the pre-trained generator and discriminator, the adversarial training is a min-max game played between them: (5) The discriminator tries to distinguish a human-generated ground-truth from a machine-generated hypothesis, while the generator tries to fool the discriminator by producing human-like hypotheses.The overall algorithm of the retrieval-enhanced adversarial training is summarized as Alogrithm 1.

Discriminator
The discriminator is a binary classifier.It takes a tuple (m, {e}, y 1:k ) as the input, and outputs the probability that y 1:k is human-generated given m and {e}.Intuitively, the message is conditioned to take the relevance between a message and a response into consideration, and the evidence list provides references to the discriminator.
Concretely, a message is first encoded into a distributed representation rep m by the discriminator using a recurrent neural network (RNN) (Mikolov et al., 2010): where h msg i is the i-th hidden state.L m is the length of the message.f is a recurrent unit function.Here, we use the LSTM (Hochreiter and Schmidhuber, 1997) for all the RNNs of the proposed approach.
We denote the process of Equations 6 and 7 as rep m = LSTM m (m), where LSTM m represents the message LSTM.Similarly, the distributed representations of a partial response and a piece of evidence can also be computed as follows: where LSTM y and LSTM e are a response LSTM and an evidence LSTM, respectively.rep y is the representation of y 1:k .Since there is more than one piece of evidence for a message, each evidence is in turn encoded and rep ej is the representation of the j-th evidence.The probability that y 1:k is humangenerated can then be computed by applying a Highway network (Srivastava, Greff, and Schmidhuber, 2015)

Generator
The generator G is a Seq2Seq model, which consists of an encoder and a decoder.The encoder reads from a message and summarizes it into a context vector c.The decoder is a language model which produces a response word by word, conditioned with the context vector.
Concretely, the encoder is modeled using a bidirectional LSTM.It consists of a forward LSTM and a backward LSTM, reading the message from two directions: where are the i-h forward hidden state and backward hidden state, respectively.The two states are then concatenated: After that, we use the Attention mechanism (Bahdanau, Cho, and Bengio, 2014) to summarize the encoder hidden states into a dynamic context vector c j for the j-th decoding time step.
The decoder is also modeled using LSTM.The j-th decoding time step can be described as follows: where h dec j is the j-th hidden state of the decoder.y j is the j-th word of a response.o is an output function that projects a decoder hidden state into a probability distribution over the vocabulary table.

Message
Happy birthday let's get the band back together Ground-truth Thank you!Sarah and the oysters reunite!!! Evidence#1 Bring back jam and Daxter though Evidence#2 Tune up that music, boys!We are ready to sing for Table 1: An example of a message, a ground-truth, and the top two pieces of evidence in the evidence list.

Retrieval System & Evidence Generation
To get the evidence list for each training sample, a retrieval system is built using the Lucene1 library.First, each sample in the training set is defined as a document consisting of a message field and a ground-truth field and subsequently added to the index.Second, we use each message as a query to search for documents whose messages are similar to the query.Finally, the ground-truths in the top N retrieved documents are extracted as the evidence list.
It should be noted that when we retrieve a message of a training sample, the top one document returned is always the training sample itself.We thus remove it from the retrieved result to make sure that the N pieces of evidence are different from the ground-truth.Table 1 shows an example of a message, a ground-truth, and top two pieces of evidence.

Experiments Data
We use a preprocessed Twitter corpus2 in our experiments.Collected from Twitter3 , each sample in the corpus consists of a tweet and a responded tweet, corresponding to the message and the ground-truth in the response generation task.The corpus has already been lowercased.We further tokenize each sentence and filter some special characters, such as emoji and non-English characters.The corpus is then split into training, validation, and test sets.Table 2 shows some statistics of these sets.

Baselines
We compare the proposed approach with retrieval-based methods, generation-based methods, as well as their combinations.
• S2S: The Seq2Seq model with the Attention mechanism (Bahdanau, Cho, and Bengio, 2014).• REGS: The Reward for Every Generation Step (REGS) is also an adversarial training method, where the generator is trained using the reinforcement learning algorithm and the discriminator is used to provide a reward for each generation step (Li et al., 2017) 2) directly training the discriminator on partially generated responses.Similar to the proposed approach, we chose the second setting here since it is more timeeffective.
• Retrieval: The retrieval system built on the training set.
Concretely, for each test message, we search the index for documents whose messages are similar to the test message.We then return the ground-truth in the top one document as the response.
• Re-rank: A hybrid method that combines retrieval-based methods and generation-based methods.First, the response candidates returned by the retrieval-based method are re-ranked according to their generation probability under the Seq2Seq model (Sordoni et al., 2015).Then the candidate with the highest probability is returned as the response.
• Multi-seq2seq: The "multi sequence to sequence" (Multi-seq2seq) model encodes N response candidates using N encoders and subsequently incorporates the results into the decoding process by the Attention mechanism (Song et al., 2018).

Experiment Settings
The proposed approach and all generation-based baselines are implemented based on an open source framework: Open-NMT (Klein et al., 2017).The word embeddings are pretrained on the training set using the Word2Vec toolkit4 .The dimensionality of the embeddings is set to 500.The vocabulary table consists of the most frequent 50,000 words.Any word not included in the table is mapped to a special symbol "UNK", which represents the unknown word.
The number of hidden units for all LSTMs in the generator and the discriminator is 500.The stochastic gradient descent is used to update all the parameters.The learning rate is initialized to 1.0 and decays with a rate of 0.5 if the perplexity does not decrease anymore on the validation set.The batch size is set to 64 and the maximum sentence length is set to 100.In addition, we use early stopping with respect to perplexity on the validation set.In the inference process, we generate responses using beam search with beam size set to 5.
In the pre-training process of the discriminator, the training set is constructed by sampling one prefix of a groundtruth and a hypothesis rather than enumerating all prefixes, because earlier actions are shared among multiple prefixes in enumeration method, which will lead to overfitting (

Evaluation Metrics
The evaluation of the response generation is still an open question.We thus employ different automatic evaluation metrics to validate the effectiveness of the proposed approach, including distinct metric (Li et al., 2016a), perplexity, and BLEU (Papineni et al., 2002).In addition, human evaluation is also employed.Significance tests are performed using a two-tailed Student's t-test between two models.

Automatic Evaluation
We employ the distinct metric (Li et al., 2016a) to evaluate the diversity of the responses.It contains two detailed metrics: distinct-1 and distinct-2, where the distinct-k computes the number of distinct kgrams and subsequently normalizes the number by the total number of words of responses.We conduct BLEU evaluation in two settings: single reference and multi-reference.
The corpus we use is a one-to-one corpus, where each message corresponds to a single ground-truth.In this way, we use the ground-truth as the reference and denote this setting as the single reference.Considering the diversity of different responses, a single reference is not always sufficient for comprehensive coverage.We thus expand the single reference with the N-best response candidates as extra references (Sordoni et al., 2015) and denote this setting as the multi-reference.In addition, we also report the perplexity result, which evaluates responses from the perspective of a language model.

Human Evaluation
The criteria for human evaluation are made following Li et al. (2017).200 messages are randomly sampled from the test set then sent to Re-rank, RESG, Multi-seq2seq and the proposed approach.Three annotators 5 are recruited to choose a better response between the proposed approach and each baseline.Ties are also allowed if they are of the same quality.
5 All annotators are well-educated students and have Bachelor or higher degree.

Analysis
The results of the distinct metric are shown in Table 3. Retrieval and Re-rank are much better than others.This is because their responses are retrieval-based (even though Re-rank and Multi-seq2seq are hybrid methods of two categories, the retrieval-based method is used to enhance the generation process.Therefore, their responses are still generation-based).In this way, all the vocabulary in the training set contributes to the diversity, including some rare words, e.g, the "->" in the second example of Table 6.While for generation-based methods, they only contain high-frequency words by using a limited vocabulary table, which limits their diversity.
For the distinct metric of generation-based methods, the proposed approach is significantly better than S2S, REGS, and Multi-seq2seq (p < 0.01).The difference between REGS and S2S is also statistically significant at the same level.This indicates that the adversarial training algorithm is effective in promoting the diversity, and by incorporating response candidates the diversity can be further improved.
As for perplexity in Table 3, after adversarial training, REGS slightly reduces perplexity on the validation set.By introducing response candidates, the proposed approach can further achieve small gains.Figure 2 shows the perplexity on the validation set w.r.We can see that both REGS and the proposed approach can further decrease the perplexity based on S2S, demonstrating the effectiveness of the adversarial training.In addition, the proposed approach outperforms REGS, indicating response candidates are helpful in forming the discriminative signal to guide the generator.The evaluation results of BLEU are shown in Table 4.For the single reference setting, the proposed approach outperforms all the baselines from BLEU-1 to BLEU-3 and achieves a very competitive BLEU-4 score.The Multi-seq2seq does not provide a significant improvement in BLEU-3 and BLEU-4 compared with S2S.We believe this is due to limited references in this setting.A single reference is not sufficient for comprehensive coverage of n-grams, especially for long n-grams.This can be demonstrated by the results of the multi-reference setting, where the performance of the proposed approach and Multi-seq2seq exceeds S2S in both BLEU-3 and BLEU-4, and the difference is significant (p < 0.01).This indicates that incorporating response candidates is helpful for generating long n-grams that occur in references.In addition, in the multi-reference setting the proposed approach significantly outperforms all baselines from BLEU-1 to BLEU-4 (p < 0.05).This indicates that the responses of the proposed approach are closer to the references.Notice that, Retrieval and Re-rank are not evaluated in the multi-reference setting because responses of the two methods are also selected from the N-best response candidates, which are used as references in this setting.
Table 5 shows the human evaluation results.Two responses are judged as tie mainly in the following two situations.First, the two responses have similar content, like "Happy Thanksgiving" vs. "Happy Thanksgiving to you", and "Yes, I like it" vs. "No, I do not like it".Second, they are both generic or irrelevant.Besides the tie, the proposed ap-   (Fleiss, 1971).The values of Ours vs. REGS and Ours vs. Multi-seq2seq are in a range of 0.2 to 0.4, which can be seen as "Fair agreement".Ours vs. Re-rank has a relatively higher Kappa value of 0.42, which is "Moderate agreement".This is because the pre-existed responses of Re-rank sometimes are irrelevant to the message, annotations can easily reach an agreement on these cases.

Case Study
Some cases are shown in Table 6.As the second and third examples show, Retrieval tends to return long and specific results.This increases the risk of containing some irrelevant contents.Re-rank can alleviate this problem by re-ranking response candidates, but the improvement is limited by the quality of these candidates.As the first example shows, Rerank selects the same candidate as Retrieval, but it is still not very suitable for its message.In the second and third examples, Re-rank returns more suitable responses than Retrieval, but both of them contain very specific information, such as "NYC", "Denver" and "LA", which strongly depend on particular scenarios.For generation-based methods, S2S tends to generate generic responses, like the first example "It is so good".In comparison, both REGS and Multi-seq2seq can reply with more informative contents.Compared with these baselines, responses generated by the proposed approach are more relevant and diverse.

Related Work
Retrieval-based Methods The methods for building a data-driven dialogue system can be roughly divided into two categories: retrieval-based and generation-based.For retrieval-based methods, Leuski et al. (2009) match a response with a message using a statistical language model in cross-lingual information retrieval.Ji, Lu, and Li (2014) employ information retrieval techniques including matching and ranking methods to search responses.The ranking method can also be implemented using neural networks (Yan, Song, and Wu, 2016;Qiu et al., 2017).Similarly, Wu et al. propose an end to end method to implement the matching model.

Generation-based Methods
For generation-based methods, Ritter, Cherry, and Dolan (2011) cast the response generation as a machine translation problem.The response generation task can also be seen as a sequence to sequence learning problem (Sordoni et al., 2015;Vinyals and Le, 2015).Shang, Lu, and Li (2015) introduce the Attention-based Seq2Seq model (Bahdanau, Cho, and Bengio, 2014) into the response generation, and further propose a hybrid model based on it.Despite the success of these models in generating grammatically correct responses, the majority of the responses are generic (Serban et al., 2016;Li et al., 2016a).Approaches to address this issue can be divided into two categories.The first category directly optimizes the Seq2Seq model such as the objective function and the training process.Li et al. (2016a) introduce the Maximum Mutual Information as the objective function.They also frame the response generation as a reinforcement learning problem, where the reward is calculated either by manually defined reward function (Li et al., 2016b) or an introduced discriminator (Li et al., 2017)

Combinations of Retrieval-based and Generation-based Methods
There is also work that combines the two categories, including integrating a generation-based method into a retrievalbased method as a ranker (Sordoni et al., 2015) and enhancing a generation-based method with retrieved response candidates.Our proposed approach falls into the second category.Contemporaneous to our work, Song et al. (2018) apply an encoder to every response candidate and integrate the results into the decoding process via the Attention mechanism.Similarly, Pandey et al. (2018) also incorporate response candidates using the attentive encoder-decoder framework, but the attention weight for each response candidate is computed by its context utterances.Different from the two models, the proposed approach employs the adversarial training framework to incorporate the response candidates.Rather than being sent to the encoder and trained with MLE loss, response candidates in the proposed approach are conditioned to the discriminator in forming the discriminative signal to guide the generator.The proposed approach is also related to Lin et al. ( 2017)'s work.They propose an unconditional GAN whose discriminator is augmented with references randomly sampled from the training set for the task of language generation.In contrast, the proposed approach focuses on the response generation and leverages the message as prior knowledge.In addition, rather than sampling references from the training set, the evidence list in the proposed approach is retrieved according to the relevance between messages using a retrieval-based system.

Conclusion and Future Work
We propose a Retrieval-Enhanced Adversarial Training method for neural response generation in dialogue systems.
In contrast to existing approaches, our REAT method directly uses response candidates from retrieval-based systems to improve the discriminator in adversarial training.Therefore, it can benefit from the advantages of retrievalbased response candidates as well as neural responses from generation-based systems.Experiments show that the REAT method significantly improves the quality of the generated responses, which demonstrates the effectiveness of this approach.
In future research, we will further investigate how to better leverage larger training data to improve the REAT method.In addition, we will also explore how to integrate the retrieval-based response candidates into the generator in adversarial training so that the quality could be further improved.
y 1:k from G as a negative sample; 12: Sample y 1:k from the ground-truth data as a positive sample; 13: Update φ according to Equation 4; 14: and a Multilayer Perception (MLP) on the concatenation of these representations: rep = Highway([rep m , rep y , rep e1 , ..., rep e N ]), (10) D(m, {e}, y 1:k ) = σ(MLP(rep)), (11) where N is the length of the evidence list (N is set to 5 in our experiments).σ is the sigmoid function.[ a, b] denotes the concatenation of a and b.

Figure 2 :
Figure 2: Perplexity of the validation set w.r.t. the training epochs.The adversarial training begins at the sixth epoch.

Table 2 :
Some statistics of the corpus in our experiments.The last column is average sentence length.
and ( Li et

Table 4 :
t. the training epochs.The dashed line represents S2S.Trained with the MLE objective func-Results of BLEU evaluation in two settings.The Single reference setting uses a ground-truth as a reference.The Multi-reference setting uses a ground-truth and five-best response candidates as references.

Table 5 :
Pairwise human evaluation results.The last column shows the FLeiss's Kappa with different annotators.tion, it can be seen as the pre-training process of the adversarial training.The adversarial training starts from the sixth epoch, where the perplexity of S2S stops decreasing.

Table 6 :
Examples of responses generated by baselines and the proposed method.The first line of each example is a message, followed by five responses of different models.proach is clearly preferred in most cases.Agreements among different annotators are calculated by Fleiss' Kappa Serban et al. (2017a)ddresses the generic response problem by introducing external knowledge.Mou et al. (2016)divide the generation process into two steps, including predicting a keyword and predicting the rest of the response starting with the keyword.Similarly,Xing et al. (2017)condition the decoder with multiple topic words.Serban et al. (2017a)introduce a keyword encoder into the Seq2Seq model to encode keywords that extracted from the message using external knowledge.