Deep Active Learning for Dialogue Generation

We propose an online, end-to-end, neural generative conversational model for open-domain dialogue. It is trained using a unique combination of offline two-phase supervised learning and online human-in-the-loop active learning. While most existing research proposes offline supervision or hand-crafted reward functions for online reinforcement, we devise a novel interactive learning mechanism based on hamming-diverse beam search for response generation and one-character user-feedback at each step. Experiments show that our model inherently promotes the generation of semantically relevant and interesting responses, and can be used to train agents with customized personas, moods and conversational styles.


Introduction
Several recent works propose neural generative conversational agents (CAs) for open-domain and task-oriented dialogue (Shang et al., 2015;Sordoni et al., 2015;Vinyals and Le, 2015;Serban et al., 2016Serban et al., , 2017Shen et al., 2017;Eric and Manning, 2017a,b). These models typically use LSTM encoder-decoder architectures (e.g. the sequence-to-sequence (Seq2Seq) framework (Sutskever et al., 2014)), which are linguistically robust but can often generate short, dull and inconsistent responses (Serban et al., 2016;Li et al., 2016a). Researchers are now exploring Deep Reinforcement Learning (DRL) to address the hard problems of NLU and NLG in dialogue generation. In most of the existing works, the reward function is hand-crafted, and is either specific to the task to be completed, or is based on a few desirable developer-defined conversational properties.
In this work we demonstrate how online Deep Active Learning can be integrated with standard neural network based dialogue systems to enhance their open-domain conversational skills. The architectural backbone of our model is the Seq2Seq framework, which initially undergoes offline supervised learning on two different types of conversational datasets. We then initiate an online active learning phase to interact with human users for incremental model improvement, where a unique single-character 1 user-feedback mechanism is used as a form of reinforcement at each turn in the dialogue. The intuition is to rely on this all-encompassing human-centric 'reinforcement' mechanism, instead of defining hand-crafted reward functions that individually try to capture each of the many subtle conversational properties. This mechanism inherently promotes interesting and relevant responses by relying on the humans' far superior conversational prowess.

Related Work & Contributions
DRL-based dialogue generation is a relatively new research paradigm that is most relevant to our work. For task-specific dialogue Zhao and Eskenazi, 2016;Cuayáhuitl et al., 2016;Williams and Zweig, 2016;Li et al., 2017b,c;Peng et al., 2017), the reward function is usually based on task completion rate, and thus is easy to define. For the much harder problem of open-domain dialogue generation (Li et al., 2016e;Weston, 2016), hand-crafted reward functions are used to capture desirable conversation properties. Li et al. (2016d) propose DRL-based diversitypromoting Beam Search (Koehn et al., 2003) for response generation.
Very recently, new approaches have been pro-posed to incorporate online human feedback into neural conversation models (Li et al., 2016c;Abel et al., 2017;Li et al., 2017a). Our work falls in this line of research, and is distinguished from existing approaches in the following key ways.
1. We use online deep active learning as a form of reinforcement in a novel way, which eliminates the need for hand-crafted reward criteria. We use a diversity-promoting decoding heuristic (Vijayakumar et al., 2016) to facilitate this process.
2. Unlike existing CAs, our model can be tuned for one-shot learning. It also eliminates the need to explicitly incorporate coherence, relevance or interestingness in the responses.

Model Overview
The architectural backbone of our model is the Seq2Seq framework consisting of one encoderdecoder layer, each containing 300 LSTM units.
The end-to-end model training consists of offline supervised learning (SL) with mini-batches of 10, followed by online active learning (AL).

Offline Two-Phase Supervised Learning
To establish an offline baseline, we train our network sequentially on two datasets, one for generic dialogue, and the other specially curated for short-text conversation.
Phase 1: We use the Cornell Movie Dialogs Corpus (Danescu-Niculescu-Mizil and Lee, 2011), consisting of 300K message-response pairs. Each pair is treated as an input and target sequence during training with the joint cross-entropy (XENT) loss function, which maximizes the likelihood of generating the target sequence given its input.
Phase 2: Phase 1 enables our CA to learn the language syntax and semantics reasonably well, but it has difficulty carrying out short-text conversations that are remarkably different from movie conversations. To combat this issue, we curate a dataset from JabberWacky's chatlogs 2 available online. The network is initialized with the weights obtained in the first phase, and then trained on the  Figure 2a).

Online Active Learning
After offline SL, our CA is equipped with the basic conversational ability, but its responses are still short and dull. To tackle this issue, we initiate an online AL process where our model interacts with real users and learns incrementally from their feedback at each turn of dialogue. The CA−human interaction for online AL is set up as follows (pseudocode in Algorithm 1, example interaction in Figure 1).
1. The user sends a message u i at time step i.
2. CA generates K responses c i,1 , c i,2 , ..., c i,K using hamming-diverse Beam Search. These are displayed to the user in order of decreasing generation likelihood.
3. The user provides feedback by selecting one of the K responses as the 'best' one or suggesting a (K+1)'th response, denoted by c * i,j . The selection criterion is subjective and entirely up to the user.
4. The message-response pair (u i , c * i,j ) is propagated through the network using XENT loss, with a learning rate optimized for one-shot learning.
5. The user responds to c * i,j with a message u i+1 , and the process repeats.
Heuristic Response Generation: We use the recently proposed Diverse Beam Search (DBS) algorithm (Vijayakumar et al., 2016) to generate the K CA responses at each turn in the dialogue. DBS has been shown to outperform BS and other diverse decoding techniques on several NLP tasks, including image captioning, machine translation and visual question generation. DBS incorporates diversity between the beams by maximizing an objective that consists of a standard sequence likelihood term and a dissimilarity metric between the beams. We use the hamming diversity metric for decoding at each time step, which penalizes the selection of words that have already been chosen in other beams (Algorithm 1). In particular, the weight λ associated with this metric is tuned to aggressively promote diversity between the first tokens of each of the K generated sequences, thereby avoiding similar beams like I don't know and I really don't know. We refer the reader to the original paper by Vijayakumar et al. for the complete DBS algorithm and derivation. K is a tunable hyper-parameter; we used K = 5 in all our experiments, based on our observation that a smaller response set usually misses out a good contender, and more than five responses become too cumbersome for the user to read at each turn.
It is possible that displaying the K responses in decreasing order of generation likelihood introduces a bias to the user's response, since users typically prefer to pick items located at the top of the screen. If this is a cause for concern for an application, the problem can be resolved simply by tweaking Algorithm 1 such that the K responses are displayed to the user in a random order. In our experiments, we assume that the users are unbiased and do not take into consideration the display order or the generation likelihood of the responses.
One-shot Learning: We control how quickly the model learns from user feedback by tuning the parameter 'initial learning rate' (lr in Algorithm 1) of Adam, the stochastic optimizer (Kingma and Ba, 2014). An appropriately high lr results in oneshot learning, where the user's feedback immediately becomes the model's most likely prediction for that prompt. This scenario is depicted in Figure  1. A low lr leads to smaller gradient descent steps, so the model requires several 'nudges' to adapt to each new data point. We experiment with different lr values to determine a suitable value (Figure 2b).

Experimental Evaluation
We evaluate our model via qualitative comparison with offline SL, as well as quantitative evaluation on four axes: syntactical coherence, relevance to prompts, interestingness and user engagement.

Quantitative Evaluation
We begin by presenting the experimental results of the quantitative evaluation our CA's conversational abilities when trained via one-phase SL, two-phase SL and online AL (denoted by SL1, SL2 and SL2+oAL respectively).
We first asked a human trainer to actively train SL2+oAL using 200 prompts of his choice. We then created a test set of 100 prompts by randomly choosing 100 of the 200 training prompts and linguistically rephrasing each of them to convey the same semantics. For instance, the AL training prompts 'How's it going?', 'I hate you' and 'What are your favorite pizza toppings?' were altered to the following test prompts: 'How are you doing?', 'I don't like you!' and 'What do you like on your pizza?'. Next, we recorded SL1's, SL2's and SL2+oAL's responses to these test prompts. Finally, we asked five human judges (not including the human trainer) to subjectively evaluate the responses of the three models on the test set. The evaluation of each response was done on four axes: syntactical coherence, relevance to the prompt, interestingness and user engagement 3 . Each judge    was asked to assign each response an integer score of 0 (label = bad) or 1 (label = good). Their averaged scores for the three models, SL1, SL2 and SL2+oAL, are shown in Figure 2a. We see that SL2+oAL outperforms the other models on three of the four axes by 14-21%.
Next, we asked the human trainer to train SL2+oAL with the same 200 prompts and responses for different values of the initial learning rate for Adam (lr in Algorithm 1). We then asked the five human judges to subjectively rate each model's syntactical coherence, response relevance, interestingness and user engagement. Each model's percentage success on the test prompts was recorded on four axes. The averaged scores are given in Figure 2b. We see that the response quality drops significantly for higher values of learning rate. This is due to the instability in the parameters induced by a high learning value associated with new data, causing the model to forget what it learned previously. Our experiments suggest that a learning rate of 0.005 strikes the right balance between stability and one-shot learning.
Finally, we asked the human trainer to train SL2+oAL with lr = 0.005 and different number of training interactions. The results in Figure 2c confirm that the model improves slowly as it continues to converse with humans. This is an appropriate reflection of how humans learn language: gradually but effectively. Although the curves seem to plateau after 300 training interactions and suggest that the learning has stopped, this is not the case. The gradient is small but nonzero, which is an expected behavior of reinforcement learning algorithms in general.

Qualitative Comparison
We illustrate the qualitative differences between the responses generated by SL1, SL2 and SL2+oAL. Table 1 shows results on a small subset of the 100 test prompts. We see that SL2 generates more relevant and appropriate responses than SL1 in many cases. This illustrates that a small shorttext conversational dataset is a useful fine-tuning add-on to a large and generic dialogue dataset for offline Seq2Seq training. We also see that SL2+oAL generates more interesting, relevant and engaging responses than SL2. These results imply that the model learns to make connections between semantically similar prompts that are syntactically different. While this may be a slow process (spanning thousands of interactions), it effectively emulates the way humans learn a new language. Table 2 illustrates how SL2+oAL can be trained to adopt a wide variety of moods and conversational styles. Here, we trained three copies of SL2 separately to adopt three different emotional personas: cheerful, gloomy and rude. Each model underwent 100 training interactions with one human trainer, who was instructed to adopt each of the four conversation styles while training the SL2+oAL model. The test prompts shown in Table 2 were syntactic variations of the training prompts, as before. The results illustrate that SL2+oAL was able to modify the mood of its responses appropriately, based on the way it was trained. Similar experiments can be done to create agents with customized backgrounds and characters, akin to Li et al.'s persona-based CA (2016b).

Conclusion & Future Work
We have developed an end-to-end neural model for open-domain dialogue generation. Our model augments the Seq2Seq framework with online Deep Active Learning to overcome some of its known short-comings with respect to dialogue generation. Experiments show that the model promotes semantically coherent, relevant, and interesting responses and can be trained to adopt diverse moods, personas and conversation styles.
In the future, we will explore context-sensitive active learning for encoder-decoder conversation models. We will also investigate whether existing Affective Computing techniques (e.g. (Asghar and Hoey, 2015)) can be leveraged to develop emotionally cognizant neural conversational agents.