Learning from Dialogue after Deployment: Feed Yourself, Chatbot!

The majority of conversations a dialogue agent sees over its lifetime occur after it has already been trained and deployed, leaving a vast store of potential training signal untapped. In this work, we propose the self-feeding chatbot, a dialogue agent with the ability to extract new training examples from the conversations it participates in. As our agent engages in conversation, it also estimates user satisfaction in its responses. When the conversation appears to be going well, the user’s responses become new training examples to imitate. When the agent believes it has made a mistake, it asks for feedback; learning to predict the feedback that will be given improves the chatbot’s dialogue abilities further. On the PersonaChat chit-chat dataset with over 131k training examples, we find that learning from dialogue with a self-feeding chatbot significantly improves performance, regardless of the amount of traditional supervision.


Introduction
Training a dialogue agent to converse like a human requires extensive supervision. The most common approach is to train models to imitate humans in large corpora of crowdsourced or scraped conversations (Serban et al., 2015). These fullysupervised conversations tend to be expensive to collect in sufficient quantity and/or occur in settings with significant differences from the deployment environment (Ross et al., 2009). Instead, dialogue agents would ideally learn directly from dialogue, the conversations they participate in after deployment, which are usually abundant, taskspecific, dynamic, and cheap. This corresponds to the way humans learn to converse-not merely observing others engaging in "expert-level" conver- * *BH completed most of this work at Facebook (FAIR). sations, but instead actively adjusting and correcting our speech based on feedback woven throughout our own conversations (Bassiri, 2011;Werts et al., 1995). Giving a dialogue agent this ability would enable it to continuously improve and adapt over its lifetime, rather than requiring additional annotation costs for each and every improvement.
However, naively training a dialogue agent on its own conversations yields poor results. For example, training a model on its own output can simply reinforce its existing failure modes, and mistakes by the agent can lead to absurd conversations that no longer resemble the target domain (Hashimoto and Sassano, 2018). To combat this, one approach is to allow the agent to request feed-back during conversations (Zhang et al., 2018a;Li et al., 2017b), e.g., when it believes it is about to make a mistake. This approach, however, falls victim to the Dunning-Kruger effect (Kruger and Dunning, 1999), which in this case suggests that a bad model will also be bad at knowing when it is doing a bad job. Regardless of when feedback is requested, existing methods typically require accompanying scalar rewards or adherence to particular templates or structure to ensure that the feedback is usable by the model (Rieser and Lemon, 2011;. These requirements may be acceptable for paid annotators, but they impose unnatural workflows on unpaid conversation partners in a standard dialogue environment. Humans are able to request and provide feedback using only natural language; ideally, dialogue agents would be able to do the same. In this work we propose the self-feeding chatbot, a dialogue agent with the ability to extract new examples from the conversations it participates in after deployment ( Figure 1). Concretely, in addition to being trained on the primary DIALOGUE task, the agent is trained to predict its speaking partner's satisfaction with its responses. When the conversation seems to be going well, the user's responses (but not the bot's own utterances) become the targets in new training examples for the DIA-LOGUE task. When the agent believes it has made a mistake, it instead requests feedback on what it could have said instead. Predicting the feedback that will be provided in a given context becomes an auxiliary task (FEEDBACK) on which the model is also trained. Importantly, these new examples improve the agent's dialogue abilities while using only natural responses from the user that do not require special structure, accompanying numerical feedback, or additional human intervention in order to be used.
With this approach, the conversations the chatbot participates in are sliced into two complementary datasets-one largely protected from the chatbot's mistakes (DIALOGUE examples), and one which directly addresses them (FEEDBACK examples). We validate our approach on the PER-SONACHAT (Zhang et al., 2018b) dialogue dataset, finding empirically that regardless of the number of available supervised examples, the dialogue ability of the chatbot is always improved by adding the automatically extracted examples of either type, and improves the most by adding both.
The main contributions of this work thus include the following: • We propose the self-feeding chatbot, a dialogue agent with the ability to extract new training examples for itself from the conversations it participates in during deployment.
• We show that dialogue ability improves by imitating human responses when the human is satisfied, or by asking for feedback when they are not, predicting it as an auxiliary task.
• We demonstrate that classifying user satisfaction is a learnable task important for the selffeeding process, significantly outperforming an approach based on model uncertainty.
• We release three new datasets to further research in this direction: (1) deployment chat logs (513k messages); (2) ratings of user satisfaction (42k); (3) textual feedback on what a bot could have said in a given context (62k).
The datasets and models described in this paper are available via the ParlAI platform (Miller et al., 2017), along with training code. Hyperparameter values are included in Appendix G.

Related Work
The general concepts of lifelong learning (Silver et al., 2013) and never-ending (language) learning (Carlson et al., 2010) are related to the topics discussed in this work, as is active learning (Tong and Koller, 2001) and predictive modeling (Schmidhuber and Huber, 1991). The specific case of learning actively from dialogue during deployment was explored for the question answering (QA) setting in (Weston, 2016) and (Li et al., 2017a), where the authors examined multiple learning strategies on a suite of dialogue tasks with varying types of feedback, such as verbal cues (e.g., "Yes, that's right!") and scalar rewards. Most relevant to our work was their use of forward prediction, where the learner improved in quality by trying to predict the teacher's responses without an explicit reward signal. Our work extends this idea, adding the ability for the model to recognize its mistakes and request feedback explicitly, and moving beyond QA to the more general chit-chat setting where there may be many valid responses in a given context.
Learning to ask questions is another area that has been studied (Strub et al., 2017;Wang et al., Figure 2: (1) The chatbot is first trained with any available supervised data (boxed in red) on the Human-Human (HH) DIALOGUE (x, y) HH and SATISFACTION (x, s) tasks.
(2) During deployment, whenever the predicted satisfaction score of the current conversation x is above the threshold (ŝ > t), a new Human-Bot (HB) DIALOGUE example (x, y) HB is extracted and the bot continues the conversation with its own responseŷ. Otherwise, the chatbot requests feedback with question q and extracts a new FEEDBACK example (x, f ). 2018; Rao and III, 2018). While those works focused on identifying which question to ask in a given context, in this work we are more interested in first learning when to ask a question. Li et al. (2017b) considered this question as well, but again in the context of a QA setting rather than dialogue. Hashimoto and Sassano (2018) used user responses to detect mistakes made by a deployed virtual assistant, showing that model mistakes can be identified in chit-chat, weather, or web search domains. However, they did not explore how to use these identified mistakes to improve the model further; their agent was not equipped to feed itself. Eskenazi et al. (2018) also found that the correctly assessing the appropriateness of chatbot responses is highly dependent on user responses and not preceding context alone.
There are other, somewhat less related, ways to use feedback during dialogue for learning, notably for collecting knowledge to answer questions (Mazumder et al., 2018;Hixon et al., 2015;Pappu and Rudnicky, 2013), and more commonly in reinforcement learning settings, where the feedback is a scalar rather than the dialogue messages themselves (Levin et al., 2000;Schatzmann et al., 2006;Rieser and Lemon, 2011;Hong et al., 2019). In particular (Serban et al., 2017) employ user sentiment detection for reward shaping in their Alexa prize entry.
Finally, our work improves dialogue quality by utilizing larger datasets with noisier labels than traditional supervision. Other applications of weak supervision to dialogue (Mallinar et al., 2019) and relation extraction have observed similar results (Bunescu and Mooney, 2007;Hancock et al., 2018;Ratner et al., 2017).

The Self-Feeding Chatbot
The lifecycle of a self-feeding chatbot is outlined in Figure 2. In the initial training phase, the dialogue agent is trained on two tasks-DIALOGUE (next utterance prediction, or what should I say next?) and SATISFACTION (how satisfied is my speaking partner with my responses?)-using whatever supervised training data is available. We refer to these initial DIALOGUE examples as Human-Human (HH) examples, since they were generated in conversations between two humans.
In the deployment phase, the agent engages in multi-turn conversations with users, extracting new deployment examples of two types. Each turn, the agent observes the context x (i.e., the conversation history) and uses it to predict its next utteranceŷ and its partner's satisfactionŝ. If the satisfaction score is above a specified threshold t, the agent extracts a new Human-Bot (HB) DIALOGUE example using the previous context x and the human's response y and continues the conversation.
If, however, the user seems unsatisfied with its previous response (ŝ < t), the agent requests feedback with a question q, and the resulting feedback response f is used to create a new example for the FEEDBACK task (what feedback am I about to receive?). The agent acknowledges receipt of the feedback and the conversation continues. The rate at which new DIALOGUE or FEEDBACK examples are collected can be adjusted by raising or lowering the satisfaction threshold t (we use t = 0.5). 1 Periodically, the agent is retrained using all available data, thereby improving performance on the primary DIALOGUE task.
It is important to note that the user's responses are always in the form of natural dialogue. In particular, at no point are the new FEEDBACK examples inspected, post-processed, or cleaned. Instead, we rely on the fact that the feedback is not random: regardless of whether it is a verbatim response, a description of a response, or a list of possible responses (see Table 2 for examples), there is a learnable relationship between conversation contexts and their corresponding feedback which requires many of the same language understanding skills to master as does carrying on a normal conversation.
The experiments in this paper are limited to the setting where the number of supervised and deployment examples are on the same order of magnitude; however, we envision scenarios in which the number of deployment examples can easily grow to 100× or more the number of supervised examples over the chatbot's deployment lifetime, effectively providing a massive task-specific corpus at minimal cost. Table 1 reports the sizes of each dataset, all of which are available via ParlAI.

Task 1: DIALOGUE
The chatbot's primary task (DIALOGUE) is to carry on a coherent and engaging conversation with a speaking partner. Training examples take the form of (x, y) pairs, where x is the context of the conversation (the concatenation of all responses so far up to some history length, delimited with tokens marking the speaker), and y is the appropriate response given by the human.
The Human-Human (HH) portion of the DIA-LOGUE dataset comes from the PERSONACHAT dataset (Zhang et al., 2018b), which consists of  short dialogs (6-8 turns) between two crowdworkers (humans) who have been assigned short text profiles and are instructed to "chat with the other person naturally and try to get to know each other." We chose this dataset because of its size (over 145k total examples), the breadth of topics it covers, and its focus on promoting engaging conversations, which we anticipate being a necessary property of a chatbot that people will be willing to chat with voluntarily and repeatedly. We use the standard splits of the dataset made available in ParlAI as a part of the ConvAI2 challenge (Burtsev et al., 2018). Since the question of how to incorporate external knowledge (such as profiles) in dialogue is an open research question of its own (Li et al., 2016;Luan et al., 2017;Luo et al., 2018) and we are primarily interested in the question of learning from dialogue, we discard the profiles and simply train and test on the conversations themselves, making the dataset more challenging in terms of raw performance scores.
The Human-Bot (HB) portion of the DIA-LOGUE dataset is extracted during deployment as described earlier. The context may contain responses from both the human and the bot, but the target response is always from the human, as we will see experimentally that targeting bot responses degrades performance. Because the chitchat domain is symmetric, both the HH and HB DIALOGUE examples are used for the same task. In an asymmetric setting where the bot has a different role than the human, it is unclear whether HB examples may still be used as an auxiliary task, but FEEDBACK examples will remain usable.

Task 2: SATISFACTION
The objective of the SATISFACTION auxiliary task is to predict whether or not a speaking partner is satisfied with the quality of the current conversation. Examples take the form of (x, s) pairs, where x is the same context as in the DIALOGUE task, and s ∈ [0, 1], ranging from dissatisfied to satisfied. Crucially, it is hard to estimate from the bot's utterance itself whether the user will be satisfied, but much easier using the human's response to the utterance, as they may explicitly say something to that effect, e.g. "What are you talking about?". The dataset for this task was collected via crowdsourcing. Workers chatted with our baseline dialogue agent and assigned a rating 1-5 for the quality of each of the agent's responses. 2 Contexts with rating 1 were mapped to the negative class (dissatisfied) and ratings [3,4,5] mapped to the positive class (satisfied). Contexts with rating 2 were discarded to increase the separation between classes for a cleaner training set. Note that these numeric ratings were requested only when collecting the initial training data, not during deployment, where only natural dialogue is used.

Task 3: FEEDBACK
The objective of the FEEDBACK auxiliary task is to predict the feedback that will be given by the speaking partner when the agent believes it has made a mistake and asks for help. Examples take the form of (x, f ) pairs, where x is the same context as the other two tasks and f is the feedback utterance.
2 A snapshot of the data collection interface and sample conversations are included in the Appendix. Training data for this task is collected during deployment. Whenever the user's estimated satisfaction is below a specified threshold, the chatbot responds "Oops! Sorry. What should I have said instead?". 3 A new example for the FEEDBACK task is then extracted using the context up to but not including the turn where the agent made the poor response as x and the user's response as f (as shown in Figure 1). At that point to continue the conversation during deployment, the bot's history is reset, and the bot instructs the user to continue, asking for a new topic. Examples of FEEDBACK responses are shown in Table 2.

Model Architecture
The self-feeding chatbot has two primary components: an interface component and a model component. The interface component is shared by all tasks, and includes input/output processing (tokenization, vectorization, etc.), conversation history storage, candidate preparation, and control flow (e.g., when to ask a question vs. when to give a normal dialogue response). The model component contains a neural network for each task, with embeddings, a network body, and a task head, some of which can be shared. In our case, we obtained maximum performance by sharing all parameters between the FEEDBACK and DIALOGUE tasks (prepending FEEDBACK responses with a special token), and using separate model parameters for the SATISFACTION task. Identifying optimal task structure in multi-task learning (MTL) architectures is an open research problem (Ruder, 2017). Regardless of what parameters are shared, each training batch contains examples from only one task at a time, candidate sets remain separate, and each task's cross-entropy loss is multiplied by a task-specific scaling factor tuned on the validation set to help account for discrepancies in dataset size, loss magnitude, dataset relevance, etc.
Our dialogue agent's models are built on the Transformer architecture (Vaswani et al., 2017), which has been shown to perform well on a variety of NLP tasks (Devlin et al., 2018;Radford et al., 2018), including multiple persona-based chat applications (Shuster et al., 2018a,b;Rashkin et al., 2018). For the SATISFACTION task, the context x is encoded with a Transformer and converted to the scalar satisfaction predictionŝ by a final linear layer in the task head. The DIALOGUE and FEEDBACK tasks are set up as ranking problems, as in (Zhang et al., 2018b;Mazaré et al., 2018), where the model ranks a collection of candidate responses and returns the top-ranked one as its response. The context x is encoded with one Transformer andŷ andf candidates are encoded with another. The score for each candidate is calculated as the dot product of the encoded context and encoded candidate.
During training, negative candidates are pulled from the correct responses for the other examples in the mini-batch. During evaluation, however, to remain independent of batch size and data shuffling, each example is assigned a static set of 19 other candidates sampled at random from its split of the data. During deployment, all 127,712 unique HH DIALOGUE candidates from the train split are encoded once with the trained model and each turn the model selects the top-ranked one for the given context.

Model Settings
Contexts and candidates are tokenized using the default whitespace and punctuation tokenizer in ParlAI. We use a maximum dialogue history length of 2 (i.e., when making a prediction, the dialogue agent has access to its previous utterance and its partner's response). Tokens are embedded with fastText (Bojanowski et al., 2017) 300-dimensional embeddings. We do not limit the vocabulary size, which varies from 11.5k to 23.5k words in our experiments, depending on the training set. The Transformer is implemented in PyTorch (Paszke et al., 2017) within the ParlAI framework. We use the AdaMax (Kingma and Ba, 2014) optimizer with a learning rate schedule that decays based on the inverse square root of the step number after 500 steps of warmup from 1e-5. We use proportional sampling (Sanh et al., 2018) to select batches from each task for training, with batch size 128. Each Transformer layer has two attention heads and FFN size 32. The initial learning rate (0.001-0.005), number of Transformer layers (1-2), and task-specific loss factors (0.5-2.0) are selected on a per-experiment basis based on a grid search over the validation set averaged over three runs (we use the DIALOGUE validation set whenever multiple tasks are involved). We use early stopping based on the validation set to decide when to stop training. The hyperparameter values for the experiments in Section 5 are included in Appendix G.
Note that throughout development, a portion of the DIALOGUE validation split was used as an informal test set. The official hidden test set for the DIALOGUE task was used only to produce the final numbers included in this paper.

Experimental Results
Throughout this section, we use the ranking metric hits@X/Y, or the fraction of the time that the correct candidate response was ranked in the top X out of Y available candidates; accuracy is another name for hits@1/Y. Statistical significance for improvement over baselines is assessed with a two-sample one-tailed T-test.

Benefiting from Deployment Examples
Our main result, reported in Table 3, is that utilizing the deployment examples improves accuracy on the DIALOGUE task regardless of the number of available supervised (HH) DIALOGUE examples. 4 The boost in quality is naturally most pronounced when the HH DIALOGUE training set is small (i.e., where the learning curve is steepest), yielding an increase of up to 9.4 accuracy points, a 31% improvement. However, even when the entire PER-SONACHAT dataset of 131k examples is used-a much larger dataset than what is available for most dialogue tasks-adding deployment examples is still able to provide an additional 1.6 points of accuracy on what is otherwise a very flat region of  where the user appears to already be satisfied with the agent's responses, each FEEDBACK example corresponds to a mistake made by the model, giving the latter dataset a more active role in im-proving quality. Interestingly, our best-performing model, which achieves 46.3 accuracy on DIA-LOGUE, scores 68.4 on FEEDBACK, suggesting that the auxiliary task is a simpler task overall.
When extracting HB DIALOGUE examples, we ignore human responses that the agent classifies as expressing dissatisfaction, since these turns do not represent typical conversation flow. Including these responses in the 60k HB dataset decreases hits@1/20 by 1.2 points and 0.6 points when added to 20k and 131k HH DIALOGUE examples, respectively. We also explored using chatbot responses with favorable satisfaction scores (ŝ > t) as new training examples, but found that our models performed better without them (see Appendix D for details).
We also found that "fresher" feedback results in bigger gains. We compared two models trained

Predicting User Satisfaction
For maximum efficiency, we aim to ask for feedback when it will most benefit our model. The approach we chose (classifying the tone of partner responses) takes advantage of the fact that it is easier to recognize that a mistake has already been made than it is to avoid making that mistake; or in other words, sentiment classification is generally an easier task than next utterance prediction.
We compare this to the approach of asking for feedback whenever the model is most uncertain what to say next. This approach acts on the as-sumption that the model will be least confident when it is about to make a mistake, which we find very frequently to not be the case. Not only is it difficult to recognize one's own mistakes, but also there are often multiple valid responses to a given context (e.g., "Yes, I love seafood!" or "Yuck, fish is gross.")-a lack of certainty about which to use does not necessarily suggest a poor model. Table 4 reports the maximum F1 scores achieved by each method on the SATISFACTION test set. For the model uncertainty approach, we tested two variants: (a) predict a mistake when the confidence in the top rated response is below some threshold t, and (b) predict a mistake when the gap between the top two rated responses is below the threshold t. We used the best-performing standalone DIALOGUE model (one trained on the full 131k training examples) for assessing uncertainty and tuned the thresholds to achieve maximum F1 score. For the user satisfaction approach, we trained our dialogue agent on just the SATISFAC-TION task. Finally, we also report the performance of a regular-expression-based method which we used during development, based on common ways of expressing dissatisfaction that we observed in our pilot studies, see Appendix F for details.
As shown by Table 4, even with only 1k training examples (the amount we used for the experiments in Section 5.1), the trained classifier significantly outperforms both the uncertainty-based methods and our original regular expression, by as much as 0.28 and 0.42 F1 points, respectively.

Future Work
In this work we achieved learning from dialogue using two types of self-feeding: imitation of satisfied user messages, and learning from the feedback of unsatisfied users. In actuality, there are even more ways a model could learn to improve itself-for example, learning which question to ask in a given context to receive the most valuable feedback. One could even use the flexible nature of dialogue to intermix data collection of more than one type-sometimes requesting new FEED-BACK examples as in this work, and other times requesting new SATISFACTION examples (e.g., by asking "Did my last response make sense?"). In this way, a dialogue agent could simultaneously increase its dialogue ability, and increase its ability to improve further. We leave exploration of this meta-learning theme to future work.

A Data Collection Protocol
Here we report in greater detail the protocol we followed to collect the SATISFACTION, FEEDBACK, and HB DIALOGUE examples used Table 4 investigating the learning curve for this task.
No filtering was performed on the crowdworker conversations. Upon inspection after the fact, some workers did indeed give poor responses, make typographical mistakes, misunderstand the instructions, try to use the chatbot as a question answering interface, etc. We assume however that similar types of noise will be present in most chatbot deployment environments and opted to maintain a workflow that truly does not require developer intervention to use the newly collected examples.

C PERSONACHAT Comparisons and Baselines
Our experiments use the PERSONACHAT distribution that was released as a part of the ConvAI2 (Burtsev et al., 2018) challenge. This distribution is slightly cleaner than the original PERSONACHAT release and comes with a new crowdsourced test set. In order to compare with the models and baselines used in the original PERSONACHAT paper (Zhang et al., 2018b), we report in this section the performance of our models on the original PERSONACHAT test set, not the ConvAI2 test set. Note that all numbers reported here are for models that do not have access to the profiles that were used in the creation of the conversations; models that do have access to this additional information tend to perform even better. Ours Transformer 49.6 Self-Feeding 51.7  We also considered whether it was possible to consistently identify really good responses by the chatbot, rather than the really bad ones. These could be potentially be used as DIALOGUE examples along with the ones that have human responses as targets (what we refer to as HH and HB in the paper). To explore this question, we modified our SATISFACTION dataset so that contexts with a rating of 5 were the positive class and ones with ratings [1, 2, 3] were the negative class (discarding ratings of 4 to increase the separation between classes). The results were negative-even with a training set of over 34k examples, the maximum precision we were able to achieve while maintaining at least 10% recall was 0.70, which is insufficient to improve performance on the DIALOGUE task. Upon inspection, it appears that really good responses are hard to identify because most of the time they look like a normal human-to-human conversation, and recognizing an appropriate next utterance is precisely the DIALOGUE task that we are trying to solve! Negative responses, however, are much more semantically similar to one another, since most express one of a few common ideas such as asking for clarification or conveying confusion.