Learning Improvised Chatbots from Adversarial Modifications of Natural Language Feedback

The ubiquitous nature of chatbots and their interaction with users generate an enormous amount of data. Can we improve chatbots using this data? A self-feeding chatbot improves itself by asking natural language feedback when a user is dissatisfied with its response and uses this feedback as an additional training sample. However, user feedback in most cases contains extraneous sequences hindering their usefulness as a training sample. In this work, we propose a generative adversarial model that converts noisy feedback into a plausible natural response in a conversation. The generator's goal is to convert the feedback into a response that answers the user's previous utterance and to fool the discriminator which distinguishes feedback from natural responses. We show that augmenting original training data with these modified feedback responses improves the original chatbot performance from 69.94% to 75.96% in ranking correct responses on the Personachat dataset, a large improvement given that the original model is already trained on 131k samples.


Introduction
Enabling chatbots to indulge in engaging conversations requires massive datasets of human-human conversations (Ritter et al., 2011;Sordoni et al., 2015;Vinyals and Le, 2015;Zhang et al., , 2019. Training such dialog agents requires substantial time and effort expended in the collection of adequate number of high quality conversation samples. Hancock et al. (2019) alleviate this problem by introducing a self-feeding chatbot which can directly learn from user interactions. This chatbot 1 Our code is released at https://github.com/ ekunnii/adversarial-feedback-chatbot/ You could have talked about your age.

Feed2Resp
Hey, I am 20. How old are you?

Response
What is your favourite movie?
My favourite movie is Toy Story.
How old are you? I like hiking.

Conversation History
Oops, I messed up. What should I have said?

Retrain chatbot
Figure 1: When the bot provides a poor response to the question posed by the user, the bot requests natural language feedback. We use the conversation context and the feedback to construct a plausible response to the user query and use it as an additional training sample to improve the chatbot. requests users to provide natural language feedback when the users are dissatisfied with its response. Hancock et al. (2019) treat this feedback as a gold response to the wrong turn and use it as an additional training sample to improve the chatbot.
Although natural language feedback is cheap to collect from a chatbot's end-users, most often, feedback cannot be used directly as a training sample since feedback is usually not the answer itself, but simply contains hints to the answer. Table 1 shows some feedback text samples. Naive modification of feedback using heuristics like regular expressions would lead to generic responses that are ineffective in improving the dialog ability of chatbots (Li et al., 2016). Additionally, writing an exhaustive set of regular expression rules is time consuming and requires extensive analysis of the data. Annotating data to convert feedback text to natural response is also expensive and defeats the purpose of learning from feedback text. you could say hey, i'm 30. how old are you? yes, i play battlefield would be a great answer tell me what your favorite breakfast food is answer the question about having children! Table 1: Samples of feedback to the chatbot. These contain hints to the answer but they are not the answers themselves.
In this work, we propose a generative adversarial setup for converting such noisy feedback instances into natural, human-like responses that provide better training signals for the dialog agents. Figure 1 gives a bird's-eye view of our problem. We frame this problem as a variant of text style transfer where the generator is tasked with making the feedback resemble the optimal response to the user's previous utterance and the discriminator is a classifier that distinguishes whether a given response is feedback or natural.
Our main contributions are the following: • We introduce FEED2RESP, a text style transfer system that converts feedback to natural responses without full supervision, thus generating additional training samples (Section 2).
• We show that the training on FEED2RESP modified responses leads to improved accuracy of chatbots (Section 4). Our results also reveal that training naively on feedback doesn't help when the original chatbot is already a strong model, whereas FEED2RESP also helps strong models.
2 Feedback to Natural Response Model Hancock et al. (2019) introduce a novel variant of a self-feeding chatbot in which the dialogue agent is equipped with the capability of extracting new training samples while in conversation with humans after deployment ( Figure 1). The agent also employs a satisfaction module which is trained to predict how satisfied the partner is with the responses it provides. When the chatbot is engaged in a conversation where the predicted satisfaction is below a defined threshold(usually 0.  of the collected feedback is not necessarily a good technique and instead, we propose an approach to better utilize the collected feedback samples. We pose the problem of converting feedback to resemble natural response as a text style transfer problem. We observe that feedback is more instructional and judgemental, whereas natural response is direct (answering questions) and engaging (asking questions, contains humor). We naturalize the feedback to a response and use it as an additional training sample to improve the chatbot.
A fully supervised approach to convert feedback to natural response is infeasible as we do not have paired (feedback ↔ response) examples and thus we adopt an adversarial setup. We utilize a GAN (Goodfellow et al., 2014) formulation where the generator modifies the feedback's style to make it seem part of a natural conversation, and in turn fool the discriminator which knows how to distinguish natural responses and feedback. Our model, FEED2RESP, is shown in Figure 2.

Adversarial Setup
Given an input sentence x (feedback or natural response) with source style s, conversation history h and target style s, the generator performs the mapping Here y is the rewrite of x into style s. It is often the case that feedback and desired responses share many words (see Table 9). We use BART encoder-decoder initialized with pretrained weights as our generator since its denoising objective helps in copying from the input while also producing realistic sentences (Lewis et al., 2019).
We additionally pretrain our model under the summarization setting to extract only the response when presented with conversation history and re-sponse. This helps maintain brevity while still integrating details from the context in the response.
The discriminator is a transformer encoder network that learns to distinguish the style of feedback and natural responses. Given an input text x and conversation history h, it predicts the style class c of x. Formally, it is defined as follows:

FEED2RESP Learning
We train FEED2RESP on three main objectives that help the model to reconstruct sentences when the style is not changed, change its style meaningfully and distinguish different styles. These objectives are shown to work well in other style transfer scenarios (Dai et al., 2019).
Self reconstruction objective For the scenario where the target style is the same as the source style, we train the generator to reconstruct the sentence given as input. Considering the input sentence as x, the source and the target style as s, we minimize the negative log-likelihood loss to generate the same sentence x as output Cycle consistency objective Taking inspiration from Cycle GAN (Zhu et al., 2017), we introduce a cycle consistency constraint to ensure that the model learns to preserve the meaning when it modifies the style of the original sentence. We first transform x to style s to produce y, i.e., g θ (x, h, s). Subsequently, we feed as input y with the target style as s and the model is trained to reconstruct the original sentence x. We minimize the negative log-likelihood loss which is given by, Style modification objective To ensure that the style of an input sentence x is changed to match the target one s, we use the discriminator's confidence as training signal. The generator wants to maximize the probability of the discriminator to classify transformed input to the target style, and therefore, we use the negative log-likelihood of the discriminator as our loss.

End-to-end training
The discrete nature of sampling and nondifferentiability of the argmax operator prevents gradient backpropogation. Following Dai et al. (2019), we consider the softmax distribution produced by the generator, g θ as the 'soft' generated sentence and use it as input for further downstream networks to maintain differentiability.

Experimental Setup
In FEED2RESP, the optimizer for both the generator and discriminator is AdamW. The learning rate of generator is 5e-6 while the learning rate of discriminator is 1e-4. The discriminator uses 4 stacked transformer layers and 4 attention heads. The token embedding size, style embedding size, positional embedding size and hidden size are all 256. For the BART (Lewis et al., 2019) generator, we use the implementation from HuggingFace (Wolf et al., 2019) and initialize the model with pretrained weights from the CNN/Daily Mail summarization task. Due to the characteristics of human response(refer Appendix A), we limit the length of text generation to a maximum of 50 words and impose a repetition penalty of 2.0 to improve diversity of output.
While evaluating the effectiveness of the modified feedback responses, we use two implementations of dialog agents provided by ParlAI (Miller et al., 2017), BIENCODER and POLYENCODER. BIENCODER has two transformer layers and 2 attention heads. The optimizer is Adamax with learning rate of 0.0025. POLYENCODER uses 12 transformer layers and 12 attentions heads. The optimizer is Adamax with learning rate of 5e-05.
The hyperparmeters for the best performing model are arrived at by random sampling and subsequently verifying the outputs using human evaluation to rate the outputs from the style transfer task. The entire list of hyper-parameters is listed in the Table 8.

Experiments
Our goal is to test whether feedback helps improve the chatbot. To do this, we compare models trained on conversational data with and without feedback data. Below we describe the chatbot evaluation setting, our datasets, the main models and different settings of these models with and without feedback.

Chatbot evaluation task and metrics
Following Hancock et al. (2019), we choose Per-sonaChat  as the main evaluation dataset. This dataset consists of human-human conversations collected using crowdsourcing where each crowdworker takes a persona. Since persona representation is a challenging research problem on its own, Hancock et al. ignore the persona and just use the conversations to train chatbots and we follow the same approach. At test time, the model is presented the conversation history and 20 candidate responses and the model has to pick the correct response. Thus, we use HITS@1/20 metric for evaluation.

Feedback data
We use the feedback data collected by Hancock et al. (2019) as this removes orthogonal factors such as differences in chatbot interfaces and annotation framework etc. which are not the focus of this work. Hancock et al. collected this feedback by deploying bi-encoder chatbots (Section 4.3) trained on varying levels of training data and making it converse with crowdworkers. Whenever the bot's response is not satisfactory, natural language feedback is collected from the crowdworker.
The data thus collected contains 60k human-bot turns, of which the last turn is always the feedback.

Chatbot Models
Given the conversation history and several candidate responses, the chatbot is trained to rank the correct candidate on the top. We use the following models as our chatbots. BIENCODER (Hancock et al., 2019;Humeau et al., 2020) contains two transformers, one for summarizing the conversation history and the other to summarize candidate responses to embeddings. The response with highest similarity is taken as the best candidate response. POLYENCODER (Humeau et al., 2020) summarizes a context and candidate responses into several embeddings. In order to contextualize context and candidates together, it performs a cross-encoder attention on the summary embeddings and scores each candidate.

Feedback-based Models
We train and test the above models in the following settings.

Results and Discussion
The experimental details of the model variants are described in Section 3.  Feed2Resp analysis We randomly sample 200 feedback responses from FEED2RESP to determine the kind of modifications the model performs (Table 3). We observe three main types of modifications -Rewrite, Retain and Remove. REWRITE is when the feedback implies an hint to the answer but not the answer itself. REMOVE is when the feedback contains the answer with extraneous words that have to be removed. RETAIN are cases where the model copies or paraphrases the feedback. Among these, REMOVE has the lowest accuracy of modification. Upon inspection, we find that these are the cases which require multiple removals. For example, for You should reply with either yes or no, the model predicts yes or no together instead of either one of them. Additionally, we visualize the attention maps of the discriminator to observe which words contribute most to the classification decision of the discriminator (Figure 3). The discriminator learns to distinguish feedback from normal dialog responses due to the presence of sequences like you could have, you should have, tell me, etc. Thus the generator learns to remove such extraneous sequences and make the feedback seem like plausible responses. We present a sample of modified outputs of FEED2RESP in Appendix C.

Conclusion
In this work, we show that while chatbots can be improved using natural language feedback, converting feedback to natural responses that fit in the conversation outperform the naive usage of feedback. We presented FEED2RESP, a generative adversarial model, that converts feedback to natural responses without requiring manually annotated parallel data. Our results show that FEED2RESP results in a 6 point improvement for the POLYEN-CODER chatbot, an already powerful dialog ranking agent. This is a strong result as HITS@1/20 is a tough metric to improve upon (Hancock et al., 2019).
Our work joins the class of models that use natural language feedback to improve different tasks, e.g., image captioning (Ling and Fidler, 2017), classification (Srivastava et al., 2017;Hancock et al., 2018;Murty et al., 2020). While these methods use feedback for reward shaping or feature extraction, we use feedback to produce correct response using adversarial learning. We pose this problem as a style transfer problem inspired from the style transfer literature (Shen et al., 2017;Xu et al., 2018;Conneau and Lample, 2019;Dai et al., 2019). While these focus on studying the stylistic attributes of sentences, e.g, sentiment, we explore this problem in the context of improving chatbots.

A Dataset Statistics
We are going to validate our approach on the chatbot's performance using PERSONACHAT  dialogue dataset and Human-Bot feedback dataset (Hancock et al., 2019). Table 6   To train the FEED2RESP model, we take the entire FEEDBACK dataset and an equal number of randomly chosen samples from the DIALOGUE dataset. We them use a train-dev-test split of 0.8:0.1:0.1 for training and evaluation of the model.

Task
Train Valid Test Total Style Transfer 96000 12000 12000 120000

Statistic
Human-Human Feedback #Words in context (MEAN) 79 13 #Words in context (MEDIAN) 77 6 #Words per turn (MEDIAN) 10.7 7.1 #Words per turn (MEAN) 11 6 #Turns (MEAN) 4 1.5 We examine the average number of turns and words in dialogues from the the feedback and human-human conversation distributions. We see that on an average, the dialogues in the feedback distribution have fewer number of turns than in human-human conversations. The average number of words per turn is also fewer on average. 2 https://parl.ai/projects/self feeding/

B Preparation of Training Data
We use the dataset provided by Hancock et al. (2019), which is a cleaner version of PER-SONACHAT dataset and comes with a new crowdsourced test set. We sample an equal number of examples from the DIALOGUE dataset, giving them a label 0, and FEEDBACK dataset, giving them a label of 1. The final response are combined with last n turns with an delimiter [RES]. Typically, n=2 turns are used for each conversation example. Conversation turns are separated with delimiter tokens [P1] or [P2].

C FEED2RESP examples
Here we include several examples of predictions from different models in Table 9 .