Deep Reinforcement Learning For Modeling Chit-Chat Dialog With Discrete Attributes

Open domain dialog systems face the challenge of being repetitive and producing generic responses. In this paper, we demonstrate that by conditioning the response generation on interpretable discrete dialog attributes and composed attributes, it helps improve the model perplexity and results in diverse and interesting non-redundant responses. We propose to formulate the dialog attribute prediction as a reinforcement learning (RL) problem and use policy gradients methods to optimize utterance generation using long-term rewards. Unlike existing RL approaches which formulate the token prediction as a policy, our method reduces the complexity of the policy optimization by limiting the action space to dialog attributes, thereby making the policy optimization more practical and sample efficient. We demonstrate this with experimental and human evaluations.


Introduction
Following the success of neural machine translation systems (Bahdanau et al., 2015;, there has been a growing interest in adapting the encoder-decoder models to model open-domain conversations (Sordoni et al., 2015;Serban et al., 2016a,b;Vinyals and Le, 2015).This is done by framing the next utterance generation as a machine translation problem by treating the dialog history as the source sequence and the next utterance as the target sequence. Then the models are trained end-to-end with Maximum Likelihood (MLE) objective without any hand crafted structures like slot-value pairs, dialog manager, etc used in conventional dialog modeling (Lagus and Kuusisto, 2002). Such data driven approaches are worth pursuing in the context of open-domain conversations since the next utterance distribution in open-domain conversations * Work done during internship at Google exhibit high entropy which makes it impractical to manually craft good features.
While the encoder-decoder approaches are promising, lack of specificity has been one of the many challenges (Wei et al., 2017) in modelling non-goal oriented dialogs. Recent encoder-decoder based models usually tend to generate generic or dull responses like "I don't know.". One of the main causes are the implicit imbalances present in the dialog datasets that tend to potentially handicap the models into generating uninteresting responses.
Imbalances in a dialog dataset can be broadly divided into two categories: many-to-one and oneto-many. Many-to-one imbalance occurs when the dataset contain very similar responses to several different dialog contexts. In such scenarios, decoder learns to ignore the context (considering it as noise) and behaves like a regular language model. Such a decoder would not generalize to new contexts and will end up predicting generic responses for all contexts. In the one-to-many case, the dataset may exhibit a different type of imbalance where a certain type of generic response may be present in abundance compared to other plausible interesting responses for the same dialog context (Wei et al., 2017). When trained with a maximum-likelihood (MLE) objective, generative models usually tend to place more probability mass around the most commonly observed responses for a given context. So, we end up observing little variance in the generated responses in such cases. While these two imbalances are problematic for training a dialog model, they are also inherent characteristics of a dialog dataset which cannot be removed.
Several approaches have been proposed in the literature to address the generic response generation issue.  propose to modify the loss function to increase the diversity in the generated responses. Multi-resolution RNN  addresses the above issue by additionally conditioning with entity information in the previous utterances. Alternatively, Song et al. (2016) uses external knowledge from a retrieval model to condition the response generation. Latent variable models inspired by Conditional Variational Autoencoders (CVAEs) are explored in (Shen et al., 2017;. While models with continuous latent variables tend to be uninterpretable, discrete latent variable models exhibit high variance during inference. Shen et al. (2017) append discrete attributes such as sentiment to the latent representation to generate next utterance.

Contributions
New Conditional Dialog Generation Model. Drawing insights from (Shen et al., 2017;, we propose a conditional utterance generation model in which the next utterance is conditioned on the dialog attributes corresponding to the next utterance. To do this, we first predict the higher level dialog attributes corresponding to the next response. Then we generate the next utterance conditioned on the dialog context and predicted attributes. Dialog attribute of an utterance refers to discrete features or aspects associated with the utterance. Example attributes include dialog-acts, sentiment, emotion, speaker id, speaker personality or other user defined discrete features of an utterance. While previous research works lack the framework to learn to predict the attributes of the next utterance and mainly view the next utterance's attribute as a control variable in their models, our method learns to predict the attributes in an endto-end manner. This alleviates the need to have utterances annotated with attributes during inference. RL for Dialog Attribute Selection. Further, it also enables us to formulate the dialog attribute selection as a reinforcement learning (RL) problem and optimize the policy initialized by the supervised training using REINFORCE (Williams, 1992). While the Supervised pre-training helps the model to generate utterances coherent with the dialog history, the RL formulation encourages the model to generate utterances optimized for long term rewards like diversity, user-satisfaction scores etc. This way of optimizing the policy over the discrete dialog attribute space is more practical as the action space is low dimensional instead of the entire vocabulary (as common in policies which involve predicting the next token to generate). By using REINFORCE (Williams, 1992) to further optimize the dialog attribute selection process, We then show improvements in specificity of the generated responses both qualitatively (based on human evaluations) and quantitatively (with respect to the diversity measures). The diversity scores, distinct-1 and distinct-2 are computed as the fraction of uni-grams and bi-grams in the generated responses as described in .
Improvements on Dialog datasets demonstrated through quantitative & qualitative Evaluations: Additionally, we annotate an existing open domain dialog dataset using dialog attribute classifiers trained with tagged datasets like Switchboard (Godfrey et al., 1992;Jurafsky et al., 1997), Frames (Schulz et al., 2017) and demonstrate both quantitative (in terms of token perplexity/embedding metrics (Rus and Lintean, 2012;Mitchell and Lapata, 2008)) and qualitative improvements (based on human evaluations) in generating interesting responses. In this work, we show results with two types of dialog attributes -sentiment and dialogacts. It is worth investigating this approach as we need not invest much in training classifiers for very high accuracy and we show empirically that annotations from classifiers with low accuracy are able to boost token perplexity. We conjecture that the irregularities in the auto-annotated dialog attributes induce a regularization effect while training deep neural networks analogous to the dropout mechanism. Also, annotating utterances with many types of dialog attributes could increase the regularization effect and potentially tip the utterance generation in the favor of certain low frequency but interesting responses.
In this work, we are mainly interested in exploring the impact of the jointly modelling extra discrete dialog attributes along with dialog history for next utterance generation and their contribution to addressing the generic response problem. Although our approach is flexible enough to include latent variables additionally, we mainly focus on the contribution of dialog attributes to address the "generic" response issue in this work.

Attribute Conditional HRED
In this paper, we extend the HRED (Serban et al., 2016a) model (elaborated in the Appendix section) by jointly modelling the utterances with the dialog attributes of each utterance. HRED is a encoderdecoder model consisting of a token-level RNN encoder and an utterance-level RNN encoder to summarize the dialog context followed by a tokenlevel RNN decoder to generate the next utterance. The joint probability can be factorized into dialog attributes prediction, followed by next utterance generation conditioned on the predicted dialog attributes as shown in equation 1 .
where DA 1:K denote K different dialog attributes corresponding to the utterance U m . U m is the m th utterance, U 1:m−1 are the past utterances. For instance, if we condition on three dialog attributes sentiment, dialog-acts and emotion, we would have K = 3. Further, we assume that the dialog attributes are conditionally independent given the dialog context. More simply, we predict the attributes of the next utterance and then, condition on the previous context & the predicted attributes to generate the next utterance. Figure 1: Dialog attribute classification: We predict the dialog attribute of the next utterance based on the previous context and attributes corresponding to the previous utterances. Please note that we depict only a single attribute for convenience

Dialog Attribute Prediction
We predict the dialog attribute of the next utterance conditioned on the context vector i.e. summary of the previous utterances and the dialog attributes of the previous utterances. We first pass the attributes of all the previous utterances through an RNN. We combine only the last hidden state of this RNN with the context vector (represents the summary of all the previous utterances) to predict the dialog attribute of the next utterance as shown in Figure 1.
If the dialog dataset is not annotated with the dialog attributes, we build a classifier (with a manually tagged dataset) to annotate the dialog attributes. This classifier is a simple MLP. We empirically show that this classifier need not have high accuracy to improve the dialog modeling. We hypothesize that few misclassified attributes could potentially provide a regularization effect similar to the dropout mechanism (Srivastava et al., 2014).

Conditional Response Generation
After the dialog attributes prediction, we generate the next utterance conditioned on the dialog context and the predicted attributes as shown in Figure 2. Token generation of the next utterance is modelled as in equation 2. The context and attributes are combined by concatenating their corresponding hidden states.
where h decm,n is the recurrent hidden state of the decoder after seeing n − 1 words in the m-th utterance, f dec is the token level response decoder, and where s m−1 is the summary of previous m − 1 utterances (recurrent hidden state of the utterancelevel encoder), and da 1 m , da 2 m , ..., da K m are the K dialog attribute embeddings corresponding to the m-th utterance.
During inference, we first predict the dialog attributes of the dialog context. We then predict the dialog attribute of the next utterance conditioned on the predicted attribute and the hierarchical utterance representations. We combine the predicted attribute's embedding vector with the context representation to generate the next utterance. Looking from another perspective, we could formulate the conditional utterance generation problem as a multi-task problem where we jointly learn to predict the dialog attributes and tokens of the next utterance.

RL for Dialog Attribute Prediction
Often the MLE objective does not capture the true goal of the conversation and lacks the framework which can take developer-defined rewards into account for modelling such goals. Also, the MLEbased seq2seq models fail to model long term influence of the utterances on the dialog flow causing coherency issues. This calls for a Reinforcement Learning (RL) based framework which has the ability to optimize policies for maximizing long term rewards. At the core, the MLE objective tries to increase the conditional utterance probabilities and influences the model to place higher probabilities over the commonly occurring utterances. On the other hand, RL based methods circumvent this issue by shifting the optimization problem to maximizing long term rewards which could promote diversity, coherency, etc.
Previous approaches ; Kottur et al. (2017); Lewis et al. (2017) propose to model the token prediction of the next utterance as a reinforcement learning problem and optimize the models to maximize hand-crafted rewards for improving diversity, coherency, and ease of answering. Their approaches involves pre-training the encoder-decoder models with supervised training and then refining the utterance generation further with RL using the hand-engineered rewards. Their state space consists of the dialog context representation (encoder hidden states). Their action space at a given time step includes all possible words that the decoder can generate (which is very large).
While this approach is appealing, policy gradient methods are known to suffer from high variance when using large action spaces. This makes training extremely unstable and requires significant engineering efforts to train successfully.
Another potential drawback with directly acting over the vocabulary space is that the RL optimization procedure tends to strip away the linguistic / natural language aspects learned during the supervised pre-training step, as observed in (Kottur et al., 2017;Lewis et al., 2017). Since the primary focus of the RL objective function is to improve the final reward (which may not emphasize on the linguistic aspects of the generated responses, for e.g., diversity scores), the optimization algorithm could lead the decoder into generating unnatural responses. We propose to avoid both the issues by reducing the action space to a higher level abstraction space i.e. the dialog attributes. Our action space comprises the discrete dialog attributes and the state space is the dialog context. Intuitively, this enables the RL policy to view the dialog attributes as control variables for improving dialog flow and modelling long term influence. For instance, if the input response was "how old are you?", an RL policy optimized to maximize conversation length and engagement could choose to set one of the next utterance attributes as a question-type to generate a response like "why do you ask?" instead of a straightforward answer, to keep the conversation engaging. Thus, we believe that this approach enables the model to predict such rare but interesting utterances to which the MLE objective fails to give attention.
Our policy network comprises of the encoders and the attribute prediction network. Given the previous utterances U 1:m−1 , the policy network first encodes them by using the encoders. Then this encoded representation is passed to the attribute prediction network. The output of the attribute prediction network is the action. While there are many ways to design the reward function, we adopt the ease-of-answering reward introduced by Li et al.
(2016) -negative log-likelihood of a set of manually constructed dull utterances (usually the most commonly occurring phrases in the dataset) in response to the next generated utterance. Let S be the set of dull utterances. With the sampled dialog-acts, DA 1:K from the policy network, we generate the next utterance U m using the decoder. Then we add this generated utterance to the context and predict the probability of seeing one of the dull utterances in the m + 1-th step. This is used to compute the reward as follows: where N s is the number of tokens in the dull utterance s. The normalization avoids the reward function attending to only the longer dull responses. We use REINFORCE (Williams, 1992) to optimize our policy, P RL (DA 1:K |U 1:m−1 ). The expected reward is given by equation 5.
The gradient is estimated as in equation 6.
where b is the reward baseline (computed as the running average of the rewards during training). We initialize the policy with the supervised training and add an L2-loss to penalize the network weights from moving away from the supervised network weights.

Training Setup
Datasets: We first start with the Reddit-discourse dataset  for training dialog attribute classifiers and modelling utterance generation. Reddit: The Reddit discourse dataset  is manually pre-annotated with dialog-acts via crowd sourcing. The dialog-acts comprise of answer, question, humor, agreement, disagreement, appreciation, negative reaction, elaboration, announcement. It comprises conversations from around 9000 randomly sampled Reddit threads with over 100000 comments and an average of 12 turns per thread. Open-Subtitles: Additionally, we show results with the unannotated Open-Subtitles dataset (Tiedemann, 2009) (we randomly sample up to 2 million dialogs for training and validation). We tag the dataset with dialog attributes using pre-trained classifiers.
We experiment with two types of dialog attributes in this paper -sentiment and dialog-acts. We annotate the utterances with sentiment tagspositive, negative, neutral using the Stanford Core-NLP tool (Manning et al., 2014). We adopt the dialog-acts from two annotated dialog corpus -Switchboard (Godfrey et al., 1992) and Frames (Schulz et al., 2017).
Switchboard: Switchboard corpus (Godfrey et al., 1992) is a collection of 1155 chit-chat style telephonic conversations based on 70 topics. Jurafsky et al. (1997) revised the original tags to 42 dialogacts. In our experiments, we restrict dialog-acts to the top-10 most frequently annotated tags in the corpus -Statement-non-opinion, Acknowledge , Statement-opinion, Agree/Accept, Abandoned or Turn-Exit, Appreciation, Yes-No-Question, Nonverbal, Yes answers, Conventional-closing. We consider the top-10 frequently annotated tags as a simple solution to avoid the class imbalance issue (the Statement-non-opinion act is tagged 72824 times, while Thanking is tagged only 67 times) for training the dialog attribute classifiers.
Frames: Frames (Schulz et al., 2017) is a task oriented dialog corpus collected in the Wizard-of-Oz fashion. It comprises of 1369 human-human dialogues with an average of 15 turns per dialog. The wizards had access to a database of hotels and flights information and had to converse with users to help finalize vacation plans. The dataset has 20 different types of dialog-acts annotations. Like the Switchboard corpus, we adopt the top 10 frequently occurring acts in the dataset for our experiments inform, offer, request, suggest, switch-frame, no result, thank you, sorry, greeting, affirm.
Model Details: We use two-layer GRUs (Chung et al., 2014) for both encoder and decoders with hidden sizes of 512. We restrict the vocabulary for both the datasets to top 25000 frequency occurring tokens. The dialog attribute classifier for dialog attributes is a simple 2-layer MLP with layer sizes of 256, and 10 respectively. We use the rectified linear unit (ReLU) as the non-linear activation function for the MLPs and use dropout rate of 0.3 for the token embeddings, hidden-hidden transition matrices of the encoder and decoder GRUs.
Training Details: We ran our experiments in Nvidia Tesla-K80 GPUs and optimized using the ADAM optimizer with the default hyperparameters used in (Merity et al., 2017(Merity et al., , 2018. All models are trained with batch size 128 and a learning rate 0.0001.

Experimental Results
In this section, we present the experimental results along with qualitative analysis.
In Section 4.1, we discuss the dialog attribute classification results for different model architectures trained on the Reddit, Switchboard and Frames datasets.
In Section 4.2, we first demonstrate quantitative improvements (token perplexity/embedding based metrics) for the Attribute conditional HRED model with the manually annotated Reddit dataset. Further, we discuss the model perplexity improvements along with sample conversations and human evaluation results on the Open-Subtitles dataset. We annotate it with sentiment and dialog-acts (from Switchboard/Frames datasets) using pre-trained classifiers described in Section 4.1.
Finally, in Section 4.3, we analyze the quality of the generated responses after RL fine-tuning us-ing diversity scores (distinct-1, distinct-2), sample conversations and human evaluation results for diversity and relevance.

Dialog Attribute Prediction
In this section, we present the experiments with the model architectures for the dialog attribute prediction -dialog-acts from Reddit, Switchboard and Frames datasets. First, we demonstrate the performance of the dialog-acts classifiers on the Reddit dataset as shown in Table 1.

Model
Acc(%) The model F(U t ) refers to the architecture which predicts the dialog-acts based on current utterance U t alone. The tokens in the current utterance U t are fed through a two-layer GRU and the final hidden state is used to predict the dialog-acts. The model F(DA t−1,t−2 ) predicts the current utterance's dialog-acts DA t based on the dialog-acts corresponding to the previous two utterances. We consider the dialog-acts prediction problem as a sequence modelling problem where we feed the dialog-acts into a single-layer GRU and predict the current dialog-acts conditioned on the previous dialog-acts. We settled on conditioning on the dialog-acts corresponding to the previous two utterances alone as we didn't observe any boost in the classifier performance from the older dialog-acts. As seen in Table 1, conditioning additionally on the dialog attributes helps improve classifier performance.
Next, we train classifiers to predict dialog-acts of utterances of the Switchboard and Frames corpus. In our experiments, the number of act types is 11 -the top 10 most frequently occurring acts in the corpus and "others" category covering the rest of the tags.
As seen from Table 2, classifier performance is not really high and yet, contribute to improvements in perplexity for the conditional Seq2Seq models (discussed in Section 4.2). While we aim for better classifier performance, it is important to note here that the primary objective of such dialog attribute classifiers is to tag unannotated open-domain dia-

Corpus
Num Acts Acc(%)  log datasets. As future work, we will study how the classification errors influence response generation.

Metric LM Seq2Seq Seq2Seq+Attr
Perplexity 176   Reddit: First, we evaluate Seq2Seq models trained on the manually annotated Reddit corpus as shown in Table 3. Seq2Seq+Attr refers to our model where we condition on the dialog-acts additionally. Please note that we use the notation "Attr" here to maintain generality as it may refer to other dialog attributes like sentiment later in this section. For both the baseline and conditional Seq2Seq models, we consider a dialog context involving the previous two turns as we did not observe significant performance improvement with three or more turns. We use a 2-layer GRU language model as a baseline for comparison. As seen from Table  3, Seq2Seq+Attr fares well both in terms of perplexity and embedding metrics. Higher perplexity observed in the Reddit corpus could be due to the presence of several topics in the dataset (exhibits high entropy) and fewer dialogs compared to other open domain dialog datasets.
Open-Subtitles: With promising results on the manually tagged Reddit corpus, we now evaluate our attribute conditional HRED model on the unannotated Open-Subtitles dataset. We tag the Open-Subtitles dataset with the sentiment tags using the Stanford Core-NLP tool (Manning et al., 2014) and  dialog-acts from Frames & Switchboard corpus using the pre-trained classifiers described in Section 4.1.
In Table 4, we compare the model perplexity when trained on varying dialog corpus size. In most of the cases, we observe that the conditioning with acts from both the frames and switchboard yields the lowest perplexity. We observe that the perplexity improvement is substantial for smaller datasets which is also corroborated from the experiments with the Reddit dataset.
Human Evaluation: Following the human evaluation setting in , we randomly sample 200 input message and the generated outputs from the Seq2Seq+Attr & Seq2Seq models. We present each of them to 3 judges and ask them to decide which of the two outputs is 1) relevant and 2) diverse or interesting. Ties are permitted. Results for human evaluation are shown in Table  8. We observe that Seq2Seq+Attr performs better than the Seq2Seq model both in terms of diversity and relevance.  Please note that the Seq2Seq+Attr model performs better in terms of diversity compared to the relevancy. This is in line with our expectations, as the purpose of dialog attribute annotations is to help the model focus better on less-frequent responses.
Additionally, we present a few sample conversations in Table 6, where we observe that the Seq2Seq+Attr model generates more interesting responses.

RL For Dialog Attribute Prediction
For the RL fine-tuning, we report the diversity scores of the generated responses with the models trained on the Open-Subtitles dataset in Table 7.
The diversity scores, distinct-1 and distinct-2 are computed as the fraction of uni-grams and bi-grams in the generated responses following the previous work by Li et al. (2015).  We use the model conditioned on acts from both Switchboard and Frames for the Seq2Seq+Attr and RL cases. The action space for the policy in this case, covers the 10 acts from Switchboard and Frames each. We choose a collection of commonly occurring phrases in the Open-Subtitles dataset as the set of dull responses, S for the reward computation in equation 4. We observe that the RL fine-tuning improves over the conditional seq2seq in terms of the diversity scores.
Human Evaluation: As described in Section 4.2, we present each of the 200 randomly sampled input-response pairs of the Seq2Seq + Attr and RL models to 3 judges and ask to them rate each sample for diversity and relevance. From Table 8, we can see that the RL model significantly performs better both in terms of diversity and relevance.
Qualitative Analysis: In Table 9, we present the percentage of the commonly occurring generic  responses from the Open-Subtitles dataset in the validation set samples corresponding to the RL and Seq2Seq + Attr models. We observe very low percentages of such generic responses in the samples after RL fine-tuning. It is interesting to note that RL model has successfully learned to minimize the generation of other dull responses like I would love to be , I would love to see, I dont want to apart from expected the dull responses, S (used in the reward computation). At the same time, RL model has scored higher in terms of the Relevancy metric, as seen in  Additionally, we present a few sample conversations in Table 10, where we observe that the RL model generates more diverse and relevant responses.

Related Work
There are several works focusing on dialog-acts classification and clustering based analysis (Reithinger and Klesen, 1997;Liu, 2006;Khanpour et al., 2016;Ang et al., 2005;Crook et al., 2009;Stolcke et al., 2000;Ezen-Can and Boyer, 2013). Shen et al. (2017) additionally add sentiment feature to the latent variables in the VAE setting for utterance generation. In our work, we use dialog attributes from different sources -Switchboard and Frames corpus to model utterance generation in a more realistic setting. As for the RL setting, existing research efforts include  Input: i'm honestly a bit confused why no one has brought me or my books any cake Seq2Seq+Attr: i dont think i dont think anything RL: i cant wait to see you in the city. Input: ive been playing spaceship with my year old niece for the past few days Seq2Seq+Attr: i dont even know what i was talking about. RL: this is the best thing ive ever seen. Input: it makes me so happy that you like them Seq2Seq+Attr: i dont know what i was thinking about it RL: i was just thinking about the same thing  Dhingra et al., 2016;Jaques et al., 2016) which formulate the token prediction as a RL policy in Seq2Seq models. However, searching over a huge vocabulary space typically involves training with huge number of samples and careful fine-tuning of the policy optimization algorithms. Additionally, as discussed in Section 2.3, it requires precautionary measures to prevent the RL algorithm from removing the linguistic aspects of the generated utterances. In another related research work, Serban et al. (2017) use dialog-acts as one among their hand crafted features to select responses from an ensemble of dialog systems. They use dialog-acts in their RL policy, however their action space comprises of responses from an ensemble of dialog models. They include dialog-acts in their features for their distributed state representation.

Conclusion
In this work, we address the dialog utterance generation problem by jointly modeling previous dialog context and discrete dialog attributes. We analyze both quantitatively (model perplexity and other embedding based metrics) and qualitatively (human evaluation, sample conversations) to validate that composed dialog attributes help generate interesting responses. Further, we formulate the dialog attribute prediction problem as a reinforcement learning problem. We fine tune the attribute selection policy network trained with supervised learning using REINFORCE and demonstrate improvements in diversity scores compared to the Seq2Seq model. In the future, we plan to extend the model for additional dialog attributes like emotion, speaker persona etc. and evaluate the controllability aspect of the responses based on the dialog attributes.