Towards Coherent and Engaging Spoken Dialog Response Generation Using Automatic Conversation Evaluators

Encoder-decoder based neural architectures serve as the basis of state-of-the-art approaches in end-to-end open domain dialog systems. Since most of such systems are trained with a maximum likelihood (MLE) objective they suffer from issues such as lack of generalizability and the generic response problem, i.e., a system response that can be an answer to a large number of user utterances, e.g., “Maybe, I don’t know.” Having explicit feedback on the relevance and interestingness of a system response at each turn can be a useful signal for mitigating such issues and improving system quality by selecting responses from different approaches. Towards this goal, we present a system that evaluates chatbot responses at each dialog turn for coherence and engagement. Our system provides explicit turn-level dialog quality feedback, which we show to be highly correlated with human evaluation. To show that incorporating this feedback in the neural response generation models improves dialog quality, we present two different and complementary mechanisms to incorporate explicit feedback into a neural response generation model: reranking and direct modification of the loss function during training. Our studies show that a response generation model that incorporates these combined feedback mechanisms produce more engaging and coherent responses in an open-domain spoken dialog setting, significantly improving the response quality using both automatic and human evaluation.


Introduction
Due to recent advances in spoken language understanding and automatic speech recognition, conversational interfaces such as Alexa, Cortana, and Siri have become increasingly common.While these interfaces focus on completing specific tasks, there is an increasing interest in building conver-sational systems that can engage in more social and natural conversations.Building systems that can have a general conversation in an open domain setting is a challenging problem, but it is an important step towards more natural humanmachine interactions.Recently, there has been significant interest in building chatbots (Sordoni et al., 2015;Wen et al., 2015) fueled by the availability of popular dialog data sets such as Ubuntu, Twitter, and Movie dialogs (Lowe et al., 2015;Ritter et al., 2011;Danescu-Niculescu-Mizil and Lee, 2011).However, work on human-machine spoken dialog is relatively under-explored.Spoken dialog poses additional challenges such as automatic speech recognition errors and divergence between spoken and written language.
Sequence-to-sequence (seq2seq) models (Sutskever et al., 2014) and their extensions (Luong et al., 2015;Sordoni et al., 2015;Li et al., 2015), which are used for neural machine translation (MT), have been widely adopted for dialog generation systems.In MT, given a source sentence, the correctness of the target sentence can be measured by semantic similarity to the source sentence.However, in open-domain conversations, a generic utterance such as "sounds good" could be a valid response to a large variety of statements.These seq2seq models are commonly trained on a maximum likelihood objective, which leads the models to place uniform importance on all user utterance and system response pairs.Thus, these models usually choose "safe" responses as they frequently appear in the dialog training data.This phenomenon is known as the generic response problem.These responses, while arguably correct, are bland and convey little information leading to short conversations and low user satisfaction.
Since response generation systems are trained by maximizing the average likelihood of the training data, they do not have a clear signal on how well the current conversation is going.We hypothesize that having a way to measure conversational success at every turn could be valuable information that can guide system response generation and help improving system quality.Such a measurement may also be useful for combining responses from various competing systems.To this end, we build a supervised conversational evaluator to assess two aspects of responses: engagement and coherence.The input to our evaluators are encoded conversations represented in fixed-length vectors as well as hand-crafted dialog and turn level features.The system outputs explicit scores on coherence and engagement of the system response.
We experiment with two ways to incorporate these explicit signals in response generation systems.First, we use the evaluator outputs as input to a reranking model, which are used to rescore the n-best outputs obtained after beam search decoding.Second, we propose a technique to incorporate the evaluator loss directly into the conversational model as an additional discriminatory loss term.Using both human and automatic evaluations, we show that both of these methods significantly improve the system response quality.The combined model utilizing re-ranking and the composite loss outperforms models using either mechanism alone.
The contributions of this work are two-fold.First, we experiment with various hand-crafted features and conversational encoding schemes to build a conversational evaluation system that can provide explicit turn-level feedback to a response generation system on the highly subjective task.This system can be used independently to compare various response generation systems or as a signal to improve response generation.Second, we experiment with two complementary ways to incorporate explicit feedback to the response generation systems and show improvement in dialog quality using automatic metrics as well as human evaluation.

Related Work
There are two major themes in this work.The first is building evaluators that allow us to estimate human perceptions of coherence, topicality, and interestingness of responses in a conversational context.The second is the use of evaluators to guide the generation process.As a result, this work is related to two distinct bodies of work.
Automatic Evaluation of Conversations: Learning automatic evaluation of conversation quality has a long history (Walker et al., 1997).However, we still do not have widely accepted solutions.Due to the similarity between conversational response generation and MT, automatic MT metrics such as BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005) are widely adopted for evaluating dialog generation.ROUGE (Lin and Hovy, 2003), which is also used for chatbot evaluation, is a popular metric for text summarization.These metrics primarily rely on token-level overlap over a corpus (also synonymy in the case of METEOR), and therefore are not well-suited for dialog generation since a valid conversational response may not have any token-level or even semantic-level overlap with the ground truths.While the shortcomings of these metrics are well known for MT (Graham, 2015;Espinosa et al., 2010;Cahill, 2009), the problem is aggravated for dialog generation evaluation because of the much larger output space (Liu et al., 2016).However, due to the lack of clear alternatives, these metrics are still widely used for evaluating response generation (Ritter et al., 2011;Lowe et al., 2017).To ensure comparability with other approaches, we report results on these metrics for our models.
To tackle the shortcomings of these automatic metrics, there has been an effort to build models to score conversations.Lowe et al. (2017) train a model to predict the score of a system response given dialog context.However, they work with tiny data sets (around 4000 sentences) in a non-spoken setting.Tao et al. (2017) tries to address the expensive annotation process by adding in unsupervised data.However, their metric is not interpretable, and the results are also not shown on a spoken setting.Our work differs from the aforementioned works as the output of our system is interpretable at each dialog turn.
There has also been work on building evaluation systems that focus on specific aspects of dialog.Li et al. (2016c) use features for information flow, Yu et al. (2016) use features for turn-level appropriateness.Guo et al. (2017) use topic information to define conversational depth and breadth and is shown to correlate well with human judgment.However, these metrics are based on a narrow aspect of the conversation and fail to capture broad ranges of phenomena that lead to a good dialog.
Improving System Response Generation: Seq2Seq models have allowed researchers to train dialog models without relying on handcrafted dialog acts and slot values.Using maximum mutual information (MMI) (Li et al., 2015) was one of the earlier attempts to make conversational responses more diverse (Serban et al., 2016b,a).Shao et al. (2017) use a segment ranking beam search to produce more diverse responses.Our method extends the strategy employed by Shao et al. (2017) utilizing a trained model as the reranking function.
More recently, there have been works which aim to alleviate this problem by incorporating conversation-specific rewards in the learning process.Yao et al. (2016) use the IDF value of generated sentences as a reward signal.Xing et al. (2017) use topics as an additional input while decoding to produce more specific responses.Li et al. (2016b) add personal information to make system responses more user specific.Li et al. (2017) use distillation to train different models at different levels of specificity and use reinforcement learning to pick the appropriate system response.Zhou et al. (2017) and Zhang et al. (2018) introduce latent factors in the seq2seq models that control specificity in neural response generation.There has been recent work which combines responses from multiple sub-systems (Serban et al., 2017;Papaioannou et al., 2017) and ranks them to output the final system response.Our method complements these approaches by introducing a novel learned-estimator model as the additional reward signal.

Data
The data used in this study was collected during the Alexa Prize (Ram et al., 2017) competition and shared with the teams who were participating in the competition.Upon initiating the conversation, users were paired with a randomly selected chatbot built by the participants.At the end of the conversation, the users were prompted to rate the chatbot quality, from 1-5, with 5 being the highest.
We randomly sampled more than 15K conversations (approximately 160K turns) collected during the competition.These were annotated for coherence and engagement (See Section 3.1) and used to train the conversation evaluators.For training the response generators, we selected highly-rated user conversations, which resulted in around 370K conversations containing 4M user utterances and their corresponding system response.One notable statistic is that user utterances are typically very short (mean: 3.6 tokens) while the system responses gen-erally are much longer (mean: 23.2 tokens).

Annotations
Asking annotators to measure coherence and engagement directly is a time-consuming task.We observed that we could collect data much faster if we asked direct "yes" or "no" questions to our annotators.Hence, upon reviewing a user-chatbot interaction along with the entire conversation to the current turn, annotators1 rated each chatbot response as "yes" or "no" on the following criteria: • The system response is comprehensible: The information provided by the chatbot made sense with respect to the user utterance.
• The system response is on topic: The chatbot response was on the same topic as the user utterance or was relevant to the user utterance.
For example, if a user asks about a baseball player on the LA Dodgers, then the chatbot mentions something about the baseball team.
• The system response is interesting: The chatbot response contains information which is novel and relevant.For example, the chatbot would provide an answer about a baseball player and give some additional information to create a fleshed-out response.
• I want to continue the conversation: Given the current state of the conversation and the system response, there is a natural way to continue the conversation.For example, this could be due to the system asking a question about the current conversation subject.
We use these questions as proxies for measuring coherence and engagement of responses.The answers to the first two questions ("comprehensible" and "on topic") are used as a proxy for coherence.Similarly, the answer to the last two questions ("interesting" and "continue the conversation") are used as a proxy for engagement.

Conversation Evaluators
We train conversational response evaluators to assess the state of a given conversation.Our models are trained on a combination of utterance and response pairs combined with context (past turn user utterances and system responses) along with other features, e.g., dialog acts and topics as described in Section 4.3.We experiment with different ways to encode the responses (Section 4.1) as well as with different feature combinations (Figure 1).

Sentence Embeddings
We pretrained models that produce sentence embeddings using the ParlAI chitchat data set (Miller et al., 2017).We use the Quick-Thought (QT) loss (Logeswaran and Lee, 2018) to train the embeddings.Our word embeddings are initialized with FastText (Bojanowski et al., 2016) to capture the sub-word features and then fine-tuned.We encode sentences into embeddings using the following methods: a) Average of word embeddings (300 dim) b) The Transformer Network (1 layer, 600 dim) (Vaswani et al., 2017) c) Concatenated last states of a BiLSTM (1 layer, 600 dim) All models were trained with a batch size of 400 and a learning rate of 0.0005.
To measure the sentence embedding quality, we evaluate our models on a few standard classification tasks.We used the same datasets as previous works.The models are used to get sentence embeddings, which are then passed through feedforward networks that are trained for the following classification tasks: (i) Semantic Textual Similarity (STS) (Marelli et al., 2014), (ii) Question Type Classification (TREC) (Voorhees and Dang, 2003), (iii) Subjectivity Classification (SUBJ) (Pang and Lee, 2004).Table 1 shows the different models' performances on these tasks.Based on this, we choose the Transformer as our sentence encoder as it was overall the best performing while being fast.

Context
Given the contextual nature of the problem (i.e., utterances and responses in the current turn may refer to past turns), we extracted the user utterances and responses for the past five turns and used an LSTM to encode conversational context.The last state of LSTM is used to obtain the encoded representation, which is then concatenated with other features in a fully-connected neural network.We used a 1-layer LSTM with 256 hidden units to encode context.

Features
Sentence embedding representation of user utterances and system responses are used as input to the evaluator models.Sentence embeddings are encoded using the Transformer as mentioned.Apart from these, the following features are used: • Dialog Act: Serban et al. (2017) show that dialog act (DA) features could be useful for response selection rankers.Following this, we use model2 -predicted DAs (Stolcke et al., 1998) of user utterances and system responses as an indicator feature.
• Entity Grid: Cervone et al. ( 2018); Barzilay and Lapata (2008) show that entities and DA transitions across turns can be strong features for assessing dialog coherence.Starting from a grid representation of the turns of the conversation as a matrix (DAs × entities), these features are designed to capture the patterns of topic and intent shift distribution of a dialog.We employ the same strategy for our models.
• Named Entity (NE) Overlap: We use named entity overlap between user utterances and their corresponding system responses as a feature.Our named entities are obtained using SpaCy3 .Papaioannou et al. (2017) have also used similar NE features in their ranker.
• Topic: We use a one-hot representation of a dialog turn (a user utterance and the system response) topic predicted by a conversational topic model (Guo et al., 2017) that classifies a given dialog turn into one of 26 pre-defined classes like Sports and Movies.
• Response Similarity: Cosine similarity between user utterance embedding and system response embedding is used as a feature.
• Length: We use the token-level length of the user utterance and system response as a feature.
The above features were selected from a large pool of features through significance testing on our development set.The effect of adding these features can be seen in Table 2. Some of the features such as Topic lack previous dialog context, which could be updated to include the context.We leave this extension for future work.

Models
Given the large number of features and their nonsequential nature, we train four binary classifiers using feedforward neural networks (FFNN).The input to these models is a dialog turn.Each output layer is a softmax function corresponding to a binary decision for each evaluation metric forming a four-dimensional vector.Each vector dimension corresponds to an evaluation metric (See Section 3.1).For example, one possible reference output would be [0,1,1,0], which corresponds to "not comprehensible," "on topic," "interesting," and "I don't want to continue." We experimented with training the evaluators jointly and separately and found that training them jointly led to better performance.We suspect this is due to the objectives of all evaluators being closely related.The features described in Section 4.3 are concatenated with the user utterance and system response embeddings in the 3-layer FFNN with 256 hidden units.Figure 1 depicts the architecture of the conversation evaluators.

Response Generation System
To incorporate the explicit turn level feedback provided by the conversation evaluators, we augment our baseline response generation system with the softmax scores provided by the conversation evaluators.Our baseline response generation system is described in Section 5.1.We evaluate our generation systems to see if the additional information produces more coherent and engaging system responses.We then incorporate evaluators outputs using two techniques: reranking and fine-tuning.(c) Fine-tuning Using Evaluators.We minimize cross entropy loss and maximize discriminator loss.The output of softmax, i.e., likelihood over vocabulary for the length of output is passed to the evaluator along with the input (x and context).Evaluator generates the discriminative score over |V |×len generator output, which is subtracted from the loss.The updated loss is back-propagated to update encoder-decoder.

Base Model (S2S)
We extended the approach of Yao et al. (2016) where the authors used Luong Attention (Luong et al., 2015).In our experiments, the decoder uses the same attention (Figure 2a).As we want to observe the full impact of conversational evalua- In open-domain conversations, it is possible that a system response for a user utterance in a current turn might be referring to the context in past turns.To make the response generation system more robust, we added user utterances and system responses from the previous turn as context.The input to the response generation model is previousturn user utterance, previous-turn system response, and current-turn user utterance concatenated sequentially.We insert a special transition token (Serban et al., 2016c) between turns.We then use a single RNN to encode these sentences.Our word embeddings are randomly initialized and then finetuned during training.We used a 1-layer Gated Recurrent Neural (GRU) network with 512 hidden units for both encoder and decoder to train the seq2seq model and use MLE as our training objective.

Reranking (S2S RR)
In this approach, we do not update the underlying encoder-decoder model.We maintain a beam to get 15-best candidates from the decoder.The top candidate out of the 15 candidates is equivalent to the output of the baseline model.Here, instead of selecting the top output, the final output response is chosen using a reranking model.
For our reranking model, we calculate BLEU scores for each of the 15 candidate responses against the ground truth response from the chatbot.We then sample two responses from the k-best list and train a pairwise response reranker.The response with the higher BLEU is placed in the positive class (+1) and the one with lower BLEU is placed in the negative class (-1).We do this for all possible candidate combinations from the 15-best responses.We use the max-margin ranking loss to train the model.The model is a FFNN with three layers and 16 hidden units in each layer.
The input to the pairwise reranker is the softmax output of the 4 evaluators as shown in Figure 1.The input to the evaluators are described in Section 4. The output of the reranker is a scalar, which, if trained right, would give a higher value for responses with higher BLEU scores.Figure 2b depicts the architecture of this model.

Fine-tuning (S2S FT)
In this approach, we fine-tune the baseline encoderdecoder response generation system using the evaluators.We first train the baseline model and then, it is fine-tuned using the evaluator outputs in the hope of generating more coherent and engaging responses.One issue with MLE is that the learned models are not optimized for the final metric (e.g., BLEU).To combat this problem, we add a discriminatory loss in addition to the generative loss to the overall loss term as shown in Equation 1. (1) where z n = x n , y n−1 , . . ., x 0 , y 0 is the conversational context where n is the context length.q ∈ R |V |×len of the first term corresponds to the softmax output generated by the response generation model.The term ŷni refers to its corresponding decoder response at n th conversation turn and i th word generated.In the second term, the function Eval refers to the evaluator score produced for a user utterance, x n , and decoder softmax output, q.
In Equation 1, the first term corresponds to the cross-entropy loss from the encoder-decoder while the second term corresponds to the discriminative loss from the evaluator.In a standalone evaluation setting, the evaluator will take one hot representation of the user utterance as input, i.e., the input is len-tokens long which is passed through an embedding lookup layer which makes it R D×len input to rest of the network where D is the size of the word embeddings.To make the loss differentiable, instead of performing argmax to get a decoded token, we use the output of the softmax layer (distribution of likelihood across entire vocabulary for output length, i.e., R |V |×len ) and use this to do a weighted embedding lookup across the entire vocabulary to get the same R D×len matrix as an input to rest of the evaluator network.Our updated evaluator input becomes the following: (2) The evaluator score is defined as the sum of softmax outputs of all 4 models.We keep the rest of the input (context and features) for the evaluator as is.
We weight the discriminator score by λ, which is a hyperparameter.We select λ using grid search to optimize for final BLEU on our development set. Figure 2c depicts the architecture of this approach.The decoder is fine-tuned to maximize evaluator scores along while minimizing the cross-entropy loss.The evaluator model is trained on the original annotated corpus and parameters are frozen.

Reranking + Fine-tuning (S2S RR FT)
We also combined fine-tuning with reranking, where we obtained the 15 candidates from the finetuned response generator and then we select the best response using the reranker, which is trained to maximize the BLEU score.
6 Experiments and Results

Conversation Evaluators
We used a batch size of 128, dropout of 0.3 and a learning rate of 0.00005 for our conversational evaluators.The models were trained using crossentropy loss.Sentence embeddings for user utterances and system responses are obtained using the fast-text embeddings and Transformer network.
Table 2 shows the result of the evaluator compared with a baseline with no handcrafted features.We present precision, recall, and f-score measures along with the accuracy.Furthermore, since the class distribution of the dataset is highly imbalanced, we also calculate Matthews correlation coefficient (MCC), which takes into account true and false positives and negatives.It is regarded as a balanced measure which can be used even if the classes sizes are very different.With the proposed features we observe significant improvement across all the metrics.We also performed a correlation study between the model predicted scores and human annotated scores (1 to 5) on 2000 utterances.The annotators 4  were asked to answer a single question: "On a scale of 1-5, how coherent and engaging is this response given the previous conversation?"From Table 3, it can be observed that evaluator predicted scores has significant correlation (moderate to high) with the overall human evaluation score on this subjective task (0.2 -0.4 Pearson co-relation with turn-level ratings).Hence, we observe that our evaluators can be used to provide turn-level feedback for a human-chatbot conversation.

Response Generation
Table 4 shows the performance comparison of different generation models (Section 5) on the conversational data set (4M utterance-response pairs from the competition.Section 3).The data are split into 80% training, 10% development, and 10% test sets.We observed that reranking n-best responses using the evaluator-based reranker (S2S RR) provides nearly 100% improvement in BLEU-4 scores.The reranker was trained using 20k additional utterances from a development set.
Fine-tuning the generator by adding evaluator loss (S2S FT) does improve the performance but the gains are smaller compared to reranking.We suspect that this is due to the reranker directly optimizing for BLEU.However, using a fine-tuned model and then reranking (S2S RR FT) complements each other and gives the best performance overall.Furthermore, we observe that even though the reranker is trained to maximize the BLEU scores, reranking shows significant gains Metric S2S(Base) S2S RR BLEU-4 3.9 7.9 (+103%) ROUGE-2 0.6 0.8 (+33%) Distinct-2 0.0047 0.0086 (+82%) in ROUGE scores as well.The hyperparameter λ was chosen to be 10.We also measured different systems performance using Distinct-2 (Li et al., 2016a), which is the number of unique lengthnormalized bigrams in responses.The metric can be a surrogate for measuring diverse outputs.We see that our generators using reranking approaches improve on this metric as well.
To further analyze the impact of reranker trained to optimize on BLEU score, we trained a baseline response generation system on a Reddit data set 5 , which comprises of 9 million comments and corresponding response comments.
We split the data to be 80% training, 10% development, and 10% test.Given the vocabulary size and complexity of the data, we used a 3-layered GRU with 1024 units.We trained a new reranker for the Reddit data using the evaluator scores obtained from the models proposed in Section 4. We show in Table 5 that even though the evaluators are trained on a different data set, the reranker learns to select better responses nearly doubling the BLEU scores as well as improving on the Distinct-2 score.Thus the evaluator generalizes in selecting more coherent and engaging responses in human-human interactions as well as human-computer interactions.As fine-tuning the evaluator is computationally expensive, we did not fine-tune it on the Reddit dataset.
The closest baseline that used BLEU scores for evaluation in open-domain setting is from Li et al. (2015) where they trained the models on Twitter data using Maximum Mutual Information (MMI) as the objective function.They obtained a BLEU score of 5.2 in their best setting on Twitter data (average length 23 chars), which is relatively less complex than Reddit (average length 75 chars).

Human Evaluation
As noted earlier, automatic evaluation metrics may not be the best way to measure chatbot response generation performance.Therefore, we performed 5 We use a publicly available data (Baumgartner, 2015).

Metric
Coherence human evaluation of our models.We asked annotators to provide ratings on the system responses from the models we evaluated, i.e., baseline model, S2S RR, S2S FT, and S2S RR FT.A rating was obtained on two metrics: coherence and engagement.We asked the annotators to provide the rating based on a scale of 1-5, with 5 being the best.We had four annotators rate 250 interactions.Table 6 shows the performance of the models on the proposed metrics.Our inter-annotator agreement is 0.42 on Cohen's Kappa Coefficient, which implies moderate agreement.We believe this is because the task is relatively subjective and the conversations were performed in the challenging open-domain setting.The S2S RR FT model provides the best performance across all the metrics, followed by S2S RR, followed by S2S FT.

Conclusion
Human annotations for conversations show significant variance, but it is still possible to train models which can extract meaningful signal from the human assessment of the conversations.We show that these models can provide useful turn-level guidance to response generation models.We design a system using various features and context encoders to provide turn-level feedback in a conversational dialog.Our feedback is interpretable on 2 major axes of conversational quality: engagement and coherence.We propose 2 ways to incorporate this feedback into a response generation system, both of which help improve on the baselines.We obtain the best performance when we combine both of the techniques.This work is complementary to other recent work in improving dialog systems such as Li et al. (2015) and Shao et al. (2017).While such open-domain systems are still in their infancy, we view the framework presented in this paper to be an important step towards building end-to-end coherent and engaging chatbots.

Figure
Figure 1: Conversation Evaluators (a) Baseline Response Generator (Seq2Seq with Attention) (b) Reranking Using Evaluators.Top 15 candidates from beam search are passed to the evaluators.The candidate that maximizes the reranker score is chosen as the output.Encoderdecoder remain unchanged.

Figure 2 :
Figure 2: Response Model Configurations.The baseline is shown at the top.The terms x n and y n correspond to n th utterance and response respectively.

Table 1 :
Sentence embedding accuracy on baseline tasks.

Table 2 :
Conversation Evaluators Performance.Numbers in parentheses denote relative changes when using our best model (all features) with respect to the baseline (no handcrafted features, only sentence embeddings).Second column shows the class imbalance in our annotations.Note that the baseline model had 0 MCC for Interesting tors, we do not incorporate inverse document frequency (IDF) or conversation topics into our objective.Extending the objective to include these terms can be a good direction for future work.

Table 3 :
Evaluators Correlation with Turn-level Ratings

Table 4 :
Generator performance on automatic metrics.

Table 5 :
Response Generator on Reddit Conversations.Due to the size of the dataset we could not fine tune these models.

Table 6 :
Mean ratings for Qualitative and Human Evaluation of Response Generators