Towards an Automatic Turing Test: Learning to Evaluate Dialogue Responses

Automatically evaluating the quality of dialogue responses for unstructured domains is a challenging problem. Unfortunately, existing automatic evaluation metrics are biased and correlate very poorly with human judgements of response quality (Liu et al., 2016). Yet having an accurate automatic evaluation procedure is crucial for dialogue research, as it allows rapid prototyping and testing of new models with fewer expensive human evaluations. In response to this challenge, we formulate automatic dialogue evaluation as a learning problem.We present an evaluation model (ADEM)that learns to predict human-like scores to input responses, using a new dataset of human response scores. We show that the ADEM model’s predictions correlate significantly, and at a level much higher than word-overlap metrics such as BLEU, with human judgements at both the utterance and system-level. We also show that ADEM can generalize to evaluating dialogue mod-els unseen during training, an important step for automatic dialogue evaluation.


Introduction
Building systems that can naturally and meaningfully converse with humans has been a central goal of artificial intelligence since the formulation of the Turing test (Turing, 1950).Research on one type of such systems, sometimes referred to as non-task-oriented dialogue systems, goes back to the mid-60s with Weizenbaum's famous program Figure 1: Example where word-overlap scores fail for dialogue evaluation; although the model response is reasonable, it has no words in common with the reference response, and thus would be given low scores by metrics such as BLEU.
ELIZA: a rule-based system mimicking a Rogerian psychotherapist by persistently either rephrasing statements or asking questions (Weizenbaum, 1966).Recently, there has been a surge of interest towards building large-scale non-task-oriented dialogue systems using neural networks (Sordoni et al., 2015b;Shang et al., 2015;Vinyals and Le, 2015;Serban et al., 2016a;Li et al., 2015).These models are trained in an end-to-end manner to optimize a single objective, usually the likelihood of generating the responses from a fixed corpus.Such models have already had a substantial impact in industry, including Google's Smart Reply system (Kannan et al., 2016), and Microsoft's Xiaoice chatbot (Markoff and Mozur, 2015), which has over 20 million users.
One of the challenges when developing such systems is to have a good way of measuring progress, in this case the performance of the chatbot.The Turing test provides one solution to the evaluation of dialogue systems, but there are limitations with its original formulation.The test requires live human interactions, which is expensive and difficult to scale up.Furthermore, the test requires carefully designing the instructions to the human interlocutors, in order to balance their behaviour and expectations so that different systems may be ranked arXiv:1708.07149v1[cs.CL] 23 Aug 2017 accurately by performance.Although unavoidable, these instructions introduce bias into the evaluation measure.The more common approach of having humans evaluate the quality of dialogue system responses, rather than distinguish them from human responses, induces similar drawbacks in terms of time, expense, and lack of scalability.In the case of chatbots designed for specific conversation domains, it may also be difficult to find sufficient human evaluators with appropriate background in the topic (Lowe et al., 2015).
Despite advances in neural network-based models, evaluating the quality of dialogue responses automatically remains a challenging and understudied problem in the non-task-oriented setting.The most widely used metric for evaluating such dialogue systems is BLEU (Papineni et al., 2002), a metric measuring word overlaps originally developed for machine translation.However, it has been shown that BLEU and other word-overlap metrics are biased and correlate poorly with human judgements of response quality (Liu et al., 2016).There are many obvious cases where these metrics fail, as they are often incapable of considering the semantic similarity between responses (see Figure 1).Despite this, many researchers still use BLEU to evaluate their dialogue models (Ritter et al., 2011;Sordoni et al., 2015b;Li et al., 2015;Galley et al., 2015;Li et al., 2016a), as there are few alternatives available that correlate with human judgements.While human evaluation should always be used to evaluate dialogue models, it is often too expensive and time-consuming to do this for every model specification (for example, for every combination of model hyperparameters).Therefore, having an accurate model that can evaluate dialogue response quality automatically -what could be considered an automatic Turing test -is critical in the quest for building human-like dialogue agents.
To make progress towards this goal, we make the simplifying assumption that a 'good' chatbot is one whose responses are scored highly on appropriateness by human evaluators.We believe this is sufficient for making progress as current dialogue systems often generate inappropriate responses.We also find empirically that asking evaluators for other metrics results in either low inter-annotator agreement, or the scores are highly correlated with appropriateness (see supp.material).Thus, we collect a dataset of appropriateness scores to various dialogue responses, and we use this dataset to train an automatic dialogue evaluation model (ADEM).The model is trained in a semi-supervised manner using a hierarchical recurrent neural network (RNN) to predict human scores.We show that ADEM scores correlate significantly with human judgement at both the utterance-level and system-level.We also show that ADEM can often generalize to evaluating new models, whose responses were unseen during training, making ADEM a strong first step towards effective automatic dialogue response evaluation.2

Data Collection
To train a model to predict human scores to dialogue responses, we first collect a dataset of human judgements (scores) of Twitter responses using the crowdsourcing platform Amazon Mechanical Turk (AMT). 3The aim is to have accurate human scores for a variety of conversational responses -conditioned on dialogue contexts -which span the full range of response qualities.For example, the responses should include both relevant and irrelevant responses, both coherent and non-coherent responses and so on.To achieve this variety, we use candidate responses from several different models.Following (Liu et al., 2016), we use the following 4 sources of candidate responses: (1) a response selected by a TF-IDF retrieval-based model, (2) a response selected by the Dual Encoder (DE) (Lowe et al., 2015), (3) a response generated using the hierarchical recurrent encoder-decoder (HRED) model (Serban et al., 2016a), and (4) human-generated responses.It should be noted that the humangenerated candidate responses are not the reference responses from a fixed corpus, but novel human responses that are different from the reference.In addition to increasing response variety, this is nec-essary because we want our evaluation model to learn to compare the reference responses to the candidate responses.We provide the details of our AMT experiments in the supplemental material, including additional experiments suggesting that several other metrics are currently unlikely to be useful for building evaluation models.Note that, in order to maximize the number of responses obtained with a fixed budget, we only obtain one evaluation score per dialogue response in the dataset.
To train evaluation models on human judgements, it is crucial that we obtain scores of responses that lie near the distribution produced by advanced models.This is why we use the Twitter Corpus (Ritter et al., 2011), as such models are pre-trained and readily available.Further, the set of topics discussed is quite broad -as opposed to the very specific Ubuntu Dialogue Corpus (Lowe et al., 2015) -and therefore the model may also be suited to other chit-chat domains.Finally, since it does not require domain specific knowledge (e.g.technical knowledge), it should be easy for AMT workers to annotate.

Recurrent Neural Networks
Recurrent neural networks (RNNs) are a type of neural network with time-delayed connections between the internal units.This leads to the formation of a hidden state h t , which is updated for every input: , where W hh and W ih are parameter matrices, f is a non-linear activation function such as tanh, and x t is the input at time t.The hidden state allows for RNNs to better model sequential data, such as language.
In this paper, we consider RNNs augmented with long-short term memory (LSTM) units (Hochreiter and Schmidhuber, 1997).LSTMs add a set of gates to the RNN that allow it to learn how much to update the hidden state.LSTMs are one of the most well-established methods for dealing with the vanishing gradient problem in recurrent networks (Hochreiter, 1991;Bengio et al., 1994).

Word-Overlap Metrics
One of the most popular approaches for automatically evaluating the quality of dialogue responses is by computing their word overlap with the reference response.In particular, the most popular metrics are the BLEU and METEOR scores used for machine translation, and the ROUGE score used for automatic summarization.While these metrics tend to correlate with human judgements in their target domains, they have recently been shown to highly biased and correlate very poorly with human judgements for dialogue response evaluation (Liu et al., 2016).We briefly describe BLEU here, and provide a more detailed summary of word-overlap metrics in the supplemental material.BLEU BLEU (Papineni et al., 2002) analyzes the co-occurrences of n-grams in the reference and the proposed responses.It computes the n-gram precision for the whole dataset, which is then multiplied by a brevity penalty to penalize short translations.For BLEU-N , N denotes the largest value of ngrams considered (usually N = 4).
Drawbacks One of the major drawbacks of word-overlap metrics is their failure in capturing the semantic similarity (and other structure) between the model and reference responses when there are few or no common words.This problem is less critical for machine translation; since the set of reasonable translations of a given sentence or document is rather small, one can reasonably infer the quality of a translated sentence by only measuring the word-overlap between it and one (or a few) reference translations.However, in dialogue, the set of appropriate responses given a context is much larger (Artstein et al., 2009); in other words, there is a very high response diversity that is unlikely to be captured by word-overlap comparison to a single response.
Further, word-overlap scores are computed directly between the model and reference responses.As such, they do not consider the context of the conversation.While this may be a reasonable assumption in machine translation, it is not the case for dialogue; whether a model response is an adequate substitute for the reference response is clearly context-dependent.For example, the two responses in Figure 1 are equally appropriate given the context.However, if we simply change the context to: "Have you heard of any good movies recently?", the model response is no longer relevant while the reference response remains valid.

An Automatic Dialogue Evaluation Model (ADEM)
To overcome the problems of evaluation with wordoverlap metrics, we aim to construct a dialogue evaluation model that: (1) captures semantic sim- ilarity beyond word overlap statistics, and (2) exploits both the context and the reference response to calculate its score for the model response.We call this evaluation model ADEM.
ADEM learns distributed representations of the context, model response, and reference response using a hierarchical RNN encoder.Given the dialogue context c, reference response r, and model response r, ADEM first encodes each of them into vectors (c, r, and r, respectively) using the RNN encoder.Then, ADEM computes the score using a dot-product between the vector representations of c, r, and r in a linearly transformed space: : where M, N ∈ R n are learned matrices initialized to the identity, and α, β are scalar constants used to initialize the model's predictions in the range [1,5].The model is shown in Figure 2.
The matrices M and N can be interpreted as linear projections that map the model response r into the space of contexts and reference responses, respectively.The model gives high scores to responses that have similar vector representations to the context and reference response after this projection.The model is end-to-end differentiable; all the parameters can be learned by backpropagation.In our implementation, the parameters θ = {M, N } of the model are trained to minimize the squared error between the model predictions and the human score, with L2-regularization: where γ is a scalar constant.The simplicity of our model leads to both accurate predictions and fast evaluation (see supp.material), which is important to allow rapid prototyping of dialogue systems.
The hierarchical RNN encoder in our model consists of two layers of RNNs (El Hihi and Bengio, 1995;Sordoni et al., 2015a).The lower-level RNN, the utterance-level encoder, takes as input words from the dialogue, and produces a vector output at the end of each utterance.The context-level encoder takes the representation of each utterance as input and outputs a vector representation of the context.This hierarchical structure is useful for incorporating information from early utterances in the context (Serban et al., 2016a).Following previous work, we take the last hidden state of the context-level encoder as the vector representation of the input utterance or context.The parameters of the RNN encoder are pretrained and are not learned from the human scores.
An important point is that the ADEM procedure above is not a dialogue retrieval model: the fundamental difference is that ADEM has access to the reference response.Thus, ADEM can compare a model's response to a known good response, which is significantly easier than inferring response quality from solely the context.

Pre-training with VHRED
We would like an evaluation model that can make accurate predictions from few labeled examples, since these examples are expensive to obtain.We therefore employ semi-supervised learning, and use a pre-training procedure to learn the parameters of the encoder.In particular, we train the encoder as part of a neural dialogue model; we attach a third decoder RNN that takes the output of the encoder as input, and train it to predict the next utterance of a dialogue conditioned on the context.
The dialogue model we employ for pre-training is the latent variable hierarchical recurrent encoderdecoder (VHRED) model (Serban et al., 2016b), shown in Figure 3.The VHRED model is an extension of the original hierarchical recurrent encoderdecoder (HRED) model (Serban et al., 2016a) with a turn-level stochastic latent variable.The dialogue context is encoded into a vector using our hierarchical encoder, and the VHRED then samples a Gaussian variable that is used to condition the decoder (see supplemental material for further details).After training VHRED, we use the last hidden state of the context-level encoder, when c, r, and r are fed as input, as the vector representations for c, r, and r, respectively.We use representations from the VHRED model as it produces more diverse and coherent responses compared to HRED.

Experimental Procedure
In order to reduce the effective vocabulary size, we use byte pair encoding (BPE) (Gage, 1994;Sennrich et al., 2015), which splits each word into sub-words or characters.We also use layer normalization (Ba et al., 2016) for the hierarchical encoder, which we found worked better at the task of dialogue generation than the related recurrent batch normalization (Ioffe and Szegedy, 2015;Cooijmans et al., 2016).To train the VHRED model, we employed several of the same techniques found in (Serban et al., 2016b) and(Bowman et al., 2016): we drop words in the decoder with a fixed rate of 25%, and we anneal the KL-divergence term linearly from 0 to 1 over the first 60,000 batches.We use Adam as our optimizer (Kingma and Ba, 2014).
When training ADEM, we also employ a subsampling procedure based on the model response length.In particular, we divide the training examples into bins based on the number of words in a response and the score of that response.We then over-sample from bins across the same score to ensure that ADEM does not use response length to predict the score.This is because humans have a tendency to give a higher rating to shorter responses than to longer responses (Serban et al., 2016b), as shorter responses are often more generic and thus are more likely to be suitable to the context.Indeed, the test set Pearson correlation between response length and human score is 0.27.
For training VHRED, we use a context embedding size of 2000.However, we found the ADEM model learned more effectively when this embedding size was reduced.Thus, after training VHRED, we use principal component analysis (PCA) (Pearson, 1901) to reduce the dimensionality of the context, model response, and reference response embeddings to n.We found experimentally that n = 50 provided the best performance.
When training our models, we conduct early stopping on a separate validation set.For the evaluation dataset, we split the train/ validation/ test sets such that there is no context overlap (i.e. the contexts in the test set are unseen during training).(Dhingra et al., 2016), and 'VHRED' indicates the dot product of VHRED embeddings (i.e.ADEM at initialization).C-and R-ADEM represent the ADEM model trained to only compare the model response to the context or reference response, respectively.We compute the baseline metric scores (top) on the full dataset to provide a more accurate estimate of their scores (as they are not trained on a training set).

Results
Utterance-level correlations We first present new utterance-level correlation results4 for existing word-overlap metrics, in addition to results with embedding baselines and ADEM, in Table 2.The baseline metrics are evaluated on the entire dataset of 4,104 responses to provide the most accurate estimate of the score. 5We measure the correlation for ADEM on the validation and test sets, which constitute 616 responses each.
We also conduct an analysis of the response data from (Liu et al., 2016), where the pre-processing is standardized by removing '<first speaker>' to-kens at the beginning of each utterance.The results are detailed in the supplemental material.We can observe from both this data, and the new data in Table 2, that the correlations for the word-overlap metrics are even lower than estimated in previous studies (Liu et al., 2016;Galley et al., 2015).In particular, this is the case for BLEU-4, which has frequently been used for dialogue response evaluation (Ritter et al., 2011;Sordoni et al., 2015b;Li et al., 2015;Galley et al., 2015;Li et al., 2016a).
We can see from Table 2 that ADEM correlates far better with human judgement than the wordoverlap baselines.This is further illustrated by the scatterplots in Figure 4. We also compare with ADEM using tweet2vec embeddings (Dhingra et al., 2016).In this case, instead of using the VHRED pre-training method presented in Section 4, we use off-the-shelf embeddings for c, r, and r, and finetune M and N on our dataset.These tweet2vec embeddings are computed at the character-level with a bidirectional GRU on a Twitter dataset for hashtag prediction (Dhingra et al., 2016).We find that they obtain reasonable but inferior performance compared to using VHRED embeddings.
System-level correlations We show the systemlevel correlations for various metrics in Table 3, and present it visually in Figure 5.Each point in the scatterplots represents a dialogue model; humans give low scores to TFIDF and DE responses, higher scores to HRED and the highest scores to other human responses.It is clear that existing word-overlap metrics are incapable of capturing this relationship for even 4 models.This renders them completely deficient for dialogue evaluation.However, ADEM produces almost the same model ranking as humans, achieving a significant Pearson correlation of 0.954.6Thus, ADEM correlates well with humans both at the response and system level.

Generalization to previously unseen models
When ADEM is used in practice, it will take as input responses from a new model that it has not seen during training.Thus, it is crucial that ADEM correlates with human judgements for new models.We test ADEM's generalization ability by performing a leave-one-out evaluation.For each dialogue model that was the source of response data for training ADEM (TF-IDF, Dual Encoder, HRED, humans), we conduct an experiment where we train on all model responses except those from the chosen model, and test only on the model that was unseen during training.
The results are given in Table 4.We observe that the ADEM model is able to generalize for all models except the Dual Encoder.This is particularly surprising for the HRED model; in this case, ADEM was trained only on responses that  Qualitative Analysis To illustrate some strengths and weaknesses of ADEM, we show human and ADEM scores for each of the responses to various contexts in Table 5.There are several instances where ADEM predicts accurately: in particular, ADEM is often very good at assigning low scores to poor responses.This seen in the first two contexts, where most of the responses given a score of 1 from humans are given scores less than 2 by ADEM.The single exception in response (4) for the second context seems somewhat appropriate and should perhaps have been scored higher by the human evaluator.There are also several instances where the model assigns high scores to suitable responses, as in the first two contexts.One drawback we observed is that ADEM tends to be too conservative when predicting response scores.This is the case in the third context, where the model assigns low scores to most of the responses that a human rated highly.This behaviour is likely due to the squared error loss used to train ADEM; since the model receives a large penalty for incorrectly predicting an extreme value, it learns to predict scores closer to the average human score.We provide many more experiments, including investigation of evaluation speed, learning curves, data efficiency, a failure analysis, and the primary source of improvement over word-overlap metrics in the supplemental material.

Related Work
Related to our approach is the literature on novel methods for the evaluation of machine translation systems, especially through the WMT evaluation task (Callison-Burch et al., 2011;Machácek and Bojar, 2014;Stanojevic et al., 2015).In particular, (Albrecht and Hwa, 2007;Gupta et al., 2015) have proposed to evaluate machine translation systems using Regression and Tree-LSTMs respectively.Their approach differs from ours as, in the dialogue domain, we must additionally condition our score on the context of the conversation, which is not necessary in translation.
There has also been related work on estimating the quality of responses in chat-oriented dialogue systems.(DeVault et al., 2011) train an automatic dialogue policy evaluation metric from 19 structured role-playing sessions, enriched with para-phrases and external referee annotations.(Gandhe and Traum, 2016) propose a semi-automatic evaluation metric for dialogue coherence, similar to BLEU and ROUGE, based on 'wizard of Oz' type data.7 (Xiang et al., 2014) propose a framework to predict utterance-level problematic situations in a dataset of Chinese dialogues using intent and sentiment factors.Finally, (Higashinaka et al., 2014) train a classifier to distinguish user utterances from system-generated utterances using various dialogue features, such as dialogue acts, question types, and predicate-argument structures.
Several recent approaches use hand-crafted reward features to train dialogue models using reinforcement learning (RL).For example, (Li et al., 2016b) use features related to ease of answering and information flow, and (Yu et al., 2016) use metrics related to turn-level appropriateness and conversational depth.These metrics are based on hand-crafted features, which only capture a small set of relevant aspects; this inevitably leads to suboptimal performance, and it is unclear whether such objectives are preferable over retrieval-based cross-entropy or word-level maximum log-likelihood objectives.Furthermore, many of these metrics are computed at the conversation-level, and are not available for evaluating single dialogue responses.The metrics that can be computed at the responselevel could be incorporated into our framework, for example by adding a term to equation 1 consisting of a dot product between these features and a vector of learned parameters.
There has been significant work on evaluation methods for task-oriented dialogue systems, which attempt to solve a user's task such as finding a restaurant.These methods include the PARADISE framework (Walker et al., 1997) and MeMo (Möller et al., 2006), which consider a task completion signal.PARADISE in particular is perhaps the first work on learning an automatic evaluation function for dialogue, accomplished through linear regression.However, PARADISE requires that one can measure task completion and task complexity, which are not available in our setting.

Discussion
We use the Twitter Corpus to train our models as it contains a broad range of non-task-oriented conversations and it has been used to train many state-ofthe-art models.However, our model could easily be extended to other general-purpose datasets, such as Reddit, once similar pre-trained models become publicly available.Such models are necessary even for creating a test set in a new domain, which will help us determine if ADEM generalizes to related dialogue domains.We leave investigating the domain transfer ability of ADEM for future work.
The evaluation model proposed in this paper favours dialogue models that generate responses that are rated as highly appropriate by humans.It is likely that this property does not fully capture the desired end-goal of chatbot systems.For example, one issue with building models to approximate human judgements of response quality is the problem of generic responses.Since humans often provide high scores to generic responses due to their appropriateness for many given contexts (Shang et al., 2016), a model trained to predict these scores will exhibit the same behaviour.An important direction for future work is modifying ADEM such that it is not subject to this bias.This could be done, for example, by censoring ADEM's representations (Edwards and Storkey, 2016) such that they do not contain any information about length.Alterna-tively, one can combine this with an adversarial evaluation model (Kannan and Vinyals, 2017;Li et al., 2017) that assigns a score based on how easy it is to distinguish the dialogue model responses from human responses.In this case, a model that generates generic responses will easily be distinguishable and obtain a low score.
An important direction of future research is building models that can evaluate the capability of a dialogue system to have an engaging and meaningful interaction with a human.Compared to evaluating a single response, this evaluation is arguably closer to the end-goal of chatbots.However, such an evaluation is extremely challenging to do in a completely automatic way.We view the evaluation procedure presented in this paper as an important step towards this goal; current dialogue systems are incapable of generating responses that are rated as highly appropriate by humans, and we believe our evaluation model will be useful for measuring and facilitating progress in this direction.Second, we filtered these human-generated responses for potentially offensive language, and combined them with approximately 1,000 responses from each of the above models into a single set of responses.We then asked AMT workers to rate the overall quality of each response on a scale of 1 (low quality) to 5 (high quality).Each user was asked to evaluate 4 responses from 50 different contexts.We included four additional attention-check questions and a set of five contexts was given to each participant for assessment of inter-annotator agreement.We removed all users who either failed an attention check question or achieved a κ interannotator agreement score lower than 0.2 (Cohen, 1968).The remaining evaluators had a median κ score of 0.63, indicating moderate agreement.This is consistent with results from (Liu et al., 2016).Dataset statistics are provided in Table 1.
In initial experiments, we also asked humans to provide scores for topicality, informativeness, and whether the context required background information to be understandable.Note that we did not ask for fluency scores, as 3/4 of the responses were produced by humans (including the retrieval models).We found that scores for informativeness and background had low inter-annotator agreement (Table 6), and scores for topicality were highly correlated with the overall score (Pearson correlation of 0.72).Results on these auxiliary questions varied depending on the wording of the question.Thus, we continued our experiments by only asking for the overall score.We provide more details concerning the data collection in the supplemental material, as it may aid others in developing effective crowdsourcing experiments.
Preliminary AMT experiments Before conducting the primary crowdsourcing experiments to collect the dataset in this paper, we ran a series

Measurement
κ score Overall 0.63 Topicality 0.57 Informativeness 0.31 Background 0.05 Table 6: Median κ inter-annotator agreement scores for various questions asked in the survey.
of preliminary experiments to see how AMT workers responded to different questions.Unlike the primary study, where we asked a small number of overlapping questions to determine the κ score and filtered users based on the results, we conducted a study where all responses (40 in total from 10 contexts) were overlapping.We did this for 18 users in two trials, resulting in 153 pair-wise correlation scores per trial.
In the first trial, we asked the following questions to the users, for each response: 1. How appropriate is the response overall?
(overall, scale of 1-5) 2. How on-topic is the response?(topicality, scale of 1-5) 3. How specific is the response to some context?(specificity, scale of 1-5) 4. How much background information is required to understand the context?(background, scale of 1-5) Note that we do not ask for fluency, as the 3/4 responses for each context were written by a human (including retrieval models).We also provided the AMT workers with examples that have high topicality and low specificity, and examples with high specificity and low topicality.The background question was only asked once for each context.We observed that both the overall scores and topicality had fairly high inter-annotator agreement (as shown in Table 6), but were strongly correlated with each other (i.e.participants would often put the same scores for topicality and overall score).Conversely, specificity (κ = 0.12) and background (κ = 0.05) had very low inter-annotator agreements.
To better visualize the data, we produce scatterplots showing the distribution of scores for different responses, for each of the four questions in our survey (Figure 6).We can see that the overall and topicality scores are clustered for each question, indicating high agreement.However, these clusters are most often in the same positions for each response, which indicates that they are highly correlated with each other.Specificity and background information, on the other hand, show far fewer clusters, indicating lower inter-annotator agreement.We conjectured that this was partially because the terms 'specificity' and 'background information', along with our descriptions of them, had a high cognitive load, and were difficult to understand in the context of our survey.
To test this hypothesis, we conducted a new survey where we tried to ask the questions for specificity and background in a more intuitive manner.We also changed the formulation of the background question to be a binary 0-1 decision of whether users understood the context.We asked the following questions: 1. How appropriate is the response overall?
(overall, scale of 1-5) 2. How on-topic is the response?scale of 1-5) 3. How common is the response?(informativeness, scale of 1-5) 4. Does the context make sense?(context, scale of 0-1) We also clarified our description for the third question, including providing more intuitive examples.Interestingly, the inter-annotator agreement on informativeness κ = 0.31 was much higher than that for specificity in the original survey.Thus, the formulation of questions in a crowdsourcing survey has a large impact on inter-annotator agreement.
For the context, we found that users either agreed highly (κ > 0.9 for 45 participants), or not at all (κ < 0.1 for 113 participants).We also experimented with asking the overall score on a separate page, before asking questions 2-4, and found that this increased the κ agreement slightly.Similarly, excluding all scores where participants indicated they did not understand the context improved inter-annotator agreement slightly.
Due to these observations, we decided to only ask users for their overall quality score for each response, as it is unclear how much additional information is provided by the other questions in the context of dialogue.We hope this information is useful for future crowdsourcing experiments in the dialogue domain.
Pre-training motivation Maximizing the likelihood of generating the next utterance in a dialogue is not only a convenient way of training the encoder parameters; it is also an objective that is consistent with learning useful representations of the dialogue utterances.Two context vectors produced by the VHRED encoder are similar if the contexts induce a similar distribution over subsequent responses; this is consistent with the formulation of the evaluation model, which assigns high scores to responses that have similar vector representations to the context.VHRED is also closely related to the skip-thought-vector model (Kiros et al., 2015), which has been shown to learn useful representations of sentences for many tasks, including se-

Spearman
Pearson BLEU-1 -0.026 (0.80) 0.016 (0.87) BLEU-2 0.065 (0.52) 0.080 (0.43) BLEU-3 0.139 (0.17) 0.088 (0.39) BLEU-4 0.139 (0.17) 0.092 (0.36) ROUGE -0.083 (0.41) -0.010 (0.92) Table 7: Correlations between word-overlap metrics and human judgements on the dataset from (Liu et al., 2016), after removing the speaker tokens at the beginning of each utterance.The correlations are even worse than estimated in the original paper, and none are significant.mantic relatedness and paraphrase detection.The skip-thought-vector model takes as input a single sentence and predicts the previous sentence and next sentence.On the other hand, VHRED takes as input several consecutive sentences and predicts the next sentence.This makes it particularly suitable for learning long-term context representations.

Hyperparameters
When evaluating our model, we conduct early stopping on an external validation set to obtain the best parameter setting.We similarly choose our hyperparameters (PCA dimension n, L2 regularization penalty γ, learning rate a, and batch size b) based on validation set results.Our best ADEM model used γ = 0.075, a = 0.01, and b = 32.For ADEM with tweet2vec embeddings, we did a similar hyperparameter searched, and used n = 150, γ = 0.01, a = 0.01, and b = 16.

Additional Results
New results on (Liu et al., 2016) data In order to ensure that the correlations between wordoverlap metrics and human judgements were comparable across datasets, we standardized the processing of the evaluation dataset from (Liu et al., 2016).In particular, the original data from (Liu et al., 2016) has a token (either '<first speaker>', '<second speaker>', or '<third speaker>') at the beginning of each utterance.This is an artifact left-over by the processing used as input to the hierarchical recurrent encoder-decoder (HRED) model (Serban et al., 2016a).Removing these tokens makes sense for establishing the ability of wordoverlap models, as they are unrelated to the content of the tweets.
We perform this processing, and report the updated results for word-overlap metrics in Table 7.

Metric
Wall time ADEM (CPU) 2861s ADEM (GPU) 168s Table 8: Evaluation time on the test set.
Surprisingly, almost all significant correlation disappears, particularly for all forms of the BLEU score.Thus, we can conclude that the word-overlap metrics were heavily relying on these tokens to form bigram matches between the model responses and reference responses.
Evaluation speed An important property of evaluation models is speed.We show the evaluation time on the test set for ADEM on both CPU and a Titan X GPU (using Theano, without cudNN) in Table 8  From Table 9, the cases where ADEM misses a good response, we can see that there are a variety of reasons for this cause of failure.In the first example, ADEM is not able to match the fact that the model response talks about sleep to the reference response or context.This is possibly because the utterance contains a significant amount of irrelevant information: indeed, the first two sentences are not related to either the context or reference response.In the second example, the model response does not seem particularly relevant to the contextdespite this, the human scoring this example gave it 4/5.This illustrates one drawback of human evaluations; they are quite subjective, and often have some noise.This makes it difficult to learn an effective ADEM model.Finally, ADEM is unable to score the third response highly, even though it is very closely related to the reference response.
We can observe from the first two examples in Table 10, where the ADEM model erroneously ranks the model responses highly, that ADEM is occasionally fooled into giving high scores for responses that are completely unrelated to the context.This may be because both of the utterances are short, and short utterances are ranked higher by humans in general since they are often more generic (as detailed in Section 5).In the third example, the response actually seems to be somewhat reasonable given the context; this may be an instance where the human evaluator provided a score that was too low.
Data efficiency How much data is required to train ADEM?We conduct an experiment where we train ADEM on different amounts of training data, from 5% to 100%.The results are shown in Table 11.We can observe that ADEM is very dataefficient, and is capable of reaching a Spearman correlation of 0.4 using only half of the available training data (1000 labelled examples).ADEM correlates significantly with humans even when only trained on 5% of the original training data (100 labelled examples).

Improvement over word-overlap metrics
Next, we analyze more precisely how ADEM outperforms traditional word-overlap metrics such as BLEU-2 and ROUGE.We first normalize the metric scores to have the same mean and variance as human scores, clipping the resulting scores to the range [1, 5] (we assign raw scores of 0 a normalized score of 1).We indicate normalization with vertical bars around the metric.We then select all of the good responses that were given low scores by word-overlap metrics (i.e.responses which humans scored as 4 or higher, and which |BLEU-2| and |ROUGE| scored as 2 or lower).The results are summarized in Table 12: of the 237 responses that humans scored 4 or higher, most of them (147/237) were ranked very poorly by both BLEU-2 and ROUGE.This quantitatively demonstrates what we argued qualitatively in Figure 1; a major failure of word-overlap metrics is the inability to consider reasonable responses   Table 13: Effect of differences in response length on the score, ∆w = absolute difference in #words between the reference response and proposed response.BLEU-1, BLEU-2, and METEOR have previously been shown to exhibit bias towards similar-length responses (Liu et al., 2016).
that have no word-overlap with the reference response.We can also see that, in almost half (60/147) of the cases where both BLEU-2 and ROUGE fail, |ADEM| is able to correctly assign a score greater than 4. For comparison, there are only 42 responses where humans give a score of 4 and |ADEM| gives a score less than 2, and only 14 of these are assigned a score greater than 4 by either |BLEU-2| or |ROUGE|.
To provide further insight, we give specific examples of responses that are scored highly (> 4) by both humans and |ADEM|, and poorly (< 2) by both |BLEU-2| and |ROUGE| in Table 14.We draw 3 responses randomly (i.e.no cherry-picking) from the 60 test set responses that meet this criteria.We can observe that ADEM is able to recognize short responses that are appropriate to the context, with-out word-overlap with the reference response.This is even the case when the model and reference responses have very little semantic similarity, as in the first and third examples in Table 14.
Finally, we show the behaviour of ADEM when there is a discrepancy between the lengths of the reference and model responses.In (Liu et al., 2016), the authors show that word-overlap metrics such as BLEU-1, BLEU-2, and METEOR exhibit a bias in this scenario: they tend to assign higher scores to responses that are closer in length to the reference response.8However, humans do not exhibit this bias; in other words, the quality of a response as judged by a human is roughly independent of its length.In Table 13, we show that ADEM also does not exhibit this bias towards similar-length responses.

Figure 2 :
Figure 2: The ADEM model, which uses a hierarchical encoder to produce the context embedding c.

Figure 3 :
Figure 3: The VHRED model used for pre-training.The hierarchical structure of the RNN encoder is shown in the red box around the bottom half of the figure.After training using the VHRED procedure, the last hidden state of the context-level encoder is used as a vector representation of the input text.

Figure 4 :
Figure4: Scatter plot showing model against human scores, for BLEU-2 and ROUGE on the full dataset, and ADEM on the test set.We add Gaussian noise drawn from N (0, 0.3) to the integer human scores to better visualize the density of points, at the expense of appearing less correlated.

Figure 5 :
Figure 5: Scatterplots depicting the system-level correlation results for ADEM, BLEU-2, BLEU-4,and ROUGE on the test set.Each point represents the average scores for the responses from a dialogue model (TFIDF, DE, HRED, human).Human scores are shown on the horizontal axis, with normalized metric scores on the vertical axis.The ideal metric has a perfectly linear relationship.

Figure 7 :
Figure 7: Plots showing the Spearman and Pearson correlations on the test set as ADEM trains.At the beginning of training, the model does not correlate with human judgements.

Table 1 :
Statistics of the dialogue response evaluation dataset.Each example is in the form (context, model response, reference response, human score).

Table 2 :
Correlation between metrics and human judgements, with p-values shown in brackets.'ADEM (T2V)' indicates ADEM with tweet2vec embeddings

Table 3 :
System-level correlation, with the p-value in brackets.
were written by humans (from retrieval models or human-generated), but is able to generalize to responses produced by a generative neural network model.When testing on the entire test set, the model achieves comparable correlations to the ADEM model that was trained on 25% less data selected at random.

Table 4 :
Correlation for ADEM when various model responses are removed from the training set.The left two columns show performance on the entire test set, and the right two columns show performance on responses only from the dialogue model not seen during training.The last row (25% at random) corresponds to the ADEM model trained on all model responses, but with the same amount of training data as the model above (i.e.25% less data than the full training set).

Table 5 :
Examples of scores given by the ADEM model.
. When run on GPU, ADEM is able to evaluate responses in a reasonable amount of time (approximately 2.5 minutes).This includes the time for encoding the contexts, model responses, and reference responses into vectors with the hierarchical RNN, in addition to computing the PCA projection, but does not include pre-training with VHRED.For comparison, if run on a test set of 10,000 responses, ADEM would take approximately 45 minutes.This is significantly less time consuming than setting up human experiments at any scale.Note that we have not yet made any effort to optimize the speed of the ADEM model.

Table 9 :
Examples where a human and either BLEU-2 or ROUGE (after normalization) score the model response highly (> 4/5), while the ADEM model scored it poorly (< 2/5).These examples are drawn randomly (i.e.no cherry-picking).The bars around |metric| indicate that the metric scores have been normalized.

Table 10 :
Examples where a human and either BLEU-2 or ROUGE (after normalization) score the model response low (< 2/5), while the ADEM model scored it highly (> 4/5).These examples are drawn randomly (i.e.no cherry-picking).The bars around |metric| indicate that the metric scores have been normalized.

Table 11 :
ADEM correlations when trained on different amounts of data.

Table 12 :
In 60/146 cases, ADEM scores good responses (human score > 4) highly when wordoverlap metrics fail.The bars around |metric| indicate that the metric scores have been normalized.