How NOT To Evaluate Your Dialogue System: An Empirical Study of Unsupervised Evaluation Metrics for Dialogue Response Generation

We investigate evaluation metrics for dialogue response generation systems where supervised labels, such as task completion, are not available. Recent works in response generation have adopted metrics from machine translation to compare a model's generated response to a single target response. We show that these metrics correlate very weakly with human judgements in the non-technical Twitter domain, and not at all in the technical Ubuntu domain. We provide quantitative and qualitative results highlighting specific weaknesses in existing metrics, and provide recommendations for future development of better automatic evaluation metrics for dialogue systems.


Introduction
An important aspect of dialogue response generation systems, which are trained to produce a reasonable utterance given a conversational context, is how to evaluate the quality of the generated response.Typically, evaluation is done using human-generated supervised signals, such as a task completion test or a user satisfaction score (Walker et al., 1997;Möller et al., 2006;Kamm, 1995), which are relevant when the dialogue is task-focused.We call models optimized for such supervised objectives supervised dialogue models, while those that do not are unsupervised dialogue models.
This paper focuses on unsupervised dialogue response generation models, such as chatbots.These models are receiving increased attention, particularly using end-to-end training with neural networks (Serban et al., 2016;Sordoni et al., 2015;Vinyals and Le, 2015).This avoids the need to collect supervised labels on a large scale, which can be prohibitively expensive.However, automatically evaluating the quality of these models remains an open question.Automatic evaluation metrics would help accelerate the deployment of unsupervised response generation systems.
Faced with similar challenges, other natural language tasks have successfully developed automatic evaluation metrics.For example, BLEU (Papineni et al., 2002a) and METEOR (Banerjee and Lavie, 2005) are now standard for evaluating machine translation models, and ROUGE (Lin, 2004) is often used for automatic summarization.These metrics have recently been adopted by dialogue researchers (Ritter et al., 2011;Sordoni et al., 2015;Li et al., 2015;Galley et al., 2015b;Wen et al., 2015;Li et al., 2016).However these metrics assume that valid responses have significant word overlap with the ground truth responses.This is a strong assumption for dialogue systems, where there is significant diversity in the space of valid responses to a given context.This is illustrated in Table 1, where two reasonable responses are proposed to the context, but these responses do not share any words in common and do not have the same semantic meaning.
In this paper, we investigate the correlation between the scores from several automatic evaluation metrics and human judgements of dialogue response quality, for a variety of response generation models.We consider both statistical word-overlap similar-Table 1: Example showing the intrinsic diversity of valid responses in a dialogue.The (reasonable) model response would receive a BLEU score of 0.
ity metrics such as BLEU, METEOR, and ROUGE, and word embedding metrics derived from word embedding models such as Word2Vec (Mikolov et al., 2013).We find that all metrics show either weak or no correlation with human judgements, despite the fact that word overlap metrics have been used extensively in the literature for evaluating dialogue response models (see above, and Lasguido et al. (2014)).In particular, we show that these metrics have only a small positive correlation on the chitchat oriented Twitter dataset, and no correlation at all on the technical Ubuntu Dialogue Corpus.For the word embedding metrics, we show that this is true even though all metrics are able to significantly distinguish between baseline and state-of-the-art models across multiple datasets.We further highlight the shortcomings of these metrics using: a) a statistical analysis of our survey's results; b) a qualitative analysis of examples from our data; and c) an exploration of the sensitivity of the metrics.
Our results indicate that a shift must be made in the research community away from these metrics, and highlight the need for a new metric that correlates more strongly with human judgement.

Related Work
We focus on metrics that are model-independent, i.e.where the model generating the response does not also evaluate its quality; thus, we do not consider word perplexity, although it has been used to evaluate unsupervised dialogue models (Serban et al., 2015).This is because it is not computed on a per-response basis, and cannot be computed for retrieval models.Further, we only consider metrics that can be used to evaluate proposed responses against ground-truth responses, so we do not consider retrieval-based metrics such as recall, which has been used to evaluate dialogue models (Schatzmann et al., 2005;Lowe et al., 2015).We also do not consider evaluation methods for supervised evaluation methods. 1everal recent works on unsupervised dialogue systems adopt the BLEU score for evaluation.Ritter et al. (2011) formulate the unsupervised learning problem as one of translating a context into a candidate response.They use a statistical machine translation (SMT) model to generate responses to various contexts using Twitter data, and show that it outperforms information retrieval baselines according to both BLEU and human evaluations.Sordoni et al. (2015) extend this idea using a recurrent language model to generate responses in a context-sensitive manner.They also evaluate using BLEU, however they produce multiple ground truth responses by retrieving 15 responses from elsewhere in the corpus, using a simple bag-of-words model.Li et al. (2015) evaluate their proposed diversity-promoting objective function for neural network models using BLEU score with only a single ground truth response.A modified version of BLEU, deltaBLEU (Galley et al., 2015b), which takes into account several humanevaluated ground truth responses, is shown to have a weak to moderate correlation to human judgements using Twitter dialogues.However, such human annotation is often infeasible to obtain in practice.Galley et al. (2015b) also show that, even with several ground truth responses available, the standard BLEU metric does not correlate strongly with human judgements.
There has been significant previous work that evaluates how well automatic metrics correlate with human judgements in in both machine translation (Callison-Burch et al., 2010;Callison-Burch et al., 2011;Bojar et al., 2014;Graham et al., 2015) and natural language generation (NLG) (Stent et al., 2005;Cahill, 2009;Reiter and Belz, 2009;Espinosa et al., 2010).There has also been work criticizing the usefulness of BLEU in particular for machine translation (Callison-Burch et al., 2006).While many of the criticisms in these works apply to dialogue generation, we note that generating dialogue responses conditioned on the conversational context is in fact a more difficult problem.This is because most of the difficulty in automatically evaluating language generation models lies in the large set of correct answers.Dialogue response generation given solely the context intuitively has a higher diversity (or entropy) than translation given text in a source language, or surface realization given some intermediate form (Artstein et al., 2009).

Evaluation Metrics
Given a dialogue context and a proposed response, our goal is to automatically evaluate how appropriate the proposed response is to the conversation.We focus on metrics that compare it to the ground truth response of the conversation.In particular, we investigate two approaches: word based similarity metrics and word-embedding based similarity metrics.

Word Overlap-based Metrics
We first consider metrics that evaluate the amount of word-overlap between the proposed response and the ground-truth response.We examine the BLEU and METEOR scores that have been used for machine translation, and the ROUGE score that has been used for automatic summarization.While these metrics have been shown to correlate with human judgements in their target domains (Papineni et al., 2002a;Lin, 2004), they have not been thoroughly investigated for dialogue systems. 2  We denote the ground truth response as r (thus we assume that there is a single candidate ground truth response), and the proposed response as r.The j'th token in the ground truth response r is denoted by w j , with ŵj denoting the j'th token in the proposed response r.
BLEU.BLEU (Papineni et al., 2002a) analyzes the co-occurrences of n-grams in the ground truth and the proposed responses.It first computes an n-gram precision for the whole dataset (we assume that there is a single candidate ground truth response 2 To the best of our knowledge, only BLEU has been evaluated in the dialogue system setting quantitatively by Galley et al. (2015a) on the Twitter domain.However, they carried out their experiments in a very different setting with multiple ground truth responses, which are rarely available in practice, and without providing any qualitative analysis of their results.per context): where k indexes all possible n-grams of length n and h(k, r) is the number of n-grams k in r. 3 To avoid the drawbacks of using a precision score, namely that it favours shorter (candidate) sentences, the authors introduce a brevity penalty.BLEU-N, where N is the maximum length of n-grams considered, is defined as: β n is a weighting that is usually uniform, and b(•) is the brevity penalty.The most commonly used version of BLEU uses N = 4. Modern versions of BLEU also use sentence-level smoothing, as the geometric mean often results in scores of 0 if there is no 4-gram overlap (Chen and Cherry, 2014).Note that BLEU is usually calculated at the corpus-level, and was originally designed for use with multiple reference sentences.
METEOR.The METEOR metric (Banerjee and Lavie, 2005) was introduced to address several weaknesses in BLEU.It creates an explicit alignment between the candidate and target responses.
The alignment is based on exact token matching, followed by WordNet synonyms, stemmed tokens, and then paraphrases.Given a set of alignments, the METEOR score is the harmonic mean of precision and recall between the proposed and ground truth sentence.
ROUGE.ROUGE (Lin, 2004) is a set of evaluation metrics used for automatic summarization.We consider ROUGE-L, which is a F-measure based on the Longest Common Subsequence (LCS) between a candidate and target sentence.The LCS is a set of words which occur in two sentences in the same order; however, unlike n-grams the words do not have to be contiguous, i.e. there can be other words in between the words of the LCS.

Embedding-based Metrics
An alternative to using word-overlap based metrics is to consider the meaning of each word as defined by a word embedding, which assigns a vector to each word.Methods such as Word2Vec (Mikolov et al., 2013) calculate these embeddings using distributional semantics; that is, they approximate the meaning of a word by considering how often it co-occurs with other words in the corpus. 4These embeddingbased metrics usually approximate sentence-level embeddings using some heuristic to combine the vectors of the individual words in the sentence.The sentence-level embeddings between the candidate and target response are compared using a measure such as cosine distance.
Greedy Matching.Greedy matching is the one embedding-based metric that does not compute sentence-level embeddings.Instead, given two sequences r and r, each token w ∈ r is greedily matched with a token ŵ ∈ r based on the cosine similarity of their word embeddings (e w ), and the total score is then averaged across all words: G(r, r) = w∈r; max ŵ∈r cos sim(e w , e ŵ) |r| This formula is asymmetric, thus we must average the greedy matching scores G in each direction.This was originally introduced for intelligent tutoring systems (Rus and Lintean, 2012).The greedy approach favours responses with key words that are semantically similar to those in the ground truth response.
Embedding Average.The embedding average metric calculates sentence-level embeddings using additive composition, a method for computing the meanings of phrases by averaging the vector representations of their constituent words (Foltz et al., 1998;Landauer and Dumais, 1997;Mitchell and Lapata, 2008).This method has been widely used in other domains, for example in textual similarity tasks (Wieting et al., 2015).The embedding average, ē, is defined as the mean of the word embeddings of each token in a sentence r: To compare a ground truth response r and retrieved response r, we compute the cosine similarity between their respective sentence level embeddings: EA := cos(ē r , ēr ).
Vector Extrema.Similarity between response vectors is again computed using cosine distance.Intuitively, this approach prioritizes informative words over common ones; words that appear in similar contexts will be close together in the vector space.Thus, common words are pulled towards the origin because they occur in various contexts, while words carrying important semantic information will lie further away.By taking the extrema along each dimension, we are thus more likely to ignore common words.

Dialogue Response Generation Models
In order to determine the correlation between automatic metrics and human judgements of response quality, we obtain response from a diverse range of response generation models in the recent literature, including both retrieval and generative models.

Retrieval Models
Ranking or retrieval models for dialogue systems are typically evaluated based on whether they can retrieve the correct response from a corpus of predefined responses, which includes the ground truth Table 2: Models evaluated using the vector-based evaluation metrics, with 95% confidence intervals.
response to the conversation (Schatzmann et al., 2005).Such systems can be evaluated using recall or precision metrics.However, when deployed in a real setting these models will not have access to the correct response given an unseen conversation.Thus, in the results presented below we remove one occurrence of the ground-truth response from the corpus and ask the model to retrieve the most appropriate response from the remaining utterances.Note that this does not mean the correct response will not appear in the corpus at all; in particular, if there exists another context in the dataset with an identical ground-truth response, this will be available for selection by the model.
We then evaluate each model by comparing the retrieved response to the ground truth response of the conversation.This closely imitates real-life deployment of these models, as it tests the ability of the model to generalize to unseen contexts.TF-IDF.We consider a simple Term Frequency -Inverse Document Frequency (TF-IDF) retrieval model (Lowe et al., 2015).TF-IDF is a statistic that intends to capture how important a given word is to some document, which is calculated as: tfidf(w, c, C) = f (w, c) × log N |{c∈C:w∈c}| , where C is the set of all contexts in the corpus, f (w, c) indicates the number of times word w appeared in context c, N is the total number of dialogues, and the denominator represents the number of dialogues in which the word w appears.
In order to apply TF-IDF as a retrieval model for dialogue, we first compute the TF-IDF vectors for each context and response in the corpus.We then return the response with the largest cosine similarity in the corpus, either between the input context and corpus contexts (C-TFIDF), or between the input context and corpus responses (R-TFIDF).
Dual Encoder.Next we consider the recurrent neural network (RNN) based architecture called the Dual Encoder (DE) model (Lowe et al., 2015).The DE model consists of two RNNs which respectively compute the vector representation of an input context and response, c, r ∈ R n .The model then calculates the probability that the given response is the ground truth response given the context, by taking a weighted dot product: p(r is correct|c, r, M ) = σ(c T M r + b) where M is a matrix of learned parameters and b is a bias.The model is trained using negative sampling to minimize the cross-entropy error of all (context, response) pairs.To our knowledge, our application of neural network models to large-scale retrieval in dialogue systems is novel.

Generative Models
In addition to retrieval models, we also consider generative models.In this context, we refer to a model LSTM language model.The baseline model is an LSTM language model (Hochreiter and Schmidhuber, 1997) trained to predict the next word in the (context, response) pair.During test time, the model is given a context, encodes it with the LSTM and generates a response using a greedy beam search procedure (Graves, 2013).
HRED.Finally we consider the Hierarchical Recurrent Encoder-Decoder (HRED) (Serban et al., 2015).In the traditional Encoder-Decoder framework, all utterances in the context are concatenated together before encoding.Thus, information from previous utterances is far outweighed by the most recent utterance.The HRED model uses a hierarchy of encoders; each utterance in the context passes through an 'utterance-level' encoder, and the output of these encoders is passed through another 'context-level' encoder, which enables the handling of longer-term dependencies.

Conclusions from an Incomplete Analysis
When evaluation metrics are not explicitly correlated to human judgement, it is possible to draw misleading conclusions by examining how the metrics rate different models.To illustrate this point, we compare the performance of selected models according to the embedding metrics on two different domains: the Ubuntu Dialogue Corpus (Lowe et al., 2015), which contains technical vocabulary and where conversations are often oriented towards solv-ing a particular problem, and a non-technical Twitter corpus collected following the procedure of Ritter et al. (2010).We consider these two datasets since they cover contrasting dialogue domains, i.e. technical help vs casual chit-chat, and because they are amongst the largest publicly available corpora, making them good candidates for building data-driven dialogue systems.
Results on the proposed embedding metrics are shown in Table 2.For the retrieval models, we observe that the DE model significantly outperforms both TFIDF baselines on all metrics across both datasets.Further, the HRED model significantly outperforms the basic LSTM generative model in both domains, and appears to be of similar strength as the DE model.Based on these results, one might be tempted to conclude that there is some information being captured by these metrics, that significantly differentiates models of different quality.However, as we show in the next section, the embedding-based metrics correlate only weakly with human judgements on the Twitter corpus, and not at all on the Ubuntu Dialogue Corpus.This demonstrates that metrics that have not been specifically correlated with human judgements on a new task should not be used to evaluate that task.

Human Correlation Analysis
Data Collection.We conducted a human survey to determine the correlation between human judgements on the quality of responses, and the score assigned by each metric.We aimed to follow the procedure for the evaluation of BLEU (Papineni et al.,   2002a).25 volunteers from the Computer Science department at the author's institution were given a context and one proposed response, and were asked to judge the response quality on a scale of 1 to 5. 5 ; a 1 indicates that the response is not appropriate or sensible given the context, and a 5 indicates that the response is very reasonable.Out of the 25 respondents, 23 had Cohen's kappa scores κ > 0.2 w.r.t. the other respondents, which is a standard measure for inter-rater agreement (Cohen, 1968).The 2 respondents with κ < 0.2, indicating slight agreement, were excluded from the analysis below.The median κ score was approximately 0.55, roughly indicating moderate to strong annotator agreement.
Each volunteer was given 100 questions per dataset.These questions correspond to 20 unique contexts, with 5 different responses: one utterance 5 Studies asking humans to evaluate text often rate different aspects separately, such as 'adequacy', 'fluency' and 'informativeness' of the text (Hovy, 1999;Papineni et al., 2002b) Our evaluation focuses on adequacy.We did not consider fluency because 4 out of the 5 proposed responses to each context were generated by a human.We did not consider informativeness because in the domains considered, it is not necessarily important (in Twitter), or else it seems to correlate highly with adequacy (in Ubuntu).
randomly drawn from elsewhere in the test set, the response selected from each of the TF-IDF, DE, and HRED models, and a response written by a human annotator.These were chosen as they cover the range of qualities almost uniformly (see Figure 1).Survey Results.We present correlation results between the human judgements and each metric in Table 3.We compute the Pearson correlation, which estimates linear correlation, and Spearman correlation, which estimates any monotonic correlation.
The first observation is that in both domains the BLEU-4 score, which has previously been used to evaluate unsupervised dialogue systems, shows very weak if any correlation with human judgement.In fact we found that the BLEU-3 and BLEU-4 scores were near-zero for a majority of response pairs; for BLEU-4, only four examples had a score > 10 −9 .Despite this, they still correlate with human judgements on the Twitter Corpus at a rate similar to BLEU-2.This is because of the smoothing constant, which gives a tiny weight to unigrams and bigrams despite the absence of higher-order n-grams.BLEU-3 and BLEU-4 behave as a scaled, noisy version of BLEU-2; thus, if one is to evaluate dialogue

Context of Conversation
A: dearest!question.how many thousands of people can panaad occupy?B: @user panaad has <number> k seat capacity while rizal has <number> k thats why they choose rizal i think .

Ground Truth Response
A: now i know about the siting capacity .thanks for the info @user great evening.Proposed Response A: @user makes sense.thanks!

Context of Conversation
A: never felt more sad than i am now B: @user aww why ?A: @user @user its a long story !sure you wanna know it ?bahaha and thanks for caring btw <heart> Ground Truth Response A: @user i don 't mind to hear it i 've got all day and youre welcome <number> Proposed Response A: @user i know , i 'm just so happy for you !
Figure 2: Examples where the metrics rated the response poorly and humans rated it highly (left), and the converse (right).Both responses are given near-zero score by BLEU-N for N> 1.While no metric will perform perfectly on all examples, we present these examples to provide intuition on how example-level errors become aggregated into poor correlation to human judgements at the corpus-level.
responses with BLEU, we recommend the choice of N = 2 over N = 3 or 4. Note that using a test corpus larger than the size reported in this paper may lead to stronger correlations for BLEU-3 and BLEU-4, due to a higher number of non-zero scores.
It is interesting to note that, while some of the embedding metrics and BLEU show small positive correlation in the non-technical Twitter domain, there is no metric that significantly correlates with humans on the Ubuntu Dialogue Corpus.This is likely because the correct Ubuntu responses contain specific technical words that are less likely to be produced by our models.Further, it is possible that responses in the Ubuntu Dialogue Corpus have intrinsically higher variability (or entropy) than Twitter when conditioned on the context, making the evaluation problem significantly more difficult.
Figure 1 illustrates the relationship between metrics and human judgements.We include only the best performing metric using word-overlaps, i.e. the BLEU-2 score (left), and the best performing metric using word embeddings, i.e. the vector average (center).These plots show how weak the correlation is: in both cases, they appear to be random noise.It seems as though the BLEU score obtains a positive correlation because of the large number of responses that are given a score of 0 (bottom left corner of the first plot).This is in stark contrast to the inter-rater agreement, which is plotted between two randomly sampled halves of the raters (right-most plots).We also calculated the BLEU scores after removing stopwords and punctuation from the responses.As shown in Table 4, this weakens the cor-relation with human judgements for BLEU-2 compared to the values in Table 3, and suggests that BLEU is sensitive to factors that do not change the semantics of the response.
Finally, we examined the effect of response length on the metrics, by considering changes in scores when the ground truth and proposed response had a large difference in word counts.Table 4 shows that BLEU and METEOR are particularly sensitive to this aspect, compared to the Embedding Average metric and human judgement.
Qualitative Analysis.In order to determine specifically why the metrics fail, we examine qualitative samples where there is a disagreement between the metrics and human rating.Although these only show inconsistencies at the example-level, they provide some intuition as to why the metrics don't correlate with human judgements at the corpuslevel.We present in Figure 2 two examples where all of the embedding-based metrics and BLEU-1 score the proposed response significantly differently than the humans.
The left of Figure 2 shows an example where the embedding-based metrics score the proposed response lowly, while humans rate it highly.It is clear from the context that the proposed response is reasonable -indeed both responses intend to express gratitude.However, the proposed response has a different wording than the ground truth response, and therefore the metrics are unable to separate the salient words from the rest.This suggests that the embedding-based metrics would ben-efit from a weighting of word saliency.
The right of the figure shows the reverse scenario: the embedding-based metrics score the proposed response highly, while humans do not.This is most likely due to the frequently occurring 'i' token, and the fact that 'happy' and 'welcome' may be close together in the embedding space.However, from a human perspective there is a significant semantic difference between the responses as they pertain to the context.Metrics that take into account the context may be required in order to differentiate these responses.Note that in both responses in Figure 2, there are no overlapping n-grams greater than unigrams between the ground truth and proposed responses; thus, all of BLEU-2,3,4 would assign a score near 0 to the response.

Discussion
We have shown that many metrics commonly used in the literature for evaluating unsupervised dialogue systems do not correlate strongly with human judgement.Here we elaborate on important issues arising from our analysis.
Constrained tasks.Our analysis focuses on relatively unconstrained domains.Other work, which separates the dialogue system into a dialogue planner and a natural language generation component for applications in constrained domains, may find stronger correlations with the BLEU metric.For example, Wen et al. (2015) propose a model to map from dialogue acts to natural language sentences and use BLEU to evaluate the quality of the generated sentences.Since the mapping from dialogue acts to natural language sentences has lower diversity and is more similar to the machine translation task, it seems likely that BLEU will correlate better with human judgements.However, an empirical investigation is still necessary to justify this.
Incorporating multiple responses.Our correlation results assume that only one ground truth response is available given each context.Indeed, this is the common setting in most of the recent literature on training end-to-end conversation models.There has been some work on using a larger set of automatically retrieved plausible responses when evaluating with BLEU (Galley et al., 2015b).However, there is no standard method for doing this in the literature.Future work should examine how retrieving additional responses affects the correlation with word-overlap metrics.
Searching for suitable metrics.While we provide evidence against existing metrics, we do not yet provide good alternatives for unsupervised evaluation.Despite the poor performance of the word embedding-based metrics in this survey, we believe that metrics based on distributed sentence representations hold the most promise for the future.This is because word-overlap metrics will simply require too many ground-truth responses to find a significant match for a reasonable response, due to the high diversity of dialogue responses.As a simple example, the skip-thought vectors of Kiros et al. (2015) could be considered.Since the embedding-based metrics in this paper only consist of basic averages of vectors obtained through distributional semantics, they are insufficiently complex for modeling sentence-level compositionality in dialogue.Instead, these metrics can be interpreted as calculating the topicality of a proposed response (i.e.how on-topic the proposed response is, compared to the ground-truth).
All of the metrics considered in this paper directly compare a proposed response to the ground-truth, without considering the context of the conversation.However, metrics that take into account the context could also be considered.Such metrics could come in the form of an evaluation model that is learned from data.This model could be either a discriminative model that attempts to distinguish between model and human responses, or a model that uses data collected from the human survey in order to provide human-like scores to proposed responses.Finally, we must consider the hypothesis that learning such models from data is no easier than solving the problem of dialogue response generation.If this hypothesis is true, we must concede and always use human evaluations together with metrics that only roughly approximate human judgements.

Full scatter plots
We present the scatterplots for all of the metrics consider and their correlation with human judgement, in Figures 3-7 below.As previously emphasized, there is very little correlation for any of the metrics, and the BLEU-3 and BLEU-4 scores are often close to zero.

Figure 1 :
Figure 1: Scatter plots showing the correlation between metrics and human judgements on the Twitter corpus (a) and Ubuntu Dialogue Corpus (b).The plots represent BLEU-2 (left), embedding average (center), and correlation between two randomly selected halves of human respondents (right).

Figure 3 :
Figure 3: Scatter plots showing the correlation between two randomly chosen groups of human volunteers on the Twitter corpus (left) and Ubuntu Dialogue Corpus (right).

Figure 4 :
Figure 4: Scatter plots showing the correlation between metrics and human judgement on the Twitter corpus (left) and Ubuntu Dialogue Corpus (right).The plots represent ROUGE (a) and METEOR (b).

Figure 5 :
Figure 5: Scatter plots showing the correlation between metrics and human judgement on the Twitter corpus (left) and Ubuntu Dialogue Corpus (right).The plots represent vector extrema (a), greedy matching (b), and vector averaging (c).
(Forgues et al., 2014)te sentence-level embeddings is using vector extrema(Forgues et al., 2014).For each dimension of the word vectors, take the most extreme value amongst all word vectors in the sentence, and use that value in the sentence-level embedding: max w∈r e wd if e wd > | min w ∈r e w d | min w∈r e wd otherwise where d indexes the dimensions of a vector; e wd is the d'th dimensions of e w (w's embedding).The min in this equation refers to the selection of the largest negative value, if it has a greater magnitude than the largest positive value.

Table 4 :
Correlation between BLEU metric and human judgements after removing stopwords and punctuation for the Twitter dataset.

Table 3 :
Correlation between each metric and human judgements for each response.Correlations shown in the human row result from randomly dividing human judges into two groups.
as generative if it is able to generate entirely new sentences that are unseen in the training set.

Table 6 :
Table6shows the full distribution over κ scores for each pair of human annotators.It is apparent that most of the scores (88.9%) are over 0.4, indicating a moderate agreement.This suggests that the task was reasonable and well understood by the annotators.Distribution of pairwise κ scores between each pair of human annotators, other than the annotators that were discarded due to low scores.