Towards Holistic and Automatic Evaluation of Open-Domain Dialogue Generation

Open-domain dialogue generation has gained increasing attention in Natural Language Processing. Its evaluation requires a holistic means. Human ratings are deemed as the gold standard. As human evaluation is inefficient and costly, an automated substitute is highly desirable. In this paper, we propose holistic evaluation metrics that capture different aspects of open-domain dialogues. Our metrics consist of (1) GPT-2 based context coherence between sentences in a dialogue, (2) GPT-2 based fluency in phrasing, (3) n-gram based diversity in responses to augmented queries, and (4) textual-entailment-inference based logical self-consistency. The empirical validity of our metrics is demonstrated by strong correlations with human judgments. We open source the code and relevant materials.


Introduction
Learning to communicate is a key capacity of intelligent agents. Research on enabling a machine to have meaningful and natural conversations with humans plays a fundamental role in developing artificial general intelligence, as can be seen in the formulation of Turing test (Turing, 1950). Recently open-domain or non-task-oriented dialogue systems have attracted a surge of research interest (Bessho et al., 2012;Sordoni et al., 2015;Shang et al., 2015;Vinyals and Le, 2015;Ghazvininejad et al., 2018).
Evaluating models of open-domain dialogue generation in an efficient manner poses a significant challenge in developing dialogue systems. The prevalent method of open-domain dialogue evaluation is human-based rating with a given rubric. Table 1: Two responses from an dialogue system (Wolf et al., 2019) on Daily Dialogue Dataset. The first generated response appears reasonable within the opendomain dialogue, while its BLEU score and semantic similarity between model response and reference response is low. The second generated response conflicts with its prior utterances. The italic text highlights the logical contradiction.
When various variations in the model and sets of hyper-parameters are needed, the labor-intensive human evaluation is deemed impracticable. This key drawback may hinder the research progress and render the human evaluation approach not scalable.
Previous automatic evaluation metrics generally focus on the quality of the dialogue generation: context coherence and fluency. Word-overlap metrics (Papineni et al., 2002;Banerjee and Lavie, 2005;Lin, 2004) or ad-hoc classifiers (Tao et al., 2018;Ghazarian et al., 2019) are designed for measuring the quality. In open-domain dialogue, the relation between two utterances is more critical as shown in the first example of Table 1. Compared with the previous two approaches, a language model, trained on an enormous amount of text, can naturally capture coherence among both words and utterances. On the other hand, a good evaluation metric should not only measure the quality of gen-eration, but also the diversity of generation, which is especially important for open-ended tasks like dialogue or story generation (Hashimoto et al., 2019). Some n-gram based metrics have been utilized to measure diversity (Mou et al., 2016;. However, this metric might be improper for diversity evaluation since the generated utterances given various queries provided by the benchmark are generally diverse. In our experiments, we observe constantly high diversity in terms of human ratings and n-gram based entropy when evaluating the generated responses directly. In addition to the three aforementioned metrics, logical selfconsistency is also a key aspect of dialogue models (Zhang et al., 2018). An dialogue example with logical contradiction is displayed in the second example of In this work, we propose holistic metrics that evaluate distinctive aspects of generated dialogues. Specifically, we consider (1) context coherence of a dialogue: the meaningfulness of a response within the context of prior query, (2) language fluency of generated responses: the quality of phrasing relative to a human native speaker, (3) response diversity of a set of generated sentences: the variety in meaning and word choice of responses, and (4) logical self-consistency: the logical consistency of utterances from a dialogue agent. Both context coherence and response fluency (quality metrics) can naturally be captured by metrics based on strong language models like GPT-2 (Radford et al., 2019). Therefore, we propose to recruit and fine-tune GPT-2 as a basis of our quality metrics. With regard to response diversity and logical selfconsistency, we propose to measure them under augmented utterances with controlled paraphrasing. We leverage two effective approaches to generate augmented utterances: word substitution and text generator with a k-best decoder. Moreover, we utilize n-gram based entropy to capture response diversity and entailment based approach to capture logical self-consistency. Our experiments show that the proposed metrics strongly correlate with human judgments. Moreover, our augmented datasets allow for a more accurate and straightforward hu-man annotation, significantly improving the agreement between human evaluation. We release the code and relevant materials as open-source contribution to pave the way towards further research.

Prior Art
Heuristic-based metrics have been shown to align well with human judgments and widely applied in various language generation tasks. For machine translation, BLEU (Papineni et al., 2002) computes n-gram precision, whereas METEOR (Banerjee and Lavie, 2005) takes into account both precision and recall. For summarization, ROUGE (Lin, 2004) also considers both precision and recall by calculating F-measure. These n-gram based metrics are well-suited for the generation tasks that are more source-determined or low conditional entropy such as translation, image captioning, and summarization. Some dialogue studies adopted these metrics to evaluate the quality of generated conversation responses (Ritter et al., 2011;Su et al., 2018;Sordoni et al., 2015). They nevertheless are not suitable for open-ended generations or high conditional entropy tasks like dialogue generation where a diverse range of generations is acceptable conditional on a query. Indeed, Liu et al. (2016) conducts extensive empirical studies on these metrics (e.g., BLEU, METEOR, and ROUGE) to test their effectiveness on evaluating dialogue generation and find limited relation between these automatic metrics and human judgments.
The word-overlap metrics (e.g., BLEU) fail to capture the semantic similarity between model and reference responses. The following works leverage the distributed representation learned in neural network models to capture semantic similarity among context, model response, and reference response.  collect a dataset of human scores and train a hierarchical recurrent neural network (RNN) to predict human-like scores to input responses given the context, resulting in an automatic metric that has a medium level correlation with human judgments. Obtaining this metric however requires a large dataset of human-annotated scores, thus rendering this approach less flexible and extensible. Tao et al. (2018) proposes a referenced metric and unreferenced metric blended evaluation routine (RUBER) for open-domain dialogue systems. This blended metric is a combination of two metrics. A referenced metric measures the similarity between model-generated and reference responses on the basis of word-embeddings. An unreferenced metric captures the relevance between the query and response. It is obtained by training a neural network classifier to determine whether a response is appropriate. The positive examples are the references, while the negative examples are reference responses randomly chosen from the dataset, hence avoiding the need of human-annotated data. After training, the Softmax score is utilized to measure whether the generated response is coherent with the query. Attempting to improve RUBER, Ghazarian et al. (2019) explores to use contextualized embeddings from BERT. The BERT-based unreferenced metric improves over the word-embedding-based RUBER unreferenced metric. Interestingly, they show that the combined metric has a reduced correlation with human judgments than the unreferenced metric alone. Although this finding is counterintuitive, it is consistent with the characteristics of open-domain dialogue that a range of diverse responses is reasonable given a query. Hence a response can be acceptable to human annotators even if it does not align well with the reference either in terms of word-overlap or semantic embedding.
Context Coherence. One key component of dialogue response is its coherence to the query as explored in Tao et al. (2018) and Ghazvininejad et al. (2018). Prior work measures the coherence based on the Softmax score of a trained binary classifier. Here we explore an alternative approach based on language modeling (Bengio et al., 2003). A language model can naturally capture the coherence of the response to the query without resorting to an ad-hoc classifier.
Language Fluency. Besides coherence, a good response should be fluent. Fluency is often measured by a language model (Holtzman et al., 2018;Xu et al., 2018). We define the response fluency score as negative perplexity of generated responses.
Response Diversity. In addition to quality metrics, response diversity is also critical, especially for high conditional entropy tasks like dialogue or story generation (Hashimoto et al., 2019). Some n-gram based metric has been utilized to measure diversity. Mou et al. (2016) and  compute unigram entropy across all generated utterances to measure the diversity. This metric might be improper for diversity since the generated utterances given various queries are generally diverse. In our experiments, we observe constantly high diversity in terms of human ratings and n-gram based entropy. In another perspective, the entropy computed across all generated responses is essentially measuring the marginal entropy of the responses, while our actual interest is in the conditional entropy of the responses conditional on the queries.
Logical Self-Consistency. Similar to diversity evaluation, current benchmarks are not suitable for evaluating logical self-consistency. The current dataset is well-formed making the system to generate a simple and nonredundant response, but unfortunately, there still exist logical contradictions as shown in Table 1. The natural language inference (NLI) task (Williams et al., 2018) aiming to check whether the sentence is entailed or contradicted by a previous sentence is highly related to logic evaluation on open-domain dialogues.

Context Coherence
Language models, which predict the next token given previous tokens, naturally capture the coherence between sentences and particularly the dialogue query and response in our case. GPT-2 (Radford et al., 2019) is a large-scale pre-trained language model based on the transformer architecture (Vaswani et al., 2017). It is trained on a vast amount of diverse data and demonstrates impressive text generation capabilities. In order to better capture the dependence between the queries and responses, GPT-2 can be fine-tuned using the next sentence prediction task on the dialogue dataset of interest.
Suppose a query q contains tokens {q t : t = 1, ..., T q } and a response r has tokens {r t : t = 1, ..., T r }. Let P denote the fine-tuned GPT-2, then the context coherence is defined as the loglikelihood of the response conditional on the the query normalized by the length of the response length: (1) Note that c raw (r|q) is some negative number and unbounded from below. A single value is then hard to explain absolutely and can only be interpreted relative to other values. Also, the unboundedness renders it prone to extreme values. Hence, a normalized score is utilized instead. Since the score distribution varies as a function of the dataset, the lower bound is defined as 5th percentile, denoted as c 5th , instead of some arbitrary value. Then the normalized score, c(r|q), is which ranges from 0 to 1.

Response Fluency
To capture the fluency of responses, we also adopt the pretrained language model, GPT-2. In particular, the raw response fluency score, f raw (r), is defined as, Similar to context coherence, a normalized version, f (r), of f raw (r) is employed.

Response Diversity
Prior work (Mou et al., 2016; measured diversity by computing the n-gram entropy across all generated responses, which essentially reflects the marginal entropy of the responses. Diversity of the responses conditional on the query (e.g., conditional entropy) are however more of interest for dialogue models. On the other hand, if we measure diversity based on responses randomly sampled from a model conditional on a single query, the response quality is generally low (Caccia et al., 2018). The current work instead proposes to measure response diversity utilizing augmented datasets with controlled paraphrasing, which allows for measuring diversity among top-ranked responses conditional on paraphrased queries and hence avoiding the tradeoff or dependency between diversity and quality. In other words, for a given query, we slightly tilt the corresponding element in the query-response joint space along the query dimension (achieved by paraphrasing-augmentation) and then measure the entropy of high-quality responses in the neighbourhood of the targeted query.
While augmenting the queries to measure the conditional entropy of responses, we need to control the diversity of the augmented queries such that the augmented ones stay in the vicinity of the targeted query. Hence the goal of controlled augmentation is to minimize diversity in both meaning and word use and avoid feeding the dialogue model identical inputs. To achieve so, two augmentation approaches are considered: (1) WordNet (Miller, 1998) Substitution (WS) and (2) Conditional Text Generator (CTG).
WordNet Substitution (WS) is a word-level manipulation method that replaces some words with synonyms defined in WordNet. Different from WS, Conditional Text Generator (CTG) is used to augment queries in multi-turn dialogue. It requires a generator to produce augments conditioned on the context, which is defined as the prior utterance history to the selected query. For instance, suppose [u 1 ; ...; u t−1 ] denotes the utterance history and u t indicates the query to be augmented, then the top-K beams, {u

Logical Self-Consistency
Logical self-consistency measures if a generated response is logically contradictory to what the agent uttered in the multi-turn history. The basic idea is to apply a pretrained Multi-Genre Natural Language Inference (MNLI; Williams et al. 2018) model to label if the relation of the response and the utterance history of the same agent is logically consistent. More specifically, we train a ternary classifier that takes two utterances as input and predicts the relation as either contradiction, entailment or neutral on the MNLI dataset. Then we average the contradiction class probabilities of the current utterance and each prior utterance from this agent as the contradiction score. In order to match the human ratings, we use 1 minus the contradiction score as the final score of logical self-consistency evaluation. Moreover, we measure logical self-consistency under augmented datasets with controlled paraphrasing, using WS and CTG introduced in Section 3.3. The main idea is to generate augmented multi-turn utterance history that more likely induces the dialogue system to produce contradictory responses. We assume that it is more likely for the agent producing self-contradictory responses when responding to similar queries. We use WS and CTG to paraphrase the query and then calcu-late the contradiction score of the current utterance and each prior utterance from this agent.

Dataset
To facilitate comparison with prior work (Ghazarian et al., 2019), the DailyDialog dataset (Li et al., 2017) is adopted for the empirical analysis of our proposed metrics. This dataset contains 13,118 high-quality multi-turn dialogue dataset. The dialogue is split into a 42,000 / 3,700 / 3,900 traintest-validation partitions.

Response Generation
A sequence-to-sequence (seq2seq) model with attention (Bahdanau et al., 2014) was trained with the train and validation partitions to generate dialogue responses. The implementation in OpenNMT (Klein et al., 2017) was used to train the model. The seq2seq consists of a 2-layer LSTM with 500 hidden units on both the encoder and decoder. The model was trained with SGD and learning rate of 1. To obtain responses on a wide spectrum of quality and diversity, we sample the data with top-k sampling where k = {1, 10, 100}.

Language Model Fine-tuning
The base GPT-2 model with 12 layers was used to compute our metrics 2 . The GPT-2 model was fine-tuned on the training and validation data. In fine-tuning, the queries and responses were concatenated together as a single sentence to feed into GPT-2. The perplexity of the fine-tuned language model on the test dataset was 16.5.

Controlled Query Generation
WordNet substitution and conditional text generators were used to augment diversity-controlled queries. The Stanford part-of-speech (POS) tagger (Toutanova and Manning, 2000) and the WordNet by Miller (1998) were utilized to do WordNet substitution. It is achieved by first using Stanford POS tagger to tag tokens in a query. Then four augmented inputs are generated by substituting verbs, nouns, adjectives & adverbs, or all of the above with synonyms in WordNet. As for conditional text generator, we trained an OpenNMT Transformer 2 We also experimented with the medium GPT-2 with 24 layers and found that the results were generally the same. And larger models (the 36-and 48-layers GPT-2) might pose computational difficulty for some researchers and thus were not considered.  on the training and validation splits for query augmentation, which was applied to the testing dataset to augment the query with the top-K beams. For response diversity, five variants are obtained, the original query and four paraphrased ones; for logical self-consistency, two variants are obtained, the original query and one paraphrase.

Metric Evaluation
To assess the validity of our proposed metrics, we utilize Amazon Turk to collect high quality human ratings from 10 subjects. For each metric, we select a set of samples to be presented to humans and each datapoint is to be rated from 1 to 5, with 1 being the worst and 5 being the best on each metric. On both context coherence and response fluency, we select 200 datapoints with a diverse range of generation quality. There are 200 query-response pairs to be rated for context coherence and 200 responses to be rated for response fluency. For response diversity, we select 100 datapoints, totaling 500 responses, to be rated in groups of 5, all of which are conditioned on the controlled inputs generated by CTG or WS given the same context. For logical self-consistency, 100 datapoints are selected independent from response diversity. After Amazon Turk results are collected, we compute the Pearson and Spearman correlation between our automatic metrics and human ratings to assess the validity of our metrics. We normalize the human rating scores to be in the range of 0 to 1. Table 3 demonstrates the Pearson and Spearman correlations between the proposed context coherence metric and human judgments. Also, the results were compared to the previous best-performing au- Figure 1: Correlation between context coherence metric c(r|q) and human ratings without and with fine-tuning of GPT-2. Note that random jitters sampled from N (0, 0.05 2 ) are added to human ratings in visualizing scatter plots showed in this paper to overlapping points.  Table 3: Correlation between RUBER+BERT and context coherence metric c(r|q) with human ratings (without and with fine-tuning of GPT-2).

Context Coherence
tomatic metric, RUBER with BERT embeddings (Ghazvininejad et al., 2018). Clearly both our language model based coherence metric shows higher correlation with human judgments than the classifier-based metric, RUBER. In addition, we compared the proposed metric with a similar metric based on a GPT-2 language model without fine-tuning on the target dataset. The fine-tuned version improved the results, indicating that fine-tuning on the dialogue dataset enables the language model to better capture the dependency between the queries and replies. Interestingly, even the metric based on the language model without fine-tuning correlated with human ratings stronger than RUBER.
We also examined the inter-rater reliability. It is computed by holding out the ratings of one rater at a time, calculating its correlation with the average of other rater's judgments, and finally averaging over or taking the maximum of all held-out correlation scores. The inter-rater reliability results also support the strong performance of our proposed context coherence metric in that the correlation between the automatic metric and human evaluation was close to the inter-rater correlations.
In addition, Figure 1 Table 4: Correlation between response fluency metric f (r) and human ratings without and with fine-tuning of GPT-2. Pairwise mean and max correlations of human ratings.
tuning on GPT-2. It helps to improve the consistency between human rating and automatic metric. Table 2 displays a case study. Our coherence metric and the human evaluation agreed that the generated response is not coherent with the given query, while RUBER indicated that this reply is coherent. This might be because RUBER simply compares the embeddings of the query and response and business travel related words in the query such as vacation, workweek and in the reply such as travel, company make RUBER judge that they are similar.

Response Fluency
Our findings show that the proposed fluency metric f (r) is highly correlated with human judgments. Table 4 summarizes the relation between our proposed fluency metric and human ratings in terms of Pearson and Spearman correlation. The importance of fine-tuning GPT-2 (as outlined in Section 4.3) is evident. We observe an increase from 0.43 to 0.82 in Pearson correlation and an enhancement from 0.32 to 0.81 in Spearman correlation. In addition, Figure 2 details the effect of fine-tuning. Notably, a correction of outliers occurs.    Table 6: Comparison of response diversity between the baseline dataset and and our paraphrasing-augmented datasets (WS and CTG datasets) using Inter-Rater Spearman and Pearson correlations. Table 5 shows the evaluation of the proposed diversity metric on the basis of the augmented datasets with WS and CTG. We also include a baseline dataset which consists of responses from randomly chosen queries from the testing data. Unigram, bigram, and trigram entropy are utilized to calculate responses' diversity and are compared to human ratings with Pearson and Spearman correlation. It is clear that automatic evaluations with the controlled paraphrasing datasets consistently achieve higher correlation compared to those with the baseline dataset. Figure 3 display correlations between normalized human ratings and corresponding n-gram entropy based on the augmented dataset. Entropy values based on WS and CTG datasets demonstrate stronger relations with human ratings, compared to those based on the baseline dataset, consistent with the reported correlations. Table 6 displays inter-rater Pearson and Spearman correlations and variance in human ratings.

Response Diversity
Human ratings based on the paraphrasing augmented datasets show high inter-rater correlations and lower variance, indicating that raters generally agree with each other. The poor baseline performance is likely due to the uncontrolled nature of input sentences such that outputs of evaluated models are generally diverse, making it difficult for humans to judge the diversity performance of the model. Furthermore, our diversity metrics have correlations with human ratings close to the corresponding mean inter-rater correlations, suggesting that the diversity evaluation based on the paraphrasingaugmented data can reveal the diversity of a dialogue system consistent with humans.  stronger relation with that based on the baseline. In particular, the metric based on CTG augmentation aligns with human judgments the closet. Inter-rater Pearson and Spearman correlations are reported in Table 9. Human ratings on the augmented data are more consistent than those on the baseline, indicating the necessity and efficiency of using a refined dataset instead of the original one. We show a case study in Table 7.

Relation between the Four Metrics
Although the four proposed metrics are intuitively and theoretically important in evaluating a dialogue system, it is not entirely clear whether they are independent from each other such that it is necessary to measure all of them. We empirically investigate their association. We randomly choose 50 dialogues from the testing dataset and construct the evaluation data for the four metrics. Five human evaluators rate on the four aspects of each dialogue. We then examine the pairwise correlation of human ratings on the four metrics. Response fluency correlates with context coherence (r = 0.42, p = 0.003). This is mainly due to the fact that inarticulate responses are often considered incoherent with the context. All other pair-wise correlations are non-significant (r s < 0.1, p s > 0.25) 3 . Thus, the four metrics are relatively independent from each other and it is critical to take into account all of them to obtain a holistic evaluation of a dialogue model.

Context of Conversation
Speaker A: Are you more of a leader or a follower? Speaker B: I don 't try to lead people. I' d rather cooperate with everybody, and get the job done by working together.

Generated Utterance
Speaker A: Are you more of a follower or a leader? Model Response Speaker B: I like to keep to myself. I'm a person who does not want to be a follower. Our Score: 0.09 Human Score: 0.20 Table 7: Case study of logical self-consistency. Generated Utterance is generated by CTG. Blue italic words highlights the logic contradiction. Our automatic score is low indicating that the logic contradiction can be detected.   Table 9: Comparison of logical self-consistency metric between the paraphrasing-augmented data (WS and CTG data) and the baseline data without augmentation using Inter-Rater Spearman and Pearson correlations.

Conclusion
This paper provides a holistic and automatic evaluation method for open-domain dialogue models. In contrast to prior art, our means of evaluation captures not only the quality of generation, but also the diversity and logical consistency of responses. We recruit GPT-2 as a strong language model to evaluate the context coherency and response fluency. For response diversity and logical self-consistency, we propose to measure these two aspects under augmented utterances with controlled paraphrasing. We leverage two effective approaches to generate augmented utterances: word substitution and text generator with k-best decoder. Moreover, we utilize n-gram based entropy to capture response diversity and entailment based approach to measure logical self-consistency. The proposed metrics show a strong correlation with human judgments. It is our hope the proposed holistic metrics may pave the way towards the comparability of open-domain dialogue models.