GRADE: Automatic Graph-Enhanced Coherence Metric for Evaluating Open-Domain Dialogue Systems

Automatically evaluating dialogue coherence is a challenging but high-demand ability for developing high-quality open-domain dialogue systems. However, current evaluation metrics consider only surface features or utterance-level semantics, without explicitly considering the fine-grained topic transition dynamics of dialogue flows. Here, we first consider that the graph structure constituted with topics in a dialogue can accurately depict the underlying communication logic, which is a more natural way to produce persuasive metrics. Capitalized on the topic-level dialogue graph, we propose a new evaluation metric GRADE, which stands for Graph-enhanced Representations for Automatic Dialogue Evaluation. Specifically, GRADE incorporates both coarse-grained utterance-level contextualized representations and fine-grained topic-level graph representations to evaluate dialogue coherence. The graph representations are obtained by reasoning over topic-level dialogue graphs enhanced with the evidence from a commonsense graph, including k-hop neighboring representations and hop-attention weights. Experimental results show that our GRADE significantly outperforms other state-of-the-art metrics on measuring diverse dialogue models in terms of the Pearson and Spearman correlations with human judgements. Besides, we release a new large-scale human evaluation benchmark to facilitate future research on automatic metrics.


Introduction
Coherence, what makes dialogue utterances unified rather than a random group of sentences, is an essential property to pursue an open-domain * Equal Contribution. † Corresponding Author.
Why not use the treadmill? Or maybe get a dog?
Sometimes my husband goes with me. I like the outdoors.
So , do you enjoy eating too ? My love of eating is why I exercise.  Figure 1: An illustrative example of how our GRADE evaluates dialogue coherence by incorporating graph information on topic transitions from a commonsense graph. Topic keywords of the context and the response are highlighted in green and red respectively, which can be aligned to the corresponding nodes in the commonsense graph. The white nodes and all the edges in the commonsense graph are pieces of evidence that assist in constructing the dialogue graph. Taking advantage of such evidence, GRADE can better capture the topic transition dynamics between the context and the response, as shown in the thickness of edges in the dialogue graph.
dialogue system aiming at conversing with humans. Although open-domain dialogue systems have achieved significant progress and performed much more human-like skills in recent years (Zhou et al., 2020;Adiwardana et al., 2020;Roller et al., 2020), automatically measuring dialogue coherence for state-of-the-art open-domain dialogue models is still an open and under-explored research problem attributing to the open-ended nature of dialogue (See et al., 2019).
Statistic-based automatic metrics, such as BLEU (Papineni et al., 2002), mostly rely on the degree of word overlap between a dialogue response and its corresponding gold response. However, due to the ignorance of the underlying semantic of a response, they are biased and correlate poorly with human judgements in terms of response coherence (Liu et al., 2016). To overcome this issue, some learning-based metrics were proposed to train a coherence scoring model by considering the utterance-level semantics, such as ADEM (Lowe et al., 2017), RUBER (Tao et al., 2018), and BERT-RUBER (Ghazarian et al., 2019). However, a coherent real-world dialogue should be not only coherent among utterances but also smooth at topic transition. As shown in Figure 1, the topics inside a coherent dialogue are close to each other in the commonsense graph, which embodies a smooth topic transition. Although the above metrics have demonstrated higher correlations with human judgements than statistic-based metrics, they only model dialogue coherence at utterance level without explicitly considering the fine-grained topic transition dynamics of dialogue flows.
To address the above problems, we propose a new automatic metric for open-domain dialogue systems, named as Graph-enhanced Representation for Automatic Dialogue Evaluation (GRADE), which explicitly models topic transition dynamics by reasoning over dialogue graphs and incorporates them into utterance-level contextualized representations. As a result, our method can capture more accurate semantic transition information, thus measuring dialogue coherence in a more human-like manner.
Specifically, our GRADE consists of two semantic extraction branches. One branch deploys BERT (Devlin et al., 2019) to learn the coarsegrained utterance-level contextualized representations, while another learns the fine-grained topiclevel graph representations by constructing topiclevel dialogue graphs and applying a graph neural network on the graphs to model the topic transition dynamics. As to the dialogue graph construction, we determine nodes and edges by utilizing the evidence from the commonsense knowledge graph, ConceptNet (Speer et al., 2017), including khop neighboring representations and hop-attention weights. GRADE is trained in an unsupervised manner with data automatically generated by a negative sampling strategy considering both lexical and semantic aspects rather than random sampling adopted by previous works (Tao et al., 2018;Ghazarian et al., 2019). Experimental results show that GRADE significantly outperforms other state-of-the-art metrics in terms of the Pearson and Spearman correlations with human judgements and can generalize to unseen chit-chat datasets well.
Our contributions are summarized as follows: • We propose GRADE, a novel automatic coherence metric for evaluating open-domain dialogue systems, which is the first attempt to introduce graph reasoning into dialogue evaluation.
• We demonstrate the effectiveness of incorporating graph information into dialogue evaluation. Extensive experiments show that GRADE has significantly stronger correlations with human judgements than other state-of-the-art metrics.
• We construct and release a new large-scale human evaluation benchmark with 11910 human annotations to the research community for encouraging future study on automatic metrics.
The code and data are available at https:// github.com/li3cmz/GRADE.

Related Work
Automatic evaluation for open-domain dialogue systems is difficult since there are many appropriate responses for a dialogue context under the open-domain setting, known as the one-to-many problem (Zhao et al., 2017). Initially, the statistic-based metrics in language generation tasks are adopted for dialogue evaluation, such as BLEU (Papineni et al., 2002), METEOR (Banerjee and Lavie, 2005) and ROUGE (Lin, 2004). These metrics use statistical rules to measure the surface similarity between generated responses and reference responses. For example, BLEU computes the geometric average of the n-gram precisions. However, they can not cope with the one-to-many problem and have weak correlations with human judgements (Liu et al., 2016).
In recent years, learning-based metrics have increasingly attracted interest from researchers. ADEM proposed by Lowe et al. (2017) achieves higher correlations with human judgements than the statistic-based metrics, which is trained with human-annotated data in a supervised manner. However, it is time-consuming and expensive to obtain large amounts of annotated data. To reduce the cost of obtaining annotated data, Tao et al. (2018) trained their metric RUBER with auto-constructed negative samples in an unsupervised manner.  Figure 2: The architecture of GRADE consists of two semantic extraction branches. One branch encodes the context-response pair with BERT, while the other constructs a topic-level dialogue graph for the pair by utilizing the evidence from ConceptNet and performs reasoning over the constructed graph. The representations from the two branches are concatenated and fed into a MLP to compute the final coherence score. Note that the green and red nodes are corresponding to the keywords in the context and the response respectively.
With the advances of the pre-trained language model, BERT (Devlin et al., 2019) has been adopted for dialogue or NLG evaluation. Ghazarian et al. (2019) proposed BERT-RUBER, which outperforms RUBER significantly by incorporating BERT embeddings. BERTScore (Zhang et al., 2020) performs soft-overlap between candidate and reference sentences by using BERT embeddings directly without fine-tuning, and has been shown to correlate with human judgment robustly. Besides, Sellam et al. (2020) introduced BLEURT by further training regular pre-trained BERT with an elaborate pre-training scheme and fine-tuning on small amounts of rating data, which yields superior results. Note that our model differs from the above learning-based metrics in two folds. First, our metric is trained with high-quality negative samples that are similar to the ground truths in both lexical and semantic aspects instead of randomly sampling. Second, different levels of representations are considered in our GRADE, especially the fine-grained topic-level graph representation.

GRADE Metric
In this paper, we focus on designing an evaluation metric that can automatically assess the coherence of responses produced by dialogue models. Formally, given a dialogue context c = {c 1 , · · · , c m } and a response r = {r 1 , · · · , r n }, where each c k is a token in the context and each r k is a token in the response, our goal is to learn a function f : (c, r) → s that predicts the coherence score s.
As illustrated in Figure 2, our GRADE predicts a coherence score s between a context c and a response r in three steps: (1) producing the utterance-level contextualized representation v c (Section 3.1); (2) generating the topic-level graph representation v g (Section 3.2 and Section 3.3); (3) predicting the coherence score s based on v c and v g (Section 3.4). The training details of our GRADE is elaborated in Section 3.5.

Utterance-level Contextualized Encoding
We use BERT (Devlin et al., 2019) to encode the context c and the response r. The pooled output feature of BERT is then taken as the utterance-level contextualized representation v c : (1)

Dialogue Graph Construction
We construct a topic-level dialogue graph based on c and r, denoted as G = (V, E), where V is a set of topic nodes and E is a set of edges between topics. The details are described as follows.
Nodes. To determine the nodes in G, we first apply a rule-based keyword extractor that combines both TF-IDF and Part-Of-Speech features (Tang et al., 2019), to extract the keywords of c and r.
Then the keywords in c is the context-topic nodes of G, denoted as V c = {t 1 , t 2 , ..., t p }, while the keywords in r is the response-topic nodes of G, denoted as V r = {t p+1 , t p+2 , ..., t p+q }, where p and q are the numbers of keywords in the context c and the response r respectively. Therefore, V = V c ∪ V r . After determining the nodes, we utilize ConceptNet to obtain node representations. Specifically, each topic node t i is aligned to the corresponding node in ConceptNet and first initialized as where h i is the initial representation of the node t i , CN means the ConceptNet Numberbatch embeddings 1 , d is the dimension of each node representation. Furthermore, in order to preferably capture the topic relations in reality, h i is updated with the representations of its k-hop neighbors in ConceptNet, named as k-hop neighboring representations: where K is the maximum number of hops taken into account and is set as 2,N k i is the k th hop neighboring nodes of t i in the ConceptNet graph, W k and b are the weight matrix and bias vector respectively. Edges. Since our goal is to predict a coherence score of a response based on a context, we only consider the edges between the context nodes V c and the response nodes V r . In other words, the edges only exist between each context-topic node V i c and each response-topic node V j r . Moreover, we consider G as a weighted undirected graph and assign a weight to each edge of G by heuristically using the hop information in the ConceptNet commonsense graph, named as hop-attention weights. Specifically, let the weighted adjacency matrix of G as A, then the hop-attention weight of the edge between the nodes t i and t j (i.e., where #hops(·) indicates the shortest path between V i c and V j r over the ConceptNet graph. As a result, the distances between topic nodes are redefined and the nodes that are far away from each other will have low weight values. After determining the edges, we randomly deactivate a certain number of edges from G at each training step to prevent over-smoothing, and normalize the adjacency matrix A (Rong et al., 2020): (5) whereĀ is the augmented normalized adjacency matrix, D is the corresponding degree matrix of A and I is the identity matrix.

Topic-level Graph Reasoning
We explicitly model the topic transition dynamics by reasoning over the constructed topic-level graph G via two steps: aggregation and combination (Hamilton et al., 2017).
In the first step, we apply the graph attention network(GAT) (Veličković et al., 2018) to aggregate neighboring information of each node t i . The aggregated representation z (l) i at the layer l for the node t i is formulated as follows: where h (0) i =h i , N i is the neighboring nodes of t i in the dialogue graph G, W l ∈ R d×d and a l ∈ R 2d are learnable parameters, α ij is the attention coefficient, ρ is LeakyReLU, and · T represents transposition. Note that we scale the attention coefficients with the above augmented normalized adjacency matrixĀ, as shown in equation 8, so that the network will pay more attention to the nodes that are closer to t i in the ConceptNet graph during the aggregation.
In the second step, the aggregated representation z where V l ∈ R d×d is the weight matrix to transform h (l) i , and ELU represents an exponential linear unit (Clevert et al., 2016). Finally, the topic-level graph representation v g is obtained by: where h (L) i is the i th node representation at the last layer, mean represents mean pooling and F C 0 is a fully-connected layer with a ELU activation.

Coherence Scoring
To compute the coherence score s, the contextualized representation v c and the graph representation v g are concatenated together and fed into a multi-layer perceptron(MLP) to transform the highdimensional representation into a real number: where F C 1 , F C 2 and F C 3 are three different fullyconnected layers whose activation functions are ELU, ELU and sigmoid, respectively.

Training
Training Objective. Inspired by Tao et al. (2018), we train our GRADE in an unsupervised manner.
where c i and r i are a ground-truth context-response pair andr i is a false response for the context c i selected by using negative sampling described in the next paragraph, then GRADE is trained to predict a higher score for each ground-truth response r i than its corresponding false responser i by minimizing the following margin ranking loss: where N is the size of the dataset, m is a margin value set as 0.1, s i ands i are the coherence scores of r i andr i respectively in the i th example.
Negative Sampling. Following Sato et al.
(2020), we select the false responser that is similar to the ground-truth response r, instead of random sampling adopted in previous works (Tao et al., 2018;Ghazarian et al., 2019). Overall, we generate negative samples by two sampling methods: lexical sampling and embedding-based sampling. For lexical sampling, we use Lucene 2 to retrieve utterances that are related to the ground-truth response r from the training set, and select the middle one in the retrieved utterances as the false responser. For embedding-based sampling, we first randomly sample 1000 utterances and take the utterances with the top-5 cosine similarity against the ground-truth response r. 3 The false responser is then randomly selected from the top-5 utterances.

Experimental Setup
Dialogue Models. We consider both retrievalbased and generation-based dialogue models to obtain diverse responses for metric evaluation so that the performance of the metrics can be assessed comprehensively. Specifically, we first deploy Transformer-Ranker and Transformer-Generator from the ParlAI platform (Miller et al., 2017), where the former is retrieval-based and the latter is generation-based. Besides, we also deploy two state-of-the-art dialogue models, BERT-Ranker (  For each coherence question, workers were provided with a contextresponse pair and asked to assess the coherence between the context and the response on a scale of 1-5 (not coherent at all to very coherent). Each pair was assessed by 8 to 10 individual workers. In total, there are 1200 different pair and 11910 human annotations from 217 unique workers, as the final human judgements. As shown in Figure 3, the distributions of human judgements are balanced from score 1 to 5. Moreover, It also demonstrates that the dialogue models we selected are diverse in performance, which helps comprehensively assess the abilities of the metrics.

Experimental Results
DailyDialog Dataset. The test set results of the DailyDialog dataset are presented in Table 1. Overall, our GRADE obtains the highest correlations with human judgements in average. Although the Spearman value of GRADE on the Transformer-Ranker is lower than BLEURT which is trained on a very large-scale dataset, the averaged correlation result of GRADE is 1% higher than BLEURT. Besides, all the correlation results of GRADE are statistically significant with p-value <0.05, which is more reliable than the baselines. Other Unseen Datasets. To verify the transferability of our GRADE, we further evaluate the human correlations of GRADE compared with other baselines on two unseen chit-chat datasets, Con-vAI2 and EmpatheticDialogues. Results in Table   conv (a) ROUGE (b) BERTScore (c) BLEURT (d) GRADE Figure 4: Score correlations between auto-metrics and human judgements, presented in a scatter plot form. Each point is associated with a context-response pair where the context is from the ConvAI2 dataset, and the response is generated by the DialogGPT model.
1 show that GRADE can easily adapt to other unseen datasets without any re-training and obtain more stable and higher correlations with human judgements than the baseline metrics. It is noteworthy that all Pearson and Spearman correlations of GRADE are statistically significant with p-value < 0.05, and most of them are with p-value < 0.01. Particularly, GRADE achieves a significant Pearson correlation of 0.606 and Spearman correlation of 0.617 for evaluating Transformer-Generator on the ConvAI2 dataset, bringing an improvement of 0.411 (Pearson) and 0.417 (Spearman) compared with BLEURT. Furthermore, Table 2 presents the correlation results of GRADE and other baselines for evaluating two state-of-the-art dialogue models, BERT-Ranker and DialoGPT. Our GRADE significantly outperforms the baseline metrics on human correlations, which shows that GRADE is better at evaluating the coherence of high-quality responses. Besides, Figure 4 illustrates the scatter plots against human judgements for DialoGPT on the ConvAI2 dataset. We can see that the scores predicted by GRADE are closer to the human scores than the baseline metrics, which intuitively shows the superiority of our GRADE.

Ablation Studies
We perform ablation studies 7 for the main components of GRADE to better analyze their relative contributions. The results are shown in Table 3. Does the negative sampling strategy work? We  first verify the effectiveness of our negative sampling strategy by replacing it with random sampling. As shown in Table 3, adopting the random sampling strategy hurts performance significantly with a 6.6% drop in average, which indicates the importance of our negative sampling strategy.
Does the graph work? To prove the contribution of our graph components, we perform three ablations respectively: 1) remove the entire graph branch of GRADE; 2) remove the k-hop neighboring representations used for initializing the node representations in the dialogue graph; 3) remove the hop-attention weights used for computing a weight for each edge in the dialogue graph. Consequently, the performance of GRADE decreased after removing the graph branch or one of the components in the graph branch.
How much graph information we need? Finally, we explore the number of k-hop neighboring representations needed for initializing the dialogue graph's nodes in two aspects: the maximum number of hops (refer to the K in Equation 3), and the number of neighboring nodes in the k th hop (denoted as N k , i.e., the number of nodes inN k i in Equation 3). By comparing the results among the first row and the last three rows in Table 3, we confirm that incorporating both the 1 st hop and the 2 nd hop neighboring nodes brings the best performance. Furthermore, we also observe that considering too much graph information may result in relatively poor performance, as shown in the last row. Therefore, the final version of GRADE adopts the 2-hop neighboring representations where N 1 = 10, N 2 = 10.  Table 3: Ablation results on the DailyDialog dataset, averaged across five random seeds, with standard deviations presented in gray color. N 1 and N 2 refer to the numbers of the 1 st and 2 nd hop neighboring nodes in ConceptNet, respectively. The symbol indicates that three or more than three correlation results over the five random seeds are not statistically significant, namely, p-value > 0.05.

Case Study
To more intuitively analyze the performance of our GRADE, three representative examples are shown in Figure 5. From the example in the first row, we can see that the score given by our metric is closer to the human score than the other two baseline metrics. However, in the second-row example, our metric performs poorly. The potential reason may be the lack of topics (i.e., keywords) in the model response, as illustrated in the graph that only contains context-topic nodes. As a result, the graph reasoning module in our GRADE fails to induce an appropriate graph representation, which harms the coherence scoring. Finally, the example in the last row shows a hard case that both our GRADE and the baseline metrics are failed to cope with. In this hard case, the topics of the model response are relevant to the dialogue context so that both our GRADE and BERT-RUBER, as learning-based metrics, deem that the response greatly matches the context. However, the truth is that the model response is more likely a response for the previous utterance U1 rather than U2, which is hard for metrics to recognize.

Conclusion and Discussion
In this paper, we proposed GRADE (Graphenhanced Representations for Automatic Dialogue Evaluation), a novel metric for dialogue coherence evaluation of open-domain dialogue systems. Empirical results show that GRADE has stronger correlations with human judgements and can generalize to other unseen chit-chat datasets. Besides, we also release a new large-scale human evaluation bench-mark to facilitate future research on automatic metrics. A limitation of GRADE is the inconsistency between the training objective (relative ranking) and the expected behavior (absolute scoring). Specifically, the ranking loss we adopted only requires good responses to be ranked higher than bad responses, which is a relatively loose constraint compared with the absolute scoring that humans do. Therefore, GRADE may deviate from the human scoring criterion and fail to quantify the dialogue responses accurately, and that the human correlation results fluctuate over different runs. Overall, to develop a dialogue metric that can quantify in a more human-like manner, it is critical to reducing the gap between the training objective and the model behavior we truly care about.