Beyond User Self-Reported Likert Scale Ratings: A Comparison Model for Automatic Dialog Evaluation

Open Domain dialog system evaluation is one of the most important challenges in dialog research. Existing automatic evaluation metrics, such as BLEU are mostly reference-based. They calculate the difference between the generated response and a limited number of available references. Likert-score based self-reported user rating is widely adopted by social conversational systems, such as Amazon Alexa Prize chatbots. However, self-reported user rating suffers from bias and variance among different users. To alleviate this problem, we formulate dialog evaluation as a comparison task. We also propose an automatic evaluation model CMADE (Comparison Model for Automatic Dialog Evaluation) that automatically cleans self-reported user ratings as it trains on them. Specifically, we first use a self-supervised method to learn better dialog feature representation, and then use KNN and Shapley to remove confusing samples. Our experiments show that CMADE achieves 89.2% accuracy in the dialog comparison task.


Introduction
Open-domain dialog system evaluation is one of the most difficult challenges in the dialog community. Open-domain chatbots have a user-centric goal: to provide human with enjoyable user experience. However, user experience is difficult to quantify due to bias and variance among different users. Previous research has optimized on automatic dialog evaluation metrics such as BLUE (Papineni et al., 2002), which measures the difference between the generated responses and the reference responses. Due to the contrast between the one-tomany nature of open-domain conversations and the limited number of available references, such met-rics correlate poorly with human judgments (Liu et al., 2016;Lowe et al., 2017;Novikova et al., 2017). Designing a fully automatic dialog evaluation metric is still an open research problem.
Currently, both academia and industry (Ram et al., 2018a; rely on human ratings to evaluate open-domain dialogs. Following the ubiquitous application of Likert scores in survey research like online reviews (Godes and Silva, 2012) and consumer satisfaction (Peterson and Wilson, 1992), a common practice of human evaluation on dialogs is to ask either a third-person rater or the chatbot user to report a Likert score. However, concerns have been raised about the validity of Likert score-based ratings. Kulikov et al. (Kulikov et al., 2018) observe high bias and variance of Likert scores. Such issue is more severe in real-world commercial dialog systems like Alexa social chatbot (Ram et al., 2018a;Venkatesh et al., 2018), because the real-world users have neither monetary incentive nor necessary annotation training to calibrate their ratings.
To explore the validity of Likert score based dialog evaluation, we first perform a large-scale data analysis of 3,608 collected real-world humanmachine dialogs along with their self-reported Likert scale ratings from Amazon Alexa Prize Challenge (Ram et al., 2018a;Chen et al., 2018). One noticeable property of the ratings is its J-shape skew distribution: nearly half of the dialogs are rated with the highest Likert score. The prevalence of such extreme distribution of ratings has long been observed by the business research community in variable aspects of reallife (Schoenmüller et al., 2018;Godes and Silva, 2012;Hu et al., 2017;Zervas et al., 2015).
Although we could tell which dialog system is better by running statistical test on a large number of noisy ratings, it is difficult to locate dialogs with bad performance reliably to improve dialog system quality. In this paper, we take on the challenge of calibrating a large number of noisy self-reported user ratings to build better dialog evaluation models. We formulate the task as to first denoise the self-reported user ratings and then train a model on the cleaned ratings. We design CMADE (Comparison Model for Automatic Dialog Evaluation), a progressive three-stage denoising pipeline. We first perform a self-supervised learning to obtain good dialog representations. We then fine-tune CMADE on smoothed self-reported user ratings to improve the dialog representation while preventing the network from overfitting on noisy ratings. Finally, we apply data Shapley to remove noisy training data, and fine-tune the model on the cleaned training set. Our experiments show that CMADE is able to successfully identify noisy training data and achieves 89.2% in accuracy and 0.787 in Kappa on a test set with unseen expert-rated dialog pairs.

Related Work
Open-domain dialog system evaluation is a longlasting challenge. It has been shown that previous automatic dialog evaluation metrics correlate poorly with human judgments (Liu et al., 2016;Lowe et al., 2017;Novikova et al., 2017). A wellknown reason is that these automatic dialog evaluation metrics rely on modeling the distance between the generated response and a limited number of references available. The fundamental gap between the open-ended nature of the conversations and the limited references (Gupta et al., 2019) is not addressed in methods that are lexical-level based (Papineni et al., 2002;Lin, 2004;Banerjee and Lavie, 2005), embedding based (Rus and Lintean, 2012;Forgues et al., 2014), or learning based (Tao et al., 2018;Lowe et al., 2017).
Given the aforementioned limitations, Likertscore based rating is the de-facto standard for current dialog research and social conversational systems such as in Amazon Alexa Prize Challenge Chen et al., 2018). Various forms of evaluation settings have been explored to better measure human judgments. Single-turn pairwise comparison (Vinyals and Le, 2015;Li et al., 2016) is primarily used for comparing two dialog systems. Each system predicts a single utterance given the static "gold" context utterance from human-human logs. Although such A/B test setting is robust to annotator score bias, it cannot capture the multiturn nature of dialogs. A more complete multi-turn evaluation is typically measured with a Likert scale for the full dialog history, where either a third-person rater or the chatbot user (Pérez-Rosas et al., 2019) reports a Likert score on user experience (Venkatesh et al., 2018), engagement (Bohus and Horvitz, 2009) or appropriateness (Lowe et al., 2017). However, as observed in (Kulikov et al., 2018;Ram et al., 2018a;Venkatesh et al., 2018) Likert scores suffer from bias and variance among different users. Different from previous empirical observations, we conduct a large-scale quantitative and qualitative data analysis of Likert score based ratings. To address the issue of Likert scores, the Alexa team proposed a rule-based ensemble of turn-granularity expert ratings (Yi et al., 2019), and automatic metrics like topical diversity  and conversational breadth. ACUTE-EVAL ) makes a small-scale attempt to use multi-turn pair-wise comparison to rank different chatbots. Given the ubiquity and simplicity of Likert scores based evaluation, instead of proposing an alternative measure, we take on the challenge of denoising Likert scores with minimal expert annotations introduced (one order of magnitude smaller). Different from , our proposed expert annotation scheme is for comparing the dialogs within the same chatbot.

Dataset
The data used in this study was collected during the 2018 Amazon Alexa Prize Competition (Ram et al., 2018b). Our data contain long and engaging spoken conversations between thousands of real-world Amazon Alexa customers and Gunrock, the 2018 Alexa Prize winning social bot . The chatbot has 11 topic dialog modules including movies, books, and animals. One notable characteristic of the chatbot is its versatile and complex dialog flows which interleaves facts, opinions and questions to make the conversation flexible and interesting (Chen et al., 2018). At the end of each dialog, a self-reported Likert scale rating is elicited by the question "on a scale of one to five, how likely would you talk to this social bot again?" We first filter out dialogs that have inappropriate content using keyword matching. We then select 3,608 ten-turn dialogs on movies, because movie dialogs are more coherent and diverse compared to other topics according to both real users and Amazon selected experts. We observe that dialogs  Figure 1: Schematic of the CMADE workflow. CMADE contains a three-stage training pipeline to denoise selfreported ratings to train an automatic dialog comparison model: learning representation viaself-supervised dialog flow anomaly detection, fine-tuning with smoothed self-reported user ratings, denoising with data Shapley & further fine-tuning. The gray and blue rectangles in stage 1 represents system and user utterances. The red rectangle in stage 1 represents the randomly replaced system utterance for dialog flow perturbation. In stage 2 & 3, each ball represents a dialog in the training data. The number on each ball represents the dialog rating.
with more than eight turns are more meaningful and semantically versatile, while dialogs more than 10 turns exceed the max length limit of the BERT model (512 tokens). So we select dialogs that have ten turns. Our approach could support longer conversations by adopting a memory footprint efficient algorithm for self-attention to support sequences with thousands of tokens . We leave this to future work. We aim to evaluate user experience for each dialog from the same chatbot of the same length. This is significantly more challenging than identifying which chatbot provides a better user experience on average since our problem setup requires us to capture more subtle difference in user experience.  Table 1: The statistics of self-reported Likert scale ratings. The distribution is heavily skewed and noisy: nearly half of the dialogs are rated with score = 5.

J-Shape Skewness
We perform a detailed analysis of the self-reported Likert scale ratings. As shown in Table 1, abnormally, nearly half of the dialogs are rated as five, which is the highest score. A similar skewed distribution is also observed in previous years' Alexa competition (Fang et al., 2018). In fact, the business research community has long observed the prevalence of the extreme distribution of reviews in which the reviews are heavily skewed to the positive end of the rating scale (known as "J-shape") in online reviews (e.g., Amazon, Airbnb, Yelp) (Godes and Silva, 2012;Hu et al., 2017;Zervas et al., 2015), word of mouth (East et al., 2007) and consumer satisfaction (Peterson and Wilson, 1992;Danaher and Haddrell, 1996).
Comparison to expert ratings We randomly selected 50 dialogs rated score-5 and showed these to an expert, and our expert rated 27 of them with score-4 or less. The Alexa team (Venkatesh et al., 2018) has also reported that the inter-user agreement is quite low for their internal rating analysis. Such phenomena indicate that the self-reported Likert scale ratings are extremely noisy. Using such ratings cannot localize individual bad interactions. In addition, Likert score based evaluation also suffers from insensitivity issues. As observed by the Alexa team (Venkatesh et al., 2018) in multiple internal user studies, even though users evaluated multiple dialogs with the same score, they had a clear rank order among the dialogs.
The skewness, noisiness and insensitivity of the self-reported Likert scale rating make it a suboptimal dialog evaluation metric. In practice, we find that directly training a classifier (even for pre-trained BERT-based model) on the noisy selfreported Likert scale ratings suffers from underfitting. One of the Alexa Price Challenge team, Alana (Papaioannou et al., 2017) train a binaryclassifier between successful dialogs (human rating 4 or 5) and unsuccessful dialogs (rating 1 or 2) with heavy hand-engineered features. They reach 69.40% accuracy on this binary classification problem, which is far from usable in real-world settings.

Pairwise Comparison Based Evaluation
Selecting the better dialog from two options is easier for a human evaluator than giving an absolute number like the Likert score, which requires the evaluator to maintain a consistent standard. People's perception is inherently relative, and pair-wise comparison is local and does not require the user to have global consistency. There are many other examples where humans find it easier to perform pairwise comparisons rather than providing direct labels (Simpson and Gurevych, 2018;Mailthody et al., 2019;Liang et al., 2018), including content search (Fürnkranz and Hüllermeier, 2010), image retrieval (Wah et al., 2014;, and age estimation . We randomly sample 400 dialog pairs for experts to annotate. We ask the question, "If you were the user, in which scenario would you be more likely to come back and talk to the system again? " We guide the experts to focus on the user experience rather than calibrating the performance of any specific module of the dialog system. Two researchers with conversational training experience annotated the data. The leading expert has been working in an Alexa competition team for more than one year with an emphasis on the user ratings. For each dialog pair (A, B), they label 'A is better than B' or 'B is better than A' or 'cannot tell'. They reached a high inter-annotator agreement score (Cohen, 1968) with kappa κ = 0.83. To make sure that the dev & test is accurate, we throw away all "cannot tell" dialog pairs. We then study the correlation between Likert score based evaluation and pairwise comparison based evaluation.

Correlation Between User Ratings and Expert Ratings
Delta of Self-Reported Ratings (e.g., 5-1=4) ∆=1 ∆=2 ∆=3 ∆=4 Disagreement Rate 0.45 0.383 0.220 0.157 To further analyze the self-reported Likert scale ratings, we also compare the annotated labels of the 403 dialog pairs with the self-reported Likert scale ratings of these dialogs. For each pair of dialogs, we compare the pairwise comparison label and the delta between the self-reported Likert scale ratings of the two dialogs. Ideally, the dialog with a higher self-reported Likert scale rating should be the one that is annotated as having a better user experience in the pairwise comparison. We count the number and fraction of "disagreement" between the two types of ratings. Overall, roughly 1/3 of the dialog pairs disagree. As shown in Table 2, as the gap between the self-reported Likert scale ratings becomes larger, the disagreement between expert and self-reported ratings goes down. This suggests that if the difference between the two dialogs' Likert score is huge, they are more likely to be consistent with the comparison ratings.

Problem Formulation
Suppose the training set D train consists of data where x i is a dialog and y i is the noisy self-reported user ratings. We define a strict partial order relationship where x i x j means that dialog x i provides a better user experience than dialog x j . Note that y i > y j does not always imply x i x j since self-reported user ratings are noisy ( § 3.3, § 3.4). The test set D test consists of N test dialog pairs along with their binary pair-wise comparison labels i,j is annotated by experts and indicates whether dialog A provides a better user experience than dialog B, i.e., z test i,j = 1(x i x j ). The development set D dev has a similar structure.
Following the structure of the expert annotated pairs, we formulate our model M (φ, f ) as a pairwise dialog predictor with a similar architecture as RankNet (Burges et al., 2005). For a dialog pair (x i , x j ), the model predicts an un-normalized where φ is a dialog encoder that maps each dialog to a feature space and f is a linear transformation that converts each dialog feature into a real number o. We define a binary relationshipˆ where x iˆ x j means that the model predicts that dialog x i provides a better user experience than dialog x j . We denote model's prediction of z i,j asẑ i,j whereẑ i,j = 1(x iˆ x j ). We model the predicted posterior P (ẑ i,j = 1) = P (x iˆ x j ) as:

Method
Our goal is to reduce the noise of the self-reported user ratings ( § 3). Directly training a classification model using the noisy ratings leads to severe underfitting. To this end, we propose a three-stage training pipeline to denoise self-reported ratings to train an automatic dialog comparison model. Figure 1 describes the overall pipeline: • In Stage 1, we learn dialog feature representation with a self-supervised dialog flow anomaly detection task.
• In Stage 2, we perform label smoothing to adjust the noisy self-reported ratings in the training set and fine-tune the dialog comparison model on the smoothed ratings.
• In Stage 3, we perform data Shapley (Ghorbani and Zou, 2019; Jia et al., 2019a) on the self-reported user ratings to identify and remove noisy data points. We then fine-tune the dialog comparison model on the cleaned training set.  Having a good dialog representation is the first step towards denoising the data. Our primary goal in this stage is to train a dialog encoder φ to learn good dialog feature representations for the following stages. Here φ could be any sequence encoder that could encode a dialog and we use BERT (Devlin et al., 2019) in this paper.
For each dialog in the training set, we perturb the dialog flow to generate a fake dialog and train the model to differentiate the fake dialog and the real one. Dialog flow is a usercentric measure of whether a conversation is "going smoothly" . To perturb the dialog flow for each dialog x i , we randomly replace a user utterance in x i with a random user utterance from the training corpus D train , yielding a perturbed dialog x i,f ake . With high probability, the system utterance immediately following the replaced user utterance becomes inappropriate. Therefore, we incorporate {(x i , x i,f ake , z = 1)} into the training pairs. Similarly, we also randomly replace a system utterance and yield another perturbed dialog. We generate two perturbed dialogs for each dialog in the training set and thus 2N train real-fake dialog pairs in total. An example is shown in Table 3. We note that appropriateness is one of the most widely applied metrics of human evaluation on dialogs (Lowe et al., 2017). By learning to differentiate the perturbed dialog and the original one, we expect CMADE to learn a good dialog encoder φ which maps dialogs with similar dialog flow close to each other in the feature space.

Stage 2: Fine-tuning with smoothed self-reported user ratings
Stage 1 only performs unsupervised learning and does not incorporate any supervision from human ratings. To obtain better dialog feature representations for Stage 3, Stage 2 fine-tunes φ with supervision from the noisy self-reported user ratings. We adopt a simple yet effective label smoothing, inspired by (Szegedy et al., 2016;Nie et al., 2019), using the representation learned in Stage 1. A key assumption in Stage 2 is that dialogs with similar dialog flow provide a similar user experience. For each dialog x i , we find its K nearest neighbors in the feature space defined by φ. We use the average self-reported ratings of the K nearest neighbors as a smoothed rating y s i for x i . To construct training dialog pairs, we randomly sample dialog pairs x i and x j and derive a pair-wise comparison label z s i,j by comparing the smoothed rating y s i and y s j : z s i,j = 1(y s i > y s j ). We discard the pairs with equal y s i and y s j . To improve the dialog feature representation, we fine-tune the model M (φ, f ) on sampled dialog pairs along with the derived labels from comparing the smoothed scores {x i , x j , z s i,j }. We note that z s i,j depends solely on the noisy self-reported ratings in the training set and does not depend on the expert annotations. Theoretically, we could iter-ate between label smoothing and model fine-tuning since the fine-tuned model provides better dialog feature representation. In practice, we find that one iteration is enough to reach good prediction performance.
Label smoothing has led to state-of-the-art models in image classification (Szegedy et al., 2016), language translation (Vaswani et al., 2017) and speech recognition (Chorowski and Jaitly, 2017). Prior attempts in label smoothing (Szegedy et al., 2016;Vaswani et al., 2017;Chorowski and Jaitly, 2017;Müller et al., 2019) focus on categorical labels to prevent the network from becoming overconfident while we apply label smoothing on ordinal labels (i.e., Likert scores) to prevent the network from overfitting on noisy ordinal labels.

Stage 3: Denoising with data Shapley & further fine-tuning
In Stage 2, noisy ratings still have effect in the smoothed ratings for other data points. In Stage 3, we aim to identify and remove dialogs with noisy self-reported user ratings y i with data Shapley value technique Jia et al., 2019a,b). Shapley value comes originally from cooperative game theory (Dubey, 1975). In a cooperative game, there are n players D = {1, ..., n} and a utility function v : 2 [n] → R assigns a reward to each of 2 n subsets of players: v(S) is the reward if the players in subset S ⊆ D cooperate. Shapley value defines a unique scheme to distribute the total gains generated by the coalition of all players v(D) with a set of appealing mathematical properties. Shapley value has been applied to problems in various domains, ranging from economics (Gul, 1989) to machine learning (Cohen et al., 2005;Yona et al., 2019). In our setting, given D train = {(x i , y i )} N train 1 , we view them as N train players. We could also view the utility function v(S) as the performance on the development set. The Shapley value for player i is defined as the average marginal contribution of {(x i , y i )} to all possible subsets that are formed by other users (Jia et al., 2019a): As suggested by the definition of data Shapley, computing data Shapley value requires an exponentially large number of computations to enumerate O(2 N train ) possible subsets and train the model M on each subset, which is intractable. Inspired by (Jia et al., 2019a), CMADE tackles this issue by reducing the deep model M to a knearest neighbors (KNN) model and then apply the closed-form solution of shapley value on KNN. Using the feature extractor φ trained in Stage 1 and Stage 2, we fix φ and map all dialogs in the training data {x i } N train 1 to {φ(x i )} N train 1 . We first define the utility function v(S) in a special case where the development set only contains one dialog pair (x dev p , x dev q , z dev p,q ) p,q∈I dev ={(1,2)} . In our setting, the development set contains dialog pairs annotated by experts. Given any nonempty subset S ⊆ D train , we use the KNN Regressor to rate x dev p and x dev q . To do this, we compute φ(x dev p ) and sort {x p } N train 1 based on their euclidean distance in the dialog feature space to x dev p , yielding K as the top-K most similar dialogs to x dev q . Based on the self-reported user ratings in the training data, we use the KNN Regressor to rate x dev p and x dev q as follows:ŷ The model predictsẑ dev p,q = 1 ifŷ dev p >ŷ dev q and vice versa.
To obtain a closed-form solution to calculate Shapley value, instead of defining the utility function as the accuracy of the pair-wise prediction, we define the utility function as follows: m can be calculated recursively as follows: With Theorem 1, the Shapley value calculation could be finished in O(N log N ) time. The above result for a single point in the development set could be readily extended to the multiple-testpoint case. In our experiment, with such optimization, the Shapley value calculation takes less than 5 seconds to finish. Theorem 1 comes primarily from (Jia et al., 2019a,b) and we extends their results of vanilla KNN regressor (Jia et al., 2019a) to our pairwise testing setting.
By applying the Shapley technique to the data, we identify noisy training data points which contribute negatively to the performance and remove them from the training set. Similar to Stage 2, to construct training dialog pairs, we randomly sample dialog pairs x i and x j from the cleaned training set and derive z i,j by comparing the self-reported rating y i and y j . We then further fine tune the model from Stage 2. Theoretically, we could iterate between Stage 2 and Stage 3 multiple times while in practice one iteration is enough.

Towards Scalable Pair-based Training
We use a similar factorization technique for pairwise ranking in LambdaRank (Burges et al., 2006) to speed up training. For Stage 2 and 3, we have O(N 2 ) possible dialog pairs, which leads to quadratically increasing training time. Similar to LambdaRank (Burges et al., 2006), it is possible to calculate the exact gradient of O(N 2 ) possible dialog pairs with O(N ) forwards and backpropagations. More specifically, we denote the possible input pairs during training at Stage 2 or Stage 3 as: D pair train = {(x i , x j , z i,j )} i,j∈I . The total cost L for O(N 2 ) possible dialog pairs is the sum of O(N 2 ) cross-entropy costs: Theorem 2 We can compute ∂L ∂w k in O(N ) by factor it into a weighted sum of ∂o i ∂w k where the weight λ i ∈ R only depends on {o j } and {z i,j }. W.l.o.g., we assume z i,j ≡ 1 .

Experiment
Model Setup We fine tune the pre-trained BERT (Devlin et al., 2019) to learn the dialog feature extractor φ. We partition the 403 expert annotated dialog pairs into a 200-pair development set and a 203-pair test set. We set K = 50 for both the KNN label smoothing in Stage 2 and the KNN Shapley value calculation in Stage 3.

Model Details
The details of extending BERT to encode multi-turn dialogs are as follows. Each dialog is represented as a sequence of tokens in the following input format: Starting with a special starting token [CLS], we concatenate tokenized user and system utterances in chronological order with [SEP ] as the separators for adjacent utterance. In other words, we represent each dialog as a sequence: [CLS], S 1,1 , S 1,2 , ..., [SEP ], U 1,1 , U 1,2 , ..., [SEP ], S 2,1 , S 2,2 , ..., [SEP ] where S i,j and U i,j are the j th token of the system and user utterance in the i th turn. Following BERT, we also add a learned embedding to every token indicating whether it comes from user utterances or system utterances.

Model Comparisons and Ablations
We compare CMADE to its several ablations (Table 4) and evaluate the performance on the testing set, which is annotated by experts. We also report the kappa  agreement (Cohen, 1968) (kappa κ and Standard Error SE) between the predicted output and the expert annotations.
(1) BERT-Classification and (2) BERT-Regression fine tune the pre-trained BERT to perform a 5-class classification and regression respectively directly using the noisy self-reported ratings. To test BERT-Classification on dialog pairs, we apply the DEX trick (Rothe et al., 2015) to get a floating-point number of predicted rating and thus get rid of the cases when the model predicts the dialog pairs as tie. (3) BERT-Pairwise shares the same model architecture with CMADE. It constructs dialog pairs for training by randomly sample dialog pairs x i and x j and derive z i,j by comparing the corresponding self-reported user rating y i and y j . We discard the pairs with equal y i and y j . (4) BERT-Pairwise+Dev augments (3) by adding the 200 expert annotated dialog pairs in the development into the training data. We also compare the variants of CMADE which skips one or two of the three stages.

Results
Our first takeaway is that vanilla classification or regression formulation might not be the best way to formulate the problem of learning a dialog evaluation model. As shown in Table 4, pairwise architecture (BERT-Pairwise, 0.73) is better than classification (BERT-Classification, 0.53) or regression (BERT-Regression, 0.64) in this problem. Similar to our observation, the research community in computer vision has long observed that both vanilla classification and regression formulation has drawbacks in age estimation (Rothe et al., 2015;Niu et al., 2016;. Our second takeaway is that denoising algorithm that is more aggressive usually makes stronger assumptions on the quality of feature representa- tions. Therefore, it helps to create a denoising pipeline that starts with better feature representation learning and less aggressive denoising algorithm to learn better feature representation before applying the more aggressive denoising algorithms. As shown in Table 4, our three-stage denoising pipeline CMADE (Acc. 0.892) significantly outperforms all baselines by a large margin. Although (8) Stage 1 does not directly provide high accuracy (Acc. 0.620), the feature representation it learned is extremely important. Without Stage 1, both (5) Stage 2 (Acc. 0.755) and (6) Stage 2 + Stage 3 (Acc. 0.763) perform worse.
Since the KNN label smoothing is performed on the feature space, we expect the smoothing performs worse without self-supervised dialog feature representation learning in Stage 1. However, they still work better than baseline (1) (2) (3) which are models that do not account for the noise in data. This is because we use the pre-trained BERT to initialize our dialog encoder φ and thus φ is still able to provide some useful features for Stage 2. In addition, we observe that denoising with data Shapley in Stage 3 requires better dialog feature representation. (7) Stage 3 (Acc. 0.714) performs even worse than BERT-Pairwise (0.730) without good representations to perform the Shapley denoising algorithm. Skipping Stage 2 also hurts performance (Acc. 0.788). However, it does not mean that Shapley denoising in Stage 3 is not powerful. We observe a large performance gain in applying stage 3 after stage 1 and stage 2 (Acc. 0.837 v.s. 0.892). Finally, we note that adding the expert annotated development set directly into the training data is much less efficient compared to using the development set for data Shapley to denoise. BERT-Pairwise+Dev only reaches an accuracy of 0.749.

Analysis
Additional analysis We also present an analysis to show how Shapley denoising works as shown in Figure 2. We examine the Shapley value for each training datum in Stage 3. We first show an example dialog with a negative Shapley value in Table 5. According to the Shapley value, we remove data points one by one starting from the least valuable to the most valuable. Each time, after the point is removed, we create new KNN regressor models K = 1, 5, 25, 50, 100 on the remaining dialogs and labels and evaluate them on the test set with expert annotations. We extract the features of the remaining dialogs using the dialog encoder φ tuned by Stage 1 and Stage 2. As shown in Figure 2, removing training data with low Shapley values increases the performance to a certain point before convergence for K of all choices. We observe a similar trend when re-training a model on the remaining data. In contrast, removing data randomly decreases the performance on the test set. In addition, larger K has a better performance, which validates the denoising effect of KNN with large K.

Conclusion
The ultimate chatbot evaluation metric should be user-centric, as chatbots are there to provide human with an enjoyable experiences. Previously Likertscore based self-reported rating is the de-facto standard for current dialog evaluation . However, our analysis indicates that self-reported dialog ratings are skewed (J-shape), noisy and insensitive due to bias and variance among different users. We propose a three-stage denoising pipeline CMADE to reduce self-reported ratings and, at the same time, build an automatic comparison-based automatic dialog quality predictor. CMADE's results highly correlate with expert judgments on pair-wise dialog comparison ratings (89.2% agreement, 0.787 User: the whole harry potter series Sys: whoops i think i misheard you. can you repeat the movie name please? User: all the harry potter series Sys: ah, i don't know any movies by that name. sorry if i sometimes confuse you saying something else with a movie name ... ... Sys: oh, okay. anyways, i really enjoyed watching black panther and wonder woman recently. what's your favorite superhero movie? User: super man Sys: what was that? can you please say the movie name again? User: stop i gotta go Table 5: An example dialog with noisy self-reported rating (Score-5) identified by data shapley. Kappa).