Learning an Unreferenced Metric for Online Dialogue Evaluation

Evaluating the quality of a dialogue interaction between two agents is a difficult task, especially in open-domain chit-chat style dialogue. There have been recent efforts to develop automatic dialogue evaluation metrics, but most of them do not generalize to unseen datasets and/or need a human-generated reference response during inference, making it infeasible for online evaluation. Here, we propose an unreferenced automated evaluation metric that uses large pre-trained language models to extract latent representations of utterances, and leverages the temporal transitions that exist between them. We show that our model achieves higher correlation with human annotations in an online setting, while not requiring true responses for comparison during inference.


Introduction
Recent approaches in deep neural language generation have opened new possibilities in dialogue generation . Most of the current language generation efforts are centered around language modelling or machine translation (Ott et al., 2018), which are evaluated by comparing directly against the reference sentences. In dialogue, however, comparing with a single reference response is difficult, as there can be many reasonable responses given a context that have nothing to do with each other (Liu et al., 2016). Still, dialogue research papers tend to report scores based on word-overlap metrics from the machine translation literature (e.g. BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014)). However word-overlap metrics aggressively penalize the generated response based on lexical differences with the ground truth and correlate poorly to human judgements (Liu et al., 2016). One can build dialogue evaluation metrics in two ways: referenced metrics, which compare the generated response with a provided ground-truth response (such as the above word-overlap metrics), or an unreferenced metrics, which evaluate the generated response without any such comparison.  propose a learned referenced metric named ADEM, which learns an alignment score between context and response to predict human score annotations. However, since the score is trained to mimic human judgements, it requires collecting large-scale human annotations on the dataset in question and cannot be easily applicable to new datasets (Lowe, 2019).
Recently, Tao et al. (2017) proposed a hybrid referenced-unreferenced metric named RUBER, where the metric is trained without requiring human responses by bootstrapping negative samples directly from the dataset. However, referenced metrics (including RUBER, as it is part referenced) are not feasible for evaluation of dialogue models in an online setting-when the model is pitched against a human agent (model-human) or a model agent (model-model)-due to lack of a reference response. In this setting, models are usually eval-uated directly by humans, which is costly and requires careful annotator training (Li et al., 2019).
The contributions of this paper are (1) a completely unsupervised unreferenced metric MAUDE (Metric for automatic Unreferenced dialogue evaluation), which leverages state-of-the-art pretrained language models (Devlin et al., 2018;Sanh et al., 2019), combined with a novel discoursestructure aware text encoder and contrastive training approach; and (2) results showing that MAUDE has good correlation with human judgements.

Background
We consider the problem of evaluating the response of a dialogue system, where an agent is provided with a sequence of sentences (or utterances) c = {u 1 , u 2 , ..., u n } (termed as context) to generate a response r = u n+1 . Each utterance, u i , can be represented as a set of words u i = {w 1 , w 2 , ..., w n }. An utterance u i can be represented as a vector as h i = f e (u i ), where f e is an encoder that encodes the words into a fixed vector representation.
This work focuses on the evaluation of generative neural dialogue models, which typically consist of an encoder-decoder style architecture that is trained to generate u n+1 word-by-word . The response of a generative model is typically evaluated by comparing with the ground-truth response using various automatic word-overlap metrics, such as BLEU or METEOR. These metrics, along with ADEM and RUBER, are essentially single-step evaluation metrics, where a score is calculated for each contextresponse pair. If a dialogue D i contains n utterances, we can extract n − 1 context-response pairs : (c 1 : {u 1 }, r 1 : {u 2 }), (c 2 : {u 1 , u 2 }, r 2 : {u 3 }), . . . , (c n−1 : {u 1 . . . u n−1 }, r n−1 : u n ). In this paper, we are interested in devising a scalar metric that can evaluate the quality of a contextresponse pair: score(c i , r i ) = R ∈ (0, 1). A key benefit of this approach is that this metric can be used to evaluate online and also for better training and optimization, as it provides partial credit during response generation.

Proposed model
We propose a new model, MAUDE, for online unreferenced dialogue evaluation. We first describe the general framework behind MAUDE, which is inspired by the task of measuring alignment in natural language inference (NLI) (Williams et al., 2017). It involves training text encoders via noise contrastive estimation (NCE) to distinguish between valid dialogue responses and carefully generated negative examples. Following this, we introduce our novel text encoder that is designed to leverage the unique structural properties of dialogue.
MAUDE is designed to output a scalar score(c i , r i ) = R ∈ (0, 1), which measures how appropriate a response r i is given a dialogue context c i . This task is analogous to measuring alignment in NLI, but instead of measuring entailment or contradiction, our notion of alignment aims to quantify the quality of a dialogue response. As in NLI, we approach this task by defining encoders f θ e (c) and f θ e (r) to encode the context and response, a combination function f comb (.) to combine the representations, and a final classifier f t (.), which outputs the alignment score: The key idea behind an unreferenced dialogue metric is the use of Noise Contrastive Estimation (NCE) (Gutmann and Hyvärinen, 2010) for training. Specifically, we train the model to differentiate between a correct response (score(c, r) → 1), and a negative response (score(c,r) → 0), wherer represents a candidate false response for the given context c. The loss to minimize contains one positive example and a range of negative examples chosen from a sampling policy P (r): L = − log(score(c, r))−Er ∼P (r) log(−score(c,r)).
(2) The sampling policy P (r) consists of syntactic and semantic negative samples. Syntactic negative samples. We consider three variants of syntax level adversarial samples: wordorder (shuffling the ordering of the words of r), word-drop (dropping x% of words in r) and wordrepeat (randomly repeating words in r). Semantic negative samples. We also consider three variants of negative samples that are syntactically well formed, but represent corruption in the semantic space. First, we choose a response r j which is chosen at random from a different dialogue such that r j = r i (random utterance). Second, we use a pre-trained seq2seq model on the dataset, and pair random seq2seq generated response with r i (random seq2seq). Third, to provide a bigger variation of semantically negative samples, for each r i we generate high-quality paraphrases r b i using Back-Translation (Edunov et al., 2018). We pair random Back-Translations r b j with r i as in the above setup (random back-translation). We also provide the paired r b i as positive example for the models to learn variation in semantic similarity. We further discuss the effect of different sampling policies in Appendix C. Dialogue-structure aware encoder. Traditional NLI approaches (e.g., Conneau et al. (2017)) use the general setup of Equation 1 to score contextresponse pairs. The encoder f e is typically a Bidirectional LSTM-or, more recently, a BERT-based model (Devlin et al., 2018), which uses a large pre-trained language model. f comb is defined as in Conneau et al. (2017): However, the standard text encoders used in these traditional NLI approaches ignore the temporal structure of dialogues, which is critical in our setting where the context is composed of a sequence of distinct utterances, with natural and stereotypical transitions between them. (See Appendix A for a qualitative analysis of these transitions). Thus we propose a specialized text encoder for MAUDE, which uses a BERT-based encoder f BERT e but additionally models dialogue transitions using a recurrent neural network: where The final representation of the dialogue context is learned by pooling the individual hidden states of the RNN using max-pool (Equation 4). This context representation is mapped into the response vector space using weight W, to obtain c i . We then learn the alignment score between the context c i and response r i 's representation h r i following Equation 1, by using the combination function f comb being the same as in Equation 3.

Experiments
To empirically evaluate our proposed unreferenced dialogue evaluation metric, we are interested in answering the following key research questions: • Q1: How robust is our proposed metric on different types of responses?
• Q2: How well does the self-supervised metric correlate with human judgements? Datasets. For training MAUDE, we use Per-sonaChat (Zhang et al., 2018), a large-scale opendomain chit-chat style dataset which is collected by human-human conversations over provided user persona. We extract and process the dataset using ParlAI (Miller et al.) platform. We use the public train split for our training and validation, and the public validation split for testing. We use the human-human and human-model data collected by See et al. (2019) for correlation analysis, where the models themselves are trained on PersonaChat. Baselines. We use InferSent (Conneau et al., 2017) and unreferenced RUBER as LSTM-based baselines. We also compare against BERT-NLI, which is the same as the InferSent model but with the LSTM encoder replaced with a pre-trained BERT encoder. Note that these baselines can be viewed as ablations of the MAUDE framework using simplified text encoders, since we use the same NCE training loss to provide a fair comparison. Also, note that in practice, we use DistilBERT (Sanh et al., 2019) instead of BERT in both MAUDE and the BERT-NLI baseline (and thus we refer to the BERT-NLI baseline as DistilBERT-NLI). 1 .

Evaluating MAUDE on different types of responses
We first analyze the robustness of MAUDE by comparing with the baselines, by using the same NCE training for all the models for fairness. We evaluate the models on the difference score, ∆ = score(c, r ground-truth ) − score(c, r) ( Table 6). ∆ provides an insight on the range of score function. An optimal metric would cover the full range of good and bad responses. We evaluate response r in three settings: Negative: responses that have been adversarially modified in the lexical units. Ideally, we would want ∆ → 1 for semantic and syntactic negative responses, ∆ → 0 for semantic positive responses. We observe that the MAUDE scores perform robustly across all the setups. RUBER and InferSent baselines are weak, quite understandably so because they cannot leverage the large pre-trained language model data and thus is poor at generalization. DistilBERT-NLI baseline performs significantly better than InferSent and RUBER, while MAUDE scores even better and more consistently overall. We provide a detailed ablation of various training scenarios as well as the absolute raw ∆ scores in Appendix C. We also observe both MAUDE and DistilBERT-NLI to be more robust on zero-shot generalization to different datasets, the results of which are available in Appendix B.

Correlation with human judgements
Metrics are evaluated on correlation with human judgements Tao et al., 2017), or by evaluating the responses of a generative model trained on the metric (Wieting et al., 2019), by human evaluation. However, this introduces a bias either during the questionnaire setup or during data post-processing in favor of the proposed metric. In this work, we refrain from collecting human annotations ourselves, but refer to the recent work by See et al. (2019) on PersonaChat dataset. Thus, the evaluation of our metric is less subject to bias. See et al. (2019) conducted a large-scale human evaluation of 28 model configurations to study the effect of controllable attributes in dialogue generation. We use the publicly released model-human and human-human chat logs from See et al. (2019) to generate the scores on our models, and correlate them with the associated human judgement on a Likert scale. See et al. (2019) propose to use a multi-step evaluation methodology, where the hu-  man annotators rate the entire dialogue and not a context-response pair. On the other hand, our setup is essentially a single-step evaluation method. To align our scores with the multi-turn evaluation, we average the individual turns to get an aggregate score for a given dialogue. We investigate the correlation between the scores and uncalibrated individual human scores from 100 crowdworkers (Fig. 2), as well as aggregated scores released by See et al. (2019) which are adjusted for annotator variance by using Bayesian calibration (Kulikov et al., 2018) (Table 2). In all cases, we report Spearman's correlation coefficients.
For uncalibrated human judgements, we observe MAUDE having higher relative correlation in 6 out of 8 quality measures. Interestingly, in case of calibrated human judgements, DistilBERT proves to be better in half of the quality measures. MAUDE achieves marginally better overall correlation for calibrated human judgements, due to significantly strong correlation on specifically two measures: Interestingness and Engagingness. These measures answers the questions "How interesting or boring did you find this conversation?" and "How much did you enjoy talking to this user?". (Refer to Appendix B of See et al. (2019) for the full list of questions). Overall, using large pre-trained language models provides significant boost in the human correlation scores.

Conclusion
In this work, we explore the feasibility of learning an automatic dialogue evaluation metric by leveraging pre-trained language models and the temporal structure of dialogue. We propose MAUDE, which is an unreferenced dialogue evaluation metric that leverages sentence representations from large pretrained language models, and is trained via Noise Contrastive Estimation. MAUDE also learns a recurrent neural network to model the transition between the utterances in a dialogue, allowing it to correlate better with human annotations. This is a good indication that MAUDE can be used to evaluate online dialogue conversations. Since it provides immediate continuous rewards and at the singlestep level, MAUDE can be also be used to optimize and train better dialogue generation models, which we want to pursue as future work.

A Temporal Structure
We hypothesize that a good encoding function can capture the structure that exists in dialogue. Often this translates to capturing the semantics, coherency in dialogue which are some of the key attributes of a conversation. Formally, we propose using a function f D i t which maps one utterance to the next.
To define a good encoding function, we turn to pre-trained language models. These models are typically trained on large corpus and achieve stateof-the-art results on a range of language understanding tasks (Ott et al., 2018). To validate our hypothesis, we use a pre-trained (and fine-tuned) BERT (Devlin et al., 2018) as f e . We compute h u i = f e (u i )∀u i ∈ D, and learn a linear classifier to predict an approximate position of the u i ∈ D i . The task has details in its design, in the case of goal-oriented dialogues the vocabulary differs in different parts of the conversation and in chitchat dialogues it cannot be said. To experiment, we choose PersonaChat (Zhang et al., 2018) and Dai-lyDialog (Li et al., 2017) to be nominal of chit-chat style data, and Frames (Asri et al., 2017) and Multi-WOZ (Budzianowski et al., 2018) for goal-oriented data.
We encode every consecutive pairs of the utterances with a % score, t, that denotes its occurrence after the completion of t% of dialogue.
where index up denote the average of the indices in the pair of the utterances and k denote the total number of utterances in dialogue. Now, we pre-define the number of bins B. We split the range 0-100 into B non-overlapping sets(every set will have min and max denoted by s i min and s i max respectively). We parse every dialogue in the dataset, and place the encoding of every utterance pair in the corresponding bin.
We then use Linear Discriminant Analysis (LDA) to predict the bin of each utterance u i in the dialogue after converting the high dimensional embedding into 2 dimensions. LDA provides the best possible class conditioned representation of data. This gives us a downsampled representation of each utterance u i which we plot as shown in Figure 3. The reduction on BERT encoding to 2dimensions shows that BERT is useful in nudging the encoded utterances towards useful structures. We see well defined clusters in goal-oriented but not-so-well-defined clusters in open domain dialogues. This is reasonable to expect and intuitive.

B Generalization on unseen dialog datasets
In order for a dialogue evaluation metric to be useful, one has to evaluate how it generalizes to unseen data. We performed the evaluation using our trained models on PersonaChat dataset, and then evaluated them zero-shot on two goal-oriented datasets, Frames ( (Table 3). We find BERT-based models are significantly better at generalization than InferSent or RUBER, with MAUDE marginally better than DistilBERT-NLI baseline. MAUDE has the biggest impact on generalization to DailyDialog dataset, which suggests that it captures the commonalities of chit-chat style dialogue from PersonaChat. Surprisingly, generalization gets significantly better of BERT-based models on goal-oriented datasets as well. This suggests that irrespective of the nature of dialogue, pre-training helps because it contains the information common to English language lexical items.

C Noise Contrastive Estimation training ablations
The choice of negative samples (Section 3) for Noise Contrastive Estimation can have a large impact on the test-time scores of the metrics. In this section, we show the effect when we train only using syntactic negative samples (Table 4) and only semantic negative samples (Table 5). For comparison, we show the full results when trained using both of the sampling scheme in Table 6. We find overall training only using either syntactic or semantic negative samples achieve less ∆ than training using both of the schemes. All models achieve high scores on the semantic positive samples when only trained with syntactical adversaries. However, training only with syntactical negative samples results in adverse effect on detecting semantic negative items.

D Qualitative Evaluation
We investigate qualitatively how the scores of different models are on the online evaluation setup on See et al. (2019)'c collected data. In Figure  4, we show a sample conversation where a human evaluator is pitched against a strong model. Here, MAUDE scores correlate strongly with raw likert scores on different metrics. We observe that RU-BER and InferSent baselines overall correlate negatively with the response. In Figure 5, we show another sample where a human evaluator is pitched against a weak model, which exhibits degenerate responses. We see both MAUDE and DistilBERT-NLI correlate strongly with human annotation and provides a very low score, compared to RUBER or InferSent.
Since we essentially cherry-picked good results, its only fair to show a similarly cherry-picked negative example of MAUDE. We sampled from responses where MAUDE scores are negatively correlated with human annotations on Inquisitiveness metric (5% of cases), and we show one of those responses in Figure 6. We notice how both DistilBERT-NLI and MAUDE fails to recognize the duplication of utterances which leads to a low overall score. This suggests there still exists room for improvement in developing MAUDE, possibly by training the model to detect degeneracy in the context.

E Hyperparameters and Training Details
We performed rigorous hyperparameter search to tune our model MAUDE. We train MAUDE with downsampling, as we observe poor results when we run the recurrent network on top of 768 dimensions. Specifically, we downsample to 300 dimensions, which is the same used by our baselines RUBER and InferSent in their respective encoder representations. We also tested with the choice of either learning a PCA to downsample the BERT representations vs learning the mapping D g (Equation 4), and found the latter producing better results. We keep the final decoder same for all models, which is a two layer MLP with hidden layer of size 200 dimensions and dropout 0.2. For BERT-based models (DistilBERT-NLI and MAUDE), we use Hug-gingFace Transformers (Wolf et al., 2019) to first fine-tune the training dataset on language model objective. We tested with training on frozen finetuned representations in our initial experiments, but fine-tuning end-to-end lead to better ablation scores. For all models we train using Adam optimizer with 0.0001 as the learning rate, early stopping till validation loss doesn't improve. For the sake of easy reproducibility, we use Pytorch Lightning (Falcon, 2019) framework. We used 8 Nvidia-TitanX GPUs