Language Model Transformers as Evaluators for Open-domain Dialogues

Computer-based systems for communication with humans are a cornerstone of AI research since the 1950s. So far, the most effective way to assess the quality of the dialogues produced by these systems is to use resource-intensive manual labor instead of automated means. In this work, we investigate whether language models (LM) based on transformer neural networks can indicate the quality of a conversation. In a general sense, language models are methods that learn to predict one or more words based on an already given context. Due to their unsupervised nature, they are candidates for efficient, automatic indication of dialogue quality. We demonstrate that human evaluators have a positive correlation between the output of the language models and scores. We also provide some insights into their behavior and inner-working in a conversational context.


Introduction
Lately, deep learning conversational systems have seen increasing interest from industry and academia alike (Chen et al., 2017). These systems find usage in various contexts, starting from personal speech assistants like Google Assistant through the "chatbots" on instant messaging platforms like Facebook Messenger, and finally, conversational services like LUIS 1 . Many of these applications serve the objective of completing a specific function like purchasing a product or booking services (e.g., hotels, flights). Nonetheless, these applications can still profit from open-domain dialogue skills like chit-chatting, which would provide a more human-like interaction with users.
Presently, scientists and engineers working on computer-based conversational systems need humanbased evaluation to assess the quality and usability of their work (Dinan et al., 2019;Yoshino et al., 2019). These evaluations are costly in terms of resources. Thus, the field of dialogue systems could take advantage of an automated method for assessing conversations.
Seminal works in text summarization and machine translation have already proposed their fieldspecific metrics for automated assessments -for the former ROUGE (Lin, 2004), and, for the latter, BLEU (Papineni et al., 2002) and METEOR (Banerjee and Lavie, 2005). Dialogue system research (Ritter et al., 2011;Yoshino et al., 2019) constantly uses these metrics. However, Liu et al. show that these metrics based on word-overlap between prediction and references are not reliable for evaluating the usefulness of dialogue systems (2016). Hence, the field should use more sophisticated methods that consider the previous utterances of a conversation and its semantic meaning.
When human annotators evaluate a dialogue, they do not use an explicit reference or necessarily seek word overlap between context and response (or the lack of it). Their assessment bases itself on experience with the language and the implicit knowledge they have about it. The core principle of statistical language models (LM) is to capture and reproduce these properties. LM have proven themselves invaluable in state-of-the-art approaches in natural language processing, and natural language understanding (Peters et al., 2018;Devlin et al., 2019;Radford et al., 2019;.
Thus, the main aim of this work 2 is to investigate their usability as means for evaluating dialogues since they do not need a reference or supervision. We demonstrate that there is a significant positive correlation between the predictions of language models and human evaluation scores. Furthermore, we provide insights into the inner-workings and behavior of language models in the dialogue context.

Related Work
In this section, we present earlier work that focuses on dialogue evaluation. Furthermore, we provide a concise introduction to language model transformers and recent advances in this particular set of approaches.

Dialogue Evaluation
Lowe et al. present a cornerstone work in dialogue evaluation. (2017). They propose an automatic dialogue evaluation model (ADEM) that employs a neural network approach that approximates human judgment using scored dialogues together with the context, reference response, and one generated by a dialogue system. Reference responses and human annotation scores are hard to obtain. That is, it is challenging to employ the approach on large dialogue datasets. Another cornerstone is the work of Tao et al. (2018), a Referenced metric, and Unreferenced metric Blended Evaluation Routine (RUBER). They suggest a method consisting of two elements: The first one captures the resemblance between a generated and reference response using word vector pooling. The second one uses a neural network to estimate the relevance of a reply. The model is trained to distinguish whether an answer in a dialogue is the original one or a random one from another conversation. A drawback of both approaches above is that they use reference responses to derive a score. Furthermore, Sai et al. (Sai et al., 2019) demonstrate that machine learning approaches for dialogue evaluation like ADEM are susceptible to adversarial attacks.
Other works focus on addressing the issue that there is more than one possible response for a given dialogue context by considering multiple reference responses. For example,  suggest an augmented version of BLEU that uses synthetically generated responses. The algorithm in  operates similarly. Sugiyama et al (2019) develop a Support Vector Regression approach to consider multiple references.  investigate a framework of dialogue-modeling methods combined with a variety of metrics, where they evaluate dialogues using various references. Zhang et al. (2020) propose BERTscore to calculate text similarity using contextual embeddings. Their work can be used for evaluating text generation against a reference. Unfortunately, it offers no way to evaluate dialogues without a specified ground truth. On another note, Kann et al. (2018) suggest a sentence level fluency metric derived from the perplexity score of a language model given a sentence without involving any references. Their results demonstrate significant positive correlations with human annotators. Nedelchev et al. (2020) experiment with an anomaly detection approach where erroneous dialogues are seen as anomalies.

Language Models
The first application of n-gram-based language models is recorded in the mid-1970s by two independent works of Jelinek (1976) and Baker (1975). Given a sequence of tokens, T = {t 1 , ..., t N }, a forward language model computes the probability of the sequence by modeling the probability of a token t K (K ≤ N) which has a history up to the K-th token (Peters et al., 2018). Some of the initial neural network models (Melis et al., 2017) use initially a context-independent vector representation for a token, which all pass through one or more LSTM layers (Hochreiter and Schmidhuber, 1997). In the end, they produce a context-dependent vector that serves as input to a softmax layer to predict the next token. In a reversed fashion, backward LM use the context to the right of the target token to predict it. In contrast, bi-directional language models use a combination of both to predict the target word. Radford et al. (2019) propose generative pre-training (GPT2), where they use the transformer (Vaswani et al., 2017) as a forward language model, due to its superiority in terms of long-term memory when contrasted to recurrent neural networks like LSTMs.
Furthermore, Devlin et al. (2019) suggest an innovative way to train language models, also utilizing transformers, specifically Bidirectional Encoder Representations from Transformers (BERT). They invent the masked language model (MLM) where a random subset of tokens from a sequence is masked or replaced, which the model then predicts by using the remaining original context. Furthermore, BERT uses an additional LM objective: next sentence prediction (NSP). It works by teaching a model to recognize whether two sentences appear sequentially in a corpus or not.
Yet another innovative transformer-based language model is XLNet by . It combines the best features of a generative LM like GPT2 and a masked LM like BERT by proposing to use the permutations of all factorization orders of a sequence to train. Thanks to it, XLNet learns to utilize knowledge from both sides of the target token, but also the respective context of other positions. Golovanov et al. (2019) demonstrate that pre-trained transformer language models provide benefits for conversational agents.
For completeness, we mention other language models below that utilize transformers but are not integral to this work. We do not employ them in this work because the architectures discussed above already supersede them, or we deem their additions as not adequate for modeling dialogues.  propose Transformer-XL, a new approach that allows transformers to model even longer sequences by caching and reusing intermediate hidden states. XLNet also utilizes the method in its implementation. Cross-lingual Language Model, by Lample and Conneau (2019), introduces Translation Language Modeling, i.e., randomly masks words in parallel sequences in two languages to teach the model leveraging multi-lingual context.  present Robustly optimized BERT, by just dropping BERTs next sentence prediction and a few other modifications in training. Raffel et al. (2019) introduce the Text-to-Text Transfer Transformer, where the language-modeling objective is using a textto-text perspective. Conditional Transformer Language model, by Keskar et al. (2019), incorporates conditioning on control codes to guide the generation of tokens.
Besides capturing syntax, LM are also capable to model semantics of sentences. The results of Tenney et al. (Tenney et al., 2019) suggest that contextual word embeddings can encode both syntax and semantics on a sub-sentence level. Furthermore, Zhou et al.  conduct a systematic benchmark to evaluate seven LM for their commonsense knowledge and reasoning. Their work suggests that they have a certain degree of those abilities. Commonsense is what would also help for evaluating open-domain dialogues.

Methodology
In this section, we report on the used datasets for assessing the usability of transformer language models for evaluating dialogue quality, introduce the used approaches in greater detail and describe their relevance to the task at hand.

Datasets
We use the data gathered during the ConvAI1 3  and Con-vAI2 4 (Zhang et al., 2018;Dinan et al., 2019) challenges. The organizers invited competitors to develop dialogue systems that had to address specific tasks. For ConvAI1, the participating systems needed to be able to converse about a topic. In the other competition, the chatbots had to engage in a small-talk while impersonating a pre-defined personality profile ("persona"). In both cases, human annotators evaluated the capability of the dialogue systems to converse by interacting with them and giving a score at the end. For both competitions, the scoring is on dialogue level. In Table 1 and in Figure 1, we present some additional details about the data. However, we do not evaluate the two challenges specifically (topic discussion and role acting). Instead, we aim at general open-domain dialogue evaluation, which implies relevance, coherence, and fluency of the utterances.

Language Model Evaluators
In Section 2.2, we presented a concise introduction into transformer-based language models. In the current subsection, we will provide more details about three of those architectures, and how we use them for conducting this study. Our main goal is to use the LM to assign a probability to the utterances in a conversation. We used HuggingFace's Transformers 5  for implementation and pre-trained weights of transformer-based language models. Since intuition dictates that responses are dependent on their preceding context, we condition the target reply on its history to measure its relevance. Kann et al. (2018) showed how language models could serve as good sentence-level fluency indicators. Thus, the calculated probability from the transformer-based LM can serve as a combined score for fluency and coherency. The following LM are used in this work: 1.) As previously mentioned, BERT (Devlin et al., 2019) is using two language modeling objectives: masked language modeling (MLM) and next sentence prediction. MLM provides no viable way for computing the probability of a target response because it originally substitutes only a random subset of tokens. Thus, there is no consistent and deterministic way to use masked language modeling for assigning a probability score to a response given its context. However, BERT's next sentence prediction is an excellent approach for the current task. It can judge if an utterance is the next one given its contextual predecessor. Thus, we pair up the sequentially appearing sequences in a conversation and compute a probability score for the second reply: P (u 2 |u 1 ) = P (t 21 , t 22 , ..., t 2n |t 11 , t 12 , ..., t 1m ) , where P (u 2 |u 1 ) is the probability score of the target response, while (t 11 , t 12 , ..., t 1m ) and (t 21 , t 22 , ..., t 2n ) are the tokens belonging to the query and response utterances prospectively.
2.) The approach of GPT2 (Radford et al., 2019) is the standard language model approach that factorizes the joint probability over the sequence tokens (t 1 , t 2 , ..., t n ) as a product of the conditional probabilities (Peters et al., 2018): In our problem domain, we need to consider two consecutive sequences and capture the coherence between them. Thus, we concatenate them into one, where the context appears first and is then followed by the second utterance. We then compute the joint probability for the second part conditioned on the past: where m is the length of the context, and n is the length of the target utterance. 3.) XLNet  follows the same general language model approach as GPT2, however, with some additions to its training objective and neural network architecture. First of all, unlike GPT2, XLNet optimizes the model over a sequence w.r.t. all possible permutations of the factorization orders rather than each one separately. Secondly, compared to conventional neural transformers, XLNet adds one more attention stream that includes the positional information of the target token but excluding the content to maintain the autoregressive properties. To compute probabilities for the utterances, we follow the same procedure as described above for GPT2.
In this work, we use a set of hyper-parameter configurations for each of the three language models. We present them in Table 2.

Scoring
In Equations 2 and 3, we showed how language models compute a probability score for a whole sequence. However, as an aggregated score over the tokens, it is losing the initial probabilistic distribution over the tokens. Furthermore, since we are dealing with dialogues, i.e., a sequence of utterances, we need to perform two levels of aggregation. The first level is an aggregation of the word tokens within an utterance, while the second is the done while aggregating over the utterances. Thus, we investigate other possible ways to derive an aggregated score over the word tokens and over the utterances within a dialogue. Besides a product of probabilities, we also look into a sum and an unweighted average, which capture the length of the sequences (utterance or dialogue), which might prove beneficial for a correlation study with human annotators. We normalize all of the scores such that they range between 0.0 (population minimum) and 1.0 (population maximum).
For GPT2 and XLNet, our experiments show that the following formulation correlates the highest with human annotator scores: We investigated other means to compute an aggregated score on the dialogue level. We present the other results with low correlation coefficients and significance values in the appendix.

Baseline
We take RUBER from Tao et al (2018) as a baseline. The approach initially employs two components that perform two functions. The first one is to calculate a resemblance score using word vector pooling and references. We aim for an unreferenced evaluation approach akin to a human evaluator. Thus, we use only the second component of the method. This second component can calculate a relevance score for a given response based on its preceding context. It uses a bidirectional GRU network and negative sampling. To reproduce as best as possible the original results of RUBER, we sample 1,449,218 pairs of sequential utterances from the OpenSubtitles dataset (Lison and Tiedemann, 2016).

Evaluation
In this part of the paper, we will conduct a correlation analysis between the calculated probabilities from the LM and the scores given to dialogues by human evaluators. We provide a closer look at some auxiliary model outputs as well.

Quantitative Assessment
In Table 3, we report the noteworthy Pearson's and Spearman's correlation coefficients between the aggregated probability scores and the evaluations of the dialogues.
The immediate observation of using language models as dialogue evaluators shows that there are gaps in terms of performance between the three different approaches. Most evident is the difference between BERT and the others. Its next sentence prediction objective explains this behavior. Unlike the other two, BERT takes the most structured approach to modeling two sequences. It recognizes the two utterances as separate and captures their information as a whole. Thus, when we compare it to GPT2 and XLNet, it has the advantage of not needing score aggregation on utterance level, because it produces a probability for the whole sentence rather than word for word.

Dataset
ConvAI1 ConvAI2  Table 3: Pearson's r, and Speaman's ρ, correlation coefficients on the two dialogue datasets' human scores and various aggreggated scores from the language models. "u-avg-d-sum" stands for averaged probabilities on utterance level and then summed up on conversation level. Most of the scores are with a confidence of p <= 0.001. Exceptions are GPT2-medium and GPT2 in ConvAI1 with 0.812 and 0.212 respectively, as well as, RUBER-U for ConvAI2, both r and ρ, with 0.5309 and 0.8166, respectively.
Also, there is a smaller difference in performance between GPT and XLNet. First of all, they share a core foundation as autoregressive language models, thus are more similar to each other than BERT, which also explains their overall behavioral similarity. However, XLNet has a structural improvement in its architecture. Unlike GPT2, it also encodes the positional information of the target token. Thus, similarly to BERT, it can capture more information about a sequence and consequently have a better correlation score.
Additionally, we investigate the effect of model size.
The difference in correlation coefficients between the hyperparameter configurations is marginal and, in one of the cases, even non-existent. The most evident example is the spectrum displayed by the three GPT2 settings. Ultimately, we can conclude that smaller models perform similarly at a much smaller energy cost.
In regards to score aggregation, all the approaches unanimously show that averaging on utterance level and summing up the whole conversation is the most informative for dialogue evaluation. At the same time, the using a product or an unweighted average produce correlation coefficients very close to zero and with an extremely low significance (e.g., p − value ranging from 0.4 to 0.8). The behavior indicates that while utterance length is insignificant, the duration of the conversation strongly dictates its quality score. In Figures 2a and 2b, the regression models show the interaction between the annotator quality score and the length of a conversation in ConvAI1 and ConvAI2, respectively. In both cases, the regression shows a positive trend that the longer a dialog is, the better its assessment is. We also see that in the case of ConvAI1, the confidence area is much wider than in ConvAI2. This behavior further supports the results in Table 3, where the language models have considerably lower correlation coefficients for ConvAI1.

Qualitative Assessment
Furthermore, we manually investigated short conversations from both datasets that also have low quality. Many of the short dialogues show that the system would indeed perform poorly by not responding at all, or the first couple of utterances would be not diverse or even the same. Thus, the annotator would terminate the session and evaluate the dialogue with a low score. In contrast, conversations that were more interactive and had longer duration also performed better in their assessment.

Original Context
Original Response (as in dataset) Generated Response (generated by transformer) "Wow! Are you man or woman?" "I am! i am a woman." " 'm a I am a man! I" "How nice! Do you have a boyfriend?" "I do not. i am a single mom." " 'm . . I am a virgin woman. i" "What do you mean?" "granted the right to accept only one religion" "anted, fact to be or the of" "Do you know Utrecht?" "granted the right to accept only one religion" "ind, title to use donations Dutch application" In this subsection, we report the correlation scores between the maximum probabilities for each token and the annotator scores. The intuition is that besides being renown for advancing the state-of-the-art in various NLP benchmarks, language models are prominent for being capable generators of natural language. Furthermore, Hendrycks and Gimpel (2017) have demonstrated that the maximum class probability of a neural network classifier tends to output lower values for samples that are out of distribution. Thus, we set to investigate whether the predicted maximum classes of language models can also indicate the quality of dialogues.
Although there are some studies  demonstrating BERT generating text, we will not consider it in this part of the work due to the nature of its masked language modeling, which does not aim at generating text. Considering GPT2 and XLNet, we look into what are the most likely words they predict for each token of the sequence instead of the original ones.
For the context of dialogue evaluation, it means that on average max scores should be higher for fluent and coherent text like the one used for pretraining the language models. At the same time, erroneous samples should have lower maximum probabilities.
Firstly, we investigate the quantitative relation of the max scores to human annotator scores. Similarly to what we did in Section 4.1, we have calculated the aggregated probability scores for the most likely words according to the language models:  Table 5: Pearson's correlation coefficients, r, and Speaman's correlation coefficients, ρ, on the two dialogue datasets' human scores and various aggregated scores for the max word instead of the target. "u-avg-d-sum" stands for averaged probabilities on utterance level and then summed up on conversation level. All of the scores are with a confidence of p <= 0.001.
We present the results in Table 5. When compared to the analogous results in Table 3, we see that GPT2 and XLNet demonstrate noticeably higher correlation coefficients, especially for the dialogues in the ConvAI1 dataset. This discrepancy suggests that for some of the cases, the models can generate text that would fit better into the conversation. Since, ConvAI1 and ConvAI2 happened before the introduction of transformer-based language models, it is save to assume that the participating systems are inferior.
In Table 4, we present some short sample conversations together with a generated text by a language model. The top two examples have high scores by the human annotators, while the rest are of low quality. The model can reconstruct sensible responses that make sense and are still different from the original reply. On the other hand, whenever there is an incoherent conversation like the third and fourth examples, GPT2 and XLNet are not able to recreate a response that is either somewhat fluent or related to the current context. Another peculiarity is that the language model possesses in a sense, common knowledge. This is demonstrated by the fourth example, while in the preceding utterance, we see Utrecht, a Dutch city, and the model is then induced to predict "Dutch" as one of the response tokens.

Conclusion
In this study, we investigated whether transformer-based language models can evaluate dialogues in terms of coherency and fluency. Overall, Pearson's and Spearman's correlation coefficients demonstrate that BERT, GPT2, and XLNet can indicate a conversation's quality without any additional supervision or reference. While, in their core, the three use the same approach, transformers, they have further structural modifications that set them apart when considered for the current problem domain. GPT2 performs worst due to its standard language modeling approach that incorporates the least structural information about a sequence. XLNet achieves an inmprovement in terms of its correlation score by taking advantage of additional positional information when predicting a target token. Finally, BERT's next sentence prediction approach delivered the highest performance thanks to its structured approach in regards to separate utterances.
While LM-based dialogue evaluators cannot yet replace human annotators, they have additional value when compared to word-over metrics like BLUE or ones that use word-embeddings. Although they cannot completely replace human evaluators, They can support as weak indicators for quality. Additionally, we have shown that they can perform better than competing approaches like the unreferenced component of RUBER.
Furthermore, the autoregressive language models, GPT2 and XLNet, demonstrate an excellent initial aptitude for conducting dialogues. They can provide alternative responses that are also coherent with the context of a discussion.
The LM-based method in this work considers dialogues as a series paired up utterances or questionanswers rather than one whole sequence. As future work, we will investigate how to extend the procedure so that it is more adept at capturing long-term information over the entire conversation.

A Equations for Aggregating Scores on Dialogue Level
In this section, we list the different aggregation measures that we experimented with. The correlation coefficients between these aggregations scores and the human annotation are either of low values, are insignificant (low p-value), or both.