Towards a Metric for Automated Conversational Dialogue System Evaluation and Improvement

We present “AutoJudge”, an automated evaluation method for conversational dialogue systems. The method works by first generating dialogues based on self-talk, i.e. dialogue systems talking to itself. Then, it uses human ratings on these dialogues to train an automated judgement model. Our experiments show that AutoJudge correlates well with the human ratings and can be used to automatically evaluate dialogue systems, even in deployed systems. In a second part, we attempt to apply AutoJudge to improve existing systems. This works well for re-ranking a set of candidate utterances. However, our experiments show that AutoJudge cannot be applied as reward for reinforcement learning, although the metric can distinguish good from bad dialogues. We discuss potential reasons, but state here already that this is still an open question for further research.


Introduction
Conversational dialogue systems (also referred to as chatbots, social bots, or non-task-oriented dialogue systems) allow for a natural conversation between computer and humans. Research on these dialogue systems has recently reemerged due to the availability of large dialogue corpora, (Serban et al., 2018) as well as the popularization of deep learning (Sordoni et al., 2015;Vinyals and Le, 2015;Serban et al., 2016b).
One major challenge in developing high-quality dialogue systems is the evaluation process. Ideally, an evaluation method should be automated, have a high correlation to human judgements and be able to discriminate between different dialogue strategies. Most common techniques to evaluate conversational dialogue systems rely on crowdsourcing, where human judges are asked to rate the appropriateness (or quality) of a generated response given a context. Although this procedure allows to discriminate between different strategies, it has several drawbacks: it is time and cost intensive, it has to be redone for every change in dialogue strategy, and the results cannot be used to improve the system.
On the other hand, the automated evaluation is usually performed by applying word-overlap metrics borrowed from the machine translation or text summarization community, which have been shown to correlate poorly to human judgements on the utterance level (Liu et al., 2016).
Trained Metrics. Recently, the notion of trained metrics was introduced for conversational dialogue systems (Lowe et al., 2017). The main idea is that humans rate the generated response of a dialogue system in relation to a given context (i.e. the dialogue history). Based on these ratings, a regression model is trained which models the human judges. For this, the context, the candidate response, and the gold-standard response are used as input and the judgement is predicted. This approach correlates well with human judgements on the turn level as well as on the system level.
However, these metrics rely on a gold-standard and work on static contexts, which is problematic for two reasons. First, as the context is written by humans it does not reflect the behaviour of the dialogue system. Second, it cannot be used in deployed systems where no gold-standard is available. Dynamic context evaluation (Gandhe and Traum, 2016), on the other hand, usually requires human-computer interaction, which is costly, and puts an additional cognitive strain on the users if they are to rate live during the conversation (Schmitt and Ultes, 2015).
Contribution. In this work we propose to automatically generate the dialogues relying on selftalk, which is derived from AlphaGo self-play (Silver et al., 2016). Dialogues are generated by two instances of the same system conversing with each other. Then the automatically generated dialogues are rated by human judges. That is, the judges read the dialogues and rate it on the turn level. Based on these ratings, we train a regression model which learns to predict the ratings of the human judges. Our results show that this method, which we refer to as AutoJudge, achieves high correlation to human judgements. Thus, it can be applied to fully automatically assess the quality of a dialogue system without being dependent on gold standard responses.
Applications. Since our approach is fully automatic and requires no humans in the loop, we want to go one step further and apply it to improve the dialogue system at hand. More precisely we attempt to apply the metric in two different ways: (i) response ranking similar to (Shalyminov et al., 2018;Hancock et al., 2019), and (ii) reward for reinforcement learning. It turns out that only the re-ranking shows promising results, whereas the metric is not useful as a reward function. This is very surprising, since the trained metric correlates well to human judgements, and it can discriminate between good and bad utterances. Why this happens, and how it can be resolved, is an open research question, which we discuss towards the end of this paper.

Experimental Setup
Our experimental pipeline follows three phases. First, the data generation phase, where we let the dialogue systems generate dialogues automatically. Second, the data annotation phase, where we rely on crowdsouring to rate the dialogues on the turn level. Third, the improvement phase, where we train an automated judgement model on the annotated data and apply this model to improve the dialogue system.

Dialogue Systems
For our experiments we relied on the following state-of-the-art dialogue systems (the training details are in Appendix A): Seq2Seq. The Sequence-to-Sequence model as proposed by (Vinyals and Le, 2015) consists of an encoder and a decoder. Both modules are based on Long Short-Term Memory cells (LSTM) (Hochreiter and Schmidhuber, 1997), where the encoder consumes the last utterance and produces a hidden representation, which is passed as initial state to the decoder to condition the generation process.
HRED. The Hierarchical Recurrent Encoder-Decoder (HRED) model proposed by (Serban et al., 2016a) enhances the Seq2Seq model by a hierarchical encoding procedure. Here, the contextturns are encoded by first encoding each turn separately and then by applying a recurrent encoder over the hidden states of the turns. The decoding procedure is conditioned on the hidden state produced by the context encoder.

VHRED. The Hierarchical Latent Variable
Encoder-Dcoder model (VHRED) (Serban et al., 2017a) enhances the aforementioned HRED model by introducing a stochastic latent variable at the utterance level. This stochastic variable aims to inject variability at the utterance level, which in turn increases the variety of responses a model generates.
MrRNN. The Multi-resolution Recurrent Neural Ntwork (MrRNN) (Serban et al., 2017b) enhances the HRED model by introducing an abstraction layer. More precisely, the dialogue is modelled by processing the inputs and outputs at various level of abstractions (e.g. at the level of meaning bearing words and the usual word-level).
DE. The Dual Encoder (DE) (Lowe et al., 2015) is a selection based model, which differs from the generation based approaches of the aforementioned models. The DE encodes both the context and a candidate response (using the same encoder as the VHRED model) and then classifies if the candidate is a valid response to the given context.

Turn-Level Annotation
We apply self-talk to automatically generate dialogues. For this, we sample 100 different contexts randomly from a set of unseen contexts and let the dialogue system generate a dialogue starting from this context, which consist of 10 turns each. For the annotation process, we use Amazon Mechanical Turk (AMT) 1 and follow the procedure outlined by (Lowe et al., 2017), i.e. the judges rated the overall quality of each turn on a scale from 1 (low quality) to 5 (high quality). Each turn is annotated by three different judges. We required the AMT workers to be from an english speaking country (USA, UK, Ireland or Australia) in order to ensure that they are native speakers, since the generated messages are highly colloquial and make heavy usage of slang. For each annotation, we paid 15 cents, where we assumed that each annotation takes between 60 to 90 seconds. For the selection of the final turn-label, we apply the MACE procedure (Hovy et al., 2013), which learns confidence scores for the annotators. Our final dataset consists of a total of 500 annotated dialogues, which amounts to 5000 annotated pairs of contexts and responses.

AutoJudge
Similarly to the ADEM procedure proposed by (Lowe et al., 2017), we train a regression model on the annotated data. For this, we use the pre-trained context and response encoder from the VHRED model. Unlike ADEM, our dialogues are generated automatically, thus, we do not have access to a gold-standard response. For this reason, we use the following scoring function: score(c, r) = (c T M r−α)/β where M ∈ R d×d is a learned similarity matrix, α, β are scalar constants, and c, r are the context and response embeddings respectively. The model is optimized to minimize the mean squared error between the predicted ratings and the human judgements.

Improving Dialogue Systems
Since AutoJudge is fully automated, we apply it to improve the existing dialogue systems. For this, we implemented the following two applications: as reward for reinforcement learning (RL), and as re-ranking candidate utterances.
Re-Ranking. Given a list of responses from the five aforementioned dialogue systems for a given context, AutoJudge re-ranks them by their predicted score. In our experiments, we use the dialogue systems, which we trained for the self-talk experiment, i.e. we re-rank the outputs of the five aforementioned dialogue systems. Thus, the reranker serves as a meta-selection module.
Reinforcement Learning Reward. We apply the predicted ratings as reward in the RL framework. For this, we apply the Policy Gradient formulation, as done in (Li et al., 2016), which is defined as follows: J RL (θ) = i log p(r i |c i ) × i R(r i , c i ) , where r i and c i are the response and context in the i th turn, R(r i , c i ) is the predicted reward by AutoJudge, and i log p(r i |c i ) is the reconstruction error.

Results and Discussion
In our experiments we use the Twitter Dialogue Corpus (Ritter et al., 2011) 2 . The Twitter Dialogue Corpus provides social interactions, which we believe to be a good basis for being annotated via crowdsouring.
Data Aggregation. The turn-level ratings provide us with 5000 annotated pairs of context and responses. The distribution over the labels is balanced (i.e. each class is represented between 19% and 21% of the cases). However, the agreement scores among the human judges is rather low: the median pairwise Spearman correlation between two judges is only at 0.403. Furthermore, the MACE procedure reports on the confidence score (between 0 and 1) of single judges, which is used as basis for selecting the final label. The average confidence is at only 0.15. We assume that these problems stem from the high degree of subjectivity of the problem. AutoJudge. We train AutoJuge using k-fold cross validation. There are two ways of splitting the data into folds, in order to ensure that all turns of the same dialogue are in the same fold. First, we group the 100 contexts into 10 folds, thus, each fold consists of 50 dialogues (i.e. 10 contexts times the number of dialogue systems), this is denoted as CONVO SPLIT. The second option is to split the data according to the system which created the conversation, which evaluates the performance of AutoJuge in rating dialogues of unseen dialogue systems. We denote this as SYSTEM SPLIT. In Table 1, we report the average Pearson correlation, Spearman's rho and mean absolute error (MAE) over all folds for the conversation split and the system split. With moderate correlations of 0.573 on the dialogue level, we get results which are comparable to (Lowe et al., 2017), 2 We use the IDs provided by (Serban et al., 2017a), which can be found here: www.iulianserban.com/Files/TweetIDs.zip where ADEM achieves a Pearson correlation of 0.436. Note that we cannot directly compare our results to BLEU score and ADEM, since these base their predictions on gold standards, which we do not have in our setting. An interesting result is the System Split, i.e. that our approach is able to maintain a high correlation (0.544) with the ratings of a dialogue system when removing the data of that system from the training, which is not the case in (Lowe et al., 2017) where the correlation for a different system dropped significantly.
Answer Selection. In order to evaluate the improvements achieved by the re-ranking method, we sample a disjoint set of 100 new contexts and apply self-talk to generate conversations. Then, we use AMT to let humans judge the automatically generated conversations on the dialogue level (i.e. a rating for the entire dialogue as opposed to turn-based ratings). We compare the performance of the five base dialogue systems to the performance of the re-ranking strategy. Table 2 shows the average scores for each dialogue system. Our results show that the re-ranking approach works very well. It raises the score to 3.47, which is 0.16 points higher than the best base-system (i.e. SEQ2SEQ).

Conclusion
Our results show that AutoJudge correlates well to human judgements and it is useful to measure the progress of a dialogue system, as it is able to discriminate among different strategies. Furthermore, it generalizes well to unseen strategies for the same domain. Since AutoJudge is independent of a gold-standard it can be applied to deployed systems where gold-standards are not available. Finally, it shows promising results when applied as answer selection module. As a next step, we intend to apply AutoJudge onto human-computer dialogues to measure the viability of AutoJudge in a real-world setting.
In this work we tried to use AutoJudge as a reward for reinforcement learning, which resulted in suboptimal dialogues. The main reason seems to be that AutoJudge cannot properly handle the bad utterance that are generated during the initial phase of reinforcement learning. This is surprising, since AutoJudge is able to distinguish good and bad utterances of fully-trained systems. This seems to indicate that there are different types of "bad" utterances, and we need to adapt the training mechanism of AutoJudge if we want to apply it not only to evaluation, but also to improving dialogue systems. Our results indicate that trained metrics suffer from instabilities, which might be caused by the size of the dataset.
One major issue is that it is not clear which aspects AutoJudge captures. Although the correlation between the human judgements and the outputs of AutoJudge are high, we cannot make any statement about what aspects of the context or the response are relevant for the predicted rating. This is a fundamental problem with the evaluation of conversational dialogue systems, as there is no clear definition for "adequate" responses. Thus, an important future work problem is the investigation into the definition of "adequacy" for conversational dialogue systems.
We conjecture that this might apply also to other automated metrics, thus, this is an important research question that needs to be addressed if we want to understand how to better train and optimize dialogue systems.  Table 3: Randomly sampled output. The conversation is sampled at random and AutoJudge rates each turn.

A Training Details
Model Training. For all models, we used a bidirectional LSTM to encode the turns, and a unidirectional LSTM for both the context encoder and decoder. We specify the number of units for the LSTMs to 500, 1000, 1000 for the turn-encoder, context-encoder and decoder respectively. We use the pretrained 300 dimensional FastText embeddings (Mikolov et al., 2018), which we refine during the training. In order to avoid too large vocabularies, we limit the vocabulary size to 20k distinct tokens. The generative models are trained to minimize the reconstruction error. For the VHRED and MrRNN, we refer to the original papers for the loss function formulation. The Dual Encoder is trained to minimize a contrastive loss function logσ(c T r T ure ) + n∈N logσ(−c T r n ), where c is the context encoding, r T rue is the correct response encoding and N is a set of negative samples. For each training sample we sampled 10 negative examples uniformly at random from the training set. All models are optimized using the Adam optimizer (Kingma and Ba, 2014), with a lr = 0.001 and a batch size of 80.
AutoJudge Training We trained AutoJudge using the pre-trained VHRED model to encode the context and the response. During the training only the matrix M gets optimized. We also experimented with non-linear transformation on these encodings, which did not yield any improvements. Similar to (Lowe et al., 2017), we use α = 0.01 and β = 32. AutoJudge is optimized using Adam optimizer (Kingma and Ba, 2014), with a lr = 0.001 and a batch size of 512.

B Reinforcement Learning
For reinforcement learning, we use the pre-trained HRED system as our initial policy. We apply Policy Gradient as described above. We experimented with various episode batch sizes (1, 10, 100, 1000), i.e. in sample n episodes at once to reduce variance. However, it had no impact on the performance. We also experimented with different formulations, i.e. using Advantage Actor Critics in order to reduce the variance.
In Table 1, we show the rolling average return over the course of 100 episodes. We used a batch size of 100 and we used the standard Policy Gradient formulation. The reward oscillates, which is due to finding new local maxima. He maxi- mal observed reward is at 37 after 80 episodes. However, the generated dialogues are all empty, i.e. the dialogue system always returns the "endof-sequence" token right away.