Towards Best Experiment Design for Evaluating Dialogue System Output

To overcome the limitations of automated metrics (e.g. BLEU, METEOR) for evaluating dialogue systems, researchers typically use human judgments to provide convergent evidence. While it has been demonstrated that human judgments can suffer from the inconsistency of ratings, extant research has also found that the design of the evaluation task affects the consistency and quality of human judgments. We conduct a between-subjects study to understand the impact of four experiment conditions on human ratings of dialogue system output. In addition to discrete and continuous scale ratings, we also experiment with a novel application of Best-Worst scaling to dialogue evaluation. Through our systematic study with 40 crowdsourced workers in each task, we find that using continuous scales achieves more consistent ratings than Likert scale or ranking-based experiment design. Additionally, we find that factors such as time taken to complete the task and no prior experience of participating in similar studies of rating dialogue system output positively impact consistency and agreement amongst raters.


Introduction and Related Work
A tremendous amount of recent research has focused on approaches towards generating responses for conversations in an open-domain setting (Radford et al., 2019;Xing et al., 2018;Wolf et al., 2019). An equally challenging task for natural language generation systems is evaluating the quality of the generated responses. Evaluation of generated output is typically conducted using a combination of crowdsourced human judgments and automated metrics adopted from machine translation and text summarization (Liu et al., 2016;Novikova et al., 2017). However, studies conducted by Liu et al.(2016) and Novikova et al. (2017) show that the automated metrics have poor correlation with human judgments. Despite their shortcomings, automated metrics like BLEU, ROUGE, and METEOR are used due to a lack of alternative metrics. This puts a major imperative on obtaining high-quality crowdsourced human judgments. Previous research which employs crowdsourced judgments has focused on metrics including ease of answering, information flow and coherence (Li et al., 2016;Dziri et al., 2018), naturalness (Asghar et al., 2018), interestingness (Asghar et al., 2017;Santhanam and Shaikh, 2019), fluency or readability (Zhang et al., 2018), engagement (Venkatesh et al., 2018). While experiment designs primarily use Likert scales, Belz and Kow (2010) argue that discrete scales, such as the Likert scales, can be unintuitive and certain individuals may avoid extreme values in their judgments. Prior research has also shown that use of continuous scales is more viable for language evaluation (Novikova et al., 2018;Belz and Kow, 2011). Such evidence places more emphasis on a careful study towards obtaining reliable and consistent human ratings for dialogue evaluation.
To address this research problem, we focus on a systematic comparison of four experimental conditions by incorporating continuous, relative and ranking scales for obtaining crowdsourced human judgments. In this initial study, we evaluate the use of two metrics: Readability and Coherence.
Our key findings are: 1. Use of Likert scales results in the lowest inter-rater consistency and agreement when compared to other experiment conditions 2. Use of continuous scales results in higher inter-rater consistency and agreement 3. Raters who have no prior experience in evaluating dialogue system output have greater inter-rater consistency and agreement than do those who have previously participated in such rating tasks.
Our findings have the potential to help the research community in the design of their evaluation tasks to obtain higher quality human judgments for natural language generation output.

Data and Models
We used the Reddit conversation corpus to train our models. The Reddit conversation corpus, made available by Dziri et al. (2018), consists of data extracted from 95 top-ranked subreddits that discuss various topics such as sports, news, education and politics.  (Bahdanau et al., 2014) • HRED: Hierarchical Encoder-Decoder (Serban et al., 2016) which incorporates an utterance and intra-utterance layer to model context.
• THRED: Topic Augmented Hierarchical Encoder-Decoder (Dziri et al., 2018) which uses topic words along with a hierarchical encoderdecoder to produce a response.

Metrics
For this initial study, we focus on two metrics, readability and coherence. These metrics are among those essential to evaluate the quality of generated responses (Novikova et al., 2017;Dziri et al., 2019). We describe an automated method to compute each metric.
Readability or Fluency measures the linguistic quality of text and helps quantify the difficulty of understanding the text for a reader (Gatt and Krahmer, 2018;Novikova et al., 2017). We use the Flesch Reading Ease (FRE) (Kincaid et al., 1975) that counts the number of words, syllables and sentences in the text. 2 Higher readability scores indicate that utterance is easier to read and comprehend.
Coherence measures the ability of the dialogue system to produce responses consistent with the topic of conversation (Venkatesh et al., 2018). To calculate coherence, we use the method proposed by Dziri et al. (2018). This metric computes the cosine similarity on embedding vectors of generated response and target while accounting for dull and generic responses through a penalty factor.
To overcome the issue of dull and generic responses, Dziri et al. (2018) induce a penalty factor which takes into account where L indicates the length of response after dropping stop words and punctuation and L indicates the length of non-dull parts of the response after dropping stop words. The penalized semantic similarity (SS) score is then calculated as: (2) where i represents the index of the dialogue in the dataset and j denotes index of the utterance in the conversation history.

Experiment Designs
In our study, we use three well-known question types of Likert Scale, Magnitude Estimation and Best-Worst Ranking. We chose these questions types to investigate as these are commonly used across various language evaluation tasks (Belz and Kow, 2011;Asghar et al., 2018;Novikova et al., 2018;Kiritchenko and Mohammad, 2017) . With the help of these three types of questions, we design four rating procedures that are explained below.
Likert Scale (LS): is typically used in experiments for crowdsourcing human evaluation of dialogue systems (Asghar et al., 2018;Lowe et al., 2017). In our experiment, we ask the raters to rate the generated responses on a 6-point scale, following Novikova et al. (2018) (where 1 is the lowest and 6 is the highest on the metrics of readability and coherence).
Rank-Based Magnitude Estimation (RME): Prior research by Belz and Kow (2011) demonstrates through six separate experiments that continuous scales are more viable and offer distinct advantages over discrete scales in evaluation tasks. Recently, Novikova et al. (2018) adopted magnitude estimation by providing the rater with a standard value for a reference sentence to evaluate output from goal-oriented systems. Following Novikova et al. (2018), we also set the value of the standard (reference utterance) as 100 since the reference utterance was produced by humans and is considered as gold-standard. The crowd-sourced workers are asked to provide a score relative to 100 (from 0 to 999) for three system-generated outputs.
Biased Magnitude Estimation (BME): Our third experiment design is biased magnitude estimation (BME). The main difference between RME and BME method is that the standard value we provide for the reference utterance is not uniformly set to 100 for all examples, but instead calculated by automated methods (explained in Section 3). Our motivation to do so is to understand if anchoring bias may affect the ratings when judgments are made relative to a fixed value (100) or relative to a value calculated by automated means. Anchoring bias is the tendency to rely too heavily on one piece of information offered (the "anchor", in this case, the number 100) when making decisions (Kahneman, 2016).
Best-Worst Scaling (BWS): Our last experiment condition is best-worst scaling (BWS) in which raters are asked to rank the generated responses in order of best to worst on both metrics (readability and coherence). This approach has previously been used to estimate emotion intensity and has been demonstrated to produce high quality and consistent judgments from humans (Kiritchenko and Mohammad, 2017).
Each task includes 50 randomly sampled conversations from the test set in our corpus along with generated responses from the three models and the ground truth (reference utterance). For each task, we collected ratings from 40 workers with Master qualifications through Amazon Mechanical Turk.

Experiment Results
We organize our findings along five main research questions (RQs) outlined in this section. In the following section, we report on statistical significance using two-way ANOVAs on the betweensubject ratings across the four experiment conditions (Tables 1-7).
RQ1: What is the effect of experiment design on the reliability on human ratings? We use intra-class correlation (ICC) to measure the reliability across multiple raters (Shrout and Fleiss, 1979;Landis and Koch, 1977). To compare the scores obtained from magnitude estimation ex- periments to the ratings from the task using discrete Likert scales, we perform a normalization of the magnitude estimation scores on a logarithmic scale as suggested by Bard et al. (1996). Table 1 represents the ICC scores on consistency (ICC-C) and agreement (ICC-A) for our four experiment tasks. We observe that use of Magnitude Estimation with anchors (RME or BME) results in more reliable ratings than using Likert Scale or using Best-Worst ranking (BWS). This result is consistent with prior research by Novikova et al. (2018) and Belz and Kow (2011).
RQ2: Does time taken to complete the survey influence reliability of the rankings? To analyze RQ2, we calculated the total time spent by each participant from the start to the end of the experiment. We found that BME task had longest on average time to completion (43 minutes), followed by RME (42.8 minutes) and Likert scale (33 minutes; Best-Worst ranking had shortest average completion time (32.5 minutes). We then test the hypothesis that raters who spent longer than average time on the task would be more reliable in their ratings than those who completed in less than average time. Table 2 represents the ICC scores for raters who spent higher than average time for the task, while Table 3 represents scores for raters who spent less than average time. Surprisingly, we find that consistency and agreement among raters who spend less than average time is higher than those who spend more time, for the Likert, BME or BWS experiment designs. When using the RME design, raters who spend more time have higher consistency and agreement.
RQ3: Does prior experience of evaluating dialogue system output or engaging with conversational agents affect reliability of rankings? We asked each rater two additional questions at the end of the task. The questions asked raters to indicate whether or not they had prior experi-   ence taking part in studies (a) to evaluate dialogue system output; and (b) to engage with a conversational agent. Tables 4 and 5 show how reliable the ratings from the participants based on their prior experience of taking part in studies about evaluating conversational response. We find that participants who have not taken part in prior studies are more consistent and have a higher agreement score than participant who have prior experience. These results are also validated by Tables 6 and 7 which shows that participants with no prior experience of engaging with conversational agents are more consistent and reliable.     calculated using automated methods (outlined in Section 3) with the human ratings in Table 8. Readability scores were computed using the Flesh Reading Ease (Kincaid et al., 1975) and coherence scores were computed based on method proposed by Dziri et al. (2018). We observe that the automated metrics for Readability (Kincaid et al., 1975) and Semantic Similarity (Dziri et al., 2018) show low correlation to human judgments ratings.
Likert RME BME BWS  RQ5: Is there any correlation between ratings of readability and coherence for each of the four experiment conditions? To evaluate whether there is any correlation between the ratings obtained for readability and coherence through of four experimental designs, we report the Spearman correlation values in Table 9. We find that there is high correlation between the human ratings of readability and coherence obtained through RME and BME (statistically significant). One likely factor affecting correlation may be anchoring bias towards the fixed value of the standard utterance provided in RME (100) and reference value provided in BME. We aim to investigate this further in future work.

Conclusion and Future Work
In this paper, we present our work on designing a systematic experiment with four experiment conditions to evaluate the output of dialogue systems. Different from prior work where a similar study was conducted with output from goal-oriented systems (Novikova et al., 2018), our study focuses on evaluating output in open-domain situations. Consistent with prior findings, metrics calculated using automated methods (Dziri et al., 2019) were found to have a negative correlation with human judgments (c.f. Table 8). This finding points to the need for more effective automated metrics. We find that that use of continuous scales to obtain crowdsourced ratings provides more consistent and reliable ratings than ratings obtained through Likert scales or Best-Worst scaling. This finding is consistent with prior work conducted by Novikova et al. (2018). Novel in our study was the testing of the Best-Worst scaling method to evaluate responses against one another. Although the Best-Worst scaling method has been shown to be effective in obtaining crowdsourced ratings of emotions (Kiritchenko and Mohammad, 2017), we did not find it to be effective in this study. We aim to investigate further whether this finding can be reproduced in a different experiment.
Further, we were able to identify the effects of time taken to complete the task on rating reliability. We find that workers who spent less than average time on the task had higher consistency (for the Likert, BME and BWS experiment conditions) than did the workers who spent more than average time. This finding is counter-intuitive, we expect that spending more time would positively impact inter-rater consistency. Our first step in the analysis of the effects of time taken on reliability included analyzing data from workers who spent more or less than average time, which offers admittedly a limited perspective; an interesting next step would be to more thoroughly study the effects of time taken on reliability by taking into account the full distribution of the time spent data.
We also find that lack of prior experience of evaluating open-domain dialogue system output results in more reliable ratings. One potential explanation for this could be that workers may have pre-conceived notions based on their past experience. One limitation of our current study is that although we had output from three separate models, we conducted the study using data from one corpus. Reproducing our findings across additional corpora, additional metrics and other experiment designs would help substantiate these findings further. An analysis of the interaction effects between independent variables such as time taken and prior experience would also help strengthen the findings of our study.
By using a larger sample size (n=40), we are able to make claims about statistical significance across experiment conditions. In future work, we plan to evaluate the impact of cognitive biases such as anchoring and confirmation bias in-depth and how it affects consistency and reliability along with testing continuous scale ratings with no reference value.