RankME: Reliable Human Ratings for Natural Language Generation

Human evaluation for natural language generation (NLG) often suffers from inconsistent user ratings. While previous research tends to attribute this problem to individual user preferences, we show that the quality of human judgements can also be improved by experimental design. We present a novel rank-based magnitude estimation method (RankME), which combines the use of continuous scales and relative assessments. We show that RankME significantly improves the reliability and consistency of human ratings compared to traditional evaluation methods. In addition, we show that it is possible to evaluate NLG systems according to multiple, distinct criteria, which is important for error analysis. Finally, we demonstrate that RankME, in combination with Bayesian estimation of system quality, is a cost-effective alternative for ranking multiple NLG systems.


Introduction
Human judgement is the primary evaluation criterion for language generation tasks (Gkatzia and Mahamood, 2015). However, limited effort has been made to improve the reliability of these subjective ratings (Gatt and Krahmer, 2017). In this research, we systematically compare and analyse a wide range of alternative experimental designs for eliciting intrinsic user judgements for the task of comparing multiple systems. We draw upon previous studies in language generation, e.g. Kow, 2010, 2011;Siddharthan and Katsos, 2012), as well as in the related field of machine translation (MT), e.g. (Bojar et al., 2016(Bojar et al., , 2017. In particular, we investigate the following challenges: Distinct criteria: Traditionally, NLG outputs are evaluated according to different criteria, such as naturalness and informativeness (Gatt and Krahmer, 2017). Naturalness, also known as fluency or readability, targets the linguistic competence of the text. Informativeness, otherwise known as accuracy or adequacy, targets the relevance and correctness of the output relative to the input specification. Ideally, we want to measure outputs of NLG systems with respect to these distinct criteria, especially for error analysis. For instance, one system may produce syntactically fluent output but misses important information, while another system, although being less fluent, may generate output that covers the meaning perfectly. Nevertheless, human judges often fail to distinguish between these different aspects, which results in highly correlated scores, e.g. (Novikova et al., 2017a). This is one of the reasons why some more recent research adds a general, overall quality criterion (Wen et al., 2015a,b;Manishina et al., 2016;Novikova et al., 2016Novikova et al., , 2017a, or even uses only that (Sharma et al., 2016). In the following, we show that discriminative ratings for different aspects can still be obtained, using distinctive task design. Consistency: Previous research has identified a high degree of inconsistency in human judgements of NLG outputs, where ratings often differ significantly (p < 0.001) for the same utterance (Walker et al., 2007). While this might be attributed to individual preferences, e.g. (Walker et al., 2007;Dethlefs et al., 2014), we also show that consistency (as measured by inter-annotator agreement) can be improved by different experimental setups, e.g. the use of continuous scales instead of discrete ones. Inconsistent user ratings are problematic in many ways, e.g. when developing metrics for automatic evaluation (Dušek et al., 2017;Novikova et al., 2017a). Relative vs. absolute assessment. Intrinsic human evaluation methods are typically designed to assess the quality of a system. However, they are frequently used to compare the quality of different NLG systems, which is not necessarily appropriate.
In the following, we show that relative assessment methods produce more consistent and more discriminative human ratings than direct assessment methods.
In order to investigate these challenges, we compare several state-of-the-art NLG systems, which are evaluated by human crowd workers using a range of evaluation setups. We show that our newly introduced method, called rank-based magnitude estimation (RankME), outperforms traditional evaluation methods. It combines advances suggested by previous research, such as continuous scales (Belz and Kow, 2011), magnitude estimation  and relative assessment (Callison-Burch et al., 2007). All code and data, as well as a more detailed description of the study setup are publicly available at: https://github.com/jeknov/RankME

Experimental Setup
We were able to obtain outputs of 3 systems from the recent E2E NLG challenge (Novikova et al., 2017b): 1 the Sheffield NLP system (Chen et al., 2018) and the Slug2Slug system (Juraska et al., 2018), as well as the outputs of the baseline TGen system (Dušek and Jurčíček, 2016). We chose these systems in order to assess whether our methods can discriminate between outputs of different quality: Automatic metric scores, including BLEU, ME-TEOR, etc., indicate that the Slug2Slug and TGen systems show similar performance while Sheffield's is further apart. 1 All three systems are based on the sequenceto-sequence (seq2seq) architecture with attention (Bahdanau et al., 2015). Sheffield NLP and TGen both use this basic architecture with LSTM recurrent cells (Hochreiter and Schmidhuber, 1997) and a beam search, TGen further adds a reranker to penalize semantically invalid outputs. Slug2Slug is an ensemble of three seq2seq models with LSTM recurrent decoders. Two of them use LSTM recurrent encoders and one uses a convolutional encoder. A reranker checking for semantic validity selects among the outputs of all three models.
We use the first one hundred outputs for each system, and we collect human ratings from three independent crowd workers for each output using the CrowdFlower platform. We use three different methods to collect human evaluation data: 6point Likert scales, plain magnitude estimation  (plain ME), and rank-based magnitude estimation (RankME). In a magnitude estimation (ME) task (Bard et al., 1996), subjects provide a relative rating of an experimental sentence to a reference sentence, which is associated with a pre-set/fixed number. If the target sentence appears twice as good as the reference sentence, for instance, subjects are to multiply the reference score by two; if it appears half as good, they should divide it in half, etc. Note that ME implies the use of continuous scales, i.e. rating scales without numerical labels, similar to the visual analogue scales used by Belz and Kow (2011) or direct assessment scales of (Graham et al., 2013;Bojar et al., 2017), however, without given end-points. Siddharthan and Katsos (2012) have previously used ME for evaluating readability of automatically generated texts. RankME extends this idea by asking subjects to provide a relative ranking of all target sentences. Table 1 provides a summary of methods and scales, and indicates whether relative ranking or direct assessment was used.

Judgements of Multiple Criteria
In our experiments, we collect ratings on the following criteria: • Informativeness (= adequacy): Does the utterance provide all the useful information from the meaning representation?
• Naturalness (= fluency): Could the utterance have been produced by a native speaker?
• Quality: How do you judge the overall quality of the utterance in terms of its grammatical correctness, fluency, adequacy and other important factors?
In order to investigate whether judgements of these criteria are correlated, we compare two experimental setups: In Setup 1, crowd workers are shown the input meaning representation (MR) and the corresponding output of one of the NLG systems and are asked to evaluate the output with respect to all three aspects in one task. In Setup 2, these aspects are assessed separately, in individual tasks. Furthermore, when crowd workers are asked to assess naturalness, the MR is not shown to them since it is not relevant for the task. Both setups utilise all three data collection methods -Likert scales, plain ME and RankME.
The results in Table 2 show that scores are highly correlated for Setup 1. This is in line with previous research in MT (Callison-Burch et al., 2007;Koehn, 2010). Separate collection (Setup 2), however, decreases correlation between naturalness and quality, as well as naturalness and informativeness to very low levels, especially when using ME methods. Nevertheless, informativeness and quality are still highly correlated. We assume that this is due to the fact that raters see the MR in both cases.
To obtain more insight into informativeness ratings, we asked crowd workers to further distinguish informativeness in terms of added and missed information with respect to the original MR. Crowd workers were asked to select a checkbox for added information if the output contained information not present in the given MR, or a checkbox for missed information if the output missed some information from the MR. The results of Chi-squared test show that distributions of missed and added information are significantly different (p < 0.01), i.e. systems add or delete information at different rates. Again, this information is valuable for error analysis. In addition, results in Table 4 show that assessing the amount of missed information indeed produces a different overall system ranking to added information. As such, it is worth considering missed information as a separate criterion for evaluation. This can also be approximated automatically, as demonstrated by Wiseman et al. (2017).

Consistency and Use of Scales
To assess consistency in human ratings, we calculate the intra-class correlation coefficient (ICC), which measures inter-observer reliability for more than two raters (Landis and Koch, 1977). In our experiments, we compare discrete Likert scales with continuous scales implemented via ME with respect to the resulting reliability of collected human ratings. The results in Table 3 show that the use of ME significantly increases ICC levels for naturalness and quality. This effect is especially pronounced for Setup 2 where ratings are collected separately. Both plain ME and RankME methods show a significant increase in ICC, with the RankME method showing the highest ICC results. This difference is most apparent for naturalness, where RankME shows an ICC of 0.42 compared to plain ME's 0.27. For informativeness, Likert scales already provide satisfactory agreement.
In previous research, discrete, ordinal Likert scales are the dominant method of human evaluation for NLG, although they may produce results where statistical significance is overestimated (Gatt and Krahmer, 2017). Recent studies show that continuous scales allow subjects to give more nuanced judgements (Belz and Kow, 2011;Graham et al., 2013;Bojar et al., 2017). Moreover, raters were found to strongly prefer continuous scales over discrete ones (Belz and Kow, 2011). In addition to this previous work, our results also show that continuous scales significantly improve reliability of human ratings when implemented via ME.

Ranking vs Direct Assessment
Most data collection methods for evaluation, including Likert and plain ME, are designed to directly assess the quality of a system. However, these methods are almost always used to compare multiple systems relative to each other. Recently, the NLP evaluation literature has started to address this issue, mostly using binary comparisons, for example between the outputs of two MT systems (Dras, 2015;Bojar et al., 2016). In our experiments, Likert and plain ME are direct assessment (DA) methods, while RankME is a relative ranking (RR)-based method (see also Table 1). In order to directly compare DA and RR, we generated overall system rankings based on our different methods, using pairwise bootstrap test at 95% confidence level (Koehn, 2004) to establish statistically significant differences.
The results in Table 4 show that both plain ME and RankME methods produce similar rankings of NLG systems, which is in line with previous research in MT (Bojar et al., 2016). It is also apparent that ME methods, by using a continuous scale, provide more distinctive overall rankings than Likert scales. For naturalness scores, no method results in clear system ratings, which possibly reflects in the low ICC of this criterion (cf. Table 3). RankME is the only method to provide a clear ranking with respect to overall utterance quality. However, its ranking of informativeness is less clear than that of plain ME, which might be due to the different results for missed and added information (see Sec. 4). In addition, the results in Table 3 show that RR, in combination with Setup 2, results in more consistent ratings than DA.  Table 2: Spearman correlation between ratings of naturalness and quality, collected using two different setups and three data collection methods -Likert, plain ME and RankME. Here, "*" denotes p < 0.05.

Relative comparisons of many outputs
While there are clear advantages to relative rankbased assessment, the amount of data needed for this approach grows quadratically with the number of systems to compare, which is problematic with larger numbers of systems, e.g. in a shared task challenge. Data-efficient ranking algorithms, such as TrueSkill (Herbrich et al., 2006), are therefore applied by recent MT evaluation studies (Sakaguchi et al., 2014;Bojar et al., 2016) to produce overall system rankings based on a sample of binary comparisons. However, TrueSkill has not previously been used for evaluating NLG systems. TrueSkill produces system rankings by gradually updating a Bayesian estimate of each system's capability according to the "surprisal" of pairwise comparisons of individual system outputs. This way, fewer direct comparisons between systems are needed to establish their overall ranking. We computed system rankings using TrueSkill over comparisons collected via RankME and were able to show that it produces exactly the same system rankings for all three criteria as using RankME directly (see Table 4), despite the fact that the comparisons are only used in a "win-loss-tie" fashion. This shows that RankME can be used with TrueSkill to produce consistent rankings of a larger number of systems.

Conclusion and Discussion
In this paper, we demonstrate that the experimental design has a significant impact on the reliability as well as the outcomes of human evaluation studies for natural language generation. We first show that correlation effects between different evaluation criteria can be minimised by eliciting them separately. Furthermore, we introduce RankME, which combines relative rankings and magnitude estimation (with continuous scales), and demonstrate that this method results in better agreement amongst raters and more discriminative results. Finally, our results suggest that TrueSkill is a cost-effective alternative for producing overall relative rankings of multiple systems. This framework has the potential to not only significantly influence how NLG evaluation studies are run, but also produce more reliable data for further processing, e.g. for developing more accurate automatic evaluation metrics, which we are currently lacking, e.g. (Novikova et al., 2017a). In current work, we test RankME with a wider range of systems (under submission). We also plan to investigate how this method transfers to related tasks, such as evaluating open-domain dialogue responses, e.g. (Lowe et al., 2017). In addition, we aim to investigate additional NLG evaluation methods, such as extrinsic task contributions, e.g. Gkatzia et al., 2016).