Item Response Theory for Efficient Human Evaluation of Chatbots

Conversational agent quality is currently assessed using human evaluation, and often requires an exorbitant number of comparisons to achieve statistical significance. In this paper, we introduce Item Response Theory (IRT) for chatbot evaluation, using a paired comparison in which annotators judge which system responds better to the next turn of a conversation. IRT is widely used in educational testing for simultaneously assessing the ability of test takers and the quality of test questions. It is similarly well suited for chatbot evaluation since it allows the assessment of both models and the prompts used to evaluate them. We use IRT to efficiently assess chatbots, and show that different examples from the evaluation set are better suited for comparing high-quality (nearer to human performance) than low-quality systems. Finally, we use IRT to reduce the number of evaluation examples assessed by human annotators while retaining discriminative power.


Introduction
One of the main problems in conversation dialog modeling is evaluation. Unlike in machine translation and task-driven dialog, automated metrics for non-task driven open-domain generative conversational models (chatbots) seem not to correlate well with human judgments (Liu et al., 2016;Tao et al., 2017;Lowe et al., 2017). While the creation of new automatic metrics is an extremely active area of research (Liu et al., 2016;Tao et al., 2017;Lowe et al., 2017;Novikova et al., 2017;Sugiyama et al., 2019), human annotations are currently the gold standard for assessing model improvements. Prior work mainly uses straightforward approaches, such as a two-sided ttest or binomial tests (e.g., Li et al., 2015;Asghar et al., 2017;Ghazvininejad   is 1 if annotator i rated system A better and 0 otherwise, and similarly for system B. "-" indicates a tie vote. Li et al., 2019b)), or pairwise bootstrap test (e.g. Baheti et al. (2018)). These methods do not assess or incorporate the effectiveness of prompts (conversational chunks used for evaluation). Given that human evaluation is necessary, it is desirable to discriminate the performance of two different systems with minimal cost.
In this paper, we present the use of Item Response Theory (IRT) (Lord et al., 1968) to compare chatbot models using a head-to-head paired experimental (A/B test) design (e.g. Table 1), which allows for statistical significance testing and item importance identification. IRT is traditionally used to assess student "ability" based on their answers ('responses') to test questions ('items') and, simultaneously, to determine how informative each question is. Throughout this paper we use the analogy of student ∼ A/B chatbot comparison and question ∼ prompt. We apply IRT to assess chatbot model performance based on human evaluations of chatbot responses to prompts, while simultaneously assessing how informative each prompt is.
IRT is a latent variable Bayesian model, with relative chatbot model quality (or student ability) being latent variables that probabilistically produce observable responses (one chatbot response to a prompt being judged as better than another, or a student answering a question correctly or wrong). IRT is widely used in psychometric studies (Embretson and Reise, 2013), and for paired comparison in psychological studies (Maydeu-Olivares and Brown, 2010). However, it is almost entirely ignored in natural language processing (NLP), with the exception of Hopkins and May (2013); Lalor et al. (2016); Otani et al. (2016); Lalor et al. (2019); Dras (2015).
Recent work has criticized the statistical methodology used in NLP and called for use of better statistical methods (Dror et al., 2018). Here, we present IRT as a powerful method for statistical assessment of model improvements. IRT not only assesses the relative quality between two systems, but also assesses the usefulness of a prompt in comparing systems. We show that IRT can filter and choose a subset of prompts from the evaluation set efficiently, i.e. with little loss in statistical power (Figure 2), and that IRT finds different prompts to be useful for assessing high quality vs. low quality chatbots.
Our core contribution is showing how Item Response Theory (IRT) can be used for open-domain social conversational agent (chatbot) comparison. In particular, we showcase the use of IRT in comparing multiple models for neural conversational agents. Finally, we show the utility of IRT for reducing the data collection required to evaluate chatbots by filtering evaluation set prompts. To our knowledge, this is the first work to apply IRT to chatbot evaluation and to use IRT for prompt selection in the evaluation of NLP systems.

Related Work
The structure of our chatbot evaluation is a comparison of two chatbots responses to each prompt. This form of head-to-head pairwise block (multiple evaluations shown to one annotator) comparison dates back at least to Thurstone (1927). Subsequently, the Bradley-Terry (BT) model has become the most common model for pairwise block comparison experiments (Bradley and Terry, 1952). Dras (2015) describes further extensions and application of the BT model to machine translation. Extended BT models can correct for dependent categorical object covariates (correlated examples) as well as subject covariates (annotator ratings) (Cattelan, 2012). As Dras (2015) points out, the BT model and IRT are similar in formulation, but IRT additionally estimates the difficulty of each item using a latent variable Bayesian model. Fixed effect BT models (Borenstein et al., 2010) or bootstrapping (Koehn, 2012) could be used to compare chatbots, but IRT's ability to assess prompts is more attractive for this task where every annotation has a non-trivial cost.
An alternative straightforward approach to assess usefulness (validity) of a prompt is item-total correlation (ITC; Guilford (1953)). However, ITC does not take the student's ability into account. In general, IRT is preferred over ITC due to the more expressive formulation. ITC is mostly used for survey analysis instead of testing. However, as a sanity check, we find that indeed prompts extremely low in discriminative power (according to IRT) also have a low item-total correlation.
There is surprisingly little work on improving statistical significance testing or prompt selection in chatbot evaluation. While this is less true for machine translation, only two prior works have used IRT for model assessment (Hopkins and May, 2013;Otani et al., 2016). Our work applies IRT in a similar fashion as Otani et al. (2016), but to chatbot evaluation instead of machine translation system evaluation. We differ from Hopkins and May (2013) and Otani et al. (2016) as follows: 1. We do pairwise comparison instead of requiring baselines -this allows for improved prompt selection as models improve. Their method is focused on WMT (batch/competition) settings whereas our work focuses on perpetual evaluation. 2. We aggregate annotators -which creates much more stable predictions (their graded mean is 1-baseline, 2tie, 3-win) whereas ours ranges from [-3,3]. 3. We explicitly assume independence of prompts and account for their correlation and thus do not overstate significance. 4. We use IRT to reduce the total number of comparisons; Otani et al. (2016) suggest this for future work.
IRT has also been applied in NLP for dataset filtering (Lalor et al., 2016). Lalor et al. (2019) uses IRT to efficiently subsample training data based on the difficulty. We differ from Lalor et al. (2019) on prompt selection: 1. We select individual prompts based on evaluations using the discriminative ability of the prompt-not just the item difficulty. 2. We use model win rank instead of item difficulty for selecting prompts for "better" models. Both of these yield more informative prompts. Kulikov et al. (2018) use a Bayesian approach for testing for significance in interactive evaluation; however, the correlation between items is not taken into account. As in Otani et al. (2016), IRT allows us to directly compare distributions; however, the correlation between the prompts still needs to be accounted for in order not to overstate significance.
Machine Translation Much effort has been placed in machine translation for correlating human annotator judgements with automatic metrics; however, Lowe et al. (2017) showed that automatic machine translation evaluation methods do not correlate with human judgments of opendomain conversational agents. This may be due to the fact that in machine translation there is a one-to-one semantic equivalence between reference and system output, whereas this is not true in the chatbot setting. Nonetheless, relevant prior work on assessing human evaluation in machine translation is relevant to chatbot evaluation. In machine translation, shared tasks offer standard evaluation sets and workshops, which have yielded standardized results (Callison-Burch et al., 2007. Since 2015, the Workshop on Machine Translation (WMT) uses TrueSkill (Herbrich et al., 2007) for model ranking. TrueSkill can also be applied to chatbot evaluation. Sakaguchi et al. (2014) used it to efficiently pair machine translation systems and compared them using random subsets of data. They show that their non-parametric method is empirically superior in accuracy to Hopkins and May (2013). However, this comparison is limited since the non-parametric might focus only on one axis of difference similar to stochastic gradient descent. Returning to our student analogy, in an example of students taking the SAT (an English and a Math test), the TrueSkill method might focus on only the Math portion to discriminate between students, whereas, IRT would use both portions. Trueskill does not select examples using item utility. Otani et al. (2016) and Hopkins and May (2013) applied IRT to machine translation. IRT is more important in chatbot evaluation than in machine translation as human evaluation is rarely reported in machine translation papers (e.g. (Sutskever et al., 2014;Vaswani et al., 2017)), but is rarely omitted in chatbot comparison (e.g. Liu et al. (2016) 2020)). Comparison of conversational generative agents using next utterance generation is in many ways similar to the evaluation of machine translation (MT); however, differentiating between chatbot models is uniquely challenging; many more responses than translations are plausible. Automated evaluation of MT is vastly better than of chatbots (Liu et al., 2016). The higher costs of human evaluation strongly encourage the use of more powerful statistical models such as IRT.

Chatbot Evaluation
Recently researchers tend to evaluate their methodological improvements relative to a sequence-to-sequence (Seq2Seq) baseline (Sutskever et al., 2014), as proposed for utterance generation by Shang et al. (2015); Vinyals and Le (2015);  as well to compare against each other. While crowd-sourcing experiments are relatively cheap, the lack of automatic metrics means that every change in model architecture requires new evaluations. Our goal is efficient and cost-effective model assessment. Ideally, chatbots would be interactively evaluated, but due to the high cost, next utterance simulation is used as a surrogate. Although next utterance generation is a more artificial task, Logacheva et al. (2018) observed a Pearson correlation of 0.6 between conversation-level and utterance-level ratings.
Human judgments are often inconsistent for non-task driven chatbots, since there is no clear objective, which leads to low inter-annotator agreement (IAA) (Sedoc et al., 2019;Yuwono et al., 2019). However, Amidei et al. (2019) point out that even with low IAA we can still find statistical significance. There are further tensions between local coherence assessments using standard evaluation sets and human interactive evaluation. These issues are exacerbated for non task-driven dialog systems, as there is rarely a single "correct" response, leading to more local minima. Thus, there is a need to obtain the maximum possible statistical power at the minimal possible cost. Novikova et al. (2018) found that relative rankings yield more discriminative results than absolute assessments when evaluating natural language generation. Recent work of Li et al. (2019a) introduce both human-bot as well as self-chat for interactive evaluation and show that this is more effective than conversation-level Likert scales.

IRT for Chatbot Evaluation
We pose chatbot human evaluation as an Item Response Theory (IRT) problem, similar to the approach of Otani et al. (2016). Again, throughout this section we consider the analogy of student ∼ A/B chatbot comparison and question ∼ prompt. In the context of educational testing, we are seeking to find the ability of a student and the effectiveness of exam questions (e.g. SAT exam) which in our setting is the comparative difference in pairs of chatbots.
As seen in Table 1, we sum the wins minus losses for each human evaluation of a pair of chatbot systems for each prompt; this net rating ranges between [n, −n] where n is the number of annotators. In the student analogy, this is equivalent to an exam question worth 2n points. This is a wellstudied problem, the so called the "graded mean" formulation of IRT (Samejima, 1969).
We first introduce the graded mean formulation of IRT required to estimate the relative assessment of chatbots and the discriminative power of the prompts. Subsequently, we describe the exact problem formulation in our setting.

Item Response Theory
The core idea behind IRT is that the probability that student i gets each question (item) j correct depends both on the ability of the student and the difficulty of the question. IRT aims to assess a latent ability trait θ i for each student i from their answers u i j to items j, and, simultaneously, to determine how informative each item j is. This informativeness depends on the ability of the student; one wants to give harder questions to good students and easier questions to weaker students. IRT is a latent variable Bayesian model that can be estimated via expectation maximization (EM) or variational inference. For a comprehensive exposition of IRT see Embretson and Reise (2013).
More formally, we use the graded mean IRT model in which the probability that a student i obtains a score above c (the "rated scale assignment") for question j (Andrich, 1978). P ijc (θ i ), the probability that student score (or aggregate chatbot rating), u i j > c, is given by where σ is the logistic function. b jc is the item (jth question) difficulty for the score c (e.g. to score 4 or more points out of 6 on an exam question), α j is the slope or item's discrimination (measuring how informative the question is for measuring the student's ability), and θ i is the latent ability of student i. 1 Better questions (higher α j ) allow investigators to determine which student is better with fewer questions. We will use this same model to test which chatbot is better using fewer prompts.
In order to make this model generative, we can define If c ∈ [−3, 3] then P ij−3 (θ i ) = 1 and P ij4 (θ i ) = 0. IRT is a latent variable Bayesian model, where θ i , b j , and log(α j ) have priors from a normal distribution. The model is estimated by gradient descent.

Problem Setting
IRT can be easily repurposed for chatbot evaluation. Rather than assessing individuals i based on their answers to exam questions j, we assess the relative rating (preference) between two chatbot models i based on their responses to conversational prompts j. Instead of teachers (or ETS) grading the students' answers, human raters now rate the chatbot responses. The overall score for a chatbot for each item is the accumulated annotator preferences for that chatbot over its competitor. The score for chatbot B compared against chatbot A for item j is Figure 1: Each curve shows the estimated distribution of difference (inverse logit) in assessed quality between a pair of two different chatbot models produced by our Bayesian IRT model. The mode of each curve is the expected value of the quality difference, and zero means that the models are believed to be equally good.
where w A kj = 1 and w B kj = 0 if for prompt j the k-th annotator chose model A as having a better response; values are reversed if model B was preferred (see examples in Table 1). 2 The resulting ability score θ i ∈ R is then the relative "ability" (i.e. assessed quality) of models i =A vs B. Figure 1 shows a distribution of ability across multiple pairwise comparisons of models.
A critical difference between our formulation and that of Otani et al. (2016) is that we explicitly account for the independence of prompts, and do not model individual annotators k. Estimating a model of individual annotators would require many annotations for each annotator, which is not practical for estimator convergence.
IRT gives an optimal way to combine item results (given the modeling assumptions). It is flexible in that one need not make comparisons for all items for all chatbot pairs. In order to avoid overstating statistical significance, we group covariate prompts using a simple correlation filter (> 0.6) over all experiments. 3 In order to keep the net rating in [−3, 3], we average the scores in the group. Note that this is the most conservative possible choice. We further control for multiple testing error by analyzing all comparisons simultaneously (Miller, 1981). As more comparisons are made, more information is revealed about the prompts in the evaluation dataset.

Experimental Details
While human evaluation remains the gold standard for dialog research, the design of human evaluation experiments is far from standard. We restrict our analysis to designs where the annotator is shown a prompt and two possible responses and 2 If the number of annotators is variable, then we scale u i j to a fixed range which here we set to [−3, 3]. 3 We calculate the correlation of judgments u i j between all prompts over all annotators and evaluations. then asked to select the better one or specify a tie. We follow the setup of Sedoc et al. (2019) (see the Appendix for instruction to Amazon Mechanical Turk crowd workers).

System Descriptions
We conducted a series of experiments to establish high-quality baselines for several popular training sets to show the efficacy of our proposed method. We compared our baselines against the OpenNMT benchmark for dialog systems 4 ; Cakechat 5 , which is a reimplementation of the hierarchical encoderdecoder model (HRED) (Serban et al., 2016); and the Neural Conversation Model's (NCM) released responses from Vinyals and Le (2015). Cakechat was trained on Twitter data, and NCM and Open-NMT benchmark were trained on movie subtitle data from OpenSubtitles (Tiedemann, 2012). We also evaluated two state-of-the-art Transformer base models: DialoGPT 6 medium (Zhang et al., 2019) and Blender (2.7B) 7 (Roller et al., 2020). Two human baselines created by Sedoc et al. (2019) were used.
All other models were trained with OpenNMTpy (Klein et al., 2017) Seq2Seq implementation with its default parameters: two layers of LSTMs with 512 hidden neurons for the bidirectional encoder and the unidirectional decoder.
We trained several models and chose the best using non-exhaustive human evaluation. 8 OpenNMT Seq2SeqAttn is trained using OpenSubtitles (Tiedemann, 2012) and Seq2SeqAttn OpenSubtitles Questions is trained using pairs where the first utterance ends in a question mark and the second does not. Finally, Seq2SeqAttn Twitter was trained on Twitter micro-blogging data as originally done by Ritter et al. (2010). 9 All of the data was extracted and tokenized using ParlAI (Miller et al., 2017). 10

Selection of Evaluation Set
Our evaluation set is the list of 200 questions released by Vinyals and Le (2015) in their seminal work on neural conversational models using a standard Seq2Seq framework borrowed from machine translation. The evaluation set is handcrafted and there are several correlated examples, such as the prompts are you a follower or a leader ? and are you a leader or a follower ? This quality is not unique to this evaluation dataset.

Human Evaluation Details
The evaluation prompts are split into blocks (currently defaulted to 10) 11 . We used the same experimental setup as Sedoc et al. (2019). The overall inter-annotator agreement (IAA) varies depending on the vagueness of the prompt as well as the similarity of the models. The overall IAA as measured by Fleiss' kappa (Fleiss, 1971) varies between .2 to .54 if we include tie choices. As Dras (2015) note, there is little agreement in the community on how to handle tie choices. Our IAA is similar to the findings of Yuwono et al. (2019) who also found low inter-annotator agreement when assessing conversational turns.
Unfortunately, "bad" workers accounted for roughly seven percent of all annotations, which we remove from our results. To identify such workers, we examine the worker annotation against the other two annotations. We remove annotators whose correlation is not statistically significantly greater than 0. It is important to note two things 1) the two annotations are likely more than two other workers since we have a minimum of 3 annotators and a maximum of 60, and 2) unless the "bad" worker is adversarial (i.e. labeling the opposite of the correct judgment) and instead just randomly labels, then the annotator will lower interannotator agreement, but IRT will not be significantly affected (Hopkins and May, 2013). How-9 From https://github.com/Marsan-Ma/ chat_corpus/raw/master/. 10 https://github.com/facebookresearch/ ParlAI 11 We used the code from ChatEval https://github. com/chateval/chateval/ ever, "bad" workers will create bias in the estimate of mean difference (a.k.a. ability) of models to be closer to 0 (see the Appendix for further details).

Results
We used IRT to compare multiple neural models for their relative strength. Furthermore, we also included human baselines in our model comparison. Finally, we assessed the discriminative quality of the hand-crafted prompts from Vinyals and Le (2015).

Model Comparison Results
A comparison of the models described in section 5.1 is in Table 3 (all model comparisons are in the Appendix). 12 By analyzing the significance of all of the models at once using IRT, we can correct for multiple testing (Miller, 1981). I.e., given multiple comparisons, by chance a comparison might look statistically significant if naively using a pvalue of 0.05.
Overall, there is a roughly uniform distribution of ratings (see the appendix for more detail). The grade is from -3 to 3 since there are 3 annotators per prompt for all but one experiment.
As seen in Table 3 the NCM (Vinyals and Le, 2015) model performance cannot be matched by any other model, even though all models are based on Seq2Seq. This indicates that either baseline models are difficult to properly train and parameterize, or that the NCM model may be overfit for the evaluation set. Interestingly, there are not enough ratings to evaluate whether NCM is worse than our human baselines. NCM also seems to outperform both Blender as well as DialoGPT; however, these results are not statistically significant. Blender is designed for multi-turn interactions, so single-turn prompts may not be a fair comparison.
Note, that IRT does not yield a total ordering of systems. In pairwise comparisons between Cakechat and Seq2SeqAttn Twitter and Seq2SeqAttn OpenSubtitles, Cakechat is superior to Seq2SeqAttn Twitter.
However, Seq2SeqAttn OpenSubtitles is almost statistically significantly better than Cakechat, while Seq2SeqAttn Twitter and Seq2SeqAttn OpenSubtitles are rated to have equivalent performance.
One possible rea-  Table 2: The mean and standard deviation of "ability" (inverse logit) of paired comparisons of various models, where overlap with zero indicates no difference. Larger positive indicates that System B is superior in terms of rating by human annotators and similarly smaller negative numbers mean that System A is superior. (* shows significant differences p < 0.05 and better system is in bold.) son for this might be that both Cakechat and Seq2SeqAttn Twitter are trained on Twitter, so their model responses are more directly comparable.

Evaluation Set Selection
In order to minimize the numbers of evaluations required to assess the relative performance of models, we first removed redundant prompts, and then used IRT to select the prompts that were most discriminative. IRT evaluates the discriminative ability of each prompt independently, so first we analyzed the correlation structure of responses over all evaluations and removed redundant prompts. To test the effect of using IRT to select prompts, we use a leave-one-out design, i.e. we keep 19 model comparisons and then select a subset of prompts with the most discriminative power for the 20th out-of-sample comparison. It is important to note that the most discriminative prompts (α j ) are usually not the most difficult ones (b j c). This is different from Lalor et al. (2019) who use training example difficultly. Figure 2 shows the change in the standard error of the ability estimates as we reduced the number of prompts. Our main result is that selecting just 100 of the 200 prompts using IRT maintains the same standard error, while selecting 100 random prompts gives a significantly higher error. Thus, using IRT allows us to reliably compare methods using fewer prompts.
Different Prompts for Better Students Finally, we assessed the effect of model quality on chatbot evaluation. Intuitively, one wants harder questions for better students. Similarly, an example such as my name is david . what is my name ? is an easier prompt than what is the purpose of being intelligent ? However, two models that are closer to human parity will only be distinguishable by the latter example. Similarly, for models further from human performance, both would perform poorly for example OpenNMT Seq2Seq: I don 't know . and CakeChat: i ' m not sure what to say . Using IRT, we were able to validate this intuition across multiple models.
We split systems into two categories "better" -(NCM, DialoGPT, Blender, and Cakechat) and the other systems (e.g. OpenNMT) by sorting using mean ∆ ability (Table 3). For each set of chatbots, we re-estimate the ability and item difficulty using only the subset of comparisons within each category (i.e., better chatbots are only compared against other better chatbots). We report the average standard error of difference of ability estimates of the left-out comparisons when using IRT with the most discriminative prompts. Thus, different prompts are selected for the better chatbots than for the others. The number of prompts was reduced while maintaining discriminative power as measured by standard error of discriminative ability ( Figure 2); using prompts customized to each group yields lower standard error than using the globally "best" prompts. As the number of models increases, such filtering based on model quality further improves samplewise efficiency. IRT prompt selection using model quality allows us to dynamically update the evaluation set to adapt to better models. Our work generalizes beyond the evaluation set from Vinyals and Le (2015). While other evaluation sets, such as random subsets of Twitter or OpenSubtitles may have fewer covariate prompts, there are many examples where further conversational context is required causing the prompts to have low discriminative power. For example, the prompt from the Twitter evaluation set (Sedoc et al., 2019), Not really is difficult to respond to without conversational context causing the prompt to have low discriminative power. Also, our method is not limited to single-turn prompts; however, for this case study, we focus on the available evaluation set. Multi-turn prompts such as A: Was this useful to you? B: Yes A: Ok are not very useful since almost any future response is valid. Initial results show that we can use IRT to automatically filter such uninformative prompts instead of handcurating an evaluation set.

Conclusion
We present a new method for incorporating IRT into chatbot evaluation and show that we can use IRT to adaptively and optimally weight prompts from the evaluation sets, eliminating less informative prompts. One of the strengths of our method is that prompt discriminative ability and difficulty are re-estimated as new evaluations are added. One can thus start with a larger evaluation set, such as a subset from the Cornell Movie Database (Danescu-Niculescu-Mizil and Lee, 2011) and continue refining the subset of the evaluation set. We showed that our method is effective with the NCM evaluation set. Applying it to the Cornell Movie Database evaluation set of Baheti et al. (2018), we found that we could reduce from 1000 to 150 prompts with negligible loss of accuracy. When evaluating a new model, one would start with a comparison, say against a human baseline on a large set of prompts, then against a similarly ranked model using an appropriate subset of prompts. After each evaluation, the accuracy of all comparisons will increase. IRT can also be used to adapt evaluation sets as chatbot models improve in performance, reducing annotation costs.
While our main exposition addresses single turn prompts for chatbot evaluation, our IRT model comparison method generalizes to many natural language generation tasks, including machine translation and text simplification. It also generalizes to multi-turn prompts, point-wise evaluation, pairwise conversational evaluation (e.g. Acute-Eval (Li et al., 2019a)), and interactive evaluations such as those of Kulikov et al. (2019).

A Further Human Evaluation Details
Crowd workers are paid $0.01 per prompt, and on average it takes 1 minute to evaluate 10 choices with a maximum allowed time of 2 minutes. We used three evaluators per prompt, so, if there are 200 prompts, we have 600 ratings and the net cost of the experiment is $7.2. We chose 3 annotators since we can generalize enough for IAA and it is cost-effective. The instructions seen by AMT workers are shown in Figure 3.
We removed workers with a correlation below 0.05 with other annotators. For a worker identified as "bad", all annotations are removed. Including these workers only increases the standard error by 10%.
From the 200 NCM evaluation set prompts, each annotation task has 10 prompts; however, we do not pair the same 3 workers to the 10 prompts; instead we randomize the prompts shown, so worker 1 many compare prompts 1-10, while worker 2 compares prompts 2, 3,5,7,9,11,13,17,19,23. As a result, the correlation between one worker and the others is more stable.
A full set of model comparisons on the Neural Conversation Model is available in Table 3.   Table 3: Comparison of various models using IRT. Larger positive indicates that System B is superior in terms of rating by human annotators and similarly smaller negative numbers mean that System A is superior. (* shows significant differences.)