What Can We Learn from Collective Human Opinions on Natural Language Inference Data?

Despite the subjective nature of many NLP tasks, most NLU evaluations have focused on using the majority label with presumably high agreement as the ground truth. Less attention has been paid to the distribution of human opinions. We collect ChaosNLI, a dataset with a total of 464,500 annotations to study Collective HumAn OpinionS in oft-used NLI evaluation sets. This dataset is created by collecting 100 annotations per example for 3,113 examples in SNLI and MNLI and 1,532 examples in Abductive-NLI. Analysis reveals that: (1) high human disagreement exists in a noticeable amount of examples in these datasets; (2) the state-of-the-art models lack the ability to recover the distribution over human labels; (3) models achieve near-perfect accuracy on the subset of data with a high level of human agreement, whereas they can barely beat a random guess on the data with low levels of human agreement, which compose most of the common errors made by state-of-the-art models on the evaluation sets. This questions the validity of improving model performance on old metrics for the low-agreement part of evaluation datasets. Hence, we argue for a detailed examination of human agreement in future data collection efforts, and evaluating model outputs against the distribution over collective human opinions. The ChaosNLI dataset and experimental scripts are available at https://github.com/easonnie/ChaosNLI

One common practice followed by most of these recent works is to simplify the evaluation of various reasoning abilities as a classification task. This is analogous to asking objective questions to a human in educational testing. This simplification not only facilitates the data annotation but also gives interpretable evaluation results, based on which behaviors of the models are studied and then weaknesses are diagnosed (Sanchez et al., 2018).
Despite the straightforwardness of this formalization, one assumption behind most prior benchmark data sourcing is that there exists a single prescriptive ground truth label for each example. The assumption might be true in human educational settings where prescriptivism is preferred over descriptivism because the goal is to test humans with well-defined knowledge or norms (Trask, 1999). However, it is not true for many NLP tasks due to their pragmatic nature where the meaning of the same sentence might differ depending on the context or background knowledge. Specifically for the NLI task, Manning (2006) advocate that annotation tasks should be "natural" for untrained annotators, and the role of NLP should be to model the inferences that humans make in practical settings. Previous work (Pavlick and Kwiatkowski, 2019) that uses a graded labeling schema on NLI, showed that there are inherent disagreements in inference tasks. All these discussions challenge the commonly used majority "gold-label" practice in most prior data collections and evaluations.
Intuitively, such disagreements among humans should be allowed because different annotators might have different subjective views of the world and might think differently when they encounter the same reasoning task. Thus, from a descriptive perspective, evaluating the capacity of NLP models in predicting not only individual human opinions or the majority human opinion, but also the overall distribution over human judgments provides a more representative comparison between model capabilities and 'collective' human intelligence. Therefore, we collect ChaosNLI, a large set of Collective HumAn OpinionS for examples in several existing (English) NLI datasets, and comprehensively examine the factor of human agreement (measured by the entropy of the distribution over human annotations) on the state-of-the-art model performances. Specifically, our contributions are: • We collect additional 100 annotations for over 4k examples in SNLI, MNLI-matched, and αNLI (a total of 464,500 annotations) and show that when the number of annotations is significantly increased: (1) a number of original majority labels fail to present the prevailing human opinion (in 10%, 20%, and 31% of the data we collected for αNLI, SNLI, and MNLI-matched, respectively), and (2) large human disagreements exist and persist in a noticeable amount of examples.
• We compare several state-of-the-art model 2 outputs with the distribution of human judgements and show that: (1) the models lack the ability to capture the distribution of human opinions 3 ; (2) such ability differs from the ability to perform well on the old accuracy metric; (3) models' performance on the subset with high levels of human agreement is substantially better than their performance on the subset with low levels of human agreement (almost close to solved versus random guess, respectively) and shared mistakes by the state-of-the-art models are made on examples with large human disagreements.
• We argue for evaluating the models' ability to predict the distribution of human opinions and discuss the merit of such evaluation with respect to NLU evaluation and model calibration. We also give design guidance on crowd-sourcing such collective annotations to facilitate future studies on relevant pragmatic tasks.
The ChaosNLI dataset and experimental scripts are available at https://github.com/easonnie/ ChaosNLI 2 Related Work Uncertainty of Annotations. Past discussions of human disagreement on semantic annotation tasks were mostly focused on the uncertainty of individual annotators and the noisiness of the data collection process. These tasks include word sense disambiguation (Erk and McCarthy, 2009;Jurgens, 2013), coreference (Versley, 2008), frame corpus collection (Dumitrache et al., 2019), anaphora resolution (Poesio and Artstein, 2005;Poesio et al., 2019), entity linking (Reidsma and op den Akker, 2008), tagging and parsing (Plank et al., 2014;Alonso et al., 2015), and veridicality (De Marneffe et al., 2012;Karttunen et al., 2014). These works focused on studying the ambiguity of annotations, how the design of the annotation setup might affect the inter-annotator-agreement, and how to make the annotations reliable. However, we consider the disagreements and subjectivity to be an intrinsic property of the populations. Our work discusses the disagreements among a large group of individuals, and further examines the relation between the annotation disagreement and the model performance.
Disagreements in NLI Annotations. Our work is significantly inspired by previous work that reveals the "inherent disagreements in human textual inference" (Pavlick and Kwiatkowski, 2019). It employed 50 independent annotators for a "graded" textual inference task, yielding a total of roughly 19,840 annotations, and validates that disagreements among the annotations are reproducible signals. In particular, in their work, the labeling schema is modified from 3-way categorical NLI to a graded one, whereas our study keeps the original 3-way labeling schema to facilitate a direct comparison between old labels and new labels, and focuses more on giving an in-depth analysis regarding the relation between the level of disagreements among humans and the state-of-the-art model performance.
Graded Labeling Schema. Some previous work attempts to address the issues with human disagreements by modifying or re-defining the evaluation task with a more fine-grained ordinal or even realvalue labeling schema rather than categorical labeling schema (Zhang et al., 2017;Pavlick and Kwiatkowski, 2019; to reduce the issues of ambiguity. Our work is independent and complementary to those by providing analysis on general language understanding from a collective distribution perspective.

Data Collection
Our goal is to gather annotations from multiple annotators to estimate the distribution over human opinions. Section 3.1 and 3.2 state some details of the collection. More importantly, Section 3.3 explains the challenges of such data collection and how our designs ensure data quality.

Dataset and Task
ChaosNLI provides 100 annotations for each example in three sets of existing NLI-related datasets. The first two sets are a subset of the SNLI development set and a subset of MNLI-matched development set, in which the examples satisfy the requirement that their majority label agrees with only three out of five individual labels collected by the original work (Bowman et al., 2015;Williams et al., 2018). 4 The third set is the entire αNLI development set introduced in Bhagavatula et al. (2020). To simplify the terminology, we denote SNLI, MNLIm and αNLI portion of the ChaosNLI as ChaosNLI-S, ChaosNLI-M, and ChaosNLI-α, respectively.

Annotation Interface
To collect multiple labels for each example, we employed crowdsourced workers from Mechanical Turker with qualifications. The annotation interface is implemented using the ParlAI 5 (Miller et al., 2017) framework. The collection is embodied in a multi-round interactive environment where at each round annotators are instructed to do one single multi-choice selection. This reduces the annotators' mental load and helps them focus on the human intelligence tasks (HITs). The compressed versions of instructions are shown in Figure 1. Screenshots of Turker interfaces are attached in Appendix A.

Quality Control
Collecting  need to enforce that each annotator will genuinely try their best on the work to avoid errors caused by carelessness. We can not denoise the data by collecting more annotations and aggregating them with majority voting, nor can we use the interannotator agreement to measure data quality.
To this end, we select a set of examples , which exhibit high human agreement for a single label, to rigorously test and track the performance of each annotator. We call them the set of unanimous examples. To obtain such set, we sampled examples from SNLI, MNLI, and αNLI training set, then crowdsourced 50 annotations for each of them, and finally selected those whose human agreement is indeed high (majority>95%). Throughout the collection process, we employ the following three mechanisms to ensure label quality: On-boarding test. Every annotator needs to pass an on-boarding test before they can work on any real example. The test includes five easy examples pre-selected from the set of unanimous examples. If they fail to give the correct selection for any of them, they will be prevented from working on any example. The mechanism tests whether the annotator understands the task.
Training phase. After passing the on-boarding test, each annotator will be given 10 to 12 examples from the set of unanimous examples to be further annotated. For each example, if an annotator gives a label that is different from the pre-collected legitimate label, the annotator will be prompted with the correct label and told to keep concentrating on the task. If the accuracy of an annotator on training examples is below 75%, the annotator will be disallowed to proceed. This training mechanism further helps the annotators get familiar with the task.
Performance tracking. After the training phase, annotators will be given real examples. For each example to be annotated, there will be 10% chance that the example is sampled from the set of unanimous examples. Again, for such examples, if an annotator gives a label that is different from the pre-collected legitimate label, the annotator will be prompted with the correct label and told to keep concentrating on the task. If the accuracy of an annotator on those examples is below 75% or if the annotator gives four consecutive incorrect labels, the annotator will be blocked. This mechanism tracks the performance of each annotator and guarantees that each annotator is capable and focused when working on any examples. Table 1 shows that on-boarding test filters more than half of the turkers. Figure 2 shows that the average accuracy of a single Turker on the set of unanimous examples improves as the annotators have completed more examples and converges at around 92%. 6 The observations indicate that our filtering mechanisms are rigorous and help improving and keeping annotator concentrate during the collection task. The design gives guidelines for future work on how to ensure data quality where normal inter-annotator-agreement measures are not applicable.

Other Details
The entire collection takes about one month to complete over 464K annotations. The mean/median time a turker spent on each example ranges from 10 to 20 seconds as shown in Table 1 (and we pay up to $0.5 on average per HIT of ten examples). We observe high variance in the time/example across turkers (including over-estimation due to breaks), hence the median estimate is more reliable. We  had a large set of qualified turkers for our final annotations. The total time of one month is largely attributed to the rigorous quality control mechanism via careful on-boarding qualification tests and quality monitoring.

Analysis of Human Judgements
Statistics. We collected 100 new annotations for each example in the three sets described in Section 3.1. Table 2 shows the total number of examples in the three sets and the percentage of cases where the new majority label is different from the old majority label (based on 5 annotations for SNLI and MNLI and 1 annotation for αNLI in their original dataset, respectively). Since we only collected labels for subsets of SNLI and MNLI-m, we also include the size of the original SNLI and MNLI-m development sets and the change-of-label ratio with respect to the original sets. The findings suggest that the old labels fail to present the genuine majority labels among humans for a noticeable amount  (10%, 25%, and 30% for ChaosNLI-α, ChaosNLI-S, and ChaosNLI-M, respectively) of the data. The label statistics for individual datasets can be found in Appendix D.
Examples. Table 3 and Table 4 show some collected NLI examples that either have low levels of human agreements or have different majority labels as opposed to the old ground truth labels. We can see that the resultant labels we collected not only provide more fine-grained human judgements but also give a new majority label that is better at presenting the prevailing human opinion. Moreover, there indeed exist different but plausible interpretations for the examples that are of low-level of human agreements and the discrepancy is not just noise but presents the distribution over human judgements with "higher resolutions". This is consistent with the finding in Pavlick and Kwiatkowski (2019).
Entropy Distribution. To further investigate the human uncertainty in our collected labels, we show the histogram of the entropy of label distribution for ChaosNLI-α, ChaosNLI-S and ChaosNLI-M in Figure 3. The label distribution is approximated by the 100 collected annotations. The entropy is calculated with H (p) = − i∈C p i log(p i ) and p i = n i j∈C n j , where C is the label category set and n i is the number of labels for category i. The entropy value gives a measure for the level of uncertainty or agreement among human judgements, where high entropy suggests low level of agreement and vice versa.
The histogram for the ChaosNLI-α shows a distribution that is similar to a U-Shaped distribution. This indicates that naturally occurring examples in ChaosNLI-α are either highly certain or uncertain among human judgements. In ChaosNLI-S and ChaosNLI-M, the distribution shows only one apparent peak; and the distribution for ChaosNLI-M is slightly skewed towards higher entropy direction. As described in Section 3.1, ChaosNLI-S and ChaosNLI-M are subsets of SNLI and MNLI-m development that are of low-level of human agreements, it could be expected that the majority of naturally occurring SNLI and MNLI data would also have low entropy, which will form another peak around the beginning of the x-axis resulting a U-like shape similar to ChaosNLI-α. 7

Analysis of Model Predictions
In Section 4, we discussed the statistics and some examples for the new annotations. The observation naturally raises two questions regarding the development of NLP models: (1) whether the stateof-the-art models are able to capture this distribution over human opinions; and (2) how the level of human agreements will affect the performance of the models. Hence, we investigate these questions in this section. Section 5.1 and 5.2 state our experimental choices. Section 5.3 discusses the results regarding the extent to which the softmax distributions produced by state-of-the-art models trained on the dataset reflects similar distributions over human annotations. Section 5.4 demonstrate the surprising influence of human agreements on the model performances. 7 In our pilot study, we collected 50 labels for 100 examples of SNLI where all five original annotators agreed with each other, the average entropy of those is 0.31. The average entropy of examples on ChaosNLI-S is 0.80.

Hypothesis
Old SNLI Majority changed A group of guys went out for a drink after work, and sitting at the bar was a real a 6 foot blonde with a fabulous face and figure to match.
The men didn't appreciate the figure of the blonde woman sitting at the bar.

MNLI Low agreements
In the other sight he saw Adrin's hands cocking back a pair of dragon-hammered pistols.
He had spotted Adrin preparing to fire his pistols.
MNLI Majority changed Table 3: Examples from ChaosNLI-S and ChaosNLI-M development set. 'Old Labels' is the 5 label annotations from original dataset. 'New Labels' refers to the newly collected 100 label annotations. Superscript indicates the frequency of the label.
Observation-1 Sadie was on a huge hike.
Observation-2 Luckily she pushed herself and managed to reach the peak. Hypothesis-1 Sadie almost gave down mid way.
Hypothesis-2 Sadie wanted to go to the top. Old Label Hypothesis-2 New Labels Hypothesis-1 (58) Hypothesis-2 (42) Observation-1 Uncle Jock couldn't believe he was rich. Observation-2 Jock lived the good life for a whole year, until he was poor again.
Hypothesis-1 He went to town and spent on extravagant things. Hypothesis-2 Jock poorly managed his finances.
Old Label Hypothesis-1 New Labels Hypothesis-1 (48) Hypothesis-2 (52) Table 4: Examples from the collected ChaosNLI-α development set. The task asks which of the two hypothesis is more likely to cause Observation-1 to turn into Observation-2. Superscript indicates the frequency of the label. Majority labels were marked in bold.

Models and Setup
Following the pretraining-then-finetuning trend, we focus our experiments on large-scale language pretraining models. We studied BERT (Devlin et al., 2019), XLNet (Yang et al., 2019), and RoBERTa (Liu et al., 2019) since they are considered to be the state-of-the-art models for learning textual representations and have been used for a variety of downstream tasks. We experimented on both the base and the large versions of these models, in order to analyze the parameter size factor. Additionally, we include BART (Lewis et al., 2020), ALBERT (Lan et al., 2019), and DistilBERT (Sanh et al., 2019) in the experiments. ALBERT is designed to reduce parameters of BERT by crosslayer parameter sharing and decomposing embedding. DistilBERT aims to compress BERT with knowledge distillation. BART is a denoising autoencoder for pretraining seq-to-seq models.
For NLI, we trained the models on a combined training set of SNLI and MNLI which contains over 900k NLI pairs. We used the best hyper-parameters chosen by their original authors. For αNLI, we trained the models on αNLI training set (169,654 examples). The hyper-parameters for αNLI were tuned with results on αNLI development set. Details of the hyper-parameters are in Appendix B.

Evaluation and Metrics
As formulated in Equation 4, we used the 100 collected annotations for each example to approximate the human label distributions for each example. In order to examine to what extent the current models are capable of capturing the collective human opinions, we compared the human label distributions with the softmax outputs of the neural networks following Pavlick and Kwiatkowski (2019).
We used Jensen-Shannon Distance (JSD) as the primary measure of the distance between the softmax multinomial distribution of the models and the distributions over human labels because JSD is a metric function based on a mathematical definition (Endres and Schindelin, 2003). It's symmetric and bounded with the range [0, 1], whereas the Kullback-Leibler (KL) divergence (Kullback and Leibler, 1951;Kullback, 1997) does not have these two properties. We also used KL as a complementary measure. The two metrics are calculated as: where p is the estimated human distribution, q is model softmax outputs, and m = 1 2 (p + q).  Table 5: Model Performances for JSD, KL, and Accuracy on majority label. ↓ indicates smaller value is better. ↑ indicates larger value is better. For each column, the best values are in bold and the second best values are underlined. "-b" and "-l" in the Model column denote "-base" and "-large", respectively.

Main Results
human performance (the last row). The chance baseline gives each label equal probability when calculating the JSD and KL measures. The accuracy of the chance baseline directly shows the proportion of the examples with the majority label in a specific evaluation set. To estimate the human performance, we employed a new set of annotators to collect another 100 labels for a set of randomly sampled 200 examples on ChaosNLI-α, ChaosNLI-S and ChaosNLI-M, respectively. For a better estimation of 'collective' human performance, we ensure that the new set of annotators employed for estimating human performance is disjoint from the set of annotators employed for the normal label collection. 8 In what follows, we discuss the results.
Significant difference exists between model outputs and human opinions. The most salient information we can get is that there are large gaps between model outputs and human opinions. To be specific, the estimated collective human performance gives JSD and KL scores far below 0.1 on all three sets. However, the best JSD achieved by the models is larger than 0.2 and the best KL achieved by the models barely goes below 0.5 across the table. The finding can be somewhat foreseeable since none of the models are designed to capture collective human opinions and suggests room for improvement. 8 The estimation of collective human performance can also be viewed as calculating the JSD and KL between two disjoint sets of 100 human opinions.
Even chance baseline is hard to beat. What is more surprising is that a number of these state-ofthe-art models can barely outperform and sometimes even perform worse than the chance baseline w.r.t. JSD and KL scores. On ChaosNLI-M, all the models yield similar JSD scores to the chance baseline and are beaten by it on KL. On ChaosNLI-α, BERT-base performs worse than the chance baseline on JSD and the scores of KL by all the models are way higher than that of the chance baseline. This hints that capturing human label distribution is a common blind spot for many models.
There is no apparent correlation between the accuracy and the two divergence scores. On both ChaosNLI-S and ChaosNLI-M, DistilBERT gives the best KL scores despite the fact that it obtains the lowest accuracy on the majority label. BERT-base gives the best JSD while having the second lowest accuracy on ChaosNLI-M. RoBERTalarge gives the highest accuracy on ChaosNLI-S and ChaosNLI-M, and the second highest accuracy on ChaosNLI-α but it only obtains the lowest JSD on ChaosNLI-α. The best JSD score on ChaosNLIα is achieved by BART but it fails to give the best accuracy. This hints that the ability required to model the distribution of human labels differs from that required to predict the majority label and perform well on the accuracy metric.
Large models are not always better. Direct comparison between base and large models for BERT, XLNet, and RoBERTa reveals that large models cannot beat base models on ChaosNLI-α and ChaosNLI-M on KL scores. Moreover, on ChaosNLI-M, all the large models give higher JSD scores than the base models. However, all the large models achieve higher accuracy than their base model counterparts on all three evaluation sets. This observation suggests that modeling the collective human opinions might require more thoughtful designs instead of merely increasing model parameter size.

The Effect of Agreement
To study how human agreements will influence the model performance, we compute the entropy of the human label distribution (by Equation 4) for each data point. Then, we partition ChaosNLI-α and the union of ChaosNLI-S and ChaosNLI-M using their respective entropy quantiles as the cut points. This results in several bins with roughly equal numbers of data points whose entropy lies in a specific range. Figure 4 and 5 shows the accuracy and the JSD of the models on different bins. 9 We observe that: • Across the board, there are consistent correla-9 Model JSD performances are similar to the accuracy performances where all the models obtain worse results at the bins with higher entropy range. One exception is the JSD of DistilBert on ChaosNLI-α. This might due to the fact that DistilBert is highly uncertain in its prediction and tend to give even distribution for each label yielding similar results to the chance baseline.
tions between the level of human agreements and the accuracy of the model. This correlation is positive, meaning that all models perform well on examples with a high level of human agreements while struggle with examples having a low level of human agreements. Similar trends also exists in JSD. • Accuracy downgrades dramatically (from 0.9 to 0.5) as the level of human agreements decrease. • The model barely outperforms and sometimes even under-performs the chance baseline on bins with the lowest level of human agreements. For both αNLI and NLI, the accuracy of most models on the bin with the lowest level of human agreements does not surpass 60%.
These results reveal that most of the data (which often compose the majority of the evaluation set) with a high level of human agreement have been solved by state-of-the-art models, and most of the common errors on popular benchmarks (like αNLI, SNLI, and MNLI) lie in the subsets where human agreement is low. However, because of the low human agreement, the model prediction will be nothing more than a random guess of the majority opinion. This raises an important concern that whether improving or comparing the performance on this last missing part of the benchmarks is advisable or useful.

Discussion & Conclusion
While common practice in natural language evaluation compares the model prediction to the majority label, Section 5.4 questions the value of continuing such evaluation on current benchmarks as most of the unsolved examples are of low human agreement. To address this concern, we suggest NLP models be evaluated against the collective human opinion distribution rather than one opinion aggregated from a set of opinions, especially on tasks which take a descriptivist approach 10 to language and meaning, including NLI and common sense reasoning. This not only complements prior evaluations by helping researchers understand whether model performance on a specific data point is reliable based on its human agreement, but also makes it possible to evaluate models' ability to capture the whole picture of human opinions. Section 5.3 shows that such ability is missing from current models and potential room for improvement is huge. It is also important to note that the level of human agreement is an intrinsic property of a data point. Section 5.4 demonstrates that such a property can be an indicator of the difficulty of the modeling. This hints at the connections between human agreements and uncertainty estimation or calibration (Guo et al., 2017) where machine learning models are required to produce the confidence value of their predictions, leading to important benefits in real-world applications.
In conclusion, we hope our data and analysis inspire future directions such as explicit modeling of collective human opinions; providing theoretical supports for the connection between human disagreement and the difficulty of acquiring language understanding in general; exploring potential usage of these human agreements; and studying the source of the human disagreements and its relations to different linguistic phenomena.

B Hyperparameters
For SNLI and MNLI, we used the same hyperparameters chosen by their original respective authors. For αNLI, we tuned batch size, learning rate and the number of epoch. For BERT, XLNet, and RoBERTa, we only searched parameters for large models and the base models use the same hyperparameters based on the results of the large ones. Table 8 shows the details. Figure 8 show the training trajectory and the changes of the accuracy and JSD of RoBERTalarge on four bins as the training data gradually increased in log space. The plots reveal that the accuracy of the models converges faster given fair amount of training data on bins with a high level of human agreements.

D Label Statistics
Labeling statistics can be found in Table 7. It is worth noting that there is a shift of majority labels from neutral to entailment in MNLI-m. We assume the difference might be due to multi-genre nature of the MNLI dataset, and collecting more intuitive and concrete reasons for such an observation from a cognitive or linguistic perspective will be important future work.

E Other Details
Our neural models are trained using a server with a Intel(R) Xeon(R) CPU E5-2630 v4 @ 2.20GHz (10 cores) and 4 NVIDIA TITAN V GPUs. Table 6 shows the urls where we downloaded external resources.