Identifying inherent disagreement in natural language inference

Natural language inference (NLI) is the task of determining whether a piece of text is entailed, contradicted by or unrelated to another piece of text. In this paper, we investigate how to tease systematic inferences (i.e., items for which people agree on the NLI label) apart from disagreement items (i.e., items which lead to different annotations), which most prior work has overlooked. To distinguish systematic inferences from disagreement items, we propose Artificial Annotators (AAs) to simulate the uncertainty in the annotation process by capturing the modes in annotations. Results on the CommitmentBank, a corpus of naturally occurring discourses in English, confirm that our approach performs statistically significantly better than all baselines. We further show that AAs learn linguistic patterns and context-dependent reasoning.


Introduction
Learning to effectively understand unstructured text is integral to Natural Language Understanding (NLU), covering a wide range of tasks such as question answering, semantic textual similarity and sentiment analysis. Natural language inference (NLI), an increasingly important benchmark task for NLU research, is the task of determining whether a piece of text is entailed, contradicted by or unrelated to another piece of text (i.a., Dagan et al., 2005;MacCartney and Manning, 2009). Pavlick and Kwiatkowski (2019) observed inherent disagreements among annotators in several NLI datasets, which cannot be smoothed out by hiring more people. They pointed out that to achieve robust NLU, we need to be able to tease apart systematic inferences (i.e., items for which most people agree on the annotations) from items inherently leading to disagreement. The last example in Table 1, from the CommitmentBank (de Marneffe et al., 2019), is a typical disagreement item: some annotators consider it to be an entailment (3 or 2), 3 Premise: "All right, so it wasn't the bottle by the bed. What was it, then?" Cobalt shook his head which might have meant he didn't know or might have been admonishment for Oliver who was still holding the bottle of wine.   while others view it as a contradiction (-3). A common practice to generate an inference label from annotations is to take the average (i.a., Pavlick and Callison-Burch, 2016). In this case, the average of the annotations is 0.25 and the gold label for this item would thus be "Neutral", but such label is not accurately capturing the annotation distribution. Alternatively, some work simply ignores items on which annotators disagree and only studies systematic inference items (Jiang and de Marneffe, 2019a,b;Raffel et al., 2019). Here, we aim at teasing apart systematic inferences from inherent disagreements. In line with what Kenyon-Dean et al. (2018) suggested for sentiment analysis, we propose a finer-grained labeling Entailment Neutral Contradiction Disagreement Total   Train  177  57  196  410  840  Dev  23  9  22  66  120  Test  58  19  54  109  240   Total  258  85  272  585 1, 200  for NLI: teasing disagreement items, labeled "Disagreement", from systematic inferences, which can be "Contradiction", "Neutral" or "Entailment". To this end, we propose Artificial Annotators (AAs), an ensemble of BERT models (Devlin et al., 2019), which simulate the uncertainty in the annotation process by capturing modes in annotations. That is, we expect to utilize simulated modes of annotations to enhance finer-grained NLI label prediction. Our results, on the CommitmentBank, show that AAs perform statistically significantly better than all baselines (including BERT baselines) by a large margin in terms of both F1 and accuracy. We also show that AAs manage to learn linguistic patterns and context-dependent reasoning.

Data: The CommitmentBank
The CommitmentBank (CB) is a corpus of 1,200 naturally occurring discourses originally collected from news articles, fiction and dialogues. Each discourse consists of up to 2 prior context sentences and 1 target sentence with a clause-embedding predicate under 4 embedding environments (negation, modal, question or antecedent of conditional). Annotators judged the extent to which the speaker/author of the sentences is committed to the truth of the content of the embedded clause (CC), responding on a Likert scale from +3 to -3, labeled at 3 points (+3/speaker is certain the CC is true, 0/speaker is not certain whether the CC is true or false, -3/speaker is certain the CC is false). Following Jiang and de Marneffe (2019b), we recast CB by taking the context and target as the premise and the embedded clause in the target as the hypothesis.
Common NLI benchmark datasets are SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), but these datasets have only one annotation per item in the training set. CB has at least 8 annotations per item, which permits to identify items on which annotators disagree. 3 in Table 1 would thus be "Disagreement". However, this seems a bit too stringent, given that 70% of the annotators all agree on the 0 label and there is only one annotation towards the extreme. Likewise, for example 5, most annotators chose a negative score and the item might therefore be better labeled as "Contradiction" rather than "Disagreement". To decide on the finer-grained NLI labels, we therefore also took variance and mean into account, as follows: 1 • Entailment: 80% of annotations fall in the range [1,3] OR the annotation variance ≤ 1 and the annotation mean > 1. • Neutral: 80% of annotations is 0 OR the annotation variance ≤ 1 and the absolute mean of annotations is bound within 0.5. • Contradiction: 80% of annotations fall in the range [-3, -1] OR the annotation variance ≤ 1 and the annotation mean < -1. • Disagreement: Items which do not fall in any of the three categories above. We randomly split CB into train/dev/test sets in a 7:1:2 ratio. 2 Table 2 gives splits' basic statistics.

Model: Artificial Annotators
We aim at finding an effective way to tease items leading to systematic inferences apart from items leading to disagreement. As pointed out by Calma and Sick (2017), annotated labels are subject to uncertainty. Annotations are indeed influenced by several factors: workers' past experience and concentration level, cognition complexities of items, etc. They proposed to simulate the annotation process in an active learning paradigm to make use of the annotations that contribute to uncertainty. Likewise, for NLI, Gantt et al. (2020) observed that directly training on raw annotations using annotator   (2019), if annotations of an item follow unimodal distributions, then it is suitable to use aggregation (i.e., take an average) to obtain a inference label; but such an aggregation is not appropriate when annotations follow multi-modal distributions. Without loss of generality, we assume that items are associated with n-modal distributions, where n ≥ 1. Usually, systematic inference items are tied to unimodal annotations while disagreement items are tied to multi-modal annotations. We, thus, introduce the notion of Artificial Annotators (AAs), where each individual "annotator" learns to model one mode.

Architecture
AAs is an ensemble of n BERT models (Devlin et al., 2019) with a primary goal of finer-grained NLI label prediction. n is determined to be 3 as there are up to 3 relationships between premise and hypothesis, excluding the disagreement class. Within AAs, each BERT is trained for an auxiliary systematic inference task which is to predict entailment/neutral/contradiction based on a respective subset of annotations. The subsets of annotations for the three BERT are mutually exclusive.
A high-level overview of AAs is shown in Figure 1. Intuitively, each BERT separately predicts a systematic inference label, each of which represents a mode 3 of the annotations. The representations of these three labels are further aggregated 3 It's possible that three modes collapse to (almost) a point. as augmented information to enhance final finegrained NLI label prediction (see Eq. 1).
If we view the AAs as a committee of three members, our architecture is reminiscent of the Query by Committee (QBC) (Seung et al., 1992), an effective approach for active learning paradigm. The essence of QBC is to select unlabeled data for labeling on which disagreement among committee members (i.e., learners pre-trained on the same labeled data) occurs. The selected data will be labeled by an oracle (e.g., domain experts) and then used to further train the learners. Likewise, in our approach, each AA votes for an item independently. However, the purpose is to detect disagreements instead of using disagreements as a measure to select items for further annotations. Moreover, in our AAs, the three members are trained on three disjoint annotation partitions for each item (see Section 3.2).

Training
We first sort the annotations in descending order for each item and divide them into three partitions. 4 For each partition, we generate an auxiliary label derived from the annotation mean. If the mean is greater/smaller than +0.5/-0.5, then it's entailment/contradiction; otherwise, it's neutral. The first BERT model is always enforced to predict the auxiliary label of the first partition to simulate an entailment-biased annotator. Likewise, the second and third BERT models are trained to simulate neutral-biased and contradiction-biased annotators.
Each BERT produces a pooled representation for the [CLS] token. The three representations are passed through a multi-layer perceptron (MLP) to obtain the finer-grained NLI label: P (y|x) = softmax(W s tanh(W t [e; n; c])) (1) with [e; n; c] being the concatenation of three learned representations out of entailment-biased, neutral-biased and contradiction-biased BERT models. W s and W t are parameters to be learned.
The overall loss is defined as the weighted sums of four cross-entropy losses: loss = r * loss f + 1 − r 3 (loss e + loss n + loss c ) (2) where r ∈ [0, 1] controls the primary finer-grained NLI label prediction task loss ratio.

Experiment
We include five baselines to compare with: • "Always 0": Always predict Disagreement.
• CBOW (Continuous Bags of Words): Each item is represented as the average of its tokens' GLOVE vectors (Pennington et al., 2014). • Heuristic baseline: Linguistics-driven rules (detailed in Appendix A), adapted from Jiang and de Marneffe (2019b); e.g., conditional environment discriminates for disagreement items. • Vanilla BERT: (Devlin et al., 2019) Straightforwardly predict among 4 finer-grained NLI labels. • Joint BERT: Two BERT models are jointly trained, each of which has a different speciality. The first one (2-way) identifies whether a sentence pair is a disagreement item. If not, this item is fed into the second BERT (3-way) which carries out systematic inference. For all baselines involving BERT, we follow the standard practice of concatenating the premise and the hypothesis with [SEP]. Table 3 gives the accuracy and F1 for each baseline and AAs, on the CB dev and test sets. We run each model 10 times, and report the average. CBOW is essentially the same as the "Always 0" baseline as it keeps predicting Disagreement regardless of the input. The Heuristic baseline achieves competitive performance on the dev set, though it has a significantly worse result on the test set. Not surprisingly, both BERT-based baselines outperform the Heuristic on the test set: fine-tuning BERT often lead to better performance, including for NLI (Peters et al., 2019;McCoy et al., 2019). These observations are consistent with Jiang and de Marneffe (2019b) who observed a similar trend, though only on systematic inferences. Our proposed AAs perform consistently better than all baselines, and statistically significantly better on the test set (t-test, p ≤ 0.01). Also, AAs achieve a smaller standard deviation on the test set within the 10 runs, indi- 2 Premise: 'She was about to tell him that was his own stupid fault and that she wasn't here to wait on him -particularly since he had proved to be so inhospitable. But she bit back the words. Perhaps if she made herself useful he might decide she could stay -for a while at least just until she got something else sorted out.  cating that it is more stable and potentially more robust to wild environments.  Table 4 show examples for which AAs make the correct prediction while other baselines might not. The confusion matrix in Table 5 shows that the majority (∼60%) of errors come from wrongly predicting a systematic inference item as a disagreement item. In 91% of   such errors, AAs predict that there is more than one mode for the annotation (i.e., the three labels predicted by individual "annotators" in AAs are not unanimous), as in example 5 in Table 4. AAs are thus predicting more modes than necessary when the annotation is actually following a uni-modal distribution. On the contrary, when the item is supposed to be a disagreement item but is missed by AAs (as in example 6 and 7 in Table 4), AAs mistakenly predict that there is only one mode in the annotations 78% of the time. It thus seems that a method which captures accurately the number of modes in the annotation distribution would lead to a better model. We also examine the model performance for different linguistic constructions to investigate whether the model learns some of the linguistic patterns present in the Heuristic baseline. The Heuristic rules are strongly tied to the embedding environments. Another construction used is one which can lead to "neg-raising" reading, where a negation in the matrix clause is interpreted as negating the content of the complement, as in example 3 (Table 4) where I do not think they have seen a really high improvement is interpreted as I think they did not see a really high improvement. "Negraising" readings often occur with know, believe or think in the first person under negation. There are 85 such items in the test set: 41 contradictions (thus neg-raising items), 39 disagreements and 5 entailments. Context determines whether a neg-raising inference is triggered (An and White, 2019).   Table 6 gives F1 scores for the Heuristic, BERT models and AAs for items under the different embedding environments and potential neg-raising items in the test set. Though AAs achieve the best overall results, it suffers under conditional and question environments, as the corresponding training data is scarce (9.04% and 14.17%, respectively). The Heuristic baseline always assigns contradiction to the "I don't know/believe/think" items, thus capturing all 41 neg-raising items but missing disagreements and entailments. BERT, a SOTA NLP model, is not great at capturing such items either: 71.64 F1 on contradiction vs. 52.84 on the others (Vanilla BERT); 71.69 F1 vs. 56.16 (Joint BERT). Our AAs capture neg-raising items better with 77.26 F1 vs. 59.38, showing an ability to carry out context-dependent inference on top of the learned linguistic patterns. Table 7, comparing performance on test items correctly predicted by the linguistic rules vs. items for which contextdependent reasoning is necessary, confirms this: AAs outperform the BERT baselines in both categories.

Conclusion
We introduced finer-grained natural language inference. This task aims at teasing systematic inferences from inherent disagreements, overlooked in prior work. We show that our proposed AAs, which simulate the uncertainty in annotation process by capturing the modes in annotations, perform statistically significantly better than all baselines. However the best performance obtained (∼66%) is still far from achieving robust NLU, leaving room for improvement.

4915
baselines and AAs are listed in Table A2.

C.4 Dataset
The characteristics of the CommitmentBank (CB) are detailed in Section 2. The original version is available at https://github.com/mcdm/ CommitmentBank, and the recast version used in this work is available at https://github. com/FrederickXZhang/FgNLI.