Uncertain Natural Language Inference

We introduce Uncertain Natural Language Inference (UNLI), a refinement of Natural Language Inference (NLI) that shifts away from categorical labels, targeting instead the direct prediction of subjective probability assessments. We demonstrate the feasibility of collecting annotations for UNLI by relabeling a portion of the SNLI dataset under a probabilistic scale, where items even with the same categorical label differ in how likely people judge them to be true given a premise. We describe a direct scalar regression modeling approach, and find that existing categorically-labeled NLI data can be used in pre-training. Our best models correlate well with humans, demonstrating models are capable of more subtle inferences than the categorical bin assignment employed in current NLI tasks.

1 Introduction [Textual] entailment inference is uncertain and has a probabilistic nature.
- Glickman et al. (2005) Variants of entailment tasks have been used for decades in benchmarking systems for natural language understanding.Recognizing Textual Entailment (RTE) or Natural Language Inference (NLI) is a categorical classification problem: predict which of a set of discrete labels apply to an inference pair, consisting of a premise (p) and hypothesis (h).The FraCaS consortium offered the task as an evaluation mechanism, along with a small challenge set (Cooper et al., 1996), which was followed by the RTE challenges (Dagan et al., 2006).Both employed a binary set of labels, which we here Woman reaching for food at the supermarket.

0.001
Woman is reaching for frozen corn at the store.
The brown dog is laying down on a blue sheet.0.671 A dog is laying down on its side, sleeping.
While researchers have recognized the inherent probabilistic nature of NLI, this has been primarily restricted to models of inference, in contrast to the task itself (see §2).Here we propose the task of Uncertain Natural Language Inference (UNLI), that shifts NLI away from categorical labels to the direct prediction of human subjective probability assessments (see Figure 1 for example).We illustrate that human-elicited probability assessments contains subtle distinctions on the likelihood of a hypothesis sentence conditioned on a context given by a premise sentence, far beyond a traditional ternary label (ENT / NEU / CON) assignment.Further, we define UNLI models built upon BERT (Devlin et al., 2019) that utilize recent advancements from large-scale language model pre-training, and provide experimental results illustrating that systems can often predict these judgments, but with clear gaps in understanding and in cases of logical incoherence.

arXiv:1909.03042v1 [cs.CL] 6 Sep 2019
UNLI is therefore a refinement of NLI that captures more subtle distinctions in meaning, that we can build models to target, and that we can collect data to support.We conclude that scalar annotation protocols such as employed here should be adopted in future NLI-style dataset creation, which should enable new work in modeling a richer space of interesting inferences.

Background
Uncertainty in NLI has been considered from a variety of perspectives.Glickman et al. (2005) stated 1 that p probabilistically entails h... if p increases the likelihood of h being true. 2 Judges annotated a pair positively if they could infer hypothesis h based on premise p with high confidence, and negatively otherwise: the prediction task was categorical, with associated model scores meant to reflect probabilities.Models during training were not provided with annotations capturing subjective uncertainty.Pavlick and Callison-Burch (2016) elicited ordinal annotations reflecting likelihood judgments, then averaged these labels under an assumption of uniform scalar distance between ordinal categories. 3The results were used for a manual analysis relating to the semantics of adjective-noun composition, but downstream use of the data in a model was restricted to casting the annotations to a ternary NLI classification problem. Lee et al. (2015) and others averaged ordinal judgments from multiple annotators with regards to whether particular events mentioned in a sentence did or did not happen, with the resulting structured prediction task being modeled as scalar regression (Stanovsky et al., 2017;Rudinger et al., 2018).Factuality data has been recast (White et al., 2017) into NLI form (Poliak et al., 2018a), but retains the traditional NLI categories.Reisinger et al. (2015) and White et al. (2016) have similarly asked annotators to annotate semantic properties on an ordinal scale, with resulting data later recast to traditional NLI.Lai and Hockenmaier (2017) leveraged a collection of image captions with a hierarchical structure to construct a probabilistic entailment model, where they state that: learning to predict the conditional probability of one phrase h given another phrase p would be helpful in predicting textual entailment.We directly ask humans for probability assessments, on complete NLI pairs.Lalor et al. (2016) and Lalor et al. (2018) attempt to capture uncertainty of each inference pair by Item Response Theory (IRT), which parameterizes the discriminative power of each inference pair: how easy is it to predict the gold label?For example, a pair (p, h) with CONTRADICTION as the gold label has high discriminative power when the pair is labeled (as CONTRADICTION) correctly by reliable human annotators, and vice versa.Lalor et al. (2018) uses IRT to estimate the discriminative power of each inference pair in a subset (180 pairs) from SNLI, showing fine-grained differences in discriminative power in each label.The IRT model relies on the discrete labels as oracle to determine the difficulty (discriminative power) of labeling each inference pair, while we propose a direct elicitation of subjective probability.
COPA (Roemmele et al., 2011) and ROCStories (Mostafazadeh et al., 2016) are examples of multiple-choice tasks which capture relative uncertainty between examples, but do not force a model to predict the probability of h given p. Zhang et al. (2017) made use of a protocol similar to Pavlick and Callison-Burch (2016), using this first to analyze various existing datasets such as SNLI, COPA and ROCStories, as well as their own automatically generated NLI hypotheses.For prediction, they advocated for an extended definition of NLI that increased the number of categories.4UNLI work can be viewed as a scalar version of their proposal.Li et al. (2019) viewed the plausibility task of COPA as a learning to rank problem, where the model is trained to assign the highest scalar score to the most plausible alternative given context.Our work can be viewed as an extension to this, with the score being an explicit human probability judgment instead.
Linguists such as van Eijck and Lappin (2014), Goodman and Lassiter (2015), Cooper et al. (2015) and Bernardy et al. (2018) have described models for natural language semantics that introduce probabilities into the compositional, model-theoretic tradition begun by those such as Davidson (1967) and Montague (1973).Where they propose probabilistic models for interpreting language, we are concerned with illustrating the feasibility of eliciting probabilistic judgements on examples through crowdsourcing, and contrasting this with prior efforts that were restricted to limited categorical label sets.
Many works in AI (e.g., by Schubert and Hwang (2000) or Garrette et al. (2011)) have proposed general language understanding systems with formal underpinnings based wholly or in part on probabilities.Here our focus is specifically on (U)NLI, as a motivating task for which we can gather data.

Uncertain NLI
We define UNLI by editing the definition by Dagan et al. (2006) for their original shared task, RTE-1: We say that p entails h h has subjective probability y given p if, typically, a human reading p would infer that h is most probably true h has a y chance of being true.This somewhat informal definition is based on (and assumes) common human understanding of language as well as common background knowledge.
Formally, given a premise p ∈ P and a hypothesis h ∈ H, a UNLI model F : P × H → [0, 1] should output an uncertainty score ŷ ∈ [0, 1] of the premise-hypothesis pair that correlates well with a human-provided subjective probability assessment.This is in contrast with a traditional 3-class NLI classification model F : P × H → T , where T = {ENT, NEU, CON} is the 3-class label set.
Metrics Given a UNLI dataset {(p i , h i , y i )} comprising premise-hypothesis-uncertainty triples, predictions of uncertainty can be computed as ŷi = F(h i , p i ).We compute the Pearson correlation (r), the Spearman rank correlation (ρ) and the mean square error (MSE) between y and ŷ as the metrics to measure the performance of UNLI models.
These metrics measure both the ranking and the regression aspects of the model: Pearson r measures the linear correlation between the gold probability assessments and the model's output; Spearman ρ measures the ability of the model ranking the premise-hypothesis pairs with respect to their subjective probability; whereas MSE measures whether the model can recover the subjective probability value from premise-hypothesis pairs.Note that we desire a high r and ρ and a low MSE.

Data
We construct a UNLI dataset by eliciting subjective probabilities from crowdsource workers (Mechanical Turk) on presented premise-hypothesis pairs.No new NLI premise-hypothesis pairs are elicited or generated, as our focus is on the uncertainty aspect of NLI.Owing to its familiarity within the community, we choose to illustrate UNLI via reannotating a sampled subset of SNLI (Bowman et al., 2015).For examples taken across the three categories CON / NEU / ENT we elicit a probability annotation y ∈ [0, 1], resulting in what we will call U-SNLI (Uncertain SNLI) (see Table 1 for examples).We preferred SNLI over MultiNLI for this work owing to SNLI containing a subset of examples for which multiple NEU hypotheses were collected per premise.Zhang et al. (2017) reported a wide range of ordinal likelihood judgments collected across SNLI NEU examples, and so we anticipated these multi-neutral premise examples to be good fodder for illustrating our points here.
There are 7,931 distinct premises in the training set of SNLI that are paired with 5 or more distinct NEU hypotheses: we take these 5 for each premise in this subset as prompts in elicitation, resulting in 39,655 NEU pairs, with additional 15,862 CON and ENT pairs combined.Altogether we call this our training set, with 55,517 pairs containing 7,931 distinct premises.Dev and test sets were sampled from SNLI dev and test respectively, again with heavy emphasis on NEU examples (see Table 3).

Annotation
Our process was inspired by the Efficient Annotation of Scalar Labels (EASL) framework of Sakaguchi and Van Durme (2018), which combines notions of direct and relative assessments into a single crowd-sourcing interface.Groups of items are put into lists of size k, where k such items are presented to a user in a single page view, each item paired with a slider bar (for example, one may present k = 5 distinct items on one page view).The slider bar enables direct assessment by the annotator per item.The interface has an implicit relative assessment aspect in that performing direct assessment judgments of multiple items placed visually together in a single page view is meant to encourage cross-item calibration of judgments.Our individual items were premise hypothesis pairs, with instructions requesting a probability assessment (see Figure 2).

Hypothesis SNLI U-SNLI
A man in a white shirt taking a picture.A man takes a picture.
ENT 1.0 A little boy in a striped shirt is standing behind a tree.
The boy is hiding outside.ENT 0.904 A man is holding a bus pole in front of a building.
The man is waiting for the bus.NEU 0.741 A person is out in the water at the beach while the sun sets.
A woman is at the beach.NEU 0.5 A town worker working on electrical equipment.
The worker is tired.NEU 0.251 A smiling child is standing behind a tree.
A man is eating a hotdog.CON 0.0381 Man laying on a platform outside on some rocks.
Man takes a nap on his couch.
CON 0.0 Table 1: Examples of subjective probability assessments on NLI pairs (from U-SNLI dev).

Premise Hypothesis SNLI U-SNLI
A man is singing into a microphone.
A man performs a song.Annotators were asked to estimate how likely the situation described in the hypothesis sentence would be true given the premise.Example pairs were provided in the instructions along with suggested probability values (see Figure 3 for three such examples).Annotators were recommended to calibrate their score for a given element taking into account the scores provided to other elements in the same page view.
Interface For each premise-hypothesis pair, we elicit a probability assessment in [0, 1] from annotators using an interface shown in Figure 2, in contrast to the uniform {1, • • • , 100} scale employed  ).
We ran pilots to tune β, finding that people often choose far lower probabilities for some events than was intuitive upon inspection, (e.g., just below 50%).Therefore, we employed different β values depending on the range of [0, 0.5] or (0.5, 1] (Figure 4).
Qualification Test Annotators were given a qualification test to ensure non-expert workers were able to give reasonable subjective probability estimates.We first extracted seven statements from Book of Odds (Shapiro et al., 2014), and manually split the statement into a bleached premise and hypothesis.We then wrote three easy premisehypothesis pairs with definite probabilities like (p = "A girl tossed a coin.",h = "The coin comes up a head.",probability: 0.5).We qualify users that meet both criteria: (1) For the three easy pairs, their annotations had to fall within a small error range around the correct label y, computed as δ = 1 4 min{y, 1 − y}.
(2) Their overall annotations have a Pearson r > 0.7 and Spearman ρ > 0.4.This qualification test led to a pool of 40 trusted annotators, which were employed for the entirety of our dataset creation.
Incremental Annotation Each item was doubly annotated.In the case where the difference between the first two annotations on the raw slider bar {0, • • • , 10000} was greater than 2000, we elicited a third round of annotation.After annotation, the associated probability to a pair was the median of gathered responses.
Statistics We plot the resultant median and quartile for each of the 3 categories of SNLI under our U-SNLI dev set (Figure 5), showing the wide range of probability judgments elicited.

Model
We base our model for UNLI on top of the sentence pair classifier6 in BERT (Devlin et al., 2019) to exploit recent advancements brought by largescale language model pre-training.The original model for NLI first concatenates the premise and the hypothesis, with a special sentinel token (CLS) inserted at the beginning and a separator (SEP) inserted after each sentence, tokenized using Word-Piece.After passing this concatenated token sequence to the BERT encoder, take the encoding of the first sentinel (CLS; index: 0) token, and pass the resulting feature vector f(p, h) through a linear layer to result in the one of the label set T = {ENT, NEU, CON} for traditional NLI.
We modify this architecture to accommodate our scenario: the last layer of the network is changed from the 3-dimensional output to a scalar outputthe logit score.The sigmoid function σ is applied so that the output lies in [0, 1], as any probability should.Therefore the UNLI task is directly modeled as a regression problem, trained using a binary cross-entropy loss7 between the human annotation y and the model output ŷ.

Training with SNLI
We establish baselines for the UNLI task by training with just the SNLI dataset and the original 3-way classification labels (i.e., without our annotated uncertainty scores in U-SNLI).For illustrative purposes, we denote the original SNLI dataset as a set of premise-hypothesis-label triples (p i , h i , t i ) where t i ∈ {ENT, NEU, CON}.

Training via regression
We derive a surrogate function s : T → [0, 1] that maps any SNLI label t ∈ {ENT, NEU, CON} to a surrogate score by taking the average of all probabilistic annotations bearing label t in the U-SNLI training set. 8The SNLI dataset is therefore mapped to a UNLI dataset {(p i , h i , s(t ))} and we use this mapped version to train our regression model.
Training via learning to rank Since we focus on the uncertainty of NLI, we alternatively approach the problem as a learning to rank problem.Instead of regression, we train a model that could correctly rank the premise-hypothesis pairs according to the probability: ENT NEU CON.To this end, we train the UNLI model F : P × H → [0, 1] with a margin-based loss (Weston and Watkins, 1999): where ξ is the margin hyperparameter.This is to say, the model learns to assign a higher score for F(p, h) than F(p , h ) if t i t j , ideally the gap being larger than ξ.
However, the summation in Equation 2is over R = {(i, j) | t i t j }, which unfortunately has a computationally infeasible O(N 2 ) complexity, where N is the number of samples in the dataset.Hence we only take the summation over subsets: (1) Shared-premise pairs: data pairs with identical premises are included -these pairs rank the probability of different hypotheses given the same premise: R 1 = {(i, j) | i = p j ∧ t i t j }; (2) Cross-premise pairs: For each sample i, we randomly sample K other samples S i with different premises and lower probability9 : R 2 = {(i, j) | j ∈ S i ∧ t i t j }.The union R 1 ∪ R 2 is used as training set, hence reducing the complexity to O(K N).  with an initial learning rate of 10 −5 and maximum gradient norm 1.0 for all these settings.For the SNLI ranking setting, the hyperparameters are tuned with the margin ξ ∈ {0.3, 0.35, 0.4} and the number of contrasting samples K = {0, 1, 2}.All models are trained for 3 epochs, where the epoch resulting in the highest Pearson r on the U-SNLI dev set is selected.We report results on both the U-SNLI dev and test set based on the selected model.

Hypothesis-only baselines
Owing to the concerns raised with SNLI (Gururangan et al., 2018;Tsuchiya, 2018;Poliak et al., 2018b) about its annotation artifacts, we include a hypothesis-only baseline version for all our settings (see Table 4), where all premises are reduced to an empty string.These baseline systems achieved a correlation around 40%, corroborating with the findings in this thread of work that a hidden bias exists in the SNLI dataset that allows prediction from hypothesis sentences even if no context information is given by the premise.These baselines show this bias also exists in U-SNLI.

Main results
The results on the U-SNLI dataset can be found in  by augmenting with pre-training on SNLI (under both pre-training settings, namely regression and ranking).Ranking consistently achieves a higher correlation than regression: in the SNLI-only setwithout U-SNLI, about a 4% boost in Pearson correlation can be observed by switching from regression to ranking; in the U-SNLI fine-tuning scenarios, this switch results in about 0.6% increase.
Figure 6 illustrates the effect of fine-tuning with U-SNLI on model behavior.It can be seen that before using our U-SNLI data for fine-tuning (just using SNLI), under the surrogate regression setting, the model's prediction concentrates on the 3 surrogate scalar values of the 3 SNLI classes (CON / NEU / ENT).The learning to rank setting results in slightly more flexible probability assigments to premise-hypothesis pairs that also correlates better with elicited U-SNLI labels, as is supported by better Pearson r scores.After fine-tuning with U-SNLI training set, the model learns smoother predictions for premise-hypothesis pairs, supported by the superior Pearson correlation score r.
Note that the right-bottom corner of the heatmaps in Figure 6 (samples with ≈ 1.0 gold U-SNLI labels and ≈ 1.0 model predictions) are of high accuracy.This is in accordance with what Zhang et al. (2017) found, that the entailments in SNLI dataset is of close to 1.0 probability, whereas the NEU and CON pairs exhibit a wider range of subjective probability values.
Errors We present a selection of the samples in the U-SNLI dev set with some of the largest gap between the gold probability assessment in the U-SNLI dev set and the BERT-based model output (the best model we produced) in Table 5.The BERT-based model seems to have learned lexiconlevel inference (e.g., race cars going fast, but ignored crucial information sits in the pits), but fail to learn certain commonsense patterns (e.g.riding amusement park ride screaming; man and woman drinking at a bar on a date).These examples show that despite significant improvements from large-scale language model pretraining, com-monsense reasoning and plausibility estimation are yet to be solved.

Human performance
We elicit additional annotations on U-SNLI dev set establish a randomly sampled human's performance on UNLI.We split the dev set into 3 parts, where each part is labeled by the annotators previously selected from the qualification test.We ensure that each item is new to its annotator (the annotator did not provide a label used in the creation of dev).Scores are then elicited for all premise-hypothesis pairs with no redundancy (one-way annotation).This setting approximates the performance of a randomly sampled human on U-SNLI, and therefore a reasonable lower bound on the performance one could achieve with a dedicated, trained single human annotator.
The metrics are listed in Table 4.Our best models achieve a higher score than this human performance (Pearson r: 67.97% > 62.28%), demonstrating they can achieve human-level inference on premise-hypothesis pairs sampled from SNLI.
Coherence Since we define UNLI as a modification to RTE in terms of human responses to h given p, here we ask whether judgments by humans and separately our system are always coherent when the same premise is paired to different hypothesis that are mutually incompatible.We consider two examples (see Table 6) pulled from SNLI train, that we selected by hand owing to the premise establishing the potential for an intuitively common-sense, finitely enumerable set of alternatives.Based on the premise we manually constructed alternatives to an existing hypothesis such that (1) they are logically mutually exclusive; and (2) one of the hypotheses must reasonably hold given the premise.Specifically, a preteen must have an age in the range of {0, • • • , 12}, and the most commonsense alternatives to lunch include breakfast and dinner.We distribute these constructed pairs into separate HITs, making sure that no annotator is viewing two related premise-hypothesis pairs at the same time, employing 6-way redundancy (see Figure 7).
With respect to human judgments, we observe the sum of probabilities across the options exceeds 1.0 in both cases.That humans can be irrational in their probability assignments is known, and therefore this result not unexpected: in UNLI we have embraced human judgments in the definition, taking seriously the phrasing of the original RTE task.
Regarding our best model's predictions on these Preteen girl with blond-hair plays with bubbles near a vendor stall in a mall courtyard.
The girl is ten.examples, we first observe that its scores also lead to a summed over-estimate, with a distribution of values strikingly similar to the median human response in the barbecue example.Second we observe a clear error in the girl example, where BERT plus subsequent (U)NLI exposure did not appear to provide a definition of the word preteen.

Conclusion and Future Work
We proposed a new task of directly predicting human likelihood judgments on NLI premisehypothesis pairs, calling this Uncertain NLI (UNLI).We built the U-SNLI dataset as a proof of this concept, which contains NLI pairs sampled from SNLI and annotated with subjective probabilistic assessments in the form of a scalar quantity between 0 and 1, instead of the 3-way CONTRA-DICTION / NEUTRAL / ENTAILMENT classification labels commonly used.
We demonstrated that (1) eliciting supporting data is feasible, and (2) annotations in the data can be used for improving a scalar regression model beyond the information contained in existing categorical labels, using recent contextualized word embeddings (BERT) are established.Performance was on the level of humans, but still retaining nonhuman-like errors in some predictions.
We suggest future resource creation in NLI shift to UNLI.Regarding what data to (re-)annotate, we chose SNLI as the basis for our proof of for reasons earlier described, but there have been various works discussing concerns of SNLI, such as earlier referenced hypothesis-only artifacts.Zhang et al. (2017) were concerned over the direct elicitation of hypothesis statements, and proposed a procedure consisting of automatic generation of common-sense hypotheses from SNLI premises, followed by human filtering and labeling: such a process could be adapted to UNLI.Based on common-sense errors observed in our model on U-SNLI, we would anticipate such a dataset proving a significant and interesting challenge.

Figure 1 :
Figure 1: Neutral premise-hypothesis pairs taken from SNLI, relabeled with subjective probability.Here p y h denotes pair (p, h) is labeled with subjective probability y.
is singing a special and meaningful song.NEU 0.152 A man performing in a bar.NEU 0.144 A man is singing the national anthem at a crowded stadium.NEU 6.18×10 −3

Figure 2 :
Figure 2: An example of our annotation interface.

Figure 3 :
Figure 3: Three examples from the instructions.

Figure 5 :
Figure 5: Distribution of U-SNLI training set, illustrating median and quartile for each of the 7 categories (ENT / NEU 1:5 / CON) under our scalar probability scheme.NEU i denotes the set of NEU samples labeled as the i-th least likely among the 5 hypotheses paired with each premise.Light / dark shade covers 96% / 50% of each category.

Figure 6 :
Figure 6: Heatmap of the predictions on U-SNLI dev set under the pretrained (left) and the fine-tuned (right); regression pre-trained (top) and ranking pre-trained (bottom) models.Prediction frequencies are normalized along each gold label row.
Alternatives: one | two | • • • | twelve Three young men standing in a field behind a barbecue smiling each giving the two handed thumbs up sign.Three men are barbecuing lunch.Alternatives: breakfast | lunch | dinnerTable 6: Examples from SNLI prompting a question of logical coherence of crowd-sourced probabilities.

Figure 7 :
Figure 7: Subjective probability for the preteen girl (left) and the barbecued meal (right).

Table 2 :
A premise in SNLI train, whose 5 hypotheses are annotated with subjective probabilities in U-SNLI.

Table 3 :
Statistics of our U-SNLI dataset.

Table 4 .
Just training by our annotated U-SNLI yields a reasonable 62.71% Pearson r on test -however this is consistently improved

Table 4 :
Metrics of the prediction models under various configurations for U-SNLI.

Table 5 :
Selected examples whose BERT prediction deviates from their probability assessments in U-SNLI dev.