How well do NLI models capture verb veridicality?

In natural language inference (NLI), contexts are considered veridical if they allow us to infer that their underlying propositions make true claims about the real world. We investigate whether a state-of-the-art natural language inference model (BERT) learns to make correct inferences about veridicality in verb-complement constructions. We introduce an NLI dataset for veridicality evaluation consisting of 1,500 sentence pairs, covering 137 unique verbs. We find that both human and model inferences generally follow theoretical patterns, but exhibit a systematic bias towards assuming that verbs are veridical–a bias which is amplified in BERT. We further show that, encouragingly, BERT’s inferences are sensitive not only to the presence of individual verb types, but also to the syntactic role of the verb, the form of the complement clause (to- vs. that-complements), and negation.


Introduction
A context is veridical when the propositions it contains are taken to be true, even if not explicitly asserted. For example, in the sentence "He does not know that the answer is 5", "know" is veridical with respect to "The answer is 5", since a speaker cannot felicitously say the former sentence unless they believe the latter proposition to be true. In contrast, "think" would not be veridical here, since "He does not think that the answer is 5" is felicitous whether or not it is taken to be true that "The answer is 5". Understanding veridicality requires semantic subtlety and is still an open problem for computational models of natural language inference (NLI) .
This paper deals specifically with veridicality in verb-complement constructions. Prior work in this area has focused on characterizing verb classes-e.g. factives like "know that" (Kiparsky and Kiparsky, 1968) and implicatives like "manage to" (Karttunen, 1971)-and on incorporating such lexical semantic information into computational models (MacCartney and Manning, 2009). However, increasingly, linguistic evidence suggests that inferences involving veridicality rely heavily on non-lexical information and are better understood as a graded, pragmatic phenomenon (de Marneffe et al., 2012;Tonhauser et al., 2018).
Thus, in this paper, we revisit the question of whether neural models of natural language inference-which are not explicitly endowed with knowledge of verbs' lexical semantic categorieslearn to make inferences about veridicality consistent with those made by humans. We solicit human judgements on 1,500 sentence pairs involving 137 verb-complement constructions. Analysis of these annotations provides new evidence of the importance of pragmatic inference in modeling veridicality judgements. We use our collected annotations to analyze how well a state-of-the-art NLI model (BERT, Devlin et al., 2018) is able to mimic human behavior on such inferences. The results suggest that, while not yet solved, BERT represents non-trivial properties of veridicality in context. Our primary contributions are: • We collect a new NLI evaluation set of 1,500 sentence pairs involving verb-complement constructions ( §4). 1 • We discuss new analysis of human judgements of veridicality and implications for NLI system development going forward ( §5).
• We evaluate the state-of-the-art BERT model on these inferences and present evidence that, while there is still work to be done, the model appears to capture non-trivial properties about verbs' veridicality in context ( §6).

Background and Related Work
There is significant work, both in linguistics and NLP, on veridicality and closely-related topics (factuality, entailment, etc). We view past work on veridicality within NLP as largely divisible into two groups, which align with two differing perspectives on the role of the NLI task: the sentencemeaning perspective and the speaker-meaning perspective. Briefly, the sentence meaning approach to NLI takes the position that NLP systems should strive to model the aspects of a sentence's semantics which are closely derivable from the lexicon and which hold independently of context (Zaenen et al., 2005). In contrast, the speaker meaning approach to NLI takes the position that NLP systems should prioritize representation of the goaldirected meaning of a sentence within the context in which it was generated (Manning, 2006). Work on veridicality which aligns with the sentencemeaning perspective tends to focus on characterizing verbs according to their lexical semantic classes (or "signatures"), while work which aligns with the speaker-meaning approach focuses on representing "world knowledge" and evaluating inferences in naturalistic contexts.
Lexical Semantics (Sentence Meaning). Most prior work treats veridicality as a lexical semantic phenomenon. Such work is largely based on lexicons of verb signatures which specify the types of inferences licensed by individual verbs (Karttunen, 2012;Nairn et al., 2006;Falk and Martin, 2017). White and Rawlins (2018) Angeli and Manning (2014); and others incorporated knowledge of verb signatures within a natural logic framework (MacCart-ney, 2009;Sánchez Valencia, 1991) in order to perform natural language inference. Richardson and Kuhn (2012) incorporated signatures into a semantic parsing system. Several recent models of event factuality similarly make use of veridicality lexicons as input to larger machine-learned systems for event factuality (Saurí and Pustejovsky, 2012;Lotan et al., 2013;Stanovsky et al., 2017;. Cases et al. (2019) used nested veridicality inferences as a test case for a meta-learning model, again assuming verb signatures as "meta information" known a priori.
Pragmatics (Speaker Meaning). Geis and Zwicky (1971) observed that implicative verbs often give rise to "invited inferences", beyond what is explainable by the lexical semantic type of the verb. For example, on hearing "He did not refuse to speak", one naturally concludes that "He spoke" unless additional qualifications are made (e.g. "...he just didn't have anything to say"). de Marneffe et al. (2012) explored this idea in depth and presented evidence that such pragmatic inferences are both pervasive and annotatordependent, but nonetheless systematic enough to be relevant for NLP models. Karttunen et al. (2014) makes similar observations specifically in the case of evaluative adjectives, and Pavlick and Callison-Burch (2016) specifically in the case of simple implicative verbs. In non-computational linguistics, Simons et al. (2017Simons et al. ( , 2010; Tonhauser et al. (2018) take a strong stance and argue that veridicality judgements are entirely pragmatic, dependent solely on the question under discussion (QUD) within the given discourse. This Work. This paper assumes the speakermeaning approach: we take the position that models which consistently mirror human inferences about veridicality in context can be said to understand veridicality in general. We acknowledge that the question of what is the "right" approach to NLI has existed since the original definition of the recognizing textual entailment (RTE) task (Dagan et al., 2006) and remains open. However, there has been a de facto endorsement of the speakermeaning definition, evidenced by the widespread adoption of NLI datasets which favor informal, "natural" inferences over prescriptivist annotation guidelines (Manning, 2006;Bowman et al., 2015;Williams et al., 2018) He did not refuse to do the same. → He did the same.

NA
Many felt that its inclusion was a mistake. → Its inclusion was a mistake.

[•/•]
Many did not feel that its inclusion was a mistake. → Its inclusion was a mistake. Table 1: Examples of several verb signatures and illustrative contexts for each. Signature s1/s2 denotes that the complement will project with polarity s1 in a positive environment and polarity s2 in a negative environment.
and Boleda (2019)). Thus, from this perspective, we ask: do NLI models which are not specifically endowed with lexical semantic knowledge pertaining to veridicality nonetheless learn to model this semantic phenomenon?

Projectivity and Verb Signatures
Veridicality is typically treated as a lexical semantic property of verbs, specified by the verb's signature. These signatures can indicate that a verb licenses positive (+), negative (−), or neutral (•) inferences. Specifically, Karttunen (2012) defines these as two-bit signatures, to reflect that verbs 2 may behave differently in positive vs. negative environments. For example, a factive verb construction like "know that" has a +/+ signature, indicating that the complement projects positively in both positive and negative environments. That is, both "He knows that the answer is 5" and "He does not know that the answer is 5" imply that "The answer is 5". In contrast, a verb like "manage to" has the signature +/− since, in a positive environment, the complement projects ("I managed to pass"→"I passed") but, in a negative environment, the negation of the complement projects ("I did not manage to pass"→ ¬"I passed"). Other verbs may exhibit veridicality only in positive or negative environments but not in both. For example, "refuse to" has signature −/•: "She refused to dance"→ ¬"She danced", but "She did not refuse to dance" neither implies nor contradicts the claim "She danced". Still other verbs are entirely non-veridical (•/•). For example, "hope to" is not expected to license any inferences about the truth of its complement. We consider 8 signatures 3 in total. Table 1 provides several examples. Table 2 lists all of the signatures and the corresponding verbs we consider.

Data
For our analysis, we collect an NLI dataset for veridicality evaluation derived from the MNLI corpus. This data is publicly available at https: //github.com/alexisjihyeross/ verb_veridicality.
We then generate premise/hypothesis pairs as follows. We use the parse tree provided by MNLI to extract the complement clause C. When needed, we inflect 6 the verb in the complement to match the tense of the main verb. We then generate two p, h pairs: the sentence and the complement as-is S, C , and the negated sentence plus the complement ¬S, C . For example, given an original sentence like "He knows that the answer is 5", we would generate two p, h pairs: "He knows that the answer is 5", "The answer is 5" and "He does not know that the answer is 5", "The answer is 5" . The examples shown in Table 1 illustrate p, h pairs drawn from our dataset, generated this way.

Annotation
For each p, h pair, we collect human judgements on Amazon Mechanical Turk. We have raters label entailment on a 5-point likert scale in which −2 means that h is definitely not true given p and 2 means that h is definitely true given p. This ordinal labelling scheme 7 matches prior work on common sense inference (Zhang et al., 2017), and on 6 https://www.clips.uantwerpen.be/ pages/pattern-en 7 Full annotation guidelines in Supplementary. veridicality specifically (de Marneffe et al., 2012). We do not provide examples for boundary cases (the difference between −2, −1 or 1, 2) to avoid biasing raters by providing explicit guidance about the extent to which common sense can factor in. Raters have the option of indicating with a check box that one or both sentences does not make sense, and thus that they are unable to judge. We require that raters have had at least 100 approved tasks, have maintained an 98% approval rating, and are located in an primarily English-speaking country (US, AU, GB, CA). We collect three annotations per p/h pair, and pay $0.10 per set of six pairs labelled.
Quality Controls and Exclusion Criteria. We remove all sentence pairs in which one or more raters checked the "does not make sense" box. We remove sentences from our analysis unless both the S, C and the ¬S, C pairs passed this filter. Finally, we remove verbs from our analysis which, after the above filtering, do not appear with at least 4 sentences (i.e. 8 p/h pairs). Our final dataset contains 137 verb types across 1,498 sentences (2,996 pairs). Table 2 lists the verbs included in our dataset and the number of sentences in which each appears. 8 To measure inter-rater agreement, for each example and each of the three raters assigned to the example, we calculated the correlation between that rater's score and the averaged score of the other two raters. The Spearman correlation among raters, averaged across the three raters for each example, was 0.78 for positive contexts and 0.74 for negative contexts.

Aggregation
We take the mean of the three human judgements for each sentence pair. We then represent each verb v (in the context of a given sentence S) using a continuous analog of the projectivity signatures discussed in §3. That is, we take the mean score for S, C as a measure of the veridicality of v (in the context of S) in a positive environment, and the mean score for ¬S, C as a measure of the veridicality in a negative environment. For example, given S="David Plotz failed to change my mind" and C="David Plotz changed my mind", we get a soft projectivity "signature" of −2.0/1.67, which is consistent with the expected (discrete) −/+ signature for "fail to".  Figure 1: Human judgements (top, blue) and model predictions (bottom, orange) for verbs in each category. Gray squares denote the region in which judgements are expected fall, given the signature. Each colored dot corresponds to a single context (verb within a specific sentence); each black dot corresponds to a single verb (averaged score over all contexts in which it was judged).

Analysis of Human Judgements
Figure 1a plots these soft veridicality signatures for each sentence. We see that, averaged across all the contexts in which they are judged, verbs tend to behave as expected given their assigned lexical semantic signature. However, we observe two noteworthy trends, discussed below. We note that these observations are consistent with arguments made by de Marneffe et al. (2012) about the strong effects of pragmatics on veridicality judgments.
Veridicality Bias. First, we observe a systematic "veridicality bias", in which inferences about complements are often made (positive or negative), even in environments when the expectation is that the verb is non-veridical (• signature). This trend is most evident in the case of verbs with •/• signatures, for example, "think that", "want to".
While embedding under such verbs should not license any inferences about the truth of the complement, we observe that, in practice, these verbs tend to behave like +/− verbs. That is, the complement is taken as true in positive environments and as false in negative environments. Table 3 shows some examples for which this is the case.
Within-Verb Variation. Second, we observe that, while individual verb types tend to behave in line with their expected signatures on average, signatures provide a weak signal for predicting the inferences licensed by the verb in any sentence individually. That is, within each signature, we see high variance across contexts, in all cases span- The GAO has indicated that it is unwilling to compromise. → It is unwilling to compromise.
[−] (−1.0) The GAO has not indicated that it is unwilling to compromise. → ¬ It is unwilling to compromise.
[  ning at least 2 points (on our −2 to 2 scale). Table 4 shows examples of words receiving different signatures based on context. Quantitatively, in an ordinary least squares regression 9 , we find that using verb signature alone to predict the human judgments in a given context explains only a small amount of the observed variation (R 2 ≈ 0.11). 10 9 statsmodels.regression.linear_model. OLS.html 10 For context, using the verb type itself produced R 2 ≈ 0.72. We experimented with other contextual features in combination with linguistic category and/or verb type (e.g. tense of the main verbs, first vs. third person subjects, etc.) to try to improve the fit of the model, but did not find any note- [+] (1.7) Everyone knows that the CPI is the most accurate. → The CPI is the most accurate.
[+] (1.7) Everyone does not know that the CPI is the most accurate. → The CPI is the most accurate.
[+] (0.7) I know that I was born to succeed. → I was born to succeed. [•] (0.3) I do not know that I was born to succeed. → I was born to succeed. Takeaways. Overall, we interpret the above analysis as evidence that veridicality judgments rely heavily on contextual as opposed to purely lexical semantic factors. While this is not a novel conclusion (Simons et al., 2010;de Marneffe et al., 2012), it is still frequently the case that system development concerned with improving veridicality judgements nearly always proceeds by incorporating explicit lexical semantic knowledge into the pipeline or architecture (Richardson and Kuhn, 2012;Lotan et al., 2013;Stanovsky et al., 2017;MacCartney and Manning, 2009;Angeli and Manning, 2014;Saurí and Pustejovsky, 2012;Cases et al., 2019;. Our analysis suggests such approaches are likely to yield only incremental gains. While admittedly more difficult to encode, focusing on context-specific factors first, e.g. predicate classes and pragmatics (de Marneffe et al., 2012) or question under discussion (Simons et al., 2010), would likely be more productive and may ultimately override the need for verb signatures altogether.

Analysis of BERT Predictions
We now turn to our primary question: do current NLI models capture the veridicality of verbs?
In particular, we are interested in the behavior of a distributional model that is not specifically endowed with lexical semantic information related worthy effects. It is likely that more careful featurization of highly relevant concepts-e.g. at-issueness (Tonhauser et al., 2018)-could yield more conclusive insights about which aspects of context lead to within-signature, or within-verb, variation. We leave such analysis for future work, and conclude simply that endowing NLI models with knowledge of projectivity signature is not alone sufficient for producing humanlike inferences on such sentences. to veridicality. We ask two questions. First: does such a model learn to make inferences consistent with those made by humans? Second: if the model does mirror human inferences, are the predictions based solely on the presence of specific lexical items, or are they sensitive to structural factors (namely, syntactic position and complement type)? Again, we prioritize modeling speaker meaning. Thus, we believe the model should ideally reflect the same biases and variation observed in the human judgments, not necessarily the inferences expected based on the lexical semantic signatures of the verbs.

Setup
We use the state-of-the-art BERT-based NLI model. Specifically, we use the original Tensor-Flow implementation 11 of the NLI model built on top of the pertained BERT language model (Devlin et al., 2018). We use the model offthe-shelf, with default training setup and hyperparameters. To fine-tune the model for the NLI task, we use the standard train/dev splits from the MNLI corpus, but, to avoid confounds, we remove the 1,500 p/h pairs from which our new test set is derived (as described in §4). 12 The model is trained to make a softmax classification over three classes: {ENTAILMENT, CONTRADICTION, NEU-TRAL}. When necessary to compare these discrete predictions to our continuous human judgments, we map the prediction to a continuous value using P (ENTAILMENT) − P (CONTRADICTION). This score ranges from −1 to 1 and is comparable to our human scores. Conversely, when necessary to compare the continuous human score to the discrete predictions, we discretize scores into evenlysized bins 13 . Overall, similar performance trends hold whether we compare in discrete space or continuous space. In the analyses below, we use whichever is most interpretable, as reported.

Overall Prediction Performance
We first measure raw prediction performance: do the inferences made by the BERT mirror the inference that our human raters made? Figure 1b shows scatter plots of the models predictions (mapped into continuous space) side-by-side with the hu-  Table 5: Accuracy and Spearman correlation of BERT MNLI model predictions against human judgements. The +/ − /• symbols denote the expected labels based on the lexical semantic category of the verb, and are not necessarily the labels given by our human annotators (compare against Figure 1). man scores just discussed. Table 5 shows performance evaluated against human judgements in terms of both (discrete) classification accuracy and (continuous) Spearman correlation. Broadly speaking, the model's predictions appear to follow the same trends as the humans' ratings (Figure 1). That is, averaged across contexts, the model's treatment of verbs is the same as the humans' treatment: largely in line with the signatures, but with a bias against assuming neutral (non-veridical) behavior. However, whereas the humans' judgments span all levels of certainty (taking a range of values from −2 to 2), the model tends to make predictions with high confidence. This is especially the case in positive environment, where the model nearly always predicts with 99 + % confidence. In negative environments, the model expresses a grater range of uncertainty values, and is much more closely in line with what we observe in human judgements. In terms of quantitative measures of accuracy (Table  5), the most notable trend is that the model performance is highest for cases in which the negation of the complement is expected to project (− signatures). This is true regardless of whether that behavior occurs in a positive or negative environment. We note that, for such cases, human judgements closely align with the lexical semantic predictions. The model performs worst in positive environments when the verb is expected to be non-veridical (• signatures). This appears to result from the model's tendency to over-exaggerate the veridicality bias: i.e. whereas humans show a general tendency to assume the complement projects in these cases, the model predicts ENTAILMENT with near certainty (see Figure 1).

Counterfactual Analysis
Next, we ask: are the above-observed trends in BERT's predictions driven predominantly by lexical priors-i.e. the presence of a specific verb-or are they sensitive to other lexicosyntactic factors that should ideally affect the inference?
Experimental Design. For each verb construction vt (e.g. "try to" or "realize that") in our dataset, we perform several manipulations in which we insert v or t into sentences where they did not originally appear, and observe the effect this has on the distribution of the model's predictions. Our specific manipulations and the expected effects are described below. Table 6 shows examples. For convenience, we use D to refer to the set of all the S, C pairs in our dataset, D vt to refer to all the pairs in which vt appears as the main verb clause in S, and D to (D that ) to refer to all the pairs in which C is a to-complement (thatcomplement). When clear from context, we abuse notation and use e.g. D to refer both to the dataset itself and to the distribution of the model's predictions when run over the dataset. Replace Main Verb: For each pair S, C ∈ D, we replace the main verb in S with the target verb v, generating a new premise S * . We expect that, if the model is sensitive specifically to presence of v and its effect on inferences, then the distribution of model predictions over all S * , C pairs should be more similar to the target distribution of predictions over all of D vt than to the baseline distribu-Main Verb (Match) Main Verb (Mismatch) S He attempted to overcome the sensation. S I decided that the department had acted illegally. S * He tried to overcome the sensation.
S * I tried that the department had acted illegally. C He overcame the sensation. C The department had acted illegally.

Complement Verb Complement Type S
He attempted to overcome the sensation. S They tried to get his attention. S * He attempted to try the sensation. S * They tried that get his attention. C * He tried the sensation.
C They got his attention. tion of predictions over all of D. We differentiate between settings with "matched" complement types, where we generate S * from pairs in D t , from those with "mismatched" complement types, where we generate from pairs in D ¬t . E.g., for a target vt = "try to", we consider substitutions into premises from D to as "matched" and substitutions into D that as "mismatched". Preserving this distinction allows us to both avoid confounds due to ungrammatical substitutions, and to investigate whether the model is sensitive to verbs which behave differently when they take different complements. For example, "forget" is −/+ when it takes to but +/+ when it takes that.

Replace Complement Verb:
For each S, C ∈ D, we replace the main verb in C with the target verb v, generating a new hypothesis C * . We expect that, if the model is sensitive not just to the presence of v, but also its syntactic role, then the distribution of predictions over all S, C * pairs should resemble the baseline distribution over D more than the target distribution over D vt .
Replace Complement Type: For each pair S, C ∈ D vt , we replace the t in S with the alternative complement type ("to" → "that"; 'that" → "to"), generating a new premise S * . This generates ungrammatical sentences, and serves as a control experiment, to check whether the model is considering the entirety of the context in which the verb construction appears, or merely the vt bigram. We expect that, if the model is considering the whole context, the distribution of predictions over all S * , C should resemble the target distribution D vt more than the baseline distribution D t .
Results. Table 7 shows, for verbs within each signature, the KL divergence between the postmanipulation prediction distribution (D * ) and 1) the baseline distribution (D) and 2) the target distribution (D vt ). Results are shown for both the main and complement verb manipulations.
A few trends are worth highlighting. First, we do see evidence that the model's prediction depends at least in part on the individual verb type. This is supported by the fact that, across verb signatures, manipulation of the main verb leads to distributions which are more similar to the target verb distribution D vt than to the baseline distribution D. This trend is strongest for verbs which involve − signatures. Second, we see encouraging, though not overwhelming, evidence that the model's prediction are sensitive to the syntactic position of the verb. This is supported by the fact that, in general, the similarity between D * and D vt is much lower (higher KL) when the manipulation occurs in the complement clause compared than when it occurs in the main clause. Note that, ideally, this manipulation should not effect the prediction distribution at all. Nonetheless, the trend is clear and points in the right direction. Table 8 shows the KL divergence between D * and the target verb distribution D vt in the matched and mismatched cases. 14 We see that BERT behaves as we hope-namely, it makes different predictions for v to constructions and v that constructions, even when the v is the same. Manipulating the main verb only substantially affects predictions when the manipulation occurs in a context with the right complement type (matched); when the manipulation results in a ungrammatical sentence (mismatched), the prediction remains close to baseline. An example of such verb-construction differentiation is shown in Figure 2 for the verb know that, but this is a trend seen across verbs. Moreover, we see that this effect is not just driven by sensitivity to the specific vt bigram. That is, simply swapping "to" with "that" (or vice-versa) in a naturally-occurring context leads to a small 14 See Supplementary for breakdown by verb type.

Main Verb
Compl. Verb Pos. Neg. Pos. Neg.  shift in the distribution of the model's predictions away from the target D vt distribution, but not to the same degree as replacing a "that"-taking verb with a "to"-taking verb in a naturally-occurring "that" context (or vice-versa). This result provides some evidence that BERT's prediction is influenced by aspects of the context other than just the presence of the vt bigram.  Table 8: KL divergence between D * and D vt for complement type manipulations ("to" vs. "that"). Inserting v into a context affects BERT's predictions only when the complement is compatible with v.

Conclusion
We investigate how well BERT, a neural NLI model not explicitly endowed with knowledge of lexical semantic verb signatures, is able to learn to make correct inferences about veridicality. We collect a new NLI dataset of human veridicality judgements. We observe that human judgments often differ from what is predicted given the lexical semantic types of verbs, and that BERT is able to replicate many of these judgments, although there is still significant room for improvement.
Through counterfactual experiments, we show that individual verbs strongly influence BERT's predictions, and that these cues interact with syntactic information in desirable ways.