Are Natural Language Inference Models IMPPRESsive? Learning IMPlicature and PRESupposition

Natural language inference (NLI) is an increasingly important task for natural language understanding, which requires one to infer whether a sentence entails another. However, the ability of NLI models to make pragmatic inferences remains understudied. We create an IMPlicature and PRESupposition diagnostic dataset (IMPPRES), consisting of 32K semi-automatically generated sentence pairs illustrating well-studied pragmatic inference types. We use IMPPRES to evaluate whether BERT, InferSent, and BOW NLI models trained on MultiNLI (Williams et al., 2018) learn to make pragmatic inferences. Although MultiNLI appears to contain very few pairs illustrating these inference types, we find that BERT learns to draw pragmatic inferences. It reliably treats scalar implicatures triggered by “some” as entailments. For some presupposition triggers like “only”, BERT reliably recognizes the presupposition as an entailment, even when the trigger is embedded under an entailment canceling operator like negation. BOW and InferSent show weaker evidence of pragmatic reasoning. We conclude that NLI training encourages models to learn some, but not all, pragmatic inferences.


Introduction
One of the most foundational semantic discoveries is that systematic rules govern the inferential relationships between pairs of natural language sentences (Aristotle, De Interpretatione, Ch. 6). In natural language processing, Natural Language Inference (NLI)-a task whereby a system determines whether a pair of sentences instantiates in an entailment, a contradiction, or a neutral relation-has been useful for training and evaluating models on sentential reasoning. However, linguists and philosophers now recognize that there * Equal Contribution Figure 1: Illustration of key properties of classical entailments, implicatures, and presuppositions. Solid arrows indicate valid commonsense entailments, and arrows with X's indicate lack of entailment. Dashed arrows indicate follow up statements with the addition of in fact, which can either be acceptable (marked with '') or unacceptable (marked with '').
are separate semantic and pragmatic modes of reasoning (Grice, 1975;Clark, 1996;Beaver, 1997;Horn and Ward, 2004;Potts, 2015), and it is not clear which of these modes, if either, NLI models learn. We investigate two pragmatic inference types that are known to differ from classical entailment: scalar implicatures and presuppositions. As shown in Figure 1, implicatures differ from entailments in that they can be denied, and presuppositions differ from entailments in that they are not canceled when placed in entailment-cancelling environments (e.g., negation, questions).
To enable research into the relationship between NLI and pragmatic reasoning, we introduce IMPPRES, a fine-grained NLI-style diagnostic test dataset for probing how well NLI models perform implicature and presupposition. Containing 25.5K sentence pairs illustrating key properties of these pragmatic inference types, IMPPRES is automatically generated according to linguist-crafted templates, allowing us to create a large, lexically varied, and well controlled dataset targeting specific instances of both types.
We first investigate whether presuppositions and implicatures are present in NLI models' training data. We take MultiNLI  as a case study, and find it has few instances of pragmatic inference, and almost none that arise from specific lexical triggers (see §4). Given this, we ask whether training on MultiNLI is sufficient for models to generalize about these largely absent commonsense reasoning types. We find that generalization is possible: the BERT NLI model shows evidence of pragmatic reasoning when tested on the implicature from some to not all, and the presuppositions of certain triggers (only, cleft existence, possessive existence, questions). We obtain some negative results, that suggest that models like BERT still lack a sophisticated enough understanding of the meanings of the lexical triggers for implicature and presupposition (e.g., BERT treats several word pairs as synonyms, e.g., most notably, or and and).
Our contributions are: (i) we provide a new diagnostic test set to probe for pragmatic inferences, complete with linguistic controls, (ii) to our knowledge, we present the first work evaluating deep NLI models on specific pragmatic inferences, and (iii) we show that BERT models can perform some types of pragmatic reasoning very well, even when trained on NLI data containing very few explicit examples of pragmatic reasoning. We publicly release all IMPPRES data, models evaluated, annotations of MultiNLI, and the scripts used to process data. 1

Background: Pragmatic Inference
We take pragmatic inference to be a relation between two sentences relying on the utterance context and the conversational goals of interlocutors. Pragmatic inference contrasts with semantic entailment, which instead captures the logical relationship between isolated sentence meanings (Grice, 1975;Stalnaker, 1974). We present implicature and presupposition inferences below.

Implicature
Broadly speaking, implicatures contrast with entailments in that they are inferences suggested by the speaker's utterance, but not included in its literal (Grice, 1975). Although there are many types 1 github.com/facebookresearch/ImpPres  of implicatures we focus here on scalar implicatures. Scalar implicatures are inferences, often optional, 2 which can be drawn when one member of a memorized lexical scale (e.g., some, all ) is uttered (see §6.1). For example, when someone utters Jo ate some of the cake, they suggest that Jo didn't eat all of the cake, (see Figure 1 for more examples). According to Neo-Gricean pragmatic theory (Horn, 1989;Levinson, 2000), the inference Jo didn't eat all of the cake arises because some has a more informative lexical alternative all that could have been uttered instead. We expect the speaker to make the most informative true statement: 3 as a result, the listener should infer that a stronger statement, where some is replaced by all, is false. Implicatures differ from entailments (and, as we will see, presuppositions; see Figure 1) in that they are deniable, i.e., they can be explicitly negated without resulting in a contradiction. For example, someone can utter Jo ate some of the cake, followed by In fact, Jo ate all of it. In this case, the implicature (i.e., Jo didn't eat all the cake from above) has been denied. We thus distinguish implicated meaning from literal, or logical, meaning.

Presupposition
Presuppositions of a sentence are facts that the speaker takes for granted when uttering a sentence (Stalnaker, 1974;Beaver, 1997). Presuppositions are generally associated with the presence of certain expressions, known as presupposition triggers. For example, in Figure 1, the definite de-scription the cake triggers the presupposition that there is a cake (Russell, 1905). Other examples of presupposition triggers are shown in Table 1.
Presuppositions differ from other inference types in that they generally project out of operators like questions and negation, meaning that they remain valid inferences even when embedded under these operators (Karttunen, 1973). The inference that there is a cake survives even when the presupposition trigger is in a question (Did Jordan eat some of the cake?), as shown in Figure 1. However, in questions, classical entailments and implicatures disappear. Table 1 provides examples of triggers projecting out of several entailment canceling operators: negation, modals, interrogatives, and conditionals.
It is necessary to clarify in what sense presupposition is a pragmatic inference. There is no consensus on whether presuppositions should be considered part of the semantic content of expressions (see Stalnaker, 1974;Heim, 1983, for opposing views). However, presuppositions may come to be inferred via accommodation, a pragmatic process by which a listener infers the truth of some new fact based on its being presupposed by the speaker (Lewis, 1979). For instance, if Jordan tells Harper that the King of Sweden wears glasses, and Harper did not previously know that Sweden has a king, they would learn this fact by accommodation. With respect to NLI, any presupposition in the premise (short of world knowledge) will be new information, and therefore accommodation is necessary to recognize it as entailed.
As datasets for NLI become increasingly numerous, one might wonder, do we need yet another NLI dataset? In this case, the answer is clearly yes: despite NLI's formulation as a commonsense reasoning task, it is still unknown whether this framing has resulted in models that learn specific modes of pragmatic reasoning. IMPPRES is the first NLI dataset to explicitly probe whether models trained on commonsense reasoning actually do treat pragmatic inferences like implicatures and presuppositions as entailments without additional training on these specific inference types.
Beyond NLI, several recent works introduce resources for evaluating sentence understanding models for knowledge of pragmatic inferences. On the presupposition side, datasets such as MegaVeridicality (White and Rawlins, 2018) and CommitmentBank (de Marneffe et al., 2019) compile gradient crowdsourced judgments regarding how likely a clause embedding predicate is to trig-ger a presupposition that its complement clause is true.  and Jiang and de Marneffe (2019) find that LSTMs trained on a gradient event factuality prediction task on these respective datasets make systematic errors. Turning to implicatures, Degen (2015) introduces a dataset measuring the strength of the implicature from some to not all with crowd-sourced judgments. Schuster et al. (2020) find that an LSTM with supervision on this dataset can predict human judgments well. These resources all differ from IMPPRES in two respects: First, their empirical scopes are all somewhat narrower, as all these datasets focus on only a single class of presupposition or implicature triggers. Second, the use of gradient judgments makes it non-trivial to use these datasets to evaluate NLI models, which are trained to make categorical predictions about entailment. Both approaches have advantages, and we leave a direct comparison for future work.
Outside the topic of sentential inference, Rashkin et al. (2018) propose a new task where a model must label actor intents and reactions for particular actions described using text. Cianflone et al. (2018) create sentence-level adverbial presupposition datasets and train a binary classifier to detect contexts in which presupposition triggers (e.g., too, again) can be used.

Annotating MultiNLI for Pragmatics
In this section, we present results of an annotation effort that show that MultiNLI contains very little explicit evidence of pragmatic inferences of the type tested by IMPPRES. Although  report that 22% of the MultiNLI development set sentence pairs contain lexical triggers (such as regret or stopped) in the premise and/or hypothesis, the mere presence of presuppositiontriggering lexical items in the data does not show that MultiNLI contains evidence that presuppositions are entailments, since the sentential inference may focus on other types of information.
To address this, we randomly selected 200 sentence pairs from the MultiNLI matched development set and presented them to three expert annotators with a combined total of 17 years of training in formal semantics and pragmatics. 4 Annotators answered the following questions for each pair: (1) are the sentences P and H related by a presupposition/implicature relation (entails/is en-tailed by, negated or not); (2) what subtype of inference (e.g., existence presupposition, some, all implicature); (3) is the presupposition trigger embedded under an entailment-cancelling operator?
Agreement among annotators was low, suggesting that few MultiNLI pairs are paradigmatic cases of implicatures or presuppositions. We found only 8 presupposition pairs and 3 implicature pairs on which two or more annotators agreed. Moreover, we found only one example illustrating a particular inference type tested in IMPPRES (the presupposition of possessed definites). All others were tagged as existence presuppositions and conversational implicatures (i.e. loose inferences dependent on world knowledge). The union of annotations was much larger: 42% of examples were identified by at least one annotator as a presupposition or implicature (51 presuppositions and 42 implicatures, with 10 sentences receiving divergent tags). However, of these, only 23 presuppositions and 19 implicatures could reliably be used to learn pragmatic inference (in 14 cases, the given tag did not match the pragmatic inference, and in 27 cases, computing the inference did not affect the relation type). Again, the large majority of implicatures were conversational, and most presuppositions were existential, and generally not linked to particular lexical triggers (e.g., topic marking).
We conclude that the MultiNLI dataset at best contains some evidence of loose pragmatic reasoning based on world knowledge and discourse structure, but almost no explicit information relevant to lexically triggered pragmatic inference, which is of the type tested in this paper.

Methods
Data Generation. IMPPRES consists of semiautomatically generated pairs of sentences with NLI labels illustrating key properties of implicatures and presuppositions. We generate IMPPRES using a codebase developed by Warstadt et al. (2019a) and significantly expanded for the BLiMP dataset (Warstadt et al., 2019b). The codebase, including our scripts and documentation, are publicly available. 5 Each sentence type in IMPPRES is generated according to a template that specifies the linear order of the constituents in the sentence. The constituents are sampled from a vocabulary of over 3000 lexical items annotated with grammatical features needed to ensure morphological,  syntactic, and semantic well-formedness. All sentences generated from a given template are structurally analogous up to the specified constituents, but may vary in sub-constituents. For instance, if the template calls for a verb phrase, the generated constituent may include a direct object or complement clause, depending on the argument structure of the sampled verb. See §6.1 and 7.1 for descriptions of the sentence types in the implicature and presupposition data.
Generating data lets us control the lexical and syntactic content so that we can guarantee that the sentence pairs in IMPPRES evaluate the desired phenomenon (see Ettinger et al., 2016, for related discussion). Furthermore, the codebase we use allows for greater lexical and syntactic variety than in many other templatic datasets (see discussion in Warstadt et al., 2019b). One limitation of this methodology is that generated sentences, while generally grammatical, often describe highly unlikely scenarios, or include low frequency combinations of lexical items (e.g., Sabrina only reveals this pasta). Another limitation is that generated data is of limited use for training models, since it contains simple regularities that supervised classifiers may learn to exploit. Thus, we create IMP-PRES solely for the purpose of evaluating NLI models trained on standard datasets like MultiNLI.

Models.
Our experiments evaluate NLI models trained on MultiNLI and built on top of three sentence encoding models: a bag of words (BOW) model, InferSent (Conneau et al., 2017), and BERT-Large (Devlin et al., 2019). The BOW and InferSent models use 300D GloVe embeddings as word representations (Pennington et al., 2014). For the BOW baseline, word embeddings for premise and hypothesis are separately summed to create sentence representations, which are concatenated to form a single sentence-pair representation which is fed to a logistic regression softmax classifier. For the InferSent model, GloVe embeddings for the words in premise and hypothesis are respectively fed into a bidirectional LSTM, after which we concatenate the representations for premise and hypothesis, their difference, and their element-wise product (Mou et al., 2016). BERT is a multilayer bidirectional transformer pretrained with the masked language modelling and next sequence prediction objectives, and finetuned on the MultiNLI dataset. We concatenate the premise and hypothesis after a special [CLS] token and separated them with the [SEP] token. The BERT representation for the [CLS] token is fed into classifier. We use Huggingface's pre-trained BERT trained on Toronto books (Zhu et al., 2015). 6 The BOW and InferSent models have development set accuracies of 49.6% and 67.6%. The development set accuracy for BERT-Large on MultiNLI is 86.6%, similar to the results achieved by (Devlin et al., 2019), but somewhat lower than state-of-the-art (currently 90.8% on test from the ensembled RoBERTa model with long pretraining optimization, Liu et al. 2019).
6 Experiment 1: Scalar Implicatures 6.1 Scalar Implicature Datasets The scalar implicature portion of IMPPRES includes six datasets, each isolating a different scalar implicature trigger from six types of lexical scales (of the type described in §2): determiners some, all , connectives or, and , modals can, have to , numerals 2,3 , 10,100 , scalar adjectives, and verbs, e.g., good, excellent , run, sprint . Examples pairs of each implicature trigger can be found in Table 4 in the Appendix. For each type, we generate 100 paradigms, each consisting of 12 unique sentence pairs, as shown in Table 2.
The six target sentence pairs comprise two main relation types: 'implicature' and 'negated implicature'. Pairs tagged as 'implicature' have a premise that implicates the hypothesis (e.g., some and not all). For 'negated implicature', the premise implicates the negation of the hypothesis (e.g., some and all), or vice versa (e.g., all and some). Six control pairs are logical contradictions, representing either scalar 'opposites' (e.g., all and none), or 'negations' (e.g., not all and all; some and none), probing the models' basic grasp of negation.
As mentioned in §2.1, implicature computation is variable and dependent on the context of utterance. Thus, we anticipate two possible rational behaviors for a MultiNLI-trained model tested on an implicature: (a) be pragmatic, and compute the implicature, concluding that the premise and hypothesis are in an 'entailment' relation, (b) be logical, i.e., consider only the literal content, and not compute the implicature, concluding they are in a 'neutral' relation. Thus, we measure both possible conclusions, by tagging sentence pairs for scalar implicature with two sets of NLI labels to reflect the behavior expected under "logical" and "pragmatic" modes of inference, as shown in Table 2.

Implicatures Results & Discussion
We first evaluate model performance on the controls, shown in Figure 2. Success on these controls is a necessary condition for us to conclude that a model has learned the basic function of negation (not, none, neither) and the scalar relationship between terms like some and all. We find that BERT performs at ceiling on control conditions for all implicature types, in contrast with InferSent and BOW, whose performance is very variable. Since only BERT passes all controls, its results on the target items are most interpretable. Full results for all models and target conditions by implicature trigger are in Figures 8-13 in the Appendix.
For connectives, scalar adjectives and verbs, the BERT model results correspond neither to the hypothesized pragmatic nor logical behavior. In fact, for each of these subdatasets, the results are consistent with a treatment of scalemates (e.g., and and or; good and excellent) as synonyms, e.g. it evaluates the 'negated implicature' sentence pairs as 'entailment' in both directions. This reveals a coarse-grained knowledge of these meanings that lacks information about asymmetric informativity relations between scalemates. Results for modals (can and have to) are split between the three labels, not showing any predicted logical or pragmatic pattern. We conclude that BERT has insufficient knowledge of the meaning of these words.
In addition to pragmatic and logical interpretations, numerals can also be interpreted as exact cardinalities. We thus predict three different behaviors: logical "at least n", pragmatic "at least n", and "exactly n". We observe that results are inconsistent: neither the "exactly" nor "at least" interpretations hold across the board. For the determiner dataset (some-all), Figure 4 breaks down the results by condition and shows that BERT behaves as though it performs pragmatic and logical reasoning in different conditions. Overall, it predicts a pragmatic relation more frequently (55% vs. 36%), and only 9% of results are consistent with neither mode of reasoning. Furthermore, the proportion of pragmatic reasoning shows consistent effects of sentence order (i.e., whether the implicature trigger is in the premise or the hypothesis), and the presence of negation in one or both sentences. Pragmatic reasoning is consistently higher when the implicature trigger is in the premise, which we can see in the results for negated implicatures: the some-all condition shows more pragmatic behavior compared to the all-some condition (a similar behavior is observed with the not all vs. none conditions).
Generally, the presence of negation lowers rates of pragmatic reasoning. First, the negated implicature conditions can be subdivided into pairs with and without negation. Among the negated ones, pragmatic reasoning is lower than for nonnegated ones. Second, having negation in the premise rather than the hypothesis makes pragmatic reasoning lower: among pairs tagged as direct implicatures (some vs. not all), there is higher pragmatic reasoning with non-negated some in the premise than with negated not all. Finally, we observe that pragmatic rates are lower for some vs. not all than for some vs. all. In this final case, pragmatic reasoning could be facilitated by explicit presentation of the two items on the scale.
In sum, for the datasets besides determiners, we find evidence that BERT fails to learn even the logical relations between scalemates, ruling out the possibility of computing scalar implicatures. It remains possible that BERT could learn these logical relations with explicit supervision (see Richard-  The presupposition portion of IMPPRES includes eight datasets, each isolating a different kind of presupposition trigger. The full set of triggers is shown in Table 5 in the Appendix. For each type, we generate 100 paradigms, with each paradigm consisting of 19 unique sentence pairs. (Examples of the sentence types are in Table 1). Of the 19 sentence pairs, 15 contain target items. The first target item tests whether the model correctly determines that the presupposition trigger entails its presupposition. The next two alter the presupposition, either negating it, or replacing a constituent, leading to contradiction and neutrality, respectively. The remaining 12 show that the relation between the trigger and the (altered) presupposition is not affected by embedding the trigger under various entailment-canceling operators. 4 control items are designed to test the basic effect of entailment-canceling operators-negation, modals, interrogatives, and conditionals. In each control, the premise is a presupposition trigger embedded under an entailment-canceling operator, and the hypothesis is an unembedded sentence containing the trigger. These controls are neces- sary to establish whether models learn that presuppositions behave differently under these operators than do classical semantic entailments.

Presupposition Results & Discussion
The results from presupposition controls are in Figure 5. BERT performs well above chance on each control (acc. > 0.33), whereas BOW and InferSent perform at or below chance. In the "negated" condition, BERT correctly identifies that the trigger is contradicted by its negation 100% of the time, e.g., Jo's cat didn't go contradicts Jo's cat went. In the other conditions, it correctly identifies the neutral relation the majority of the time, e.g., Did Jo's cat go? is neutral with respect to Jo's cat went. This indicates that BERT mostly learns that negation, modals, interrogatives, and conditionals cancel classical entailments, while BOW and InferSent do not capture the ordinary behavior of these common operators.
Next, we test whether models identify presuppositions of the premise as entailments, e.g., that Jo's cat went entails that Jo has a cat. Recall from §2.2 that this is akin to a listener accommodating a presupposition. The results in Figure 6 show that each of the three models accommodates some presuppositions, but this depends on both the nature of the presupposition and the model. For instance, the BOW and InferSent models accommodate presuppositions of nearly all trigger types at well above chance rates (acc. 33%). For the uniqueness presupposition of clefts, these models generally correctly predict an entailment (acc. > 90%), but for most triggers, performance is less reliable. By contrast, BERT's behavior is bimodal. It always accommodates the existence presuppositions of clefts and possessed definites, as well as the presupposition of only, but almost never accommodates any presupposition involving numeracy, e.g. Both flowers that bloomed died entails There are exactly two flowers that bloomed. 7 Finally, we evaluate whether models predict that presuppositions project out of entailment canceling operators (e.g., that Did Jo's cat go? entails that Jo has a cat). We can only consider such a prediction as evidence of projection if two conditions hold: (a) the model correctly identifies that the relevant operator cancels entailments in the control from the same paradigm (e.g., Did Jo's cat go? is neutral with respect to Jo's cat went), and (b) the model identifies the presupposition as an entailment when the trigger is unembedded in the same paradigm (e.g. Jo's cat went entails Jo has a cat). Otherwise, a model might correctly predict entailment essentially by accident if, for instance, it systematically ignores negation. For this reason, we filter out results for the target conditions that do not meet these criteria. Figure 7 shows results for the target conditions after filtering. While InferSent rarely predicts that presuppositions project, we find strong evidence that the BERT and BOW models do. Specifically, they correctly identify that the premise entails the presupposition (acc. ≥ 80% for BERT, acc. ≥ 90% for BOW). Furthermore, BERT is the only model to reliably identify (i.e., over 90% of the time) that the negation of the presupposition is contradicted. These results hold irrespective of the entailment canceling operator. No model reliably performs above chance when the presupposition is altered to be neutral (e.g., Did Jo's cat go? is neu- tral with respect to Jo has a cat).
It is surprising that the simple BOW model can learn some of the projective behavior of presuppositions. One explanation for this finding is that many of the key features of presupposition projection are insensitive to word order. If a lexical presupposition trigger is present at all in a sentence, a presupposition will generally arise irrespective of its position in the sentence. There are some edge cases where this heuristic is insufficient, but IMP-PRES is not designed to test such cases.
To summarize, training on NLI is sufficient for all models we evaluate to learn to accommodate presuppositions of a wide variety of unembedded triggers, though BERT rejects presuppositions involving numeracy. Furthermore, BERT and even the BOW model appear to learn the characteristic projective behavior of some presuppositions.

General Discussion & Conclusion
We observe some encouraging results in §6-7. We find strong evidence that BERT learns scalar implicatures associated with determiners some and all. Pragmatic or logical reasoning was not diagnosable for the other scales, whose meaning was not fully understood by our models (as most scalar pairs were treated as synonymous). In the case of presuppositions, the BERT NLI models, and BOW to some extent, perform well on a number of our subdatasets (only, cleft existence, possessive existence, questions). For the other subdatasets, the models did not perform as expected on the basic unembedded presupposition triggers, again sug-gesting the model's lack of knowledge of the basic meaning of these words. Though their behavior is far from systematic, this is suggestive evidence that some NLI models can perform in ways that correlate with human-like pragmatic behavior.
Given that MultiNLI contains few examples of the type found in IMPPRES (see §4), where might our positive results come from? There are two potential sources of signal for the BERT model: NLI training, and pretraining (either BERT's masked language modeling objective or its input word embeddings). NLI training provides specific examples of valid (or invalid) inferences constituting an incomplete characterization of what commonsense inference is in general. Since presuppositions and scalar implicatures triggered by specific lexical items are largely absent from the MultiNLI data used for NLI training, any positive results on IMPPRES would likely use prior knowledge from the pretraining stage to make an inductive leap that pragmatic inferences are valid commonsense inferences. The natural language text used for pretraining certainly contains pragmatic information, since, like any natural language data, it is produced with the assumption that readers are capable of pragmatic reasoning. Maybe this induces patterns in the data that make the nature of those assumptions recoverable from the data itself.
This work is an initial step towards rigorously investigating the extent to which NLI models learn semantic versus pragmatic inference types. We have introduced a new dataset IMPPRES for probing this question, which can be reused to evaluate pragmatic performance of any NLI given model.