Evaluating BERT for natural language inference: A case study on the CommitmentBank

Natural language inference (NLI) datasets (e.g., MultiNLI) were collected by soliciting hypotheses for a given premise from annotators. Such data collection led to annotation artifacts: systems can identify the premise-hypothesis relationship without observing the premise (e.g., negation in hypothesis being indicative of contradiction). We address this problem by recasting the CommitmentBank for NLI, which contains items involving reasoning over the extent to which a speaker is committed to complements of clause-embedding verbs under entailment-canceling environments (conditional, negation, modal and question). Instead of being constructed to stand in certain relationships with the premise, hypotheses in the recast CommitmentBank are the complements of the clause-embedding verb in each premise, leading to no annotation artifacts in the hypothesis. A state-of-the-art BERT-based model performs well on the CommitmentBank with 85% F1. However analysis of model behavior shows that the BERT models still do not capture the full complexity of pragmatic reasoning, nor encode some of the linguistic generalizations, highlighting room for improvement.


Introduction
Natural language inference (NLI), the task of identifying whether a hypothesis can be inferred from, contradicted by, or not related to a premise, has become one of the standard benchmark tasks for natural language understanding. NLI datasets, such as SNLI (Bowman et al., 2015) and MultiNLI (Williams et al., 2018), are typically built by asking annotators to compose sentences based on premises extracted from corpora, so that the composed sentences stand in entailment/contradiction/neutral relationship to the premise. The hypotheses collected this way have Premise: A: Boy that's scary, isn't it. B: Oh, can you imagine, because it happens in the middle of the night, so you know, these parents didn't know the kid was gone until the kid is knocking on the door screaming, let me in. Hypothesis: the kid was gone.
Neutral (0) Premise: GM confirmed it received U.S. antitrust clearance to boost its holding. Sansui Electric agreed to sell a 51% stake to Polly Peck of Britain for $110 million. Still, analysts said the accord doesn't suggest Japan is opening up to more foreign takeovers. Hypothesis: Japan is opening up to more foreign takeovers. Contradiction (-1.2) been found to contain annotation artifacts -clues allowing systems to identify the relationship between a premise and a hypothesis without observing the premise. For instance, Gururangan et al. (2018) found that in SNLI and MultiNLI, negation is highly indicative of contradiction and generic nouns (e.g., animal, something) of entailment.
To address this issue, we recast the Commit-mentBank (CB henceforth) (de Marneffe et al., 2019), an English dataset of speaker commitment/event factuality, for NLI. 1 The original Com-mitmentBank includes naturally occurring discourses annotated with speaker commitment towards the content of complements of clauseembedding verbs under entailment-canceling environments (negation, modal, question and conditional). CB does not suffer from the drawback of annotation artifacts in the hypotheses, since the hypotheses are the complement of a clauseembedding verb in the premise. It thus tests for inferences involving a particular kind of syntactic construction and contains no annotation artifacts in the hypothesis, making it suitable to test for robust language understanding. CB has many challenging aspects which are highlighted in various adversarial NLI datasets. It can be thought of as a variant of HANS (McCoy et al., 2019), which contains examples where the hypothesis is a subsequence or a constituent of the premise. It contains several phenomena in the "stress tests" (Naik et al., 2018) including word overlap, negation, and length mismatch. However these datasets are artificially constructed while CB data are naturally occurring.
Here we evaluate BERT, the state-of-the-art model in NLI, on CB. While BERT models achieve good performance with supervision from both CB and MultiNLI, they still struggle with items involving pragmatic reasoning and lag behind human performance. Experiments show that BERT does not use the linguistic generalizations for speaker commitment to make predictions, although BERT can learn them with direct supervision. CB is thus a useful benchmark for measuring progress on robust natural language understanding and specifically speaker commitment inferences.

The CommitmentBank
To study the linguistic correlates of speaker commitment in English, de Marneffe et al. (2019) introduced the CommitmentBank dataset. 2 It consists of naturally occurring English items with up to two sentences of preceding context and one target sentence, from three genres: newswire (Wall Street Journal), fiction (British National Corpus), and dialogue (Switchboard). The target sentences contain a clause-embedding verb (such as think) in an entailment-canceling environment (negation, modal, question, or conditional). Each item has at least 8 annotations indicating the extent to which the speaker of the sentences are committed to the truth of the embedded clause (+3/speaker is certain that it is true, 0/speaker is not certain about its truth, −3/speaker is certain that it is false).

Recast for NLI
For each item, we take the context and target sentence to be the premise, and the embedded clause in the target sentence to be the hypothesis. 2 The original CommitmentBank is available at https: //github.com/mcdm/CommitmentBank Total   Train  115  16  119  250  Dev  23  5  28  56  Test  113  16  121  250   Total  251  37 268 556 We identified a subset of the CommitmentBank with high annotator agreement, and assigned categorical labels (entailment/neutral/contradiction) to them according to their mean annotations in [−3, 3]. We label an item as entailment if at least 80% of its annotations are within [1,3], where the speaker is committed to the complement p, as neu-

Entailment Neutral Contradiction
where the speaker is committed to ¬p. We discard the item if less than 80% of the annotations are within one of the three sub-ranges. Table 1 contains examples from CB with the original mean annotation and the gold NLI label. The number of items in each class is in Table 2.

Possible Annotation Artifacts
Since the hypotheses in CB are extracted from the premises instead of generated by annotators, we expect CB to contain less annotation artifacts compared to SNLI or MultiNLI.
Length Gururangan et al. (2018) found that entailed hypotheses in SNLI tend to be shorter and neutral ones longer. 3 The hypothesis length in the CB train set is distributed evenly across the three classes (mean length 8.5 tokens for entailment, 6.6 for neutral and 7.3 for contradiction).
Lexical Features Following Gururangan et al.
(2018), we computed the PMI between each unigram/4-gram and class in the training set, 4 capturing the extent to which an expression is associated with each class.  a particular class. But most of these expressions align with linguistic generalizations about these particular constructions, as elaborated on below.
Entailment The most discriminating expressions for the entailment premises include modal operators perhaps, could and might. This is because 63 out of the 115 entailment items in the train set involve the modal environment. Factive verbs notice and know are also discriminating features of entailment, indicating that factive verbs tend to suggest the truths of their complement (Karttunen, 1971).
Neutral The most discriminating expressions for the neutral class include questions: Do you think. This is due to the fact that 10 out of 16 neutral items are under the question environment.
Contradiction For contradiction, the most discriminating expressions involve neg-raising constructions (I don't think/know/believe that p, where the speaker is committed to p being false), filler phrases Uh and I mean, and indicator of speakers in dialogues B:. These are all characteristic of the Switchboard genre, which makes up 80% of the contradictions in the training set.

Predicting NLI labels
We evaluate BERT, the state-of-the-art model for NLI, on CB. 5 The BERT model follows the standard practice for sentence-pair tasks as in Devlin et al. (2019). We concatenate the premise and the hypothesis with [SEP], prepend the sequence 5 We used jiant (Wang et al., 2019b) with the bert large cased model for all our experiments.
with [CLS], and feed the input to BERT. The representation for [CLS] is fed into a softmax layer for a three-way classification.
For all experiments, we used the Adam optimizer (Kingma and Ba, 2015) with an initial learning rate of 1e-5, a batch size of 8, and fine-tuned with at most 10 epochs on each dataset. We finetuned BERT with three different sets of training data: CB only (CB B ), MultiNLI only (MNLI B ), and MultiNLI first then CB (MNLI+CB B ). 6 For comparison, we also included the models' performance on the MultiNLI dev set.
Baselines We included two baselines: a bag-ofwords baseline (CBOW) in which each item is represented as the average of its tokens' GloVe (Pennington et al., 2014) vectors; a Heuristics baseline, only applicable to the CB dataset, which uses five rules based on the observations in Section 2.2: 1. items under modals are entailments, 2. neg-raising items of the form I don't think/know/believe are contradictions, 3. items with factive verbs are entailments, 4. items under negation are contradictions, 5. all other items are neutral.
Human Performance We included human performance on CB from Wang et al. (2019a) obtained by asking crowdworkers to re-annotate a part of the test set. The MultiNLI human performance accuracy is for the matched/mismatched test set from Nangia and Bowman (2019).   Figure 1 shows the precision, recall and F1 scores of each class on the CB test set for the three BERT variants and the Heuristics baseline. Heuristics performs similarly as CB B on all classes. Compared with CB B , MNLI+CB B improves the overall performance of contradiction and the recall of neutral. MNLI B identifies contradictions with perfect precision but poor recall. All models do poorly on the neutral class, which has very few items in the dataset and no clear linguistic generalizations.  There is no statistical difference between the models' performance for the No-items. There is thus a performance gap between items requiring more pragmatic reasoning in general (No-items) and items which can be correctly predicted by identifying certain structures (Yes-items), suggesting that there is still work to achieve robust language understanding. Table 6 shows some items on which MNLI+CB B still fails.

Analysis
Feature Probing To investigate whether BERT actually learns the linguistic features from the Heuristics baseline and uses them to make NLI predictions, we trained two probing models to predict 1. whether the clause-embedding verb is factive and 2. the type of entailment-canceling environment. Following Tenney et al. (2019), we take the weighted sum of BERT layers (fine-tuned for NLI) to produce a pooled representation for each token. Unlike Tenney et al. (2019), in which the representations for the word tokens are used, we take the representation of the [CLS] token for each item and fed it into a MLP classifier to predict whether the discourse has certain features. We extracted the trained scalar mixing weights to see the importance of the different layers. For each layer k, we trained a series of classifiers using all previous layers up to k, to measure at which layer the feature can be correctly predicted. 7 We did the above experiments in two settings: 1. fine-tune all BERT layers to learn the featurespecific representations, and 2. freeze BERT layers tuned for NLI and only train the probing classifier. The results are shown in Figure 2.
When fine-tuning BERT layers for each feature task, we see that performance increases as more layers are added. Factives, conditionals, and modals are correctly predicted at later layers than nonfactives and negation. For conditionals and modals, this might be due to the fact that they are rare in the dataset. Factives possibly require more contextual information in order to be learned: the scalar weights indicate that factivity is processed at higher layers than entailment-canceling environment. This is consistent with the language acquisition literature (Hacquard and Lidz, 2019) which suggests that rich syntactic/pragmatic infor-7 The code and data are available at https://github. com/njjiang/jiant/tree/cb_emnlp19.  Table 6: Items in the test set with predictions the by Heuristics baseline (H) and MNLI+CB B (B). The first one is correctly predicted by Heuristics, while the second one is not. mation is required to learn the semantics of factive verbs.
However, when we freeze the BERT parameters from MNLI+CB B , the models always give the highest probability to negation environment and nonfactive verb, resulting in zero F1s on every other feature. The scalar mixing weights are smaller than the weights from the fine-tuned model. This suggests that, although BERT can learn these features with direct supervision, training BERT for NLI does not result in representations that encode these features: the model relies on other statistical clues to make decisions.

Conclusion
We introduce CB as a dataset for NLI, and show that it does not contain annotation artifacts in the hypotheses in contrast to previous NLI datasets. Our evaluation shows that despite the high F1 scores, BERT models have systematic error patterns, suggesting that they still do not capture the full complexity of human pragmatic reasoning. There is much room for improvement, and the CB dataset will be a useful testbed to assess models' progress on such reasoning.