TaxiNLI: Taking a Ride up the NLU Hill

Pre-trained Transformer-based neural architectures have consistently achieved state-of-the-art performance in the Natural Language Inference (NLI) task. Since NLI examples encompass a variety of linguistic, logical, and reasoning phenomena, it remains unclear as to which specific concepts are learnt by the trained systems and where they can achieve strong generalization. To investigate this question, we propose a taxonomic hierarchy of categories that are relevant for the NLI task. We introduce TaxiNLI, a new dataset, that has 10k examples from the MNLI dataset with these taxonomic labels. Through various experiments on TaxiNLI, we observe that whereas for certain taxonomic categories SOTA neural models have achieved near perfect accuracies—a large jump over the previous models—some categories still remain difficult. Our work adds to the growing body of literature that shows the gaps in the current NLI systems and datasets through a systematic presentation and analysis of reasoning categories.


Introduction
The Natural Language Inference (NLI) task tests whether a hypothesis (H) in text contradicts with, is entailed by, or is neutral with respect to a given premise (P) text. This 3-way classification task, popularized by Bowman et al. (2015), which was in turn inspired by Dagan et al. (2005), now serves as a benchmark for evaluation of natural language understanding (NLU) capability of models; for example, NLI datasets (Bowman et al., 2015;Williams et al., 2018) are included in all NLU benchmarks such as GLUE and SuperGLUE (Wang et al., 2018). These corpora, in turn, have been successfully used to train models such as BERT (Devlin et al., † denotes equal contribution. * Work was done while Authors were at Microsoft Research India. 2019) to achieve state-of-the-art (SOTA) performance in these tasks. Despite the wide adoption of NLI datasets, a growing concern in the community has been the lack of clarity as to which linguistic or reasoning concepts these trained NLI systems are truly able to learn and generalize (see, for example (Linzen, 2020) and (Bender and Koller, 2020), for a discussion). Over the years, as models have shown steady performance increases in NLI tasks, many authors (Nie et al., 2019;Kaushik et al., 2019) demonstrate steep drops in performance when these models are tested against adversarially (or counterfactually) created examples by non-experts. Richardson et al. (2019) use templated examples to show trained NLI systems fail to capture essential logical (negation, boolean, quantifier) and semantic (monotonicity) phenomena.
Herein lie the central questions of our work: 1) what is the distribution of various categories of reasoning tasks in the NLI datasets? 2) which categories of tasks are rarely captured by current NLI datasets (owing to the nature of the task and the non-expert annotators)? 3) which categories are well-understood by the SOTA models? and 4) are there categories where Transformer-based architectures are consistently deficient?
In order to answer these questions, we first discuss why performance-specific error analysis categories (Wang et al., 2018;Nie et al., 2019), and stress testing categories (Naik et al., 2018) are inadequate. We then propose a taxonomy of the various reasoning tasks that are commonly covered by the current NLI datasets (Sec 2). Next, we annotate 10,071 P-H pairs from the MNLI dataset (Williams et al., 2018) with the lowest level taxonomic categories, 18 in total (Sec 3). Then we conduct various experiments and careful error analysis of the SOTA models-BERT and RoBERTa, as well as other baselines, such as Bag-of-words Naïve Bayes and ESIM, on their performance across these categories (Sec 4). Our analyses indicate that while these models perform well on some categories such as linguistic reasoning, the performance on many other categories, such as those that require world knowledge or temporal reasoning, are quite poor. We also look into the embeddings of the P-H pairs to understand which of these categorical distinctions are captured well in the learnt representations, and which get conflated (Sec 5). Inline with our previous finding, we observe strong correlation between the level of clustering within the representation of the examples from a category, and the performance of the models for that particular category.

A New Taxonomy for NLI 2.1 Necessity for a New Taxonomy
According to Wittgenstein (1922), "Language disguises the thought", and human beings try to gauge such thought from colloquial language using "complex silent adjustments". The journey from lexicon and syntax of "language" to the aspects of semantics and pragmatics can be thought of as a journey that portrays important milestones that an ideal NLU system should achieve. Irrespective of the order of such milestones 1 , we believe that NLU (and NLI) systems should be tested and analyzed with respect to fundamental linguistic and logical phenomena. Recently, different types of phenomena have been tested through 1) creating new datasets, 2) probing tasks, and 3) error-analysis categorizations. Researchers have created new datasets by recasting various NLU tasks to a large NLI dataset (Poliak et al., 2018), eliciting counter-factual examples from non-experts by considering different lexical and reasoning factors (Kaushik et al., 2019), and adversarial example (Nie et al., 2019) elicitation by letting non-experts come up with examples through interacting with SOTA systems. However, these datasets do not expose the linguistic aspects where the current systems have difficulty. Using the probing task methodology, researchers (Jawahar et al., 2019;Goldberg, 2019) observed that BERT captures syntactic structure, along with some semantics such as NER, and semantic role labels (Tenney et al., 2019b). However, BERT's ability to reason is questioned by the observed performance degradation in MNLI (McCoy et al., 2019). Linzen (2020) also called for a pretraining-agnostic evaluation setup, where the setup is not limited to pre-trained 1 "For an infant, a foreigner, or an instant-message addict, context is more important than syntax" (Sowa, 2010). language models. Our taxonomic categorization is meant to serve as a set of necessary inferencing capabilities that one would expect a competing NLI system to possess; thereby promoting more probing tasks along unexplored categories.
Existing categorization efforts have centred around informing feature creation in the pre-Transformer era, and model-specific error analysis in more recent times. Previously, (LoBue and Yates, 2011) enumerated the type of commonsense knowledge required for NLI. Among recent error analysis efforts, the GLUE diagnostic dataset (Wang et al., 2018), inference types for Adversarial NLI (Nie et al., 2019), the new CheckList (Ribeiro et al., 2020) system and the Stress Tests (Naik et al., 2018) are mentionworthy. As we attempted to group the categorizations in Nie et al. (2019) and Wang et al. (2018) into four high-level categories (lexical, syntactic, semantic, and pragmatic) 2 , we observe that there is a lack of consensus, nonuniformity and repetitiveness of these categories. For example, the Tricky label in Nie et al. (2019) groups examples that involve "wordplay, linguistic strategies such as syntactic transformations, or inferring writer intentions from contexts"; thereby spanning aspects of syntax and pragmatics. Similarly, Reference and Names requires both reasoning and knowledge. The GLUE diagnostic categories (Wang et al., 2018) does not include interesting reasoning categories such as temporal, and spatial. The stress types proposed by Naik et al. (2018) are specific to mostly lexical and some semantic corner cases. This is expected, as these categorizations are analysis-oriented and often dependent on the performance of a set of models in question. Here, we propose a taxonomic categorization that delineates 2 Table provided in Appendix a set of necessary uniform inferencing capabilities for the NLI task.

Taxonomic Categories: Definitions and Examples
In Figure 1, we present our taxonomic categorization. Our categorization is based on the following principles. First, we take a model-agnostic approach, where we work from the first principles to arrive at a set of basic inferencing processes that are required in NLI task. These include an unrestricted variety of linguistic and logical phenomena, and may require knowledge beyond text, thus providing us with the higher-level categories: linguistic, logical and knowledge-based. Second, we retain categories that are non-overlapping and sufficiently represented in NLI datasets. For example, for subcategories under linguistic, we prune semantics because necessary aspects are covered by logical and knowledge-based categories. We omit specific aspects of pragmatics such as implicatures and pre-suppositions, as they are rarely observed in NLI datasets (Jeretic et al., 2020). Thirdly, we aim to list a set of necessary sub-categories. For example, for logical deduction sub-categories, we take inspiration from Davis and Marcus (2015), who list the commonsense reasoning categories where systems have seen success. Lastly, since we aim to employ non-experts for collecting annotations, we decide to restrict further sub-division wherever the definitions get complicated, or pre-suppose certain expertise; for example the lexical category is not sub-divided further (as followed in Wang et al. (2018)). Thus, we take a pragmatic approach that is theory neutral and does not warrant coverage of all reasoning tasks, though we do believe that the taxonomy is sufficiently deep and generic that allows systematic and meaningful analysis of NLI models with respect to their reasoning capabilities. Next we define the categories. For a full set of examples, please see Table 1.
High-Level Categories: The Linguistic category represents NLI examples where the inference process to determine the entailment are internal to the provided text. We classify examples as Logical when the inference process may involve processes external to text, such as mapping words to percepts and reason with them (Sowa, 2010).

Knowledge-based category represents examples
where some form of external, domain or commonly assumed knowledge is required for inferencing.  reasoning in dealing with temporal and spatial reasoning (Gabelaia et al., 2005), and causality (Pearl, 2009), we list relational, temporal, spatial and causal under "Deductions". The relational reasoning stands for the requirement to perform deductive reasoning using relations present in text. Spatial (and temporal) denotes reasoning using spatial (and temporal) properties of objects represented in text. We also consider language-inspired reasoning categories such as co-reference resolution, which is known to often require event-understanding (Ng, 2017) beyond superficial cues. Example: Note that presence of a certain lexical trigger for a category (such as negation) does not warrant the labeling with the category, unless understanding of that concept is invoked in the deduction process.

TaxiNLI: Dataset Details
We present TaxiNLI, a dataset collected based on the principles and categorizations of the aforementioned taxonomy. We curate a subset of examples from MultiNLI (Williams et al., 2018) by sampling uniformly based on the entailment label and the domain. We then annotate this dataset with finegrained category labels.

Annotation Process
Task Design For large-scale data collection, our aim was to propose an annotation methodology that is relatively flexible in terms of annotator qualifications, and yet results in high quality annotations. To employ non-expert annotators, we designed a simplified guideline (questionnaire/interface) for the task, that does not pre-suppose expertise in language or logic. As an overhead, the guideline requires a few rounds of one-on-one training of the annotators. Because it is expensive to perform such rounds of training in most crowdsourcing platforms, we hire and individually train a few chosen annotators. Upon conducting the previouslydiscussed pilot study and using the given feedback, we created a hierarchical questionnaire which first asked the annotator to do the NLI inference on the P-H pair, and then asked targeted questions to get the desired category annotations for the datapoints. The questionnaire is shared in the Appendix.
For the MNLI datapoints with 'neutral' gold labels, we realized, through observation and annotator feedback, that annotating the categories were difficult, as sometimes the hypotheses could not be connected well back to their premise. Hence, we created 2 questionnaires, one for the 'entailment/contradiction' examples, and one for 'neutral' examples. For the entailment/contradiction examples, We collected binary annotations for each of the 15 categories in our NLI taxonomy, for datapoints in MNLI which had 'entailment' or 'contradiction' as gold labels. To resolve this, for the 'neutral' examples we specifically asked them whether the premise and hypothesis were discussing 1) the same general topic (politics, geology, etc.), and if so, 2) had the same subject and/or object of discussion (Obama, Taj Mahal, etc.). If the response to 2) was 'yes', then they were asked to provide the category annotations as previously defined.
Annotator Training/Testing We first tested our two annotators by asking them to do inference on a set of randomly selected premise-hypothesis pairs from MultiNLI. This was to familiarize them with the inference task. After giving the category annotation task, we also continuously tested and trained the two annotators. After a set of datapoints were annotated, we reviewed and went through clarification and feedback sessions with the annotators to ensure they understood the task, and the categories, and what improvements were required. More details are provided in the Appendix.

Annotation Metrics
Here, we assess the individual annotator performance and inter-annotator agreement. Since, automated metrics for individual complex category annotations are hard to define, we use an indicative metric that matches the annotated inference label with the gold label, i.e., their inference accuracy. We also calculated inter-annotator agreement between the two annotators for an overlapping subset of 600 examples. For agreement, we use the Fleiss' Kappa (κ) (Fleiss, 1971). We also compute another simple statistic, namely the 'IOU' (Intersection-Over-Union) of categories per datapoint, defined as: category annotations for Annotator i for datapoint j, averaged over total datapoints N . Looking at the category-wise Fleiss' κ values in Fig. 2, we observe that there are promising levels of agreement in most of the categories except syntactic, relational, and world. We observe the average inference accuracy (86.7%) is high despite known issues in MNLI example ambiguity. Similarly, both the average Fleiss' κ (0.226) and the IOU metric (0.241) suggest an overall reasonable inter-annotator agreement.

Dataset Statistics
Each datapoint in TaxiNLI 3 consists of a premisehypothesis pair, the entailment label, and binary annotations for 18 features. 15 features correspond to the 15 categories discussed in the taxonomy, and 3 additional features for the 'neutral' gold label datapoints based on same general topic, same subject, and same object. The statistics are listed in Tab  Categorical Observations From our annotations, we observe that inferencing each MNLI ex- ample requires about 2 categories. Fig. 3 shows the distribution of categories in the TaxiNLI dataset. We see that a large number of P-H pairs in MNLI require lexical and syntactic knowledge to make an inference; whereas the challenges of relational, spatial, and taxonomic for inference are not adequately represented. Categorical Correlations Fig. 4 shows correlations among categories in our dataset. We observe that most categories show weak correlation in the MNLI dataset, hinting at a possible independence of categories with respect to each other. Relatively stronger positive correlations are seen between boolean-quantifier, and boolean-comparative categories. We specifically looked at the genre-wise split of datapoints containing boolean-quantifier and saw that nearly 25% of them came from the 'telephone' genre of MNLI. An example is "P: have that well and it doesn't seem like very many people uh are really i mean there's a lot of people that are on death row but there's not very many people that actually um do get killed H: Most people on death row end up living out their lives awaiting execution.". Factivity, on the other hand, is negatively correlated with almost all the other categories, except world, which means P-H pairs labeled with factivity typically have no other categories marked.

(Re)Evaluation of SOTA Models
We re-evaluate two Transformer-based and two standard baseline machine learning models on TAXINLI, under the lens of the taxonomic categories.  (Chen et al., 2017). We also train a Naive Bayes (NB) model using bag-of-words features for the P-H pairs after removing stop words 4 .

TAXINLI Error Analysis
We report the NLI task accuracy of the baseline systems on the MNLI validations sets in Table 3.

47
The systems are fine-tuned on the MNLI training set using the procedures followed in Devlin et al.  We evaluate the systems on a total of 7.7k examples, which are in the intersection of TAXINLI and the validation sets of MNLI. Figure 5 shows for each category c i , the normalized frequency for a model predicting an NLI example of that category accurately, i.e., #c i =1,correct=1 We observe that compared to NB, the improvements in BERT have been higher in lexical, syntactic categories compared to others. Improvements in ESIM compared to NB show a very similar trend, and show for knowledge categories the improvements are negligible. ESIM shows largest improvement on negation.

Factor Analysis
In order to quantify the precise influence of the category labels on the prediction of the NLI models, we probe into indicators and confounding factors using two methods: linear discriminant analysis (LDA) and logistic regression (LR). We use indicators for each category (0 or 1) and for two potential confounding variables (lengths of P,H), to model the correctness of prediction of the NLI system. The coefficients of these analyses on BERT are shown in Fig 6. The values for RoBERTa follow a similar trend, and are presented in the appendix. We see that presence of certain taxonomic categories strongly influence the correctness Significant LR coefficients: syntactic ** , negation *** , boolean * , causal *** , world *** , Length2 ** ; where p value is smaller than: 0.001***, 0.01**, 0.05*. of prediction. As we found in the analysis presented in Sec. 4, we observe that syntactic, negation, and spatial categories are strong indicators of correctness of prediction. On the other hand, conditional, relational, causal, coreference are harder to predict accurately. Sentence length does not play a significant role.
We also make an observation for categories such as lexical, syntactic, where the proportion of a single NLI label is high, also correlated with a high prediction accuracy (Fig. 6).

Discussion
Visual Analysis Section 4 paints a thorough picture by analysing the fine-grained capabilities of SOTA NLI systems at a behavioral 5 level. Whereas we can say the systems are lacking in certain aspects despite their high overall performance, it naturally also raises questions at the understanding level: 1) Is there any implicit knowledge acquired by the NLI-finetuned systems about the kinds of reasoning required in the inference task? 2) If not, do the systems simply lack the understanding of what kind of reasoning is required per example, or despite understanding that, are unable to do the  (Fig. 7) reveal definitive patterns of clustering by taxonomic categories. The earliest separation is observed for the lexical category, at layer 3 in BERT (and layer 1 in RoBERTa), much before any other categories are realized. At layers 6 in BERT, and 19 in RoBERTa, about the same time as clustering by NLI label is seen, the connectives cluster is revealed. The deductions (see Sec. 2), and syntactic categories are revealed in later layers (layer 11 and 12 in BERT and layer 21 in RoBERTa). The knowledge cate-gory is revealed more prominently in RoBERTa at layer 21, while BERT does not seem to show such a cluster. By the last few layers, separation into most categories becomes apparent. This means, along various layers of a NLI finetuned language model, taxonomic information is implicitly captured. Despite this, as discussed in the previous sections, SOTA models seem to be deficient in some of the categories-certain categories remain harder to perform inference on. In the latter layers, the separation along taxonomic categories also corresponds strongly with separation along NLI labels. For instance, in Fig. 7 (c), the examples categorized as syntactic almost entirely lie in the entailment cloud, which matches our intuition based on the statstics in Fig. 3.
The layer-wise separation of examples by taxonomy raises an interesting possibility to motivate model architectures that may attempt to use its discriminative power to identify such taxonomic categories, for specialized treatment to examples requiring certain reasoning capabilities.
Recasting: The under-representation of certain categories in the MNLI dataset raises a need for more balanced data collection. A possible alternative is to build recast diagnostic datasets for each category, and create probing tasks. Some datasets (Zhang et al., 2019;Richardson et al., 2020) can be recast to the syntactic and Logical categories respectively, as their data creation aligns with our category definitions. However, most categories lack such aligned synthetic data, and crowdsourced data would require manual annotation as above. This poses an avenue for future work.

Conclusion
To bridge the gap between accuracy-led performance measurement and linguistic analysis of stateof-the-art NLI systems, we propose a taxonomic categorization of necessary inferencing capabilities for the NLI task, and a re-evaluation framework of systems on a re-annotated NLI dataset using this categorization; which underscores the reasoning categories that current systems struggle with.
We would like to emphasize that unlike the case with challenge and adversarial datasets, TAXINLI re-annotates samples from existing NLI datasets which the SOTA models have been exposed to. Therefore, a lower accuracy in certain taxonomic categories in this case cannot be simply explained away by the "lack of data" and "unnatural distribution" arguments.  Since, examples are annotated with multiple categories, we capture the dependencies by defining a Bayesian Network (BN) where, each category (a boolean random variable) has a directed edge to the correct node (representing correctness of prediction) 6 . We learn the parameters by fitting this BN to the observed data. In Figure 9, we see from the Bayesian estimate again that, the improvements by BERT in categories such as relational reasoning has been low. It also shows, that there is a sharp decrease in accuracy for examples requiring the use of taxonomic knowledge. However, RoBERTa improves over NB and ESIM by large margins, albeit non-uniformly.

C Factor Analysis of Correctness of Prediction by RoBERTa
In Fig. 9, we show the results of Linear Discriminant Analysis (LDA) and Linear Regression (LR) results for RoBERTa predictions. A very similar trend as BERT can be seen here as well. 6 Additionally, we attempted to learn a Bayes Net from the data using bnlearn package. But the limited number of observations yield non-intuitive results.

Figure 8:
We show a Bayesian Estimate of P (correct = 1|category = 1) for different systems.

D Annotation Questionnaire
Our annotation process went through several steps of refinement and improvement. We started with the most basic annotation flow, which was to have a manual which defines each taxonomic category in detail, and then have the annotator mark for each category. For the pilot study, we took roughly 300 examples from MNLI and asked an initial annotator to annotate. The feedback was the following: • The manual describing each taxonomic category had a lot of information and took time to understand and digest.
• It was difficult to keep referring to the guide, although after sufficient examples, it became easier.
• There was confusion and ambiguity about the definitions, and the annotator interpreted the definitions differently than what we intended.
• Figuring out the categorical annotations for neutral examples was a challenge, as sometimes the topic or subject of what the hypothesis was discussing was separate from what the premise was discussing.
Through the analysis of these annotations, we also observed that some of the initial categories we had were either exceedingly underrepresented in the MNLI dataset, or were consistently confused with others. Thus, we revised the set of categories, setting more distinct boundaries, and ensuring independence of categories. We revised the questionnaire into a hierarchical 'if-else' multi-choice design. The questionnaire is structured as follows: Figure 9: Coefficients obtained through Linear Discriminant Analysis (LDA) and Logistic Regression (LR) to model the correctness of NLI prediction by RoBERTa, given taxonomy categories and possible confound variables. Significant LR coefficients: lexical ** , negation *** , world *** , quantifier * , boolean * , Length1 * ; where p value is smaller than: 0.001***, 0.01**, 0.05*.
1. Can you evaluate S2 by just using the information/context given in S1? Or do you require knowledge from external documents, say history books, news articles, science books, etc.?
(a) Need more information ( A few examples are animal groups (snakes are reptiles), currencies (dollar is a currency), types of activities (football is a sport, sport is an activity). Basically, how a common noun (snake) belongs to a class (say reptiles), which can belong to yet another class (animals). Do not select this if the name of one object belongs (or is a substring) of the other class of objects (e.g. green snake is a snake isn't part of this category), or if the names of the objects are pronouns (e.g. Barack Obama -president and related examples are part of category 2a, not this one). E.g: • S1: Norman hated all musical instruments. • S2: Norman loves the piano. This is FALSE and requires external knowledge that a piano belongs to the class of instruments, hence Norman cannot love the piano. (c) No extra knowledge required 3. Using just the information from S1, and given that you have the required knowledge from the above question, did you have to use some reasoning to figure out the answer, or did you just need the knowledge of words and paraphrasing, or both? (More than one can be ticked) (a) Some reasoning was required, which wasn't explicitly written down in S1, but was implicitly understood. (b) Knowledge of words (e.g. synonyms, antonyms), and recognizing paraphrases. The information I needed was explicitly written down in S1.
4. What kind of reasoning was required (if applicable) (More than one can be ticked)?
(a) You needed reasoning about relations in S1. You observed that there were objects/entities in S1 and there were explicit mentions of how they were related (e.g. Jack and his son went to the circus), and you used your reasoning about the nature of those relations to arrive to the