Behavior Analysis of NLI Models: Uncovering the Influence of Three Factors on Robustness

Natural Language Inference is a challenging task that has received substantial attention, and state-of-the-art models now achieve impressive test set performance in the form of accuracy scores. Here, we go beyond this single evaluation metric to examine robustness to semantically-valid alterations to the input data. We identify three factors - insensitivity, polarity and unseen pairs - and compare their impact on three SNLI models under a variety of conditions. Our results demonstrate a number of strengths and weaknesses in the models’ ability to generalise to new in-domain instances. In particular, while strong performance is possible on unseen hypernyms, unseen antonyms are more challenging for all the models. More generally, the models suffer from an insensitivity to certain small but semantically significant alterations, and are also often influenced by simple statistical correlations between words and training labels. Overall, we show that evaluations of NLI models can benefit from studying the influence of factors intrinsic to the models or found in the dataset used.


Introduction
The task of Natural Language Inference (NLI) 1 has received a lot of attention and has elicited models which have achieved impressive results on the Stanford NLI (SNLI) dataset (Bowman et al., 2015).
Such results are impressive due to the linguistic knowledge required to solve the task (LoBue and Yates, 2011;Maccartney, 2009).However, the ever-growing complexity of these models inhibits a full understanding of the phenomena that they capture.
1 Also known as Recognizing Textual Entailment.
As a consequence, evaluating these models purely on test set performance may not yield enough insight into the complete repertoire of abilities learned and any possible abnormal behaviors (Kummerfeld et al., 2012;Sammons et al., 2010).A similar case can be observed in models from other domains; take as an example an image classifier that predicts based on the image's background rather than on the target object (Zhao et al., 2017;Ribeiro et al., 2016), or a classifier used in social contexts that predicts a label based on racial attributes (Crawford and Calo, 2016).In both examples, the models exploit a bias (an undesired pattern hidden in the dataset) to enhance accuracy.In such cases, the models may appear to be robust to new and even challenging test instances; however, this behavior may be due to spurious factors, such as biases.Assessing to what extent the models are robust to these contingencies just by looking at test accuracy is, therefore, difficult.
In this work we aim to study how certain factors affect the robustness of three pre-trained NLI models (a conditional encoder, the DAM model (Parikh et al., 2016), and the ESIM model (Chen et al., 2017)).We call these target factors insensitivity (not recognizing a new instance), polarity (a word-pair bias), and unseen pairs (recognizing the semantic relation of new word pairs).We became aware of these factors based on an exploration of the models' behavior, and we hypothesize that these factors systematically influence the behavior of the models.
In order to systematically test if the above factors affect robustness, we propose a set of challenging instances for the models: We sample a set of instances from SNLI data, we apply a transformation on this set that yields a new set of instances, and we test both how well the models classify these new instances and whether the target factors influence the models' behavior.The transformation (swapping a pair of words between premise and hypothesis sentences) is intended to yield both easy and difficult instances to challenge the models, but easy for a human to annotate them.
We draw motivation to study the robustness of NLI models from previous work on evaluating complex models (Isabelle et al., 2017;White et al., 2017).Furthermore, we base our approach on the discipline of behavioral science which provides methodologies for analyzing how certain factors influence the behavior of subjects under study (Epling and Pierce, 1986).
We aim to answer the research questions: How robust is the predictive behavior of the pre-trained models under our transformation to input data?Do the target factors (insensitivity, polarity, and unseen pairs) influence the prediction of the models?Are these factors common across models?
Our results show that the models are robust mainly where the semantics of the new instances do not change significantly with respect to the sampled instances and thus the class labels remain unaltered; i.e., the models are insensitive to our transformation to input data.However, when the class labels change, the models significantly drop accuracy.In addition, the models exploit a bias, polarity, to stay robust when facing new instances.We also find that the models are able to cope with unseen word pairs under a hypernym relation, but not with those under an antonym relation, suggesting their inability to learn a symmetric relation.

Analysis of Complex Models
Previous works in ML and NLP have analyzed different aspects of complex models using a variety of approaches; for example, understanding input-output relationships by approximating the local or global behavior of the model using an interpretable model (Ribeiro et al., 2016;Craven and Shavlik, 1996), or analyzing the output of the model under lesions of its internal mechanism (Li et al., 2016).Another line of work has analyzed the robustness of NLP models both via controlled experiments to complement the information from the test set accuracy and test abilities of the models (Isabelle et al., 2017;B. Hashemi and Hwa, 2016;White et al., 2017) and via adversarial instances to expose weak-nesses (Jia and Liang, 2017).In addition, work has been done to uncover and diminish gender biases in datasets captured by structured prediction models (Zhao et al., 2017) and word embeddings (Bolukbasi et al., 2016).However, to the best of our knowledge, there is no previous work to study the robustness of NLI models while analyzing factors affecting their predictions.

Behavior Analysis
Previous work on behavioral science has focused on understanding how environmental factors influence behaviors in both human (Soman, 2001) and animal (Mench, 1998) subjects with the objective of predicting behavioral patterns or analyzing environmental conditions.This methodology also helps to identify and understand abnormal behaviour by collecting behavioral data without the need to reach any internal component of the subject (Birkett and Newton-Fisher, 2011).
We base our approach in the discipline of behavioral science since some of our research questions and objectives align to those from this discipline; in addition, its methodology to study how factors effect on the subjects' behavior provides statistical guarantees.

Background
3.1 Natural Language Inference NLI, or RTE, is the task of inferring whether a natural language sentence (hypothesis) is entailed by another natural language sentence (premise) (Maccartney, 2009;Dagan et al., 2009;Dagan and Glickman, 2004).
More formally, given a pair of natural language sentences i = (premise, hypothesis), a model classifies the type of relation such sentences fall in from three possible classes, entailment, where the hypothesis is necessarily true given the premise, neutral, where the hypothesis may be true given the premise, and contradiction, where the hypothesis is necessarily false given the premise.Solving this task is challenging since it requires linguistic and semantic knowledge, such as co-reference, hypernymy, and antonymy (LoBue and Yates, 2011), as well as pragmatic knowledge and informal reasoning (Maccartney, 2009).

Behavior Analysis
Behavior analysis seeks to account for the role that factors (independent variables) play in the behav-ior (dependent variable) of subjects.Testing for the influence of a factor on the subject's behavior can be done via statistical tests: A null hypothesis states no association between a target factor and behavior, whereas the alternative hypothesis states an association (McDonald, 2014).

SNLI Dataset
The Stanford NLI dataset (Bowman et al., 2015) was created with the purpose of training deep neural models while providing human-annotated data.Each instance was created by providing a premise sentence, harvested from a pre-existing dataset, to a crowdsource worker who was instructed to produce three hypothesis sentences, one for each NLI class (entailment, neutral, contradiction).This process yielded a balanced dataset containing around 570K instances.

Models
Conditional Encoder We use two bidirectional LSTMs; the first LSTM encodes the premise sentence into a fixed-size vector embedding by sequentially reading on a word basis, while the second LSTM encodes the hypothesis sentence conditioned on the representation of the premise sentence.At the final layer we used a softmax over the class labels on top of a 3-layer MLP.All embeddings, of dimensionality d = 100, were randomly initialized and learned during training.Accuracy on SNLI's dev set is 0.782.

Decomposable
Attention Model DAM (Parikh et al., 2016) consists of 2-layer multilayerperceptrons (MLPs) factorized in a 3-step process.First, a soft-alignment matrix is created for all the words in both the premise and hypothesis.Then, each word of the premise is paired with the soft-alignment representation of the hypothesis sentence and fed into an MLP, and similarly for each word in the hypothesis with the soft-alignment of the premise.The resulting representations are then aggregated where the vector representations of the premise are summed up and the same for those of the hypothesis; the new representations are then fed to an MLP, followed by a linear layer and a softmax whose output is a class label.We use d = 300 dimensional GloVe embeddings (not updated at training time).All layers use the ReLU function.Accuracy on SNLI's dev set is 0.854.
Enhanced Sequential Information Model ESIM (Chen et al., 2017) performs inference in three stages.First, Input Encoding uses BiLSTMs to produce representations of each word in its context within premise or hypothesis.Then, Local Inference Modelling constructs new word representations for each hypothesis (premise) by summing over the BiLSTM hidden states for the premise (hypothesis) words using weights from a soft attention matrix.Additionally, these representations are enhanced with element-wise products and differences of the original hidden states vectors and the new attention based vectors.Finally, Inference Composition uses a BiLSTM, average and max pooling and an MLP output layer to produce predicted labels.Accuracy on SNLI's dev set is 0.882.

Methods
We test our main hypothesis (Section 1) by perturbing instances in a controlled, simple, and meaningful way.This alteration, at the instance level, yields new sets of instances which range from easy (the semantics and the label of the new instance are the same to those of the original instance) to challenging (both semantics and label of the new instance change with respect to those of the original instance), but all of them remain easy to annotate for a human.
To examine how the models generalize from seen instances to transformed instances, we sample our original instances from the SNLI training set, which we refer to as control instances from now on.We then produce new instances which differ either minimally from the control instances, by changing only a single word in the premise and hypothesis, or more substantially, by copying the same sentence structure into the premise and hypothesis with a single word changed.In this way, we produce instances that contain only words seen at training time, within sentence structures also seen at training time.Thus, our evaluation sets are as in-domain as possible, and control for factors associated with novel sentential contexts and vocabulary.

Basic Procedure and Statistical Analyses
We first sample an instance from the SNLI dataset according to a given criterion, namely we look for a specific word pair in the instance; then, we apply our transformation over the word pair.
This procedure generates a new instance.After that, the models label the new instance, and we statistically analyze which target factors influenced the models to respond in such a way via chi-square (McNemar's, independence, and homogeneity) tests (McDonald, 2014;Alpaydin, 2010).When the sample size is too small we apply Yate's correction or a Fisher test.We use the StatsModels (Seabold and Perktold, 2010) and SciPy (Oliphant, 2007) packages.The level of significance is p < 0.0001, unless otherwise stated.2This procedure is applied in four experiments, where we study the effect of different word pairs (hypernym, hyponym, and antonyms) and the effect of two types of context words surrounding the word pairs which we refer to as in situ and ex situ (explained in Section 5.3).

Transformation and Word Pairs
Given a set of word pairs of the form W = (w 1 , w 2 ), where w 1 and w 2 hold under a semantic relation s ∈ {antonymy, hypernymy, hyponymy}, we look through the training set for instances i k = (p k , h k ), where p k and h k are premise and hypothesis sentences, respectively, such that w 1 ∈ p k and w 2 ∈ h k .For each instance i k we apply transformation T : we swap w 1 with w 2 ; this transformation yields an instance i m = (p m , h m ) where w 2 ∈ p m , w 1 ∈ h m and w 1 / ∈ p m , w 2 / ∈ h m .3An example of transformation T on a contradiction instance i k is the following: (1) p k : A soccer game occurring at sunset.h k : A basketball game is occurring at sunrise.
Where the word pair (sunset, sunrise) are antonyms.After applying transformation T , we obtain the new contradiction instance i m : (2) p m : A soccer game occurring at sunrise.h m : A basketball game is occurring at sunset.
Consider now the following instance i l (class label entailment): (3) p l : A little girl hugs her brother on a footbridge in a forest.h l : A pair of siblings are on a bridge.
If we now apply transformation T on the hypernym word pair (footbridge, bridge) we derive the new instance i n (class neutral): (4) p n : A little girl hugs her brother on a bridge in a forest.h n : A pair of siblings are on a footbridge.
Since swapping word pairs under hypernymy or hyponymy relations may yield a different class label for the new instance, we manually annotate all the instances in the new sample, discarding those that are semantically incoherent.

Experimental Conditions
We consider two types of sentential context for the word pairs, namely in situ and ex situ.Examples of instances under the in situ condition are Examples 1, 2, 3, and 4 in Section 5.2.The name in situ refers to the fact that we analyze the effect of the transformation T within the original context of the premise and hypothesis sentences.This allows to control for confounding factors, such as sentence length and order of the context words.
We also consider an ex situ condition in which we remove the word pair from the original premise and hypothesis and analyze the effect of the transformation T within a simplified sentential context which is the same in premise and hypothesis.Specifically, we randomly select either the premise or hypothesis context from the original instance and copy it into both positions.In this way, we obtain a sentence pair where the only difference between the premise and hypothesis is the word pair, which allows us to isolate the effect of this pair from its interaction with the surrounding context; this condition thus allows to control for context words.This process yields a new set of instances, which we refer to as E.
An example of an ex situ instance can be constructed from Example 1 (Section 5.2).If the premise sentence is selected, then after performing the procedure described above, the following sentence pair e k is generated: (5) p k : A soccer game occurring at sunset.h k : A soccer game occurring at sunrise.
Given a sample E, we apply the transformation T in order to generate a transformed sample E T where the word pairs are swapped, similar to the procedure applied in Section 5.2 on SNLI control instances in order to generate their transformed instances counterpart.In the latter case, we say that given a sample of control instances I we generate a transformed sample I T .
As an example of obtaining a transformed ex situ instance, we apply T to (sunset, sunrise) in Example 5 to obtain the new instance e m : (6) p m : A soccer game occurring at sunrise.h m : A soccer game occurring at sunset.
We note that for both conditions, in situ and ex situ, the same word pairs are swapped, so the differences are the surrounding context words and the factors being controlled.

Test Sets
In each experiment we use two sets of instances in order to measure the robustness of the models and analyze our target factors: 1) The control instances where the target word pair is in its original position and 2) the transformed instances generated after applying transformation T .The name of each set corresponds with the experimental setting it is used in.Samples used in in situ experiments are named as I, and E for ex situ.Subscripts distinguish both the type of word pairs (A for antonyms and H for hypernym/hyponym) and the type of set (control or transformed).For example, I A refers to the control in situ set whose instances contain antonym word pairs, whereas E TH refers to the ex situ transformed test set containing hypernym/hyponym swapped word pairs.
We clarify: a) the sets I A and I H are sampled from the SNLI dataset; b) transformed test sets are generated from control sets containing control instances; c) we refer to the sets E A and E H as control test sets because the target word pairs are in their original position, and we apply T on them in order to obtain the transformed samples E TA and E TH , respectively.
Details about the sets: In order to build set I A , we sample only contradiction instances (instances in E A are also contradictions).We use the antonym word pairs from (Mohammad et al., 2013) to yield the sets I TA1 and E TA , which also only contain contradictions since the relation of antonymy is symmetric. 4We build two more sets, I TA2 and I TA3 (explained in Section 6.1).Sets I H , E H , I TH , and E TH contain instances with any class label.In order to generate sets I TH and E TH , we use the hypernym word pairs from (Baroni et al., 2012).We manually annotate these transformed sets and discard incoherent instances.

Factors Under Study
We describe the three target factors that we hypothesize that affect the models' response.
Insensitivity is the name we give to the tendency of a model to predict the original label on a transformed instance that is similar to a control instance.Thus a model would be insensitive if, for example, it incorrectly predicts the same class label for both the control instance in Example 3 and the transformed instance in Example 4 just because they closely resemble each other.A simple measure of the impact of this effect is to look at the accuracy on the subset of instances in which the gold label was changed by the transformation.We show this effect by statistically correlating the rate of correct predictions with changes in the labels predicted.
Unseen Word Pairs are another factor we can use to evaluate robustness.In this case, we are interested in the subset of transformed instances where the swapped word pair is now in an order within premise and hypothesis that was unseen in the training data.An example is Example 2 which contains the unseen word pair (sunrise, sunset); i.e., no instance in the training set contains the word sunrise in the premise and the word sunset in the hypothesis.Poor performance on this subset reflects an inability to exploit the symmetry (antonym pairs) or anti-symmetry (hypernym pairs) of the word pairs involved.We show models' abilities to cope with unseen pairs by statistically associating proportions of instances containing unseen pairs with incorrect predictions rates.
Polarity is the name we give to the association between a word pair and the most frequent class it is found in across training instances.For example, we associate the word pair (sunset, sunrise) with polarity contradiction because it mainly appears on training instances with label contradiction.We define four main categories of polarity: neutral, contradiction, entailment, and none for unseen word pairs.5 Accuracy on the subset of instances where polarity and gold label disagree is an indicator of the extent to which a model is influenced by this factor.For example, a model incorrectly predicting label entailment for the instance in Example 4 (class neutral) based on the polarity of class entailment of its word pair (bridge, footbridge) indicates that the model is influenced by this factor.We show this influence by statistically correlating labels predicted with polarities.

Experiments and Results
Table 1 presents the performance of the models across the different test sets.In general, DAM and ESIM seem to be more robust than CE, with the latter's accuracy degrading to essentially random performance on the most challenging subsets.However, this general trend is reversed in a single row of the table.On E TH , ESIM shows a comparable performance to CE.And on Subset 3 of I H , DAM appears to rely on a bias (polarity) in the same way as CE.Overall, all models are affected by the three target factors, dropping performance up to 0.25, 0.20, and 0.28 for ESIM, DAM, CE, respectively, just by virtue of our simple transformation of swapping words.

Experiment 1: Swapping Antonyms in In Situ Instances
In this experiment we use sets I A and I TA1 .Swapping antonyms seems to have no effect on the overall performance of the DAM model on I TA1 when compared to I A , and little effect on ESIM.Thus these two models appear to be robust to this transformation.Nonetheless, further analysis will not support the conclusion that both models have learned that antonymy is symmetric, and we will show that this seemingly robust behavior is due to confounding factors and not due to inference abilities.Accuracy scores of CE model seem to reveal that it is much less robust to the antonym swap, with performance significantly dropping by roughly 10.5% according to a McNemar's test.
Insensitivity Because instances in I TA1 are contradiction, we perform a proxy experiment to understand the models' sensitivity.From I A , we substitute one of the antonyms in each word pair (in each instance) with a hyponym, hypernym, or synonym6 of the other.Doing this on both the premise and hypothesis yields two new samples, I TA2 and I TA3 , which we manually annotate.
Examples of control (Example 7) and transformed (Example 8) instances are given below, showing the replacement of young, in the hypothesis, with aged, a synonym of elderly from the premise.This transformation changes gold-label from contradiction to neutral.Approximately, half the sample yields such changes in gold-label.
(7) p k : An elderly woman sitting on a bench.h k : A young mother sits down.
(8) p m : An elderly woman sitting on a bench.h m : An aged mother sits down.
This transformation leads to a considerable drop in overall performance for all models when accuracy scores on sets I TA2 and I TA3 are compared to the accuracy on the control instances in I A : up to 0.175 (CE), 0.201 (DAM), and 0.24 (ESIM) points (Table 1).To test if insensitivity to the transformation is associated with these behaviors, we measure accuracy only on those instances that changed gold-label (Subset 1 from the sets I TA2 and I TA3 ), where we see a further reduction in performance for all models.2-way tests of independence provide strong evidence for the insensitivity of the models (CE: χ 2 (1) = 73.33,DAM: χ 2 (1) = 108.30,ESIM: χ 2 (1) = 175.34).
Table 2 shows the case for ESIM: most of its incorrect predictions are due to predicting the same label on both control and transformed instances when these two type of instances have different gold labels.Paradoxically, this effect works in the models' favour in the antonym swapping case (I TA1 ) because all the gold-labels remain as contradiction.Thus ignoring the transformation will avoid any loss in performance.Unseen Word Pairs The results in the column Subset 2 of I TA1 (Table 1) suggest that performance on unseen word pairs is weak.However, only 40 instances within I TA1 contain unseen antonym pairs; thus the impact of this result may be limited.2-way tests of homogeneity show that the difference in accuracy of predictions in instances containing seen or unseen word pairs is nonetheless significant for all models (CE: χ 2 (1) = 19.46,DAM: χ 2 (1) = 74.16,ESIM: χ 2 (1) = 39.33).In other words, the models struggle to recognize the reversed antonym pairs, even though they were all seen in their original order at training time.This effect can be seen, for example, in the contingency table for DAM in Table 3.

Distribution of predictions
Polarity Only 11% of the instances in the transformed sample I TA1 contain word pairs that have polarity other than contradiction.Thus, a model Word pairs Predictions seen unseen correct 567 20 incorrect 13 20 relying only on this factor could achieve an accuracy of 89%.We investigate if the predicted labels on instances in I TA1 are associated with the polarity of the transformed word pair.For all models, independence tests are highly significant (CE: χ 2 (6) = 30.69,DAM: χ 2 (6) = 101.26,ESIM: χ 2 (6) = 64.40).Table 4 shows that the predictions of DAM change according to the polarity of the word pairs.For example, when the polarity is contradiction, around 98.5% of the predictions are contradictions; however, this figure changes when the polarity is neutral where the rate of correct predictions (contradictions) fall to 80.7%, and a more dramatic fall is observed when the word pairs are unseen (polarity none) where only 50% of the predictions are correct.This is strong evidence that the models learned to rely on polarity.
We note that a model with perfect accuracy on I TA1 , would lead to a statistic that does not reject the null hypothesis, showing in this case that the predictions are independent of polarity.In this condition, we refrain from analyzing the effect of insensitivity, since doing so would require a transformation similar to that in the in situ condition, which might add an extra layer of change and the results may turn difficult to interpret.
Unseen Word Pairs Accuracy scores strongly suggest that the models are weak at dealing with unseen antonym pairs (Subset 2 of E TA in Table 1); drops in performance on this subset range from 0.315 up to 0.429 points across the three models.Tests of homogeneity show strong evidence of this weakness for all models (CE: χ 2 (1) = 15.91,DAM: χ 2 (1) = 59.17,ESIM: χ 2 (1) = 44.72).
Comparing results on this subset with those of Subset 2 in I TA1 , we notice that ESIM and DAM keep similar behavior, but CE seems to be strongly affected by this context type.
Polarity All models perform poorly in the subset of instances where polarity disagrees with gold label of the instance (Subset 3 of E TA ), showing that the models' behavior rely on this bias.These results are highly significant (CE: χ 2 (6) = 34.37,DAM: χ 2 (6) = 136.99,ESIM: χ 2 (6) = 103.47).This is further evidence that the models get confused with a simple reversal of an antonym pair.

Experiment 3: Swapping Hypernyms and
Hyponyms in In Situ Instances We now study the effect on the robustness of the systems when we swap hypernym and hyponym word pairs in in situ instances.Whole sample accuracy scores in Unseen Word Pairs Whereas model performance was significantly worse on unseen antonym pairs, this effect is not obvious on the hyponymhypernym results (Subset 2 of I TH ).In fact, all models have a slightly higher accuracy on this subset than overall.Homogeneity tests find no evidence of an association between unseen word pairs and incorrect predictions for any model (CE:χ 2 (1) = 0.00036, p = 0.98, DAM:χ 2 (1) = 0.98, p = 0.32, ESIM:χ 2 (1) = 0.178, p = 0.67).This effect may be explained by the models exploiting information from word embeddings.It has been shown that word embeddings are able to capture hypernymy (Sanchez and Riedel, 2017); thus the models may use this information to generalize to unseen hypernym pairs.
Polarity We find very strong evidence for an association between polarity and class label predicted on sample I H for all models (CE:χ 2 (10) = 168.40,DAM:χ 2 (10) = 182.76,ESIM:χ 2 (10) = 157.76).However, for sample I TH , only DAM keeps this strong correlation (χ 2 (14) = 47.71).In the case of CE, we find weak evidence in favour of this correlation on instances of I TH (χ 2 (14) = 25.27,p = 0.03).For ESIM we find no evidence of correlation (χ 2 (14) = 22.72, p = 0.06), thus we do not reject the null hypothesis.Polarity's influence can be observed in Subset 3 of I H (Table 1), where we observe a drop in accuracy for instances whose gold labels do not match the polarity of the word pairs, compared to the accuracy of the whole sample; this means that when the models have polarity as a cue, they improve performance.However, in the case of DAM, this factor seems to play a small role on its behavior as seen when we compare accuracy on Subset 1 with that of the whole transformed sample.Insensitivity seems to have a bigger influence on the models when the transformed instances are closer to the training set: Accuracy scores on Subset 1 from I TH are smaller than those on Subset 1 from E TH .

Discussion and Conclusions
Although all three models achieve strong results on the original SNLI development set (CE: 0.782, DAM: 0.854, ESIM: 0.882), each model exhibits particular weaknesses on the transformed training instances.Notably, all perform poorly on I TH instances in which the gold label is changed, with ESIM and CE performing below the level of chance.Thus, on these instances, the models tend to predict the label of the original unaltered training instance and inference in this case is similar to nearest-neighbour prediction.
On the other hand, much better performance is obtained for the DAM and ESIM models on I TH instances containing unseen word pairs, indicating these models have learned to infer hyper-nym/hyponym relations from information in the pre-trained word embeddings.In contrast, performance on the unseen word pairs in I TA1 and E TA suggests that inferring antonymy from the embeddings is more difficult.
Weak performance is seen again on the E A and E TA instances where the polarity of the antonym pair is not consistent with the gold label.For these cases, the only difference between premise and hypothesis is the antonym pair, and the models tend to fall back on predicting the most frequent gold label seen for that word pair.
One result that remains anomalous is the overall performance of the ESIM model on the whole E TH sample.While this sample contains unseen word pairs and instances in which the gold label changes or is inconsistent with polarity, these effects do not by themselves explain the poor performance overall.Neither is this weakness explained by the ex situ structure, in which premise and hypothesis differ by only one word, as performance on the control ex situ sample, E H , is much stronger.The effect, then, appears to be due to an interaction of the ex situ structure in combination with the transformation.
In the present work, we have limited ourselves to examining single influences independently.However, there are undoubtedly manifold interactions contributing to model performance.In fact, the complexities of these models (LSTMs, attention mechanisms and MLPs) are specifically intended to capture the interactions between the words in the premise and hypothesis.Further work is required to understand what these interactions are and how they contribute to performance.Fully uncovering these factors in current NLI datasets is a pre-requisite for the construction of more effective resources in the future.

Table 1 :
Accuracy scores of all models.Exp: experiment number.Whole sample: accuracy scores on the whole sample.Subset 1: subset of transformed instances that have different gold label with respect to the control instances they were generated from.Subset 2: subset of transformed instances that contain word pairs unseen at training time.Subset 3: subset of control or transformed instances containing word pairs whose polarity does not match the instance's gold label.

Table 2 :
Contingency table for ESIM: Predictions on transformed instances with different gold labels from those of the control instances.

Table 3 :
Contingency table for DAM: Predictions distributed according to instances containing a seen or an unseen antonym word pair.

Table 4 :
Contingency table for DAM: Predictions distributed according to the polarity of target word pairs found in the transformed instances.In this experiment, we use samples E A and E TA .Swapping antonyms has little effect on the performance of all models, where the biggest drop comes from DAM (0.029 points).However, the CE model performs quite poorly at both samples (0.508 and 0.48 accuracy points on E A and E TA ); this drop in performance, with respect to the in situ condition, suggests that the repeated sentence context is too different from the structure of the training instances for the CE model to generalize effectively.

Table 1
Insensitivity The drop in performance described above can be partially explained by insensitivity to changes in gold label, since around 93% of the instances in E TH changed gold-label with respect to E H .We find strong statistical evidence for this hypothesis (CE:χ 2 (1) = 175.19,DAM:χ 2 (1) = 158.62,ESIM:χ 2 (1) = 252.27).
H and E TH .Compared to the in situ condition, DAM's performance improves, opposite to CE's and ESIM's behavior.