Hypothesis Only Baselines in Natural Language Inference

We propose a hypothesis only baseline for diagnosing Natural Language Inference (NLI). Especially when an NLI dataset assumes inference is occurring based purely on the relationship between a context and a hypothesis, it follows that assessing entailment relations while ignoring the provided context is a degenerate solution. Yet, through experiments on 10 distinct NLI datasets, we find that this approach, which we refer to as a hypothesis-only model, is able to significantly outperform a majority-class baseline across a number of NLI datasets. Our analysis suggests that statistical irregularities may allow a model to perform NLI in some datasets beyond what should be achievable without access to the context.


Introduction
Though datasets for the task of Natural Language Inference (NLI) may vary in just about every aspect (size, construction, genre, label classes), they generally share a common structure: each instance consists of two fragments of natural language text (a context, also known as a premise, and a hypothesis), and a label indicating the entailment relation between the two fragments (e.g., ENTAILMENT, NEUTRAL, CONTRADICTION).Computationally, the task of NLI is to predict an entailment relation label (output) given a premise-hypothesis pair (input), i.e., to determine whether the truth of the hypothesis follows from the truth of the premise (Dagan et al., 2006(Dagan et al., , 2013)).
When these NLI datasets are constructed to facilitate the training and evaluation of natural language understanding (NLU) systems (Nangia et al., 2017), it is tempting to claim that systems achieving high accuracy on such datasets have successfully "understood" natural language or at least a logical relationship between a premise and hypothesis.While this paper does not attempt to prescribe the sufficient conditions of such a claim, we argue for an obvious necessary, or at least desired condition: that interesting natural language inference should depend on both premise and hypothesis.In other words, a baseline system with access only to hypotheses (Figure 1b) can be said to perform NLI only in the sense that it is understanding language based on prior background knowledge.If this background knowledge is about the world, this may be justifiable as an aspect of natural language understanding, if not in keeping with the spirit of NLI.But if the "background knowledge" consists of learned statistical irregularities in the data, this may not be ideal.Here we explore the question: do NLI datasets contain statistical irregularities that allow hypothesis-only models to outperform the datasets specific prior?
We present the results of a hypothesis-only baseline across ten NLI-style datasets and advocate for its inclusion in future dataset reports.We find that this baseline can perform above the majority-class prior across most of the ten examined datasets.We examine whether: (1) hypotheses contain statistical irregularities within each arXiv:1805.01042v1[cs.CL] 2 May 2018 entailment class that are "giveaways" to a welltrained hypothesis-only model, (2) the way in which an NLI dataset is constructed is related to how prone it is to this particular weakness, and (3) the majority baselines might not be as indicative of "the difficulty of the task" (Bowman et al., 2015) as previously thought.
We are not the first to consider the inherent difficulty of NLI datasets.For example, MacCartney (2009) used a simple bag-of-words model to evaluate early iterations of Recognizing Textual Entailment (RTE) challenge sets. 1 Concerns have been raised previously about the hypotheses in the Stanford Natural Language Inference (SNLI) dataset specifically, such as by Rudinger et al. (2017) and in unpublished work. 2 Here, we survey of large number of existing NLI datasets under the lens of a hypothesis-only model. 3Concurrently, Tsuchiya (2018) and Gururangan et al. (2018) similarly trained an NLI classifier with access limited to hypotheses and discovered similar results on three of the ten datasets that we study.

Motivation
Our approach is inspired by recent studies that show how biases in an NLU dataset allow models to perform well on the task without understanding the meaning of the text.In the Story Cloze task (Mostafazadeh et al., 2016(Mostafazadeh et al., , 2017)), a model is presented with a short four-sentence narrative and asked to complete it by choosing one of two suggested concluding sentences.While the task is presented as a new common-sense reasoning framework, Schwartz et al. (2017b) achieved state-of-the-art performance by ignoring the narrative and training a linear classifier with features related to the writing style of the two potential endings, rather than their content.It has also been shown that features focusing on sentence length, sentiment, and negation are sufficient for achieving high accuracy on this dataset (Schwartz et al., 2017a;Cai et al., 2017;Bugert et al., 2017).
NLI is often viewed as an integral part of NLU.Condoravdi et al. (2003) argue that it is a necessary metric for evaluating an NLU system, since it forces a model to perform many distinct types of reasoning.Goldberg (2017) suggests that "solving [NLI] perfectly entails human level understanding of language", and Nangia et al. (2017) argue that "in order for a system to perform well at natural language inference, it needs to handle nearly the full complexity of natural language understanding."However, if biases in NLI datasets, especially those that do not reflect commonsense knowledge, allow models to achieve high levels of performance without needing to reason about hypotheses based on corresponding contexts, our current datasets may fall short of these goals.

Methodology
We modify Conneau et al. (2017)'s InferSent method to train a neural model to classify just the hypotheses.We choose InferSent because it performed competitively with the best-scoring systems on the Stanford Natural Language Inference (SNLI) dataset (Bowman et al., 2015), while being representative of the types of neural architectures commonly used for NLI tasks.InferSent uses a BiLSTM encoder, and constructs a sentence representation by max-pooling over its hidden states.This sentence representation of a hypothesis is used as input to a MLP classifier to predict the NLI tag.
We preprocess each recast dataset using the NLTK tokenizer (Loper and Bird, 2002).Following Conneau et al. (2017), we map the resulting tokens to 300-dimensional GloVe vectors (Pennington et al., 2014) trained on 840 billion tokens from the Common Crawl, using the GloVe OOV vector for unknown words.We optimize via SGD, with an initial learning rate of 0.1, and decay rate of 0.99.We allow at most 20 epochs of training with optional early stopping according to the following policy: when the accuracy on the development set decreases, we divide the learning rate by 5 and stop training when learning rate is < 10 −5 .

Datasets
We collect ten NLI datasets and categorize them into three distinct groups based on the methods by which they were constructed.

SPR 150K 2
The judge was aware of the dismissing

FN+ 150K 2
the irish are actually principling to come home Judged ADD-1 5K 2 A small child staring at a young horse and a pony

SCITAIL 25K 2
Humans typically have 23 pairs of chromosomes

SICK 10K 3
Pasta is being put into a dish by a woman

MPE 10K 3
A man smoking a cigarette

JOCI 30K 3
The flooring is a horizontal surface Elicited SNLI 550K 3 An animal is jumping to catch an object

MNLI 425K 3
Kyoto has a kabuki troupe and so does Osaka

Human Elicited
In cases where humans were given a context and asked to generate a corresponding hypothesis and label, we consider these datasets to be elicited.Although we consider only two such datasets, they are the largest datasets included in our study and are currently popular amongst researchers.The elicited NLI datasets we look at are:

Stanford Natural Language Inference (SNLI)
To create SNLI, Bowman et al. (2015) showed crowdsourced workers a premise sentence (sourced from Flickr image captions), and asked them to generate a corresponding hypothesis sentence for each of the three labels (ENTAILMENT, NEUTRAL, CONTRADICTION).SNLI is known to contain stereotypical biases based on gender, race, and ethnic stereotypes (Rudinger et al., 2017).Furthermore, Zhang et al. (2017) commented that this "elicitation protocols can lead to biased responses unlikely to contain a wide range of possible common-sense inferences." Multi-NLI Multi-NLI is a recent expansion of SNLI aimed to add greater diversity to the existing dataset (Williams et al., 2017).Premises in Multi-NLI can originate from fictional stories, personal letters, telephone speech, and a 9/11 report.

Human Judged
Alternatively, if hypotheses and premises were automatically paired but labeled by a human, we consider the dataset to be judged.Our humanjudged data sets are: Sentences Involving Compositional Knowledge (SICK) To evaluate how well compositional distributional semantic models handle "challenging phenomena", Marelli et al. (2014) introduced SICK, which used rules to expand or normalize existing premises to create more difficult examples.Workers were asked to label the relatedness of these resulting pairs, and these labels were then converted into the same three-way label space as SNLI and Multi-NLI.
Add-one RTE This mixed-genre dataset tests whether NLI systems can understand adjectivenoun compounds (Pavlick and Callison-Burch, 2016).Premise sentences were extracted from Annotated Gigaword (Napoles et al., 2012), image captions (Young et al., 2014), the Internet Argument Corpus (Walker et al., 2012), and fictional stories from the GutenTag dataset (Mac Kim and Cassidy, 2015).To create hypotheses, adjectives were removed or inserted before nouns in a premise, and crowd-sourced workers were asked to provide reliable labels (ENTAILED, NOT-ENTAILED).
SciTail Recently released, SciTail is an NLI dataset created from 4th grade science questions and multiple-choice answers (Khot et al., 2018).
Hypotheses are automatically paired with premise sentences from domain specific texts (Clark et al., 2016), and labeled (ENTAILMENT, NEUTRAL) by crowdsourced workers.
Notably, the construction method allows for the same sentence to appear as a hypothesis for more than one premise.

Multiple Premise Entailment (MPE)
Unlike the other datasets we consider, the premises in MPE (Lai et al., 2017) are not single sentences, but four different captions that describe the same image in the FLICKR30K dataset (Plummer et al., 2015).Hypotheses were generated by simplifying either a fifth caption that describes the same image or a caption corresponding to a different image, and given the standard 3-way tags.Each hypothesis has at most a 50% overlap with the words in its corresponding premise.Since the hypotheses are still just one sentence, our hypothesis-only baseline can easily be applied to MPE.
Hypotheses were created automatically by systems trained to generate entailed facts from a premise. 4Crowd-sourced workers labeled the likelihood of the hypothesis following from the premise on an ordinal scale.We convert these into a 3-way NLI tags where 1 maps to CONTRADIC-TION, 2-4 maps to NEUTRAL, and 5 maps to EN-TAILMENT.Converting the annotations into a 3way classification problem allows us to limit the range of the number of NLI label classes in our investigation.

Automatically Recast
If an NLI dataset was automatically generated from existing datasets for other NLP tasks, and sentence pairs were constructed and labeled with minimal human intervention, we refer to such a dataset as recast.

Definite Pronoun Resolution (DPR)
The DPR dataset targets an NLI model's ability to perform anaphora resolution (Rahman and Ng, 2012).In the original dataset, sentences contain two entities and one pronoun, and the task is to link the pronoun to its referent.In the recast version, the premises are the original sentences and the hypotheses are the same sentences with the pronoun replaced with its correct (ENTAILED) and incorrect (NOT-ENTAILED) referent.For example, People raise dogs because they are obedient and People raise dogs because dogs are obedient is such a context-hypothesis pair.We note that this mechanism would appear to maximally benefit a hypothesis-only approach, as the hypothesis semantically subsumes the context.
FrameNet Plus (FN+) Using paraphrases from PPDB (Ganitkevitch et al., 2013) (1) a.That is the way the system works b.That is the way the framework works c.That is the road the system works d.That is the way the system creations

Results
Our goal is to determine whether a hypothesisonly model outperforms the majority baseline and investigate what may cause significant gains.In Criticism of the Majority Baseline Across six of the ten datasets, our hypothesis-only model significantly outperforms the majority-baseline, even outperforming the best reported results on one dataset, recast SPR.This indicates that there exists a significant degree of exploitable signal that may help NLI models perform well on their corresponding test set without considering NLI contexts.From Table 2, it is unclear whether the construction method is responsible for these improvements.The largest relative gains are on humanelicited models where the hypothesis-only model more than doubles the majority baseline.However, there are no obvious unifying trends across these datasets: Among the judged and recast datasets, where humans do not generate the NLI hypothesis, we observe lower performance margins between majority and hypothesis-only models compared to the elicited data sets.However, the baseline performances of these models are noticeably larger than on SNLI and Multi-NLI.
The drop between SNLI and Multi-NLI suggests that by including multiple genres, an NLI dataset may contain less biases.However, adding additional genres might not be enough to mitigate biases as the hypothesis-only model still drastically outperforms the majority-baseline.Therefore, we believe that models tested on SNLI and Multi-NLI should include a baseline version of the model that only accesses hypotheses.
We do not observe general trends across the datasets based on their construction methodology.On three of the five human judged datasets, the hypothesis-only model defaults to labeling each instance with the majority class tag.We find the same behavior in one recast dataset (DPR).However, across both these categories we find smaller relative improvements than on SNLI and Multi-NLI.These results suggest the existence of exploitable signal in the datasets that is unrelated to NLI contexts.Our focus now shifts to identifying precisely what these signals might be and understanding why they may appear in NLI hypotheses.

Statistical Irregularities
We are interested in determining what characteristics in the datasets may be responsible for the hypothesis-only model often outperforming the majority baseline.Here, we investigate the importance of specific words, grammaticality, and lexical semantics.6.1 Can Labels be Inferred from Single Words?
Since words in hypotheses have a distribution over the class of labels, we can determine the conditional probability of a label l given the word w by If p(l|w) is highly skewed across labels, there exists the potential for a predictive bias.Consequently, such words may be "give-aways" that allow the hypothesis model to correctly predict an NLI label without considering the context.If a single occurrence of a highly label-specific word would allow a sentence to be deterministically classified, how many sentences in a dataset are prone to being trivially labeled?The plots in Figure 2 answer this question for SNLI and DPR.The Y -value where X = 1.0 captures the number of such sentences.Other values of X < 1.0 can also have strong correlative effects, but a priori the relationship between the value of X and the coverage of trivially answerable instances in the data is unclear.We illustrate this relationship for varying values of p(l|w).When X = 0, all words are considered highly-correlated with a specific class label, and thus the entire data set would be treated as trivially answerable.
In DPR, which has two class labels, because the uncertainty of a label is highest when p(l|w) = 0.5, the sharp drop as X deviates from this value indicates a weaker effect, where there are proportionally fewer sentences which contain highly label-specific words with respect to SNLI.As SNLI uses 3-way classification we see a gradual decline from 0.33.

What are "Give-away" Words?
Now that we analyzed the extent to which highly label-correlated words may exist across sentences in a given label, we would like to understand what these words are and why they exist.
Figure 3 reports some of the words with the highest p(l|w) for SNLI, a human elicited dataset, and MPE, a human judged dataset, on which our hypothesis model performed identically to the majority baseline.Because many of the most discriminative words are low frequency, we report only words which occur at least five times.We rank the words according to their overall frequency, since this statistic is perhaps more indicative of a word w's effect on overall performance compared to p(l|w) alone.
The score p(l|w) of the words shown for SNLI deviate strongly, regardless of the label.In contrast, in MPE, scores are much closer to a uniform distribution of p(l|w) across labels.Intuitively, the stronger the word's deviation, the stronger the potential for it to be a "give-away" word.A high word frequency indicates a greater potential of the word to affect the overall accuracy on NLI.
Qualitative Examples Turning our attention to the qualities of the words themselves, we can easily identify trends among the words used in contradictory hypotheses in SNLI.In our top-10 list, for example, three words refer to the act of sleeping.Upon inspecting corresponding context sentences, we find that many contexts, which are  sourced from Flickr, naturally deal with activities.This leads us to believe that as a common strategy, crowd-source workers often do not generate contradictory hypotheses that require finegrained semantic reasoning, as a majority of such activities can be easily negated by removing an agent's agency, i.e. describing the agent as sleeping.A second trend we notice is that universal negation constitutes four of the remaining seven terms in this list, and may also be used to similar effect. 8The human-elicited protocol does not guide, nor incentivize crowd-source workers to come up with less obvious examples.If not properly controlled, elicited datasets may be prone to many label-specific terms.The existence of labelspecific terms in human-elicited NLI datasets does not invalidate the datasets nor is surprising.Stud-8 These are "Nobody", "alone", "no", and "empty".
ies in eliciting norming data are prone to repeated responses across subjects (McRae et al., 2005) (see discussion in §2 of (Zhang et al., 2017)).

On the Role of Grammaticality
Like MPE, FN+ contains few high frequency words with high p(l|w).However, unlike on MPE, our hypothesis-only model outperforms the majority-only baseline.If these gains do not arise from "give-away" words, then what is the statistical irregularity responsible for this discriminative power?Upon further inspection, we notice an interesting imbalance in how our model performs for each of the two classes.The hypothesis-only model performs similarly to the majority baseline for entailed examples, while improving by over 34% those which are not entailed, as shown in by Poliak et al. (2018), FN+ contains more grammatical errors than the other recast datasets.We explore whether grammaticality could be the statistical irregularity exploited in this case.We manually sample a total of 200 FN+ sentences and categorize them based on their gold label and our model's prediction.Out of 50 sentences that the model correctly labeled as ENTAILED, 88% of them were grammatical.On the other-hand, of the 50 hypotheses incorrectly labeled as EN-TAILED, only 38% of them were grammatical.
Similarly, when the model correctly labeled 50 NOT-ENTAILED hypotheses, only 20% were grammatical, and 68% when labeled incorrectly.This suggests that a hypothesis-only model may be able to discover the correlation between grammaticality and NLI labels on this dataset.

Lexical Semantics
A survey of gains (Table 4) in the SPR dataset suggest a number of its property-driven hypotheses, such as X was sentient in [the event], can be accurately guessed based on lexical semantics (background knowledge learned from training) of the argument.For example, the hypothesis-only baseline correctly predicts the truth of hypotheses in the dev set such as: Experts were sentient ... or Mr. Falls was sentient ..., and the falsity of The campaign was sentient, while failing on referring expressions like Some or Each side.A model exploiting regularities of the real world would seem to be a different category of dataset bias: while not strictly wrong from the perspective of NLU, one should be aware of what the hypothesis-only baseline is capable of, to recognize those cases where access to the context is required and therefore more interesting under NLI.

Open Questions
There may remain statistical irregularities, which we leave for future work to explore.For example, are there correlation between sentence length and label class in these data sets?Is there a particular construction method that minimizes the amount of "give-away" words present in the dataset?And lastly, our study is another in a line of research which looks for irregularities at World Knowledge and NLI As mentioned earlier, hypothesis-only models that perform without exploiting statistical irregularities may be performing NLI only in the sense that it is understanding language based on prior background knowledge.Here, we take the approach that interesting NLI should depend on both premise and hypotheses.Prior work in NLI reflect this approach.For example, Glickman and Dagan (2005) argue that "the notion of textual entailment is relevant only" for hypothesis that are not world facts, e.g."Paris is the capital of France."Glickman et al. (2005a,b), introduce a probabilistic framework for NLI where the premise entails a hypothesis if, and only if, the probability of the hypothesis being true increases as a result of the premise.
NLI's resurgence Starting in the mid-2000's, multiple community-wide shared tasks focused on NLI, then commonly referred to as RTE, i.e, recognizing textual entailment.Starting with Dagan et al. (2006), there have been eight iterations of the PASCAL RTE challenge with the most recent being Dzikovska et al. (2013).9NLI datasets were relatively small, ranging from thousands to tens of thousands of labeled sentence pairs.In turn, NLI models often used alignmentbased techniques (MacCartney et al., 2008)

Conclusion
We introduced a stronger baseline for ten NLI datasets.Our baseline reduces the task from labeling the relationship between two sentences to classifying a single hypothesis sentence.Our experiments demonstrated that in six of the ten datasets, always predicting the majority-class label is not a strong baseline, as it is significantly outperformed by the hypothesis-only model.Our analysis suggests that statistical irregularities, including word choice and grammaticality, may reduce the difficulty of the task on popular NLI datasets by not fully testing how well a model can determine whether the truth of a hypothesis follows from the truth of a corresponding premise.
We hope our findings will encourage the development of new NLI datasets which exhibit less exploitable irregularities, and that encourage the development of richer models of inference.As a baseline, new NLI models should be compared against a corresponding version that only accesses hypotheses.In future work, we plan to apply a similar hypothesis-only baseline to multi-modal tasks that attempt to challenge a system to understand and classify the relationship between two inputs, e.g.Visual QA (Antol et al., 2015).

Figure 1 :
Figure 1: (1a) shows a typical NLI model that encodes the premise and hypothesis sentences into a vector space to classify the sentence pair.(1b) shows our hypothesis-only baseline method that ignores the premise and only encodes the hypothesis sentence.
,Rastogi and Van Durme (2014) automatically replaced words with their paraphrases.Subsequently,Pavlick et al. (2015) asked crowd-source workers to judge how well a sentence with a paraphrase preserved the original sentence's meanings.In this NLI dataset that targets a model's ability to perform paraphrastic inference, premise sentences are the original sentences, the hypotheses are the edited versions, and the crowd-source judgments are converted to 2-way NLI-labels.For not-entailed examples, White et al. (2017) replaced a single token in a context sentence with a word that crowdsource workers labeled as not being a paraphrase of the token in the given context.In turn, we might suppose that positive entailments (1-b) are keeping in the spirit of NLI, but not-entailed examples might not because there are adequacy (1-c) and fluency (1-d) issues.5

Figure 2 :
Figure 2: Plots showing the number of sentences per each label (Y-axis) that contain at least one word w such that p(l|w) >= x for at least one label l.Colors indicate different labels.Intuitively, for a sliding definition of what value of p(l|w) might constitute a "give-away" the Y-axis shows the proportion of sentences that can be trivially answered for each class.

Figure 3 :
Figure3: Lists of the most highly-correlated words in each dataset for given labels, thresholded to the top 10 and ranked according to frequency.

Table 1 :
Basic statistics about the NLI datasets we consider.'Size' refers to the total number of labeled premisehypothesis pairs in each dataset (for datasets with > 100K examples, numbers are rounded down to the nearest 25K).The 'Creation Protocol' column indicates how the dataset was created.The 'Class' column reports the number of class labels/tags.The last column shows an example hypothesis from each dataset.
We use the recast datasets from White et al. (2017):

Table 2 :
NLI accuracies on each dataset.Columns 'Hyp-Only' and 'MAJ' indicates the accuracy of the hypothesisonly model and the majority baseline.|∆| and ∆% indicate the absolute difference in percentage points and the percentage increase between the Hyp-Only and MAJ.Blue numbers indicate that the hypothesis-model outperforms MAJ.In the right-most section, 'Baseline' indicates the original baseline on the test when the dataset was released and 'SOTA' indicates current state-of-the-art results.MNLI-1 is the matched version and MNLI-2 is the mismatched for MNLI.The names of datasets are italicized if containing ≤ 10K labeled examples.
such cases a hypothesis-only model should be used as a stronger baseline instead of the majority class baseline.For all experiments except for JOCI, we use each NLI dataset's standard train, dev, and test splits.6Table2comparesthehypothesis-only model's accuracy with the majority baseline on each dataset's dev and test set.7

Table 3 .
As shown by White et al. (2017) and noticed

Table 3 :
Accuracies on FN+ for each class label.
or manually engineered features (Androutsopoulos and Malakasiotis, 2010).Bowman et al. (2015) sparked a renewed interested in NLI, particularly among deep-learning researchers.By developing and releasing a large NLI dataset containing over 550K examples, Bowman et al. (2015) enabled the community to successfully apply deep learning models to the NLI problem.