Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation

We present a large scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation encoded by a neural network captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. Our collection of diverse datasets is available at http://www.decomp.net/, and will grow over time as additional resources are recast and added from novel sources.


Introduction
A plethora of new natural language inference (NLI) 1 datasets has been created in recent years (Bowman et al., 2015;Williams et al., 2017;Lai et al., 2017;Khot et al., 2018).However, these datasets do not provide clear insight into what type of reasoning or inference a model may be performing.For example, these datasets cannot be used to evaluate whether competitive NLI models can determine if an event occurred, correctly differentiate between figurative and literal language, or accurately identify and categorize named entities.Consequently, these datasets cannot answer how well sentence representation learning models capture distinct semantic phenomena necessary for general natural language understanding (NLU).
To answer these questions, we introduce the Diverse NLI Collection (DNC), a large-scale NLI dataset that tests a model's ability to perform diverse types of reasoning.The DNC is a collection of NLI problems, each requiring 1 The task of determining if a hypothesis would likely be inferred from a context, or premise; also known as Recognizing Textual Entailment (RTE) (Dagan et al., 2006(Dagan et al., , 2013)).a model to perform a unique type of reasoning.Each NLI dataset contains labeled contexthypothesis pairs that we recast from semantic annotations for specific structured prediction tasks.
We extend various prior works on challenge NLI datasets (Zhang et al., 2017), and define recasting as leveraging existing datasets to create NLI examples (Glickman, 2006;White et al., 2017).We recast annotations from a total of 13 datasets across 7 NLP tasks into labeled NLI examples.The tasks include event factuality, named entity recognition, datasets, gendered anaphora resolution, sentiment analysis, relationship extraction, pun detection, and lexicosyntactic inference.Currently, DNC contains over half a million labeled examples.Table 1 includes NLI pairs that test specific types of reasoning.
Using a hypothesis-only NLI model, with access to just hypothesis sentences, as a strong baseline (Tsuchiya, 2018;Gururangan et al., 2018;Poliak et al., 2018b), our experiments demonstrate how DNC can be used to probe a model's ability to capture different types of semantic reasoning necessary for general NLU.In short, this work answers a recent plea to the community to test "more kinds of inference" than in previous challenge sets (Chatzikyriakidis et al., 2017).

Motivation & Background
Compared to eliciting NLI datasets directly, i.e. asking humans to author contexts and/or hypothesis sentences, recasting 1) help determine whether an NLU model performs distinct types of reasoning; 2) limit types of biases observed in previous NLI data; and 3) generate examples cheaply, potentially at large scales.
Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) and its successor Multi-NLI (Williams et al., 2017), were created by eliciting hypotheses from humans.Crowdsource workers were tasked with writing one sentence each that is entailed, neutral, and contradicted by a caption extracted from the Flickr30k corpus (Young et al., 2014).
Although these datasets are widely used to train and evaluate sentence representations, a high accuracy is not indicative of what types of reasoning NLI models perform.Workers were free to create any type of hypothesis for each context and label.Such datasets cannot be used to determine how well an NLI model captures many desired capabilities of language understanding systems, e.g.paraphrastic inference, complex anaphora resolution (White et al., 2017), or compositionality (Pavlick and Callison-Burch, 2016;Dasgupta et al., 2018).By converting prior annotation of a specific phenomenon into NLI examples, recasting allows us to create a diverse NLI benchmark that tests a model's ability to perform distinct types of reasoning.
Limit Biases Studies indicate that many NLI datasets contain significant biases.Examples in the early Pascal RTE datasets could be correctly predicted based on syntax alone (Vanderwende and Dolan, 2006;Vanderwende et al., 2006).Statistical irregularities, and annotation artifacts, within class labels allow a hypothesis-only model to significantly outperform the majority baseline on at least six recent NLI datasets (Poliak et al., 2018b).Class label biases may be attributed to the human-elicited protocol.Moreover, examples in such NLI datasets may contain racial and gendered stereotypes (Rudinger et al., 2017).
We limit some biases by not relying on humans to generate hypotheses.Recast NLI datasets may still contain some biases, e.g.non-uniform distributions over NLI labels caused by the distribution of labels in the original dataset that we recast.2Experimental results using Poliak et al. (2018b)'s hypothesis-only model indicate to what degree the recast datasets retain some biases that may be present in the original semantic datasets.
NLI Examples at Large-scale Generating NLI datasets from scratch is costly.Humans must be paid to generate or label natural language text.This linearly scales costs as the amount of generated NLI-pairs increases.Existing annotations for a wide array of semantic NLP tasks are freely available.By leveraging existing semantic annotations already invested in by the community we can generate and label NLI pairs at little cost and create large NLI datasets to train data hungry models.
Why These Semantic Phenomena?A long term goal is to develop NLU systems that can achieve human levels of understanding and reasoning.Investigating how different architectures and training corpora can help a system perform human-level general NLU is an important step in this direction.DNC contains recast NLI pairs that are easily understandable by humans and can be used to evaluate different sentence encoders and NLU systems.These semantic phenomena cover distinct types of reasoning that an NLU system may often encounter in the wild.While higher performance on these benchmarks might not be conclusive proof of a system achieving human-level reasoning, a system that does poorly should not be viewed as performing human-level NLU.We argue that these semantic phenomena play integral roles in NLU.There exist more semantic phenomena integral to NLU (Allen, 1995) and we plan to include them in future versions of the DNC.
Previous Recast NLI Example sentences in RTE1 (Dagan et al., 2006) were extracted from MT, IE, and QA datasets, with the process referred to as 'recasting' in the thesis by Glickman (2006).NLU problems were reframed under the NLI framework and candidate sentence pairs were extracted from existing NLP datasets and then labeled under NLI (Dagan et al., 2006).Years later, this term was independently used by White et al. (2017), who proposed to "leverage existing largescale semantic annotation collections as a source of targeted textual inference examples."The term 'recasting' was limited to automatically converting existing semantic annotations into labeled NLI examples without manual intervention.We adopt the broader definition of 'recasting' since our NLI examples were automatically or manually generated from prior NLU datasets.

Applied Framework versus Inference Probing
Traditionally, NLI has not been viewed as a downstream, applied NLP task. 3 Instead, the community has often used it as "a generic evaluation framework" to compare models for distinct downstream tasks (Dagan et al., 2006) or to determine whether a model performs distinct types of reasoning (Cooper et al., 1996).These two different evaluation goals may affect which datasets are recast.We target both goals as we recast applied tasks and linguistically focused phenomena.

Recasting Semantic Phenomena
We describe efforts to recast 7 semantic phenomena from a total of 13 datasets into labeled NLI examples.Many of the recasting methods rely on simple templates that do not include nuances and variances typical of natural language.This allows us to specifically test how sentence representations capture distinct types of reasoning.When recasting, we preserve each dataset's train/dev/test split.If a dataset does not contain such a split, we create a random split with roughly a 80:10:10 ratio.Table 2 reports statistics about each recast dataset.
Event Factuality (EF) Event factuality prediction is the task of determining whether an event described in text occurred.Determining whether an event occurred enables accurate inferences, e.g.monotonic inferences, based on the event (Rudinger et al., 2018b). 4Incorporating factuality has been shown to improve NLI (Sauri and Pustejovsky, 2007).
We recast event factuality annotations from UW (Lee et al., 2015), MEAN-TIME (Minard et al., 2016), and Decomp (Rudinger et al., 2018b).We use sentences from the original datasets as contexts and templates (1a) and (1b) as hypotheses. 51) a.The Event happened b.The Event did not happen If the predicate denoting the Event was annotated as having happened in the factuality dataset, the context paired with (1a) is labeled as ENTAILED and the same context paired with (1b) is labeled as NOT-ENTAILED.Otherwise, we swap the labels.
Named Entity Recognition (NER) Distinct types of entities have different properties and relational objects (Prince, 1978) that can help infer facts from a given context.For example, if a system can detect that an entity is a name of a nation, then that entity likely has a leader, a language, and a culture (Prince, 1978;Van Durme, 2010).When classifying NLI pairs, a model can determine if an object mentioned in the hypothesis can be a relational object typically associated with the type of entity described in the context.NER tags can also be directly used to determine if a hypothesis is likely to not be entailed by a context, such as when entities in contexts and hypotheses do not share NER tags (Castillo and Alemany, 2008;Sammons et al., 2009;Pakray et al., 2010).Given a sentence annotated with NER tags, we recast the annotations by preserving the original sentences as contexts and creating hypotheses using the template "NP is a Label." 6For ENTAILED hypotheses we replace Label with the correct NER label of the NP; for NOT-ENTAILED hypotheses, we choose an incorrect label from the prior distribution of NER tags for the given phrase.This prevents us from adding additional biases besides any class-label statistical irregularities present in the original data.We apply this procedure on the Gronigen Meaning Bank (Bos et al., 2017) and the ConLL-2003 Shared Task (Tjong Kim Sang and De Meulder, 2003).

Gendered Anaphora Resolution (GAR)
The ability to perform pronoun resolution is essential to language understanding, in many cases requiring common-sense reasoning about the world (Levesque et al., 2012).White et al. (2017) Sem.Phenomena

Gendered Anaphora
Winogender (Rudinger et al., 2018a Table 2: Statistics summarizing the recast datasets.The first column refers to the original annotation that was recast, the 'Combined' row refers to the combination of our recast datasets.The second column indicates the datasets that were recast, and the 3rd column reports how many labeled NLI pairs were extracted from the corresponding dataset.The last column indicates whether the recasting method was fully-automatic without human involvement (✓), manual (✗), or used a semi-automatic method that included human intervention (✓✗).The Multi-NLI and SNLI numbers contextualize the scale of our dataset.
show that this task can be directly recast as an NLI problem by transforming Winograd schemas into NLI sentence pairs.Using a similar formula Rudinger et al. (2018a) introduce Winogender schemas, minimal sentence pairs that differ only by pronoun gender.With this adapted pronoun resolution task, they demonstrate the presence of systematic gender bias in coreference resolution systems.We recast Winogender schemas as an NLI task, introducing a potential method of detecting gender bias in NLI systems or sentence embeddings.In recasting, the context is the original, unmodified Winogender sentence; the hypothesis is a short, manually constructed sentence having a correct (ENTAILED) or incorrect (NOT-ENTAILED) pronoun resolution.
Lexicosyntactic Inference (Lex) While many inferences in natural language are triggered by lexical items alone, there exist pervasive inferences that arise from interactions between lexical items and their syntactic contexts.This is particularly apparent among propositional attitude verbs -e.g.think, want, know -which display complex distributional profiles (White and Rawlins, 2016).For instance, the verb remember can take both finite clausal complements and infinitival clausal complements.
(2) a. Jo didn't remember that she ate b.Jo didn't remember to eat This small change in the syntactic structure gives rise to large changes in the inferences that are licensed: (2a) presupposes that Jo ate while (2b) entails that Jo didn't eat.We recast data from three datasets that are relevant to these sorts of lexicosyntactic interactions.
Lex #1: MegaVeridicality (MV) White and Rawlins (2018) build the MegaVeridicality dataset by selecting verbs from the MegaAttitude dataset (White and Rawlins, 2016) based on their grammatical acceptability in the [NP _ that S] and [NP was _ed that S] frames. 7hey then asked annotators to answer questions of the form in (3) using three possible responses: yes, maybe or maybe not, and no (Karttunen et al., 2014).
(3) a. Someone {knew, didn't know} that a particular thing happened.b.Did that thing happen?
We use the same procedure to annotate sentences containing verbs that take various types of infinitival complement: [NP _ for NP to VP], [NP _ to VP], [NP _ NP to VP], and [NP was _ed to VP]. 8To recast these annotations, we assign the context sentences like (3a) to the majority classyes, maybe or maybe not, no -across 10 different annotators, after applying an ordinal model-based normalization to their responses.We then pair each context sentence with three hypotheses.
(4) a.That thing happened b.That thing may or may not have happened c.That thing didn't happen If annotated yes, maybe or maybe not, or no, the pair (3a)-( 4a), (3a)-(4b), or (3a)-( 4c) is respectively assigned ENTAILED and the other pairings are assigned NOT-ENTAILED; train/dev/test split labels are randomly assigned to every pair that context sentence appears in.
Lex #2: Recasting VerbNet (VN) We create additional lexicosyntactic NLI examples from Verb-Net (Schuler, 2005).VerbNet contains classes of verbs that each can have multiple frames.Each frame contains a mapping from syntactic arguments to thematic roles, which are used as arguments in Neo-Davidsonian first-order logical predicates (5b) that describe the frame's semantics.Each frame additionally contains an example sentence (5a) that we use as our NLI context and we create templates (5c) from the most frequent semantic predicates to generate hypotheses (5d).
(5) a. Michael swatted the fly b.cause(E, Agent) c. Agent caused the E d. Michael caused the swatting We use the Berkeley Parser (Petrov et al., 2006) to match tokens in an example sentence with the thematic roles and then fill in the templates with the matched tokens (5d).We also decompose multi-argument predicates into unary predicates to increase the number of hypotheses we generate.On average, each context is paired with 4.5 hypotheses.We generate NOT-ENTAILED hypotheses by filling in templates with incorrect thematic roles. 9We partition the recast NLI examples into train/development/test splits such that all example sentences from a VerbNet class (which we use a NLI hypothesis) appear in only one partition of our dataset.In turn, the recast VerbNet dataset's partition is not exactly 80:10:10.
Lex #3: Recasting VerbCorner (VC) The third dataset testing lexicosyntactic inference that Figurative Language (Puns) Figurative language demonstrates natural language's expressiveness and wide variations.Understanding and recognizing figurative language "entail[s] cognitive capabilities to abstract and meta-represent meanings beyond physical words" (Reyes et al., 2012).
Puns are prime examples of figurative language that may perplex general NLU systems as they are one of the more regular uses of linguistic ambiguity (Binsted, 1996) and rely on a widerange of phonetic, morphological, syntactic, and semantic ambiguity (Pepicello and Green, 1984;Binsted, 1996;Bekinschtein et al., 2011).
We recast puns from Yang et al. ( 2015) and Miller et al. (2017) using templates to generate contexts (6a) and hypotheses (6b), (6c).We replace Name with names sampled from a distribution based on US census data, 11 and Pun with the original sentence.If the original sentence was labeled as containing a pun, the (6a)-(6b) pair is labeled as ENTAILED and (6a)-( 6c) is labeled as NOT-ENTAILED, otherwise we swap the labels.(6) a. Name heard that Pun b.Name heard a pun c.Name did not hear a pun Relation Extraction (RE) The goal of the relation extraction (RE) task is to infer the real-world relationships between pairs of entities from natural language text.The task is "grounded" in the sense that the input is natural language text and the output is entity1, relation, entity2 tuples defined in the schema of some knowledge base.RE requires a system to understand the many different surface forms which may entail the same underlying relation, and to distinguish those from surface forms which involve the same entities but do not entail the relation of interest.For example, (7a) is entailed by (7b) and (7c) but not by (7d).Natural language surface forms are often used in RE in a weak-supervision setting (Mintz et al., 2009;Hoffmann et al., 2011;Riedel et al., 2013).
That is, if entity1 and entity2 are known to be related by relation, it is assumed that every sentence observed which mentions both entity1 and entity2 is assumed to be a realization of relation: i.e. (7d) would (falsely) be taken as evidence of the birthPlace relation.
Here we first generate hypotheses and then corresponding contexts.
To generate hypotheses, we begin with entity-relation triples extracted from DBPedia infoboxes: e.g.Barack Obama, birthPlace, Hawaii .These relation predicates were extracted directly from Wikipedia infoboxes and are not cleaned.As a result, many relations are redundant with one another (birthPlace, hometown) and some relations do not correspond to obvious natural language glosses based on the name alone (demographics1Info).Thus, we construct a template for each predicate p by manually inspecting 1) a sample of entities which are related by p 2) a sample of sentences in which those entities co-occur and 3) the most frequent natural language strings which join entities related by p according to a OpenIE triple database (Schmitz et al., 2012;Fader et al., 2011) extracted from a large text corpus.We then manually write a simple template (e.g.Mention1 was born in Mention2) for p, ignoring any unclear relations.In total, we end up with 574 unique relations, expressed by 354 unique templates.
For each such hypothesis generated, we create a number of contexts.
We begin with the FACC1 corpus (Gabrilovich et al., 2013) which contains natural language sentences from ClueWeb in which entities have been automatically linked to disambiguated Freebase entities, when possible.
Then, given a tuple entity1, relation, entity2 , we find every sentence which contains both entity1 and entity2.Since many of these sentences are false positives (7d), we have human annotators vet each context/hypothesis pair, using the ordinal entailment scale described in Zhang et al. (2017).
We include optional binary labels by converting pairs labeled as 1 − 4 and 5 to ENTAILED and NOT-ENTAILED respectively. 12We apply pruning methods (described in Appendix B.4) to combat issues related to noisy, ungrammatical hypotheses and disagreement between multiple annotators.
Subjectivity (Sentiment) Some of the previously discussed semantic phenomena deal with objective information -did an event occur or what type of entities does a specific name represent.Subjective information is often expressed differently (Wiebe et al., 2005), making it important to use other tests to probe whether an NLU system understands language that expresses subjective information.We are interested in determining whether general NLU models capture 'subjective clues' that can help identify and understand emotions, opinions, and sentiment within a subjective text (Wilson et al., 2006).We recast a sentiment analysis dataset since the task is the "expression of subjectivity as either a positive or negative opinion" (Taboada, 2016).We extract sentences from product, movie, and restaurant reviews labeled as containing positive or negative sentiment (Kotzias et al., 2015).Contexts (8a) and hypotheses (8b), (8c) are generated using the following templates: (8) a.When asked about Item, Name said Review b.Name liked the Item c. Name did not like the Item Item is replaced with either "product", "movie", or "restaurant", and the Name is sampled as previously discussed.If the original sentence contained positive (negative) sentiment, the (8a)-(8b) pair is labeled as ENTAILED (NOT-ENTAILED) and ( 8a)-( 8c) is labeled as NOT-ENTAILED (ENTAILED).(Pavlick et al., 2015).

Noise in
In the DNC, most of the noisy examples are in the recast VerbNet and Relation Extraction portions.In recast VerbNet, some examples are noisy because of incorrect subject-verb agreement. 13 Since more noisy examples appeared in the Relation Extraction set, we relied on Amazon Mechanical Turk workers to flag ungrammatical hypotheses in the recast dataset, and we remove NLI pairs with ungrammatical hypotheses. 14

Experiments
Our experiments demonstrate how these recast datasets may be used to evaluate how well models capture different types of semantic reasoning necessary for general language understanding.We also include results from a hypothesis-only model as a strong baseline.This may reveal whether the recast datasets retain statistical irregularities from the original, task-specific annotations.
13 "Her teeth was cared for" or "Floss were used". 14See Appendix B.4 for details.

Models
For demonstrating how well an NLI model performs these fine-grained types of reasoning, we use InferSent (Conneau et al., 2017).
InferSent independently encodes a context and hypothesis with a bi-directional LSTM and combines the sentence representations by concatenating the individual sentence representations, their element-wise subtraction and product.The combined representation is then fed into a MLP with a single hidden layer.The hypothesis-only model is a modified version of InferSent that only accesses hypotheses (Poliak et al., 2018b).We report experimental details in Appendix C.

Results
Table 3 reports the models' accuracies across the recast NLI datasets.Even though we categorize VerbNet, MegaVeridicality, and VerbCorner as lexicosyntatic inference, we train and evaluate models separately on these three datasets because we use different strategies to individually recast them.When evaluating NLI models, our baseline is the maximum between the accuracies of the hypothesis-only model and the majority class label (MAJ).In six of the eight recast datasets that we use to train our models the hypothesisonly model outperforms MAJ.The two datasets where the hypothesis-only model does not outperform MAJ are Sentiment and VN, each of which contain less than 10K examples. 15We do not train on GAR because of its small size.
Our results suggest that InferSent, when not pre-trained on any other data, might capture specific semantic phenomena better than other semantic phenomena.InferSent seems to learn the most about determining if an event occurred, since the difference between its accuracy and that of the hypothesis-only baseline (+13.93) is largest on the recast EF dataset compared to the other recast annotations.The model seems to similarly learn to perform (or detect) the type of lexico-syntactic inference present in VC and MV.Interestingly, the hypothesis-only model outperforms InferSent on the recast RE.

Hypothesis Only Baseline
The hypothesis-only model can demonstrate how likely it is that an NLI label applies to a hypothesis, regardless of its context and indicates how well each recast dataset tests a model's ability to perform each specific type of reasoning when performing NLI.The high hypothesis-only accuracy on the recast NER dataset may demonstrate that the hypothesis-only model is able to detect that the distribution of class labels for a given word may be peaky.For example, Hong Kong appears 130 times in the training set and is always labeled as a location.Based on this, in future work we may consider different methods to recast NER annotations into labeled NLI examples, or limit the dataset's training size.
Pre-training models on DNC We would like to know whether initializing models with pre-trained parameters improves scores.We notice that when we pre-train our models on DNC, for the larger datasets, a pre-trained model does not seem to significantly outperform randomly initializing the parameters.For the smaller datasets, specifically Puns, Sentiment and VN, a pre-trained model significantly outperforms random initialization. 16e are also interested to know whether finetuning these pre-trained models on each category (update) improves a model's ability to perform well on the category compared to keeping the pre-trained models' parameters static (fixed).Across all of the recast datasets, updating the pretrained model's parameters during training improves InferSent's accuracies more than keep-ing the model's parameters fixed.When updating a model pre-trained on the entire DNC, we see the largest improvements on VN (+9.15).Williams et al. (2017) argue that Multi-NLI "[makes] it possible to evaluate systems on nearly the full complexity of the language."However, how well does Multi-NLI test a model's capability to understand the diverse semantic phenomena captured in DNC?We posit that if a model, trained on and performing well on Multi-NLI, does not perform well on our recast datasets, then Multi-NLI might not evaluate a model's ability to understand the "full complexity" of language as argued. 17hen trained on Multi-NLI, our InferSent model achieves an accuracy of 70.22% on (matched) Multi-NLI. 18When we test the model on the recast datasets (without updating the parameters), we see significant drops. 19On the datasets testing a model's lexico-syntactic inference capabilities, the model performs below the majority class baseline.On the NER, EF, and Puns datasets its performs below the hypothesis-only baseline.We also notice that on three of the datasets (EF, Puns, and VN), the fixed hypothesis-only model outperforms the fixed InferSent model.

Models trained on Multi-NLI
These results might suggest that Multi-NLI does not evaluate whether sentence representations capture these distinct semantic phenomena.This is a bit surprising for some of the recast phenomena.We would expect Multi-NLI's fiction section (especially its humor subset) in the training set to contain some figurative language that might be similar to puns, and the travel guides (and possibly telephone conversations) to contain text related to sentiment.

Pre-training on DNC or Multi-NLI?
Initializing a model with parameters pre-trained on DNC or Multi-NLI often outperforms random initialization. 20Is it better to pre-train on DNC or Multi-NLI?On five of the recast datasets, using a model pre-trained on DNC outperforms a model pre-trained on Multi-NLI.The results are flipped on the two datasets focused on downstream tasks (Sentiment and RE) and MV.However, the differences between pre-training on the DNC or Multi-NLI are small.From this, it is unclear whether pre-training on DNC is better than Multi-NLI.
Size of Pre-trained DNC Data We randomly sample 10K and 20K examples from each datasets' training set to investigate what happens if we train our models on a subsample of each training set instead of the entire DNC.Although we noticed a slight decrease across each recast test set, the decrease was not significant.We leave this investigating for a future thorough study.

Related Work
Exploring what linguistic phenomena neural models learn Many tests have been used to probe how well neural models learn different linguistic phenomena.Linzen et al. (2016) use "number agreement in English subject-verb dependencies" to show that LSTMs learn about syntax-sensitive dependencies.In addition to syntax (Shi et al., 2016), researchers have used other labeling tasks to investigate whether neural machine translation (NMT) models learn different linguistic phenomena (Belinkov et al., 2017a,b;Dalvi et al., 2017;Marvin and Koehn, 2018).Recently, Poliak et al. (2018a) used recast NLI datasets to investigate semantics captured by NMT encoders.
Targeted Tests for Natural Language Understanding We follow a long line of work focused on building datasets to test how well NLU systems perform distinct types of semantic reasoning.FraCaS uses a limited number of sentencepairs to test whether systems understand semantic phenomena, e.g.generalized quantifiers, temporal references, and (nominal) anaphora (Cooper et al., 1996).FraCas cannot be used to train neural models -it includes just roughly 300 highquality instances manually created by linguists.MacCartney (2009) created the FraCaS textual inference test suite by automatically "convert [ing] each FraCaS question into a declarative hypothesis."Levesque et al. (2012)'s Winograd Schema Challenge forces a model to choose between two possible answers for a question based on a sentence describing an event.
Recent benchmarks test whether NLI models handle adjective-noun composition (Pavlick and Callison-Burch, 2016), other types of composition (Dasgupta et al., 2018), paraphrastic inference, anaphora resolution, and semantic proto-roles (White et al., 2017).Concurrently, Conneau et al. (2018)'s benchmark can be used to probe whether sentence representations capture many linguistic properties.It includes syntactic and surface form tests but does not focus on as a wide range of semantic phenomena as in the DNC.Glockner et al. (2018) introduce a modified version of SNLI to test how well NLI models perform when requiring lexical and world knowledge.Wang et al. (2018)'s GLUE dataset is intended to evaluate and potentially train a sentence representation to perform well across different NLP tasks.This continues an aspect of the initial RTE collection, designed to be representative of downstream tasks like QA, MT, and IR (Dagan et al., 2010).While GLUE is therefore concerned with applied tasks, DNC, as well as Naik et al. (2018)'s NLI stress tests, is concerned with probing the capabilities of NLU models to capture explicitly distinguished aspects of meaning.While one may conjecture that the latter is needed to be "solved" to eventually "solve" the former, it may be that these goals only partially overlap.Some NLP researchers might focus on probing for semantic phenomena in sentence representations while others may be more interested in developing single sentence representations that can help models perform well on a wide array of downstream tasks.

Conclusion
We described how we recast a wide range of semantic phenomena from many NLP datasets into labeled NLI sentence pairs.These examples serve as a diverse NLI framework that may help diagnose whether NLU models capture and perform distinct types of reasoning.Our experiments demonstrate how to use this framework as an NLU benchmark.The DNC is actively growing as we continue recasting more datasets into labeled NLI examples.We encourage dataset creators to recast their datasets in NLI and invite them to add their recast datasets into the DNC.The collection, along with baselines and trained models are available online at http://www.decomp.net.

B Recasting Semantic Phenomena
Here we add secondary information about the original datasets and our recasting efforts.

B.1 Event Factuality
We demonstrate how determining whether an event occurred can enable accurate inferences based on the event.Consider the following sentences: (9) a.  Miller et al. (2017) were sampled from prior pun detection datasets (Miller and Gurevych, 2015;Miller and Turković, 2016) and includes new examples generated from scratch for the shared task; the original labels denote whether the sentences contain homographic, heterographic, or no pun at all.Here, we are only interested in whether a sentence contains a pun or not instead of discriminating between homographic and heterographic puns.

B.4 Relation Extraction
Since hypotheses were automatically generated from Wikipedia infoboxes, many examples are noisy and ungrammatical.We presented hypotheses (independent of their corresponding contexts) to Mechanical Turk workers and asked them to label each sentence as containing no grammatical error, minor grammatical issues, or major grammatical issues.We removed the 2, 056 NLI examples with hypothesis containing major grammatical issues, resulting in 28, 041 labeled pairs.Interestingly, almost 70% of those examples where labeled between 1 − 4, which we view as NOT-ENTAILED.We release the ungrammatical NLI examples as supplementary data.
A The Netherlands is an event The student met with the architect to view her blueprints for inspiration The appraiser told the buyer that he had paid too much for the painting Gendered Anaphora The architect has blueprints The appraiser had purchased a painting Someone assumed that a particular thing happened A particular person craved to do a particular thing MegaVeridicality That thing might or might not have happened That person did that thing

C Experimental Details
In all our experiments, we use pre-computed GloVe embeddings (Pennington et al., 2014) and use the OOV vector for words that do not have a defined embedding.We follow Conneau et al. (2017)'s procedure to train our models.During training, our models are optimized with SGD.Our initial learning rate is 0.1 with a decay rate of 0.99.
Our models train for at most 20 epochs and can optionally terminate early when the learning rate is less than 10 −5 .If the accuracy deceases on the development set in any epoch, the learning rate is di- (7) a. Name was born in Place b.Name is from Place c. Name, a Place native, . . .d. Name visited Place

Table 1 :
Find him before he finds the dog food EventThe finding did not happen ✓ Example sentence pairs for different semantic phe- nomena.◮ indicates the line is a context and the following line is its corresponding hypothesis.✓ and ✗ respectively indicate that the context entails, or does not entail the hypothesis.Appendix A includes more recast examples.

Table 3 :
NLI accuracies on test data.Columns correspond to each semantic phenomena and rows correspond to the model used.Columns are ordered from larger to smaller in size, but the last three (VC, MV, VN) are separated they fall under lexico-syntactic inference.(update) refers to a model that was initialized with pre-trained parameters and then re-trained on the corresponding recast data.(fixed) refers to a model that was trained and then evaluated on these data sets.Bold numbers in each column indicate which settings were responsible for the highest accuracy on the specific recast dataset.
a high accuracy by learning dataset specific characteristics that are unrelated to NLU.For example,Poliak et al. (2018a,b)previously noted the association between ungrammaticality and NOT-ENTAILED examples based on how White et al. (2017) recast the FrameNet+ dataset Table4includes examples from all of the recast NLI datasets.We include one ENTAILED and one NOT-ENTAILED example from each dataset that tests a distinct type of reasoning.
Puns inYang et al. (2015)were originally extracted from punsoftheday.com,and sentences without puns came from newswire and proverbs.The sentences are labeled as containing a pun or not.Puns in second source of noise in the recast relation extraction dataset can be caused by disagreement amongst multiple annotators.Examples in our training and development sets are annotated by a single annotator while we use 3-to 5-way redundancy to annotate the test examples.To guarantee high-quality test examples, we only include examples with 100% inner-annotator agreement.Additionally, we remove the 16 examples labeled with 4 from our NOT-ENTAILED examples in this pruned test set since some of these examples are arguably entailments.Consequently, the test set contains 761 examples, out of the original 3, 670 test examples.Nevertheless, we separately release all 3, 670 test examples and include the original annotations as well, enabling others to consider other methods to collapse the multi-way annotations.

Table 4 :
McAuley and Leskovec (2013)different semantic phenomena.The ✓ and ✗ columns respectively indicate that the context entails, or does not entail the hypothesis.Each cell's first and second line respectively represent a context and hypothesis.B.5 Sentiment Kotzias et al. (2015) compiled examples from previous sources.The movie dataset came from Maas et al. (2011), the Amazon product reviews were released byMcAuley and Leskovec (2013)add the restaurant reviews were sourced from the Yelp dataset challenge.21