Collecting Diverse Natural Language Inference Problems for Sentence Representation Evaluation

We present a large-scale collection of diverse natural language inference (NLI) datasets that help provide insight into how well a sentence representation captures distinct types of reasoning. The collection results from recasting 13 existing datasets from 7 semantic phenomena into a common NLI structure, resulting in over half a million labeled context-hypothesis pairs in total. We refer to our collection as the DNC: Diverse Natural Language Inference Collection. The DNC is available online at https://www.decomp.net, and will grow over time as additional resources are recast and added from novel sources.


Introduction
A plethora of new natural language inference (NLI) 1 datasets has been created in recent years (Bowman et al., 2015;Williams et al., 2017;Lai et al., 2017;Khot et al., 2018). However, these datasets do not provide clear insight into what type of reasoning or inference a model may be performing. For example, these datasets cannot be used to evaluate whether competitive NLI models can determine if an event occurred, correctly differentiate between figurative and literal language, or accurately identify and categorize named entities. Consequently, these datasets cannot answer how well sentence representation learning models capture distinct semantic phenomena necessary for general natural language understanding (NLU).
To answer these questions, we introduce the Diverse NLI Collection (DNC), a large-scale NLI dataset that tests a model's ability to perform diverse types of reasoning. DNC is a collection of NLI problems, each requiring a model to perform 1 The task of determining if a hypothesis would likely be inferred from a context, or premise; also known as Recognizing Textual Entailment (RTE) (Dagan et al., 2006(Dagan et al., , 2013  indicates the line is a context and the following line is its corresponding hypothesis. and respectively indicate that the context entails, or does not entail the hypothesis. Appendix A includes more recast examples. a unique type of reasoning. Each NLI dataset contains labeled context-hypothesis pairs that we recast from semantic annotations for specific structured prediction tasks. We extend various prior works on challenge NLI datasets (Zhang et al., 2017), and define recasting as leveraging existing datasets to create NLI examples (Glickman, 2006;White et al., 2017). We recast annotations from a total of 13 datasets across 7 NLP tasks into labeled NLI examples. The tasks include event factuality, named entity recognition, gendered anaphora resolution, sentiment analysis, relationship extraction, pun detection, and lexicosyntactic inference. Currently, the DNC contains over half a million labeled examples. Table 1 includes NLI pairs that test specific types of reasoning.
Using a hypothesis-only NLI model, with access to just hypothesis sentences, as a strong baseline (Tsuchiya, 2018;Gururangan et al., 2018;Poliak et al., 2018b), our experiments demonstrate how DNC can be used to probe a model's ability to capture different types of semantic reasoning necessary for general NLU. In short, this work answers a recent plea to the community to test "more kinds of inference" than in previous challenge sets (Chatzikyriakidis et al., 2017).

Motivation & Background
Compared to eliciting NLI datasets directly, i.e. asking humans to author contexts and/or hypothesis sentences, recasting can 1) help determine whether an NLU model performs distinct types of reasoning; 2) limit types of biases observed in previous NLI data; and 3) generate examples cheaply, potentially at large scales.
NLU Insights Popular NLI datasets, e.g. Stanford Natural Language Inference (SNLI) (Bowman et al., 2015) and its successor Multi-NLI (Williams et al., 2017), were created by eliciting hypotheses from humans. Crowd-source workers were tasked with writing one sentence each that is entailed, neutral, and contradicted by a caption extracted from the Flickr30k corpus (Young et al., 2014). Although these datasets are widely used to train and evaluate sentence representations, a high accuracy is not indicative of what types of reasoning NLI models perform. Workers were free to create any type of hypothesis for each context and label. Such datasets cannot be used to determine how well an NLI model captures many desired capabilities of language understanding systems, e.g. paraphrastic inference, complex anaphora resolution (White et al., 2017), or compositionality (Pavlick and Callison-Burch, 2016;Dasgupta et al., 2018). By converting prior annotation of a specific phenomenon into NLI examples, recasting allows us to create a diverse NLI benchmark that tests a model's ability to perform distinct types of reasoning.
Limit Biases Studies indicate that many NLI datasets contain significant biases. Examples in the early Pascal RTE datasets could be correctly predicted based on syntax alone (Vanderwende and Dolan, 2006;. Statistical irregularities, and annotation artifacts, within class labels allow a hypothesis-only model to significantly outperform the majority baseline on at least six recent NLI datasets (Poliak et al., 2018b). Class label biases may be attributed to the human-elicited protocol. Moreover, examples in such NLI datasets may contain racial and gendered stereotypes . We limit some biases by not relying on humans to generate hypotheses. Recast NLI datasets may still contain some biases, e.g. non-uniform distributions over NLI labels caused by the distribution of labels in the original dataset that we recast. 2 Experimental results using Poliak et al. (2018b)'s hypothesis-only model indicate to what degree the recast datasets retain some biases that may be present in the original semantic datasets.
NLI Examples at Large-scale Generating NLI datasets from scratch is costly. Humans must be paid to generate or label natural language text. This linearly scales costs as the amount of generated NLI-pairs increases. Existing annotations for a wide array of semantic NLP tasks are freely available. By leveraging existing semantic annotations already invested in by the community we can generate and label NLI pairs at little cost and create large NLI datasets to train data hungry models.
Why These Semantic Phenomena? A long term goal is to develop NLU systems that can achieve human levels of understanding and reasoning. Investigating how different architectures and training corpora can help a system perform human-level general NLU is an important step in this direction. DNC contains recast NLI pairs that are easily understandable by humans and can be used to evaluate different sentence encoders and NLU systems. These semantic phenomena cover distinct types of reasoning that an NLU system may often encounter in the wild. While higher performance on these benchmarks might not be conclusive proof of a system achieving human-level reasoning, a system that does poorly should not be viewed as performing human-level NLU. We argue that these semantic phenomena play integral roles in NLU. There exist more semantic phenomena integral to NLU (Allen, 1995) and we plan to include them in future versions of the DNC.
Previous Recast NLI Example sentences in RTE1 (Dagan et al., 2006) were extracted from MT, IE, and QA datasets, with the process referred to as 'recasting' in the thesis by Glickman (2006). NLU problems were reframed under the NLI framework and candidate sentence pairs were extracted from existing NLP datasets and then labeled under NLI (Dagan et al., 2006). Years later, this term was independently used by White et al. (2017), who proposed to "leverage existing largescale semantic annotation collections as a source of targeted textual inference examples." The term 'recasting' was limited to automatically converting existing semantic annotations into labeled NLI examples without manual intervention. We adopt the broader definition of 'recasting' since our NLI examples were automatically or manually generated from prior NLU datasets.

Applied Framework versus Inference Probing
Traditionally, NLI has not been viewed as a downstream, applied NLP task. 3 Instead, the community has often used it as "a generic evaluation framework" to compare models for distinct downstream tasks (Dagan et al., 2006) or to determine whether a model performs distinct types of reasoning (Cooper et al., 1996). These two different evaluation goals may affect which datasets are recast. We target both goals as we recast applied tasks and linguistically focused phenomena.

Recasting Semantic Phenomena
We describe efforts to recast 7 semantic phenomena from a total of 13 datasets into labeled NLI examples. Many of the recasting methods rely on simple templates that do not include nuances and variances typical of natural language. This allows us to specifically test how sentence representations capture distinct types of reasoning. When recasting, we preserve each dataset's train/dev/test split. If a dataset does not contain such a split, we create a random split with roughly a 80:10:10 ratio. Table 2 reports statistics about each recast dataset.
Event Factuality (EF) Event factuality prediction is the task of determining whether an event described in text occurred. Determining whether an event occurred enables accurate inferences, e.g. monotonic inferences, based on the event . 4 Incorporating factuality has been shown to improve NLI (Sauri and Pustejovsky, 2007).
We recast event factuality annotations from UW (Lee et al., 2015), MEANTIME (Minard et al., 2016), and Decomp . We use sentences from original datasets as contexts and templates (1a) and (1b) as hypotheses. 5 (1) a. The Event happened b. The Event did not happen If the predicate denoting the Event was annotated as having happened in the factuality dataset, the context paired with (1a) is labeled as ENTAILED and the same context paired with (1b) is labeled as NOT-ENTAILED. Otherwise, we swap the labels.
Named Entity Recognition (NER) Distinct types of entities have different properties and relational objects (Prince, 1978) that can help infer facts from a given context. For example, if a system can detect that an entity is a name of a nation, then that entity likely has a leader, a language, and a culture (Prince, 1978;Van Durme, 2010). When classifying NLI pairs, a model can determine if an object mentioned in the hypothesis can be a relational object typically associated with the type of entity described in the context. NER tags can also be directly used to determine if a hypothesis is likely to not be entailed by a context, such as when entities in contexts and hypotheses do not share NER tags (Castillo and Alemany, 2008;Sammons et al., 2009;Pakray et al., 2010). Given a sentence annotated with NER tags, we recast the annotations by preserving the original sentences as contexts and creating hypotheses using the template "NP is a Label." 6 For ENTAILED hypotheses we replace Label with the correct NER label of the NP; for NOT-ENTAILED hypotheses, we choose an incorrect label from the prior distribution of NER tags for the given phrase. This prevents us from adding additional biases besides any class-label statistical irregularities present in the original data. We apply this procedure on the Gronigen Meaning Bank (Bos et al., 2017) and the ConLL-2003 Shared Task (Tjong Kim Sang and De Meulder, 2003).

Gendered Anaphora Resolution (GAR)
The ability to perform pronoun resolution is essential to language understanding, in many cases requiring common-sense reasoning about the world (Levesque et al., 2012). White et al. (2017) show that this task can be directly recast as an NLI problem by transforming Winograd schemas into NLI sentence pairs.
Using  Table 2: Statistics summarizing the recast datasets. The first column refers to the original annotation that was recast, the 'Combined' row refers to the combination of our recast datasets. The second column indicates the datasets that were recast, and the 3rd column reports how many labeled NLI pairs were extracted from the corresponding dataset. The last column indicates whether the recasting method was fully-automatic without human involvement (), manual (), or used a semi-automatic method that included human intervention (). The Multi-NLI and SNLI numbers contextualize the scale of our dataset.
adapted pronoun resolution task, they demonstrate the presence of systematic gender bias in coreference resolution systems. We recast Winogender schemas as an NLI task, introducing a potential method of detecting gender bias in NLI systems or sentence embeddings. In recasting, the context is the original, unmodified Winogender sentence; the hypothesis is a short, manually constructed sentence having a correct (ENTAILED) or incorrect (NOT-ENTAILED) pronoun resolution.
Lexicosyntactic Inference (Lex) While many inferences in natural language are triggered by lexical items alone, there exist pervasive inferences that arise from interactions between lexical items and their syntactic contexts. This is particularly apparent among propositional attitude verbs -e.g. think, want, know -which display complex distributional profiles (White and Rawlins, 2016). For instance, the verb remember can take both finite clausal complements and infinitival clausal complements.
(2) a. Jo didn't remember that she ate b. Jo didn't remember to eat This small change in the syntactic structure gives rise to large changes in the inferences that are licensed: (2a) presupposes that Jo ate while (2b) entails that Jo didn't eat. We recast data from three datasets that are relevant to these sorts of lexicosyntactic interactions. To recast these annotations, we assign the context sentences like (3a) to the majority class -yes, maybe or maybe not, no -across 10 different annotators, after applying an ordinal model-based normalization to their responses. We then pair each context sentence with three hypotheses.
(4) a. That thing happened b. That thing may or may not have happened c. That thing didn't happen If annotated yes, maybe or maybe not, or no, the pair (3a)-(4a), (3a)-(4b), or (3a)-(4c) is respectively assigned ENTAILED and the other pairings are assigned NOT-ENTAILED; train/dev/test split labels are randomly assigned to every pair that context sentence appears in.
Lex #2: Recasting VerbNet (VN) We create additional lexicosyntactic NLI examples from Verb-Net (Schuler, 2005). VerbNet contains classes of verbs that each can have multiple frames. Each frame contains a mapping from syntactic arguments to thematic roles, which are used as arguments in Neo-Davidsonian first-order logical predicates (5b) that describe the frame's semantics. Each frame additionally contains an example sentence (5a) that we use as our NLI context and we create templates (5c) from the most frequent semantic predicates to generate hypotheses (5d).
(5) a. Michael swatted the fly b. cause(E, Agent) c. Agent caused the E d. Michael caused the swatting We use the Berkeley Parser (Petrov et al., 2006) to match tokens in an example sentence with the thematic roles and then fill in the templates with the matched tokens (5d). We also decompose multi-argument predicates into unary predicates to increase the number of hypotheses we generate. On average, each context is paired with 4.5 hypotheses. We generate NOT-ENTAILED hypotheses by filling in templates with incorrect thematic roles. 9 We partition the recast NLI examples into train/development/test splits such that all example sentences from a VerbNet class (which we use a NLI hypothesis) appear in only one partition of our dataset. In turn, the recast VerbNet dataset's partition is not exactly 80:10:10. Each sentence in VC is judged based on the decomposed semantic properties. We convert each semantic property into declarative statements 10 to create hypotheses and pair them with the original sentences which we preserve as contexts. The NLI pair is ENTAILED or NOT-ENTAILED depending on the given sentence's semantic judgment.
Figurative Language (Puns) Figurative language demonstrates natural language's expressiveness and wide variations. Understanding and recognizing figurative language "entail[s] cognitive capabilities to abstract and meta-represent meanings beyond physical words" (Reyes et al., 2012). Puns are prime examples of figurative language that may perplex general NLU systems as they are one of the more regular uses of linguistic ambiguity (Binsted, 1996) and rely on a wide-range of phonetic, morphological, syntactic, and semantic ambiguity (Pepicello and Green, 1984;Binsted, 1996;Bekinschtein et al., 2011).
We recast puns from Yang et al. (2015) and Miller et al. (2017) using templates to generate contexts (6a) and hypotheses (6b), (6c). We replace Name with names sampled from a distribution based on US census data, 11 and Pun with the original sentence. If the original sentence was labeled as containing a pun, the (6a)-(6b) pair is labeled as ENTAILED and (6a)-(6c) is labeled as NOT-ENTAILED, otherwise we swap the labels.
(6) a. Name heard that Pun b. Name heard a pun c. Name did not hear a pun Relation Extraction (RE) The goal of the relation extraction (RE) task is to infer the real-world relationships between pairs of entities from natural language text. The task is "grounded" in the sense that the input is natural language text and the output is entity1, relation, entity2 tuples defined in the schema of some knowledge base. RE requires a system to understand the many different surface forms which may entail the same underlying relation, and to distinguish those from surface forms which involve the same entities but do not entail the relation of interest. For example, (7a) is entailed by (7b) and (7c) but not by (7d).
(7) a. Name was born in Place b. Name is from Place c. Name, a Place native, . . .

d. Name visited Place
Natural language surface forms are often used in RE in a weak-supervision setting (Mintz et al., 2009;Hoffmann et al., 2011;Riedel et al., 2013). That is, if entity1 and entity2 are known to be related by relation, it is assumed that every sentence observed which mentions both entity1 and entity2 is assumed to be a realization of relation: i.e. (7d) would (falsely) be taken as evidence of the birthPlace relation.
Here we first generate hypotheses and then corresponding contexts.
To generate hypotheses, we begin with entity-relation triples extracted from DBPedia infoboxes: e.g. Barack Obama, birthPlace, Hawaii . These relation predicates were extracted directly from Wikipedia infoboxes and are not cleaned. As a result, many relations are redundant with one another (birthPlace, hometown) and some relations do not correspond to obvious natural language glosses based on the name alone (demographics1Info). Thus, we construct a template for each predicate p by manually inspecting 1) a sample of entities which are related by p 2) a sample of sentences in which those entities co-occur and 3) the most frequent natural language strings which join entities related by p according to a OpenIE triple database (Schmitz et al., 2012;Fader et al., 2011) extracted from a large text corpus. We then manually write a simple template (e.g. Mention1 was born in Mention2) for p, ignoring any unclear relations. In total, we end up with 574 unique relations, expressed by 354 unique templates.
For each such hypothesis generated, we create a number of contexts.
We begin with the FACC1 corpus (Gabrilovich et al., 2013) which contains natural language sentences from ClueWeb in which entities have been automatically linked to disambiguated Freebase entities, when possible.
Then, given a tuple entity1, relation, entity2 , we find every sentence which contains both entity1 and entity2. Since many of these sentences are false positives (7d), we have human annotators vet each context/hypothesis pair, using the ordinal entailment scale described in Zhang et al. (2017). We include optional binary labels by converting pairs labeled as 1 − 4 and 5 to ENTAILED and NOT-ENTAILED respectively. 12 We apply pruning methods (described in Appendix B.4) to combat issues related to noisy, ungrammatical hypotheses and disagreement between multiple annotators. Subjectivity (Sentiment) Some of the previously discussed semantic phenomena deal with objective information -did an event occur or what type of entities does a specific name represent. Subjective information is often expressed differently (Wiebe et al., 2005), making it important to use other tests to probe whether an NLU system understands language that expresses subjective information. We are interested in determining whether general NLU models capture 'subjective clues' that can help identify and understand emotions, opinions, and sentiment within a subjective text (Wilson et al., 2006).
We recast a sentiment analysis dataset since the task is the "expression of subjectivity as either a positive or negative opinion" (Taboada, 2016). We extract sentences from product, movie, and restaurant reviews labeled as containing positive or negative sentiment (Kotzias et al., 2015). Contexts (8a) and hypotheses (8b), (8c) are generated using the following templates: (8) a. When asked about Item, Name said Review b. Name liked the Item c. Name did not like the Item Item is replaced with either "product", "movie", or "restaurant", and the Name is sampled as previously discussed. If the original sentence contained positive (negative) sentiment, the (8a)-(8b) pair is labeled as ENTAILED (NOT-ENTAILED) and (8a)-(8c) is labeled as NOT-ENTAILED (ENTAILED).

Experiments
Our experiments demonstrate how these recast datasets may be used to evaluate how well models capture different types of semantic reasoning necessary for general language understanding. We also include results from a hypothesis-only model as a strong baseline. This may reveal whether the recast datasets retain statistical irregularities from the original, task-specific annotations.

Models
For demonstrating how well an NLI model performs these fine-grained types of reasoning, we use InferSent (Conneau et al., 2017). InferSent independently encodes a context and hypothesis with a bi-directional LSTM and combines the sentence representations by concatenating the individual sentence representations, 13 "Her teeth was cared for" or "Floss were used". 14 See Appendix B.4 for details. their element-wise subtraction and product. The combined representation is then fed into a MLP with a single hidden layer. The hypothesis-only model is a modified version of InferSent that only accesses hypotheses (Poliak et al., 2018b). We report experimental details in Appendix C. Table 3 reports the models' accuracies across the recast NLI datasets. Even though we categorize VerbNet, MegaVeridicality, and VerbCorner as lexicosyntatic inference, we train and evaluate models separately on these three datasets because we use different strategies to individually recast them. When evaluating NLI models, our baseline is the maximum between the accuracies of the hypothesis-only model and the majority class label (MAJ). In six of the eight recast datasets that we use to train our models the hypothesisonly model outperforms MAJ. The two datasets where the hypothesis-only model does not outperform MAJ are Sentiment and VN, each of which contain less than 10K examples. 15 We do not train on GAR because of its small size.

Results
Our results suggest that InferSent, when not pre-trained on any other data, might capture specific semantic phenomena better than other seman-tic phenomena. InferSent seems to learn the most about determining if an event occurred, since the difference between its accuracy and that of the hypothesis-only baseline (+13.93) is largest on the recast EF dataset compared to the other recast annotations. The model seems to similarly learn to perform (or detect) the type of lexicosyntactic inference present in VC and MV. Interestingly, the hypothesis-only model outperforms InferSent on the recast RE.

Hypothesis Only Baseline
The hypothesis-only model can demonstrate how likely it is that an NLI label applies to a hypothesis, regardless of its context and indicates how well each recast dataset tests a model's ability to perform each specific type of reasoning when performing NLI. The high hypothesis-only accuracy on the recast NER dataset may demonstrate that the hypothesis-only model is able to detect that the distribution of class labels for a given word may be peaky. For example, Hong Kong appears 130 times in the training set and is always labeled as a location. Based on this, in future work we may consider different methods to recast NER annotations into labeled NLI examples, or limit the dataset's training size.
Pre-training models on DNC We would like to know whether initializing models with pre-trained parameters improves scores. We notice that when we pre-train our models on DNC, for the larger datasets, a pre-trained model does not seem to significantly outperform randomly initializing the parameters. For the smaller datasets, specifically Puns, Sentiment and VN, a pre-trained model significantly outperforms random initialization. 16 We are also interested to know whether finetuning these pre-trained models on each category (update) improves a model's ability to perform well on the category compared to keeping the pre-trained models' parameters static (fixed). Across all of the recast datasets, updating the pretrained model's parameters during training improves InferSent's accuracies more than keeping the model's parameters fixed. When updating a model pre-trained on the entire DNC, we see the largest improvements on VN (+9.15). Williams et al. (2017) argue that Multi-NLI "[makes] it possible to evaluate systems on nearly the full complexity 16 By 32.81, 31.00, and 30.83 points respectively. of the language." However, how well does Multi-NLI test a model's capability to understand the diverse semantic phenomena captured in DNC? We posit that if a model, trained on and performing well on Multi-NLI, does not perform well on our recast datasets, then Multi-NLI might not evaluate a model's ability to understand the "full complexity" of language as argued. 17 When trained on Multi-NLI, our InferSent model achieves an accuracy of 70.22% on (matched) Multi-NLI. 18 When we test the model on the recast datasets (without updating the parameters), we see significant drops. 19 On the datasets testing a model's lexicosyntactic inference capabilities, the model performs below the majority class baseline. On the NER, EF, and Puns datasets its performs below the hypothesis-only baseline. We also notice that on three of the datasets (EF, Puns, and VN), the fixed hypothesis-only model outperforms the fixed InferSent model.

Models trained on Multi-NLI
These results might suggest that Multi-NLI does not evaluate whether sentence representations capture these distinct semantic phenomena. This is a bit surprising for some of the recast phenomena. We would expect Multi-NLI's fiction section (especially its humor subset) in the training set to contain some figurative language that might be similar to puns, and the travel guides (and possibly telephone conversations) to contain text related to sentiment.
Pre-training on DNC or Multi-NLI? Initializing a model with parameters pre-trained on DNC or Multi-NLI often outperforms random initialization. 20 Is it better to pre-train on DNC or Multi-NLI? On five of the recast datasets, using a model pre-trained on DNC outperforms a model pre-trained on Multi-NLI. The results are flipped on the two datasets focused on downstream tasks (Sentiment and RE) and MV. However, the differences between pre-training on the DNC or Multi-NLI are small. From this, it is unclear whether pre-training on DNC is better than Multi-NLI.
Size of Pre-trained DNC Data We randomly sample 10K and 20K examples from each datasets' training set to investigate what happens if we train our models on a subsample of each training set instead of the entire DNC. Although we noticed a slight decrease across each recast test set, the decrease was not significant. We leave this investigating for a future thorough study.

Related Work
Exploring what linguistic phenomena neural models learn Many tests have been used to probe how well neural models learn different linguistic phenomena. Linzen et al. (2016) use "number agreement in English subject-verb dependencies" to show that LSTMs learn about syntaxsensitive dependencies. In addition to syntax (Shi et al., 2016), researchers have used other labeling tasks to investigate whether neural machine translation (NMT) models learn different linguistic phenomena (Belinkov et al., 2017a,b;Dalvi et al., 2017;Marvin and Koehn, 2018). Recently, Poliak et al. (2018a) used recast NLI datasets to investigate semantics captured by NMT encoders.
Targeted Tests for Natural Language Understanding We follow a long line of work focused on building datasets to test how well NLU systems perform distinct types of semantic reasoning. FraCaS uses a limited number of sentencepairs to test whether systems understand semantic phenomena, e.g. generalized quantifiers, temporal references, and (nominal) anaphora (Cooper et al., 1996). FraCas cannot be used to train neural models -it includes just roughly 300 highquality instances manually created by linguists. MacCartney (2009)  Recent benchmarks test whether NLI models handle adjective-noun composition (Pavlick and Callison-Burch, 2016), other types of composition (Dasgupta et al., 2018), paraphrastic inference, anaphora resolution, and semantic protoroles (White et al., 2017). Concurrently, Conneau et al. (2018)'s benchmark can be used to probe whether sentence representations capture many linguistic properties. It includes syntactic and surface form tests but does not focus on as a wide range of semantic phenomena as in the DNC. Glockner et al. (2018) introduce a modified version of SNLI to test how well NLI models perform when requiring lexical and world knowledge. Wang et al. (2018)'s GLUE dataset is intended to evaluate and potentially train a sentence representation to perform well across different NLP tasks. This continues an aspect of the initial RTE collection, designed to be representative of downstream tasks like QA, MT, and IR (Dagan et al., 2010). While GLUE is therefore concerned with applied tasks, DNC, as well as Naik et al. (2018)'s NLI stress tests, is concerned with probing the capabilities of NLU models to capture explicitly distinguished aspects of meaning. While one may conjecture that the latter is needed to be "solved" to eventually "solve" the former, it may be that these goals only partially overlap. Some NLP researchers might focus on probing for semantic phenomena in sentence representations while others may be more interested in developing single sentence representations that can help models perform well on a wide array of downstream tasks.

Conclusion
We described how we recast a wide range of semantic phenomena from many NLP datasets into labeled NLI sentence pairs. These examples serve as a diverse NLI framework that may help diagnose whether NLU models capture and perform distinct types of reasoning. Our experiments demonstrate how to use this framework as an NLU benchmark. The DNC is actively growing as we continue recasting more datasets into labeled NLI examples. We encourage dataset creators to recast their datasets in NLI and invite them to add their recast datasets into the DNC. The collection, along with baselines and trained models are available online at http://www.decomp.net.
Collin F Baker, Charles J Fillmore, and John B Lowe.
1998. The berkeley framenet project. In Proceedings of the 17th international conference on Computational linguistics-Volume 1, pages 86-90. Association for Computational Linguistics. A More Recast NLI Examples Table 4 includes examples from all of the recast NLI datasets. We include one ENTAILED and one NOT-ENTAILED example from each dataset that tests a distinct type of reasoning.

B Recasting Semantic Phenomena
Here we add secondary information about the original datasets and our recasting efforts.

B.1 Event Factuality
We demonstrate how determining whether an event occurred can enable accurate inferences based on the event. Consider the following sentences: (9) a. She walked a beagle b. She walked a dog c. She walked a brown beagle If the walking occurred, (9a) entails (9b) but not (9c). If we negate the action in sentences (9a), (9b), and (9c) to respectively become: (10) a. She did not walk a beagle b. She did not walk a dog c. She did not walk a brown beagle The new hypothesis (10c) is now entailed by the context (10a) while (10b) is not.

B.2.1 VerbCorner
When recasting VerbCorner, we use the following templates for hypotheses, assigning them as EN-TAILED and NOT-ENTAILED based on the positive or negative answers to the annotation task questions about the context sentence.  Yang et al. (2015) were originally extracted from punsoftheday.com, and sentences without puns came from newswire and proverbs. The sentences are labeled as containing a pun or not. Puns in Miller et al. (2017) were sampled from prior pun detection datasets (Miller and Gurevych, 2015;Miller and Turković, 2016) and includes new examples generated from scratch for the shared task; the original labels denote whether the sentences contain homographic, heterographic, or no pun at all. Here, we are only interested in whether a sentence contains a pun or not instead of discriminating between homographic and heterographic puns.

B.4 Relation Extraction
Since hypotheses were automatically generated from Wikipedia infoboxes, many examples are noisy and ungrammatical. We presented hypotheses (independent of their corresponding contexts) to Mechanical Turk workers and asked them to label each sentence as containing no grammatical error, minor grammatical issues, or major grammatical issues. We removed the 2, 056 NLI examples with hypothesis containing major grammatical issues, resulting in 28, 041 labeled pairs. Interestingly, almost 70% of those examples where labeled between 1 − 4, which we view as NOT-ENTAILED. We release the ungrammatical NLI examples as supplementary data.
A second source of noise in the recast relation extraction dataset can be caused by disagreement amongst multiple annotators. Examples in our training and development sets are annotated by a single annotator while we use 3to 5-way redun- dancy  When asked about the product, Liam said, "Don't waste your money" When asked about the movie, Angel said, "A bit predictable" Sentiment Analysis Liam did not like the product Angel liked the movie

C Experimental Details
In all our experiments, we use pre-computed GloVe embeddings (Pennington et al., 2014) and use the OOV vector for words that do not have a defined embedding. We follow Conneau et al.
(2017)'s procedure to train our models. During training, our models are optimized with SGD. Our initial learning rate is 0.1 with a decay rate of 0.99. Our models train for at most 20 epochs and can optionally terminate early when the learning rate is less than 10 −5 . If the accuracy deceases on the development set in any epoch, the learning rate is