Collecting Entailment Data for Pretraining: New Protocols and Negative Results

Natural language inference (NLI) data has proven useful in benchmarking and, especially, as pretraining data for tasks requiring language understanding. However, the crowdsourcing protocol that was used to collect this data has known issues and was not explicitly optimized for either of these purposes, so it is likely far from ideal. We propose four alternative protocols, each aimed at improving either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. Using these alternatives and a ﬁfth baseline protocol, we collect and compare ﬁve new 8.5k-example training sets. In evaluations focused on transfer learning applications, our results are solidly negative, with models trained on our baseline dataset yielding good transfer performance to downstream tasks, but none of our four new methods (nor the recent ANLI) showing any improvements over that baseline. In a small silver lining, we observe that all four new protocols, especially those where annotators edit pre-ﬁlled text boxes, reduce previously observed issues with annotation artifacts.


Introduction
The task of natural language inference (NLI; also known as textual entailment) has been widely used as an evaluation task when developing new methods for language understanding tasks, and it has recently become clear that high-quality NLI data can be useful in transfer learning as well, driving much of the recent use of the task: Several recent papers have shown that training large neural network models on natural language inference data, then fine-tuning them for other language understanding tasks often yields substantially better results on those target tasks (Conneau et al., 2017;Subramanian et al., 2018). This result holds even when starting from large models like BERT (Devlin et al., 2019) that have already been pretrained extensively on unlabeled data (Phang et al., 2018;Clark et al., 2019;Liu et al., 2019b;Wang et al., 2019b).
The largest general-purpose corpus for NLI, and the one that has proven most successful in this setting, is the Multi-Genre NLI Corpus (MNLI Williams et al., 2018). MNLI was designed informally for use in a benchmark task (with no consideration of transfer learning), and in any case, no explicit experimental research went into its design. Further, data collected under MNLI's data collection protocol has known issues with annotation artifacts which make it possible to perform much better than chance using only one of the two sentences that make up each example (Tsuchiya, 2018;Gururangan et al., 2018;Poliak et al., 2018).
This work experimentally evaluates four potential changes to the original MNLI data collection protocol that are designed to improve either the ease with which annotators can produce sound training examples or the quality and diversity of those examples. We collect a baseline dataset of 8.5k examples that follows the MNLI protocol with our annotator pool, followed by four additional datasets of the same size which isolate each of our candidate changes. (See Figure 1 for a schematic. ) We then compare all five in a set of experiments, focused on transfer learning, that look at our ability to use each of these datasets to improve performance on the eight downstream language understanding tasks in the SuperGLUE (Wang et al., 2019b) benchmark.
All five of our datasets are consistent with the task definition that was used in MNLI, which is in turn based on the definition introduced by Dagan et al. (2006). In this task, each example consists of a pair of short texts: a premise and a hypothesis. The model is asked to read both texts and make a three-way classification decision: Given the premise, would a reasonable person infer that hypothesis must be true (entailment), that it must be false (contradiction), or that there is not enough information to judge (neutral)? While it is certainly not clear that this design is optimal for any application, we leave a more broad-based exploration of task definitions for future work.
Our BASE data collection protocol follows MNLI closely in asking annotators to read a premise sentence and then write three corresponding hypothesis sentences in empty text boxes corresponding to the three different labels (entailment, contradiction, and neutral). When an annotator follows this protocol, they produce three sentence pairs at once, all sharing a single premise.
Our PARAGRAPH protocol tests the effect of supplying annotators with complete paragraphs, rather than sentences, as premises. Longer texts offer the potential for discourse-level inferences, the addition of which should yield a dataset that is more difficult, more diverse, and less likely to contain trivial artifacts. However, one might expect that asking annotators to read full paragraphs should increase the time required to create a single example; time which could potentially be better spent creating more examples.
Our EDITPREMISE and EDITOTHER protocols test the effect of pre-filling a single seed text in each of the three text boxes that annotators are asked to fill out. By reducing the raw amount of typing required, this could allow annotators to produce good examples more quickly. By encouraging them to keep the three sentences similar, it could also indirectly facilitate the construction of minimalpair-like examples that minimize artifacts, in the style of Kaushik et al. (2020). We test two variants of this idea: One uses a copy of the premise sentence as a seed text and the second retrieves a new sentence from an existing corpus that is similar to the premise sentence, and uses that.
Our CONTRAST protocol tests the effect of adding artificial constraints on the kinds of hypothesis sentences annotators can write. Giving annotators difficult and varying constraints could encourage creativity and prevent annotators from falling into patterns in their writing that lead to easier or more repetitive data. However, as with the use of longer contexts in PARAGRAPH, this protocol risks substantially slowing the annotation process. We experiment with a procedure inspired by that used to create the language-and-vision dataset NLVR2 (Suhr et al., 2019), in which annotators must write sentences that show some specified relationship (entailment or contradiction) to a given premise, but do not show that relationship to a second similar distractor premise.
Because we see transfer learning as the primary application area for which it would be valuable to collect additional large-scale NLI datasets, we focus our evaluation on this setting, and do not collect or designate test sets for the experimental datasets we collect. In transfer evaluations on the SuperGLUE benchmark (Wang et al., 2019b), our BASE dataset and the datasets collected under our four new protocols offer substantial improvements in transfer ability over a plain RoBERTa or XLNet model, comparable to the gains seen with an equally-sized sample of MNLI. However, BASE reliably shows the strongest transfer results. This finding, combined with a low variance across runs, strongly suggests that none of these four interventions improves the suitability of NLI data for transfer learning. We also observe that BASE, PARAGRAPH, EDITPREMISE, and EDITOTHER all require very similar amounts of annotator time, reducing the potential downside of PARAGRAPH, but also invalidating the primary motivation behind ED-ITPREMISE and EDITOTHER. While our primary results are negative, we also observe that all four of these methods produce data of comparable subjective quality to BASE while significantly reducing the incidence of previously reported annotation artifacts.

Related Work
Existing NLI datasets have been built using a wide range of strategies: FraCaS (Cooper et al., 1996) and several targeted evaluation sets were constructed manually by experts from scratch. The RTE challenge corpora (Dagan et al., 2006, et seq.) primarily used expert annotations on top of existing premise sentences. SICK (Marelli et al., 2014) was created using a structured pipeline centered on asking crowdworkers to edit sentences in prescribed ways. MPE (Lai et al., 2017) uses a similar strategy, but constructs unordered sets of sentences for use as premises. SNLI (Bowman et al., 2015) introduced the method, also used in MNLI, of asking crowdworkers to compose labeled hypotheses for a given premise. SciTail (Khot et al., 2018) and SWAG (Zellers et al., 2018) used domain-specific resources to pair up existing sentences as potential entailment pairs for annotation, with SWAG additionally using trained models to select the examples most worth annotating. There has been little work directly evaluating and comparing these many methods. In that absence, we focus on the SNLI/MNLI approach, because it has been shown to be effective for the collection of pretraining data and because its reliance on only crowdworkers and unstructured source text makes it simple to scale.
Two recent papers have investigated methods that could augment the base MNLI protocol we study here. ANLI (Nie et al., 2020) collects new examples following this protocol, but adds an incentive for crowdworkers to produce sentence pairs on which a baseline system will perform poorly. Kaushik et al. (2020) introduce a method for expanding an already-collected dataset by making minimal edits to existing examples that change their labels, with the intent to better teach models to isolate the factors that are causally responsible for the label assignments. Both of these papers offer  methodological changes that are potentially complementary to the changes we investigate here, and neither evaluates the impact of their methods on transfer learning. Since ANLI is large and roughly comparable with MNLI, we include it in our transfer evaluations here.
In addition to NLVR2 (which motivated our CONTRAST protocol), WinoGrande (Sakaguchi et al., 2019) also showed promising results from the use of artificial constraints during the annotation process for another style of dataset.
The observation that NLI data can be effective in pretraining was first reported for SNLI and MNLI by Conneau et al. (2017) on models pretrained from scratch on NLI data. This finding was replicated in the setting of multi-task pretraining by Subramanian et al. (2018). This was later extended to the context of intermediate training-where a model is pretrained on unlabeled data, then on relatively abundant labeled data (MNLI), and finally scarce task-specific labeled data-by  (Sap et al., 2019;Bhagavatula et al., 2020), especially to target tasks centered on common sense. Another related body of work (Mou et al., 2016;Bingel and Søgaard, 2017;Wang et al., 2019a;Pruksachatkun et al., 2020) has explored the broader empirical landscape of which supervised NLP tasks can offer effective pretraining for other supervised NLP tasks.

Data Collection
The annotation interface for our tasks is similar to that used for SNLI and MNLI: We provide a premise from a preexisting text source and ask human annotators to provide three hypothesis sentences: one that says something true about the fact or situation in the prompt (entailment), one that says something that may or may not be true about the fact or situation in the prompt (neutral)-with the additional instruction that this sentence should discuss the same topic as the prompt but could be either true or false because the prompt does not provide enough information to be sure-and one that definitely does not say something true about the fact or situation in the prompt (contradiction).
We evaluate five variants of this interface: BASE We show annotators a premise sentence and ask them to compose one new sentence for each label.
PARAGRAPH We use the same instructions as BASE, but with full paragraphs, rather than single sentences, as the supplied premises.
EDITPREMISE We pre-fill three text boxes with editable copies of the premise sentence, and ask annotators to edit each text field to compose sentences that match the three different labels. Annotators may delete the pre-filled text.
EDITOTHER We follow the same procedure as EDITPREMISE, but rather than pre-filling the premise as a seed sentence, we instead use a similarity search method to retrieve a new sentence that is similar to the premise.
CONTRAST We again retrieve a second sentence that is similar to the premise, but we display it as a contrasting premise rather than using it to seed an editable text box. We then ask annotators to compose two new sentences: One sentence must be true only about the fact or situation in the first premise (that is, contradictory or neutral with respect to the contrasting premise). The other sentence must be false only about the fact or situation in the first premise (and true or neutral with respect to the contrasting premise). This yields an entailment pair and a contradiction pair, both of which use only the first premises, with the contrasting premise serving only as a constraint on the annotation process. We could not find a sufficiently intuitive way to collect neutral sentence pairs under this protocol and opted to use only two classes rather than increase the difficulty of an already unintuitive task.

Text Source
MNLI uses the small but stylistically diverse OpenANC corpus (Ide and Suderman, 2006) as its source for premise sentences, but uses nearly every available sentence from its non-technical sections, making it impractical for our use. To avoid re-using premise sentences, We instead draw on English Wikipedia. 1

Similarity Search
The EDITOTHER and CON-TRAST protocols require pairs of similar sentences as their inputs. To construct these, we assemble a heuristic sentence-matching system intended to generate pairs of highly similar sentences that can be minimally edited to construct entailments or contradictions: Given a premise, we retrieve its closest 10k nearest neighbors according to dot-product similarity over Universal Sentence Encoder (Cer et al., CONTRAST hypothesis entailment 7.9 (5.1) 2.5 (5.1) hypothesis contradiction 7.7 (4.9) 3.5 (4.9) Table 2: Key text statistics. Premises are drawn from essentially the same distribution in all our tasks except PARAGRAPH, so are shown only once. The Unique column shows the number of tokens that appear in a hypothesis but not in the corresponding premise.
2018) embeddings. Using a parser and an NER system, we then select those neighbors which share a subject noun phrase in common with the premise (dropping premises for which no such neighbors exist). From those filtered neighbors, we retrieve the single non-identical neighbor that has the highest overlap with the premise in both raw tokens and entity mentions, preferring sentences with similar length to the hypothesis. 2

The Annotation Process
We start data collection for each protocol with a pilot of 100 items, which are not included in the final datasets. We use these to refine task instructions and to provide feedback to our annotator pool on the intended task definition. We continue to provide regular feedback throughout the annotation process to clarify ambiguities in the protocols and to discourage the use of systematic patternssuch as consistently composing shorter hypotheses for entailments than for contradictions-that could make the resulting data artificially easy.
Annotators are allowed to skip prompts which they deem unusable for any reason. These generally involve either non-sentence strings that were mishandled by our sentence tokenizer or premises with inaccessible technical language. Skip rates ranged from about 2.5% for EDITOTHER to about 10% for CONTRAST (which can only be completed when the two premises are both comprehensible and sufficiently different from one another).
A pool of 19 professional annotators located in the United States worked on our tasks, with about ten working on each protocol. As a consequence of this relatively small annotation team, many annotators worked under more than one protocol, which we ran consecutively. This introduces a modest potential bias against BASE, in that annotators start the later tasks having seen somewhat more feedback.
Because of our focus on collecting training data for transfer learning applications, we do not use any kind of second-pass annotation process for quality control, and we neither designate a test set nor recommend the use of our released datasets for system evaluation. We aim to use our limited annotation time budget to collect the largest and best possible sample of (pre)training data, and we are motivated by work like Khetan et al. (2018) which calls into question the value of second-pass quality-control annotations for training data.

The Resulting Data
Using each protocol, we collect a training set of exactly 8,500 examples and a small validation set of at least 300 examples. 3 Table 1 shows examples.
Hypotheses are mostly fluent, full sentences that adhere to writing conventions for US English. In constructing hypotheses, annotators often reuse words or phrases from the premise, but rearrange them, alter their inflectional forms, or substitute synonyms or antonyms. Hypotheses tend to differ from premises both grammatically and stylistically. Table 2 shows some statistics for the collected text. The two methods that use seed sentences tend to yield longer hypotheses and tend not to show a clear relationship between hypothesis-premise token overlap and label. CONTRAST tends to produce shorter hypotheses.
Time Cost Annotators completed each of the five protocols at a similar rate, taking 3-4 minutes per prompt. This goes against our expectations that the longer premises in PARAGRAPH should substantially slow the annotation process, and that the pre-filled text in EDITPREMISE and EDITOTHER should speed annotation. Since the relatively complex CONTRAST produces only two sentence pairs per prompt rather than three, it yields fewer examples per minute. Table 3 shows the four words in each dataset that are most predictive of example labels, using the smoothed PMI method of Gururangan et al. (2018). We also include results for two baselines: 8.5k-example samples from MNLI, and from MNLI's the government documents single-genre section, which is meant to to be maximally comparable to the single-genre datasets we collect.

Label-Word Associations
BASE shows similar associations to MNLI, but all four of our interventions reduce these associations at least slightly. The use of seed sentences, especially in EDITPREMISE, largely eliminates the strong association between negation and contradiction seen in MNLI, and no new strong associations appear to take its place.

Modeling Experiments
We run three types of machine learning experiments: Sanity check experiments where we train and test on the NLI task-both in a standard setting and in a hypothesis-only limited-input setting to measure relevant annotation artifacts-and our primary evaluation experiments in which we train models on NLI before evaluating them on other tasks through transfer learning.
These experiments generally compare models trained on ten NLI datasets: Each of the five 8.5kexample training sets introduced in this paper; the   Yang et al., 2019). RoBERTa was at or near the state of the art on most of our target tasks as of the launch of our experiments. XLNet is competitive with RoBERTa on most tasks, it offers a natural replication, and because of its substantially different design, it mitigates issues with evaluating ANLI that arise because ANLI was collected with a model-in-the-loop procedure using RoBERTa.
We run our experiments using jiant 1.2 (Wang et al., 2019d), which implements the SuperGLUE tasks, MNLI, and ANLI, and in turn builds on transformers (Wolf et al., 2019), AllenNLP (Gardner et al., 2017), and PyTorch (Paszke et al., 2017). To make it possible to train these large models on single consumer GPUs, we use small-batch (b = 4) training and a maximum total sequence length of 128 word pieces. 5 We train for up to 2 epochs for the very large ReCoRD, 10 epochs for the very small CB, COPA, and WSC, and 4 epochs for the remaining tasks. Except where noted, all results reflect the median final performance from three random restarts of training. 6 Direct NLI Evaluations As a preliminary sanity check, Table 4 shows the results of evaluating models trained in each of the settings described above on their own validation sets, on the MNLI validation set, and on the expert-constructed GLUE diagnostic set (Wang et al., 2019c). As NLI classifiers trained on CONTRAST cannot produce the neutral labels used in MNLI, we evaluate them separately and compare them with two-class variants of the MNLI models.
Our BASE data yields a model that performs somewhat worse than a comparable MNLI Gov. 5 We cut this to a slightly lower number on a few individual runs as needed to satisfy memory constraints. Note that this potentially limits the gains observable for PARAGRAPH, which has a longer mean premise length of 66.7 words. 6 Scripts implementing our experiments are available at https://github.com/nyu-mll/jiant/tree/ nli-data.  8.5k model, both on the full MNLI validation set and on the GLUE diagnostic set. This suggests, at least tentatively, that the new annotations are significantly less consistent with the MNLI labeling standard. This is disconcerting, but does not interfere with our key comparisons. Precise comparisons between MNLI and our new data on indomain test sets are not possible, since only MNLI has in-domain evaluation data that has undergone substantial quality control. The main conclusion we draw from these results is that none of the first three interventions improve performance on the out-of-domain GLUE diagnostic set, suggesting that they do not help in the collection of high-quality training data that is consistent with the MNLI label definitions. We also observe that the newer ANLI data yields worse performance than MNLI on the out-of-domain evaluation data when we control for dataset size.

Hypothesis-Only Models
To further investigate the degree to which our hypotheses contain artifacts that reveal their labels, Table 5 shows results with single-input versions of our models trained on hypothesis-only versions of the datasets under study and evaluated on the datasets' validation sets.
Our first three interventions, especially EDIT-PREMISE, show much lower hypothesis-only performance than BASE. This drop is much larger than the drop seen in our standard NLI experiments in the Self column of Table 4. This indicates that these results cannot be explained away as a consequence of the lower label consistency of the evaluation sets for these three new datasets. This adds  Transfer Evaluations For our primary evaluation, we use the training sets from our datasets in STILTs-style intermediate training (Phang et al., 2018): We fine-tune a large pretrained model on our collected data using standard fine-tuning procedures, then fine-tune copies of the resulting model again on each of the target task datasets we use. We then measure the aggregate performance of the resulting models across those evaluation datasets. We evaluate on the target tasks in the Super-GLUE benchmark (Wang et al., 2019b): which consists of standardized splits and metrics for the question answering tasks BoolQ (Clark et al., 2019), MultiRC (Khashabi et al., 2018), ReCoRD (Zhang et al., 2018); the entailment and reasoning tasks CommitmentBank (CB; De Marneffe et al., 2019), Choice of Plausible Alternatives (COPA; Roemmele et al., 2011), Recognizing Textual Entailment (RTE;Bar Haim et al., 2006;Giampiccolo et al., 2007;Bentivogli et al., 2009), and the Winograd Schema Challenge (WSC; Levesque et al., 2012); and the word sense disambiguation task WiC (Pilehvar and Camacho-Collados, 2019). These tasks were selected to be difficult for BERT but relatively easy for crowdworkers, and are meant to replace the largely-solved GLUE benchmark (Wang et al., 2019c).
SuperGLUE does not include labeled test data, and does not allow for substantial ablation analyses on its test sets. Since we have no single final model whose performance we aim to show off, we do not use the test sets. We train our WSC model in the standard way without adding data or modifying the format (as in Kocijan et al., 2019;Liu et al., 2019b). Without these modifications, few of our models exceed chance accuracy.
Results are shown in Table 6. Our first observation is that our overall data collection pipeline worked well for our purposes: Our BASE data yields models that transfer substantially better than the plain RoBERTa or XLNet baseline, and at least slightly better than 8.5k-example samples of MNLI, MNLI Government or ANLI. However, all four of our interventions yield worse transfer performance than BASE. The variances across runs are small, and this pattern is consistent across both RoBERTa and XLNet, and across most individual target tasks. We believe that this is a genuine negative result: At least under the broad experimental setting outlined here, we find that none of these four interventions is helpful for transfer learning.
We chose to collect 8,500-example samples because of the prior observation that this approximate amount was sufficient to show clear results on transfer learning, and we reproduce that finding here: Both MNLI 8.5k and the BASE dataset yield large improvements over plain RoBERTa or XLNet through transfer learning. If any of our interventions were to be helpful in general, we would expect them to be harmless or helpful in our regime relative to BASE. This is not what we observe.
We believe that this is the first study to evaluate ANLI as a pretraining task in transfer learning, and we observe that the large combined ANLI training set yields consistently better transfer than the original MNLI dataset. However, we observe (to our surprise) that this result reverses when we control for ANLI's larger size, with an 8.5k-example sample of MNLI yielding consistently better performance than an equivalently small sample of ANLI.
Our best overall result uses only 8.5k NLI training examples, suggesting either that this size is enough to maximize the gains available through NLI pretraining, or that the potential for models to forget skills learned in pretraining makes using larger intermediate datasets more challenging.
Finally, we replicate the finding from Phang et al. (2018) that intermediate-task training with NLI data substantially reduces the variance across restarts seen in target task tuning.  Table 6: Model performance on the SuperGLUE task validation sets. The Avg. column shows the overall Super-GLUE score-an average across the eight tasks -as a mean and standard deviation across three restarts.

Conclusion
Our chief results on transfer learning are conclusively negative: All four interventions yield substantially worse transfer performance than our base MNLI data collection protocol. However, we also observe promising signs that all four of our interventions help to reduce the prevalence of artifacts in the generated hypotheses that reveal the label. While these interventions may be helpful for future evaluation data, it appears that the type of creativity induced by our relatively open-ended BASE prompt works well for pretraining, and the resulting artifacts are a tolerable side-effect of that creativity. The need and opportunity that motivated this work remains compelling: Human-annotated data like MNLI has already proven itself as a valuable tool in teaching machines general-purpose skills for language understanding, and discovering ways to more effectively build and use such data could further accelerate the field's already fast progress toward robust, general-purpose language understanding technologies.
On another note, most available text corpora, including our Wikipedia source text and comparable past NLI datasets, contain evidence of social inequalities and stereotypes, which models can easily learn to reproduce (Wagner et al., 2015;Rudinger et al., 2017). Our interventions are not meant to address this, and are likely orthogonal. Bias mitigation in models and datasets remains a crucial direction for future work if systems based on datasets like the ones we study are to be widely deployed.
Beyond this: Work on incentive structures and task design could facilitate the creation of crowdsourced datasets that are both creative and consistently labeled. Machine learning methods work on transfer learning could help to better understand and exploit the effects that drive the successes we have seen with NLI data so far. Finally, there remains room for further empirical work investigating the kinds of task definitions and data collection protocols most likely to yield training data that teaches models transferrable skills.