Adversarial NLI: A New Benchmark for Natural Language Understanding

We introduce a new large-scale NLI benchmark dataset, collected via an iterative, adversarial human-and-model-in-the-loop procedure. We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks, while posing a more difficult challenge with its new test set. Our analysis sheds light on the shortcomings of current state-of-the-art models, and shows that non-expert annotators are successful at finding their weaknesses. The data collection method can be applied in a never-ending learning scenario, becoming a moving target for NLU, rather than a static benchmark that will quickly saturate.


Introduction
Progress in AI has been driven by, among other things, the development of challenging large-scale benchmarks like ImageNet (Russakovsky et al., 2015) in computer vision, and SNLI (Bowman et al., 2015), SQuAD (Rajpurkar et al., 2016), and others in natural language processing (NLP). Recently, for natural language understanding (NLU) in particular, the focus has shifted to combined benchmarks like SentEval (Conneau and Kiela, 2018) and GLUE (Wang et al., 2018), which track model performance on multiple tasks and provide a unified platform for analysis.
With the rapid pace of advancement in AI, however, NLU benchmarks struggle to keep up with model improvement. Whereas it took around 15 years to achieve "near-human performance" on MNIST (LeCun et al., 1998;Cireşan et al., 2012;Wan et al., 2013) and approximately 7 years to surpass humans on ImageNet (Deng et al., 2009;Russakovsky et al., 2015;He et al., 2016), the GLUE benchmark did not last as long as we would have hoped after the advent of BERT (Devlin et al., 2018), and rapidly had to be extended into Super-GLUE (Wang et al., 2019). This raises an important question: Can we collect a large benchmark dataset that can last longer?
The speed with which benchmarks become obsolete raises another important question: are current NLU models genuinely as good as their high performance on benchmarks suggests? A growing body of evidence shows that state-of-the-art models learn to exploit spurious statistical patterns in datasets (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;Glockner et al., 2018;Geva et al., 2019;McCoy et al., 2019), instead of learning meaning in the flexible and generalizable way that humans do. Given this, human annotators-be they seasoned NLP researchers or non-expertsmight easily be able to construct examples that expose model brittleness.
We propose an iterative, adversarial human-andmodel-in-the-loop solution for NLU dataset collection that addresses both benchmark longevity and robustness issues. In the first stage, human annotators devise examples that our current best models cannot determine the correct label for. These resulting hard examples-which should expose additional model weaknesses-can be added to the training set and used to train a stronger model. We then subject the strengthened model to the same procedure and collect weaknesses over several rounds. After each round, we train a new model and set aside a new test set. The process can be iteratively repeated in a never-ending learning (Mitchell et al., 2018) setting, with the model getting stronger and the test set getting harder in each new round. Thus, not only is the resultant dataset harder than existing benchmarks, but this process also yields a "moving post" dynamic target for NLU systems, rather than a static benchmark that will eventually saturate.
Our approach draws inspiration from recent ef- Step 1: Write examples Step 2: Get model feedback Step 3: Verify examples and make splits Step 4: Retrain model for next round Training Phase Collection Phase forts that gamify collaborative training of machine learning agents over multiple rounds (Yang et al., 2017) and pit "builders" against "breakers" to learn better models (Ettinger et al., 2017). Recently, Dinan et al. (2019) showed that such an approach can be used to make dialogue safety classifiers more robust. Here, we focus on natural language inference (NLI), arguably the most canonical task in NLU. We collected three rounds of data, and call our new dataset Adversarial NLI (ANLI).
Our contributions are as follows: 1) We introduce a novel human-and-model-in-the-loop dataset, consisting of three rounds that progressively increase in difficulty and complexity, that includes annotator-provided explanations. 2) We show that training models on this new dataset leads to state-of-the-art performance on a variety of popular NLI benchmarks. 3) We provide a detailed analysis of the collected data that sheds light on the shortcomings of current models, categorizes the data by inference type to examine weaknesses, and demonstrates good performance on NLI stress tests. The ANLI dataset is available at github.com/facebookresearch/anli/. A demo is available at adversarialnli.com.

Dataset collection
The primary aim of this work is to create a new large-scale NLI benchmark on which current stateof-the-art models fail. This constitutes a new target for the field to work towards, and can elucidate model capabilities and limitations. As noted, however, static benchmarks do not last very long these days. If continuously deployed, the data collection procedure we introduce here can pose a dynamic challenge that allows for never-ending learning.

HAMLET
To paraphrase the great bard (Shakespeare, 1603), there is something rotten in the state of the art. We propose Human-And-Model-in-the-Loop Enabled Training (HAMLET), a training procedure to automatically mitigate problems with current dataset collection procedures (see Figure 1).
In our setup, our starting point is a base model, trained on NLI data. Rather than employing automated adversarial methods, here the model's "adversary" is a human annotator. Given a context (also often called a "premise" in NLI), and a desired target label, we ask the human writer to provide a hypothesis that fools the model into misclassifying the label. One can think of the writer as a "white hat" hacker, trying to identify vulnerabilities in the system. For each human-generated example that is misclassified, we also ask the writer to provide a reason why they believe it was misclassified.
For examples that the model misclassified, it is necessary to verify that they are actually correct -i.e., that the given context-hypothesis pairs genuinely have their specified target label. The best way to do this is to have them checked by another human. Hence, we provide the example to human verifiers. If two human verifiers agree with the writer, the example is considered a good example. If they disagree, we ask a third human verifier to break the tie. If there is still disagreement between the writer and the verifiers, the example is discarded. If the verifiers disagree, they can over- A melee weapon is any weapon used in direct hand-to-hand combat; by contrast with ranged weapons which act at a distance. The term "melee" originates in the 1640s from the French word "mȇlée", which refers to hand-to-hand combat, a close quarters battle, a brawl, a confused fight, etc. Melee weapons can be broadly divided into three categories Melee weapons are good for ranged and hand-to-hand combat.
Melee weapons are good for hand to hand combat, but NOT ranged.

A2
(Wiki) If you can dream it, you can achieve it-unless you're a goose trying to play a very human game of rugby. In the video above, one bold bird took a chance when it ran onto a rugby field mid-play. Things got dicey when it got into a tussle with another player, but it shook it off and kept right on running. After the play ended, the players escorted the feisty goose off the pitch. It was a risky move, but the crowd chanting its name was well worth it.
The crowd believed they knew the name of the goose running on the field.
Because the crowd was chanting its name, the crowd must have believed they knew the goose's name. The word "believe" may have made the system think this was an ambiguous statement. rule the original target label of the writer. Once data collection for the current round is finished, we construct a new training set from the collected data, with accompanying development and test sets, which are constructed solely from verified correct examples. The test set was further restricted so as to: 1) include pairs from "exclusive" annotators who are never included in the training data; and 2) be balanced by label classes (and genres, where applicable). We subsequently train a new model on this and other existing data, and repeat the procedure.

Annotation details
We employed Mechanical Turk workers with qualifications and collected hypotheses via the ParlAI 1 framework. Annotators are presented with a context and a target label-either 'entailment', 'contradiction', or 'neutral'-and asked to write a hypothesis that corresponds to the label. We phrase the label classes as "definitely correct", "definitely incorrect", or "neither definitely correct nor definitely incorrect" given the context, to make the task easier to grasp. Model predictions are obtained for the context and submitted hypothesis pair. The probability of each label is shown to the worker as feedback. If the model prediction was incorrect, the job is complete. If not, the worker continues to write hypotheses for the given (context, targetlabel) pair until the model predicts the label incor-1 https://parl.ai/ rectly or the number of tries exceeds a threshold (5 tries in the first round, 10 tries thereafter).
To encourage workers, payments increased as rounds became harder. For hypotheses that the model predicted incorrectly, and that were verified by other humans, we paid an additional bonus on top of the standard rate.

Round 1
For the first round, we used a BERT-Large model (Devlin et al., 2018) trained on a concatenation of SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2017), and selected the best-performing model we could train as the starting point for our dataset collection procedure. For Round 1 contexts, we randomly sampled short multi-sentence passages from Wikipedia (of 250-600 characters) from the manually curated HotpotQA training set (Yang et al., 2018). Contexts are either ground-truth contexts from that dataset, or they are Wikipedia passages retrieved using TF-IDF (Chen et al., 2017) based on a HotpotQA question.

Round 2
For the second round, we used a more powerful RoBERTa model (Liu et al., 2019b) trained on SNLI, MNLI, an NLI-version 2 of FEVER (Thorne et al., 2018), and the training data from the previous round (A1). After a hyperparameter search, we  selected the model with the best performance on the A1 development set. Then, using the hyperparameters selected from this search, we created a final set of models by training several models with different random seeds. During annotation, we constructed an ensemble by randomly picking a model from the model set as the adversary each turn. This helps us avoid annotators exploiting vulnerabilities in one single model. A new non-overlapping set of contexts was again constructed from Wikipedia via HotpotQA using the same method as Round 1.

Round 3
For the third round, we selected a more diverse set of contexts, in order to explore robustness under domain transfer. In addition to contexts from Wikipedia for Round 3, we also included contexts from the following domains: News (extracted from Common Crawl), fiction (extracted from Sto-ryCloze (Mostafazadeh et al., 2016) and CBT (Hill et al., 2015)), formal spoken text (excerpted from court and presidential debate transcripts in the Manually Annotated Sub-Corpus (MASC) of the Open American National Corpus 3 ), and causal or procedural text, which describes sequences of events or actions, extracted from WikiHow. Finally, we also collected annotations using the longer contexts present in the GLUE RTE training data, which came from the RTE5 dataset (Bentivogli et al., 2009). We trained an even stronger RoBERTa ensemble by adding the training set from the second round (A2) to the training data.

Comparing with other datasets
The ANLI dataset, comprising three rounds, improves upon previous work in several ways. First, and most obviously, the dataset is collected to be more difficult than previous datasets, by design. Second, it remedies a problem with SNLI, 3 anc.org/data/masc/corpus/ namely that its contexts (or premises) are very short, because they were selected from the image captioning domain. We believe longer contexts should naturally lead to harder examples, and so we constructed ANLI contexts from longer, multisentence source material. Following previous observations that models might exploit spurious biases in NLI hypotheses, (Gururangan et al., 2018;Poliak et al., 2018), we conduct a study of the performance of hypothesisonly models on our dataset. We show that such models perform poorly on our test sets.
With respect to data generation with naïve annotators, Geva et al. (2019) noted that models can pick up on annotator bias, modelling annotator artefacts rather than the intended reasoning phenomenon. To counter this, we selected a subset of annotators (i.e., the "exclusive" workers) whose data would only be included in the test set. This enables us to avoid overfitting to the writing style biases of particular annotators, and also to determine how much individual annotator bias is present for the main portion of the data. Examples from each round of dataset collection are provided in Table 1.
Furthermore, our dataset poses new challenges to the community that were less relevant for previous work, such as: can we improve performance online without having to train a new model from scratch every round, how can we overcome catastrophic forgetting, how do we deal with mixed model biases, etc. Because the training set includes examples that the model got right but were not verified, learning from noisy and potentially unverified data becomes an additional interesting challenge.

Dataset statistics
The dataset statistics can be found in Table 2  to over 103k examples for Round 3. We collected more data for later rounds not only because that data is likely to be more interesting, but also simply because the base model is better and so annotation took longer to collect good, verified correct examples of model vulnerabilities.
For each round, we report the model error rate, both on verified and unverified examples. The unverified model error rate captures the percentage of examples where the model disagreed with the writer's target label, but where we are not (yet) sure if the example is correct. The verified model error rate is the percentage of model errors from example pairs that other annotators confirmed the correct label for. Note that error rate is a useful way to evaluate model quality: the lower the model error rate-assuming constant annotator quality and context-difficulty-the better the model. We observe that model error rates decrease as we progress through rounds. In Round 3, where we included a more diverse range of contexts from various domains, the overall error rate went slightly up compared to the preceding round, but for Wikipedia contexts the error rate decreased substantially. While for the first round roughly 1 in every 5 examples were verified model errors, this quickly dropped over consecutive rounds, and the overall model error rate is less than 1 in 10. On the one hand, this is impressive, and shows how far we have come with just three rounds. On the other hand, it shows that we still have a long way to go if even untrained annotators can fool ensembles of state-of-the-art models with relative ease. Table 2 also reports the average number of "tries", i.e., attempts made for each context until a model error was found (or the number of possible tries is exceeded), and the average time this took (in seconds). Again, these metrics are useful for evaluating model quality: observe that the average number of tries and average time per verified error both go up with later rounds. This demonstrates that the rounds are getting increasingly more difficult. Further dataset statistics and inter-annotator agreement are reported in Appendix C. Table 3 reports the main results. In addition to BERT (Devlin et al., 2018) and RoBERTa (Liu et al., 2019b), we also include XLNet (Yang et al., 2019) as an example of a strong, but different, model architecture. We show test set performance on the ANLI test sets per round, the total ANLI test set, and the exclusive test subset (examples from test-set-exclusive workers). We also show accuracy on the SNLI test set and the MNLI development set (for the purpose of comparing between different model configurations across table rows). In what follows, we discuss our observations. Base model performance is low. Notice that the base model for each round performs very poorly on that round's test set. This is the expected outcome: For round 1, the base model gets the entire test set wrong, by design. For rounds 2 and 3, we used an ensemble, so performance is not necessarily zero. However, as it turns out, performance still falls well below chance 4 , indicating that workers did not find vulnerabilities specific to a single model, but generally applicable ones for that model class.

Accuracy
Training Data Rounds become increasingly more difficult.
As already foreshadowed by the dataset statistics, round 3 is more difficult (yields lower performance) than round 2, and round 2 is more difficult than round 1. This is true for all model architectures.
Training on more rounds improves robustness. Generally, our results indicate that training on more rounds improves model performance. This is true for all model architectures. Simply training on more "normal NLI" data would not help a model be robust to adversarial attacks, but our data actively helps mitigate these.  Continuously augmenting training data does not downgrade performance. Even though ANLI training data is different from SNLI and MNLI, adding it to the training set does not harm performance on those tasks. Our results (see also rows 2-3 of Table 6) suggest the method could successfully be applied for multiple additional rounds.
Exclusive test subset difference is small. We included an exclusive test subset (ANLI-E) with examples from annotators never seen in training, and find negligible differences, indicating that our models do not over-rely on annotator's writing styles.

The effectiveness of adversarial training
We examine the effectiveness of the adversarial training data in two ways. First, we sample from respective datasets to ensure exactly equal amounts of training data. Table 5 shows that the adversarial data improves performance, including on SNLI and MNLI when we replace part of those datasets with the adversarial data. This suggests that the adversarial data is more data efficient than "normally collected" data. Figure 2 shows that adversarial data collected in later rounds is of higher quality and more data-efficient. Second, we compared verified correct examples of model vulnerabilities (examples that the model got wrong and were verified to be correct) to unverified ones. Figure 3 shows that the verified correct examples are much more valuable than the unverified examples, especially in the later rounds (where the latter drops to random).

Stress Test Results
We also test models on two recent hard NLI test sets: SNLI-Hard (Gururangan et al., 2018) and    Table 4. We observe that all our models outperform the models presented in original papers for these common stress tests. The RoBERTa models perform best on SNLI-Hard and achieve accuracy levels in the high 80s on the 'antonym' (AT), 'numerical reasoning' (NR), 'length' (LN), 'spelling error'(SE) sub-datasets, and show marked improvement on both 'negation' (NG), and 'word overlap' (WO).
Training on ANLI appears to be particularly useful for the AT, NR, NG and WO stress tests.

Hypothesis-only results
For SNLI and MNLI, concerns have been raised about the propensity of models to pick up on spurious artifacts that are present just in the hypotheses (Gururangan et al., 2018;Poliak et al., 2018). Here, we compare full models to models trained only on the hypothesis (marked H). Table 6 reports results on ANLI, as well as on SNLI and MNLI. The table shows that hypothesis-only models perform poorly on ANLI 5 , and obtain good performance on SNLI and MNLI. Hypothesis-only performance 5 Obviously, without manual intervention, some bias remains in how people phrase hypotheses-e.g., contradiction might have more negation-which explains why hypothesisonly performs slightly above chance when trained on ANLI.  decreases over rounds for ANLI.
We observe that in rounds 2 and 3, RoBERTa is not much better than hypothesis-only. This could mean two things: either the test data is very difficult, or the training data is not good. To rule out the latter, we trained only on ANLI (∼163k training examples): RoBERTa matches BERT when trained on the much larger, fully in-domain SNLI+MNLI combined dataset (943k training examples) on MNLI, with both getting ∼86 (the third row in Table 6). Hence, this shows that the test sets are so difficult that state-of-the-art models cannot outperform a hypothesis-only prior.

Linguistic analysis
We explore the types of inferences that fooled models by manually annotating 500 examples from each round's development set. A dynamically evolving dataset offers the unique opportunity to track how model error rates change over time.
Since each round's development set contains only verified examples, we can investigate two interesting questions: which types of inference do writers employ to fool the models, and are base models differentially sensitive to different types of reasoning?
The results are summarized in Table 7. We devised an inference ontology containing six types of inference: Numerical & Quantitative (i.e., reason-  , and reasoning from outside knowledge or additional facts (e.g., "You can't reach the sea directly from Rwanda"). The quality of annotations was also tracked; if a pair was ambiguous or a label debatable (from the expert annotator's perspective), it was flagged. Quality issues were rare at 3-4% per round. Any one example can have multiple types, and every example had at least one tag.
We observe that both round 1 and 2 writers rely heavily on numerical and quantitative reasoning in over 30% of the development set-the percentage in A2 (32%) dropped roughly 6% from A1 (38%)-while round 3 writers use numerical or quantitative reasoning for only 17%. The majority of numerical reasoning types were references to cardinal numbers that referred to dates and ages. Inferences predicated on references and names were present in about 10% of rounds 1 & 3 development sets, and reached a high of 20% in round 2, with coreference featuring prominently. Standard inference types increased in prevalence as the rounds increased, ranging from 18%-27%, as did 'Lexical' inferences (increasing from 13%-31%). The percentage of sentences relying on reasoning and outside facts remains roughly the same, in the mid-50s, perhaps slightly increasing over the rounds. For round 3, we observe that the model used to collect it appears to be more susceptible to Standard, Lexical, and Tricky inference types. This finding is compatible with the idea that models trained on adversarial data perform better, since annotators seem to have been encouraged to devise more creative examples containing harder types of inference in order to stump them. Further analysis is provided in Appendix B.
6 Related work Bias in datasets Machine learning methods are well-known to pick up on spurious statistical patterns. For instance, in the first visual question answering dataset (Antol et al., 2015), biases like "2" being the correct answer to 39% of the questions starting with "how many" allowed learning algorithms to perform well while ignoring the visual modality altogether (Jabri et al. In question answering, Kaushik and Lipton (2018) showed that question-and passage-only models can perform surprisingly well, while Jia and Liang (2017) added adversarially constructed sentences to passages to cause a drastic drop in performance. Many tasks do not actually require sophisticated linguistic reasoning, as shown by the surprisingly good performance of random encoders (Wieting and Kiela, 2019). Similar observations were made in machine translation (Belinkov and Bisk, 2017) and dialogue (Sankar et al., 2019). Machine learning also has a tendency to overfit on static targets, even if that does not happen deliberately (Recht et al., 2018). In short, the field is rife with dataset bias and papers trying to address this important problem. This work presents a potential solution: if such biases exist, they will allow humans to fool the models, resulting in valuable training examples until the bias is mitigated.
Dynamic datasets. Bras et al. (2020) proposed AFLite, an approach for avoiding spurious biases through adversarial filtering, which is a modelin-the-loop approach that iteratively probes and improves models. Kaushik et al. (2019) offer a causal account of spurious patterns, and counterfactually augment NLI datasets by editing examples to break the model. That approach is human-inthe-loop, using humans to find problems with one single model. In this work, we employ both human and model-based strategies iteratively, in a form of human-and-model-in-the-loop training, to create completely new examples, in a potentially never-ending loop (Mitchell et al., 2018).
Human-and-model-in-the-loop training is not a new idea. Mechanical Turker Descent proposes a gamified environment for the collaborative training of grounded language learning agents over multiple rounds (Yang et al., 2017). The "Build it Break it Fix it" strategy in the security domain (Ruef et al., 2016)  There has been a flurry of work in constructing datasets with an adversarial component, such as Swag (Zellers et al., 2018) and HellaSwag (Zellers et al., 2019), CODAH , Adversarial SQuAD (Jia and Liang, 2017), Lambada (Paperno et al., 2016) and others. Our dataset is not to be confused with abductive NLI (Bhagavatula et al., 2019), which calls itself αNLI, or ART.

Discussion & Conclusion
In this work, we used a human-and-model-in-theloop training method to collect a new benchmark for natural language understanding. The benchmark is designed to be challenging to current stateof-the-art models. Annotators were employed to act as adversaries, and encouraged to find vulnerabilities that fool the model into misclassifying, but that another person would correctly classify. We found that non-expert annotators, in this gamified setting and with appropriate incentives, are remarkably creative at finding and exploiting weaknesses. We collected three rounds, and as the rounds progressed, the models became more robust and the test sets for each round became more difficult. Training on this new data yielded the state of the art on existing NLI benchmarks.
The ANLI benchmark presents a new challenge to the community. It was carefully constructed to mitigate issues with previous datasets, and was designed from first principles to last longer. The dataset also presents many opportunities for further study. For instance, we collected annotatorprovided explanations for each example that the model got wrong. We provided inference labels for the development set, opening up possibilities for interesting more fine-grained studies of NLI model performance. While we verified the development and test examples, we did not verify the correctness of each training example, which means there is probably some room for improvement there.
A concern might be that the static approach is probably cheaper, since dynamic adversarial data collection requires a verification step to ensure examples are correct. However, verifying examples is probably also a good idea in the static case, and adversarially collected examples can still prove useful even if they didn't fool the model and weren't verified. Moreover, annotators were better incentivized to do a good job in the adversarial setting. Our finding that adversarial data is more data-efficient corroborates this theory. Future work could explore a detailed cost and time trade-off between adversarial and static collection.
It is important to note that our approach is modelagnostic. HAMLET was applied against an ensemble of models in rounds 2 and 3, and it would be straightforward to put more diverse ensembles in the loop to examine what happens when annotators are confronted with a wider variety of architectures.
The proposed procedure can be extended to other classification tasks, as well as to ranking with hard negatives either generated (by adversarial models) or retrieved and verified by humans. It is less clear how the method can be applied in generative cases.
Adversarial NLI is meant to be a challenge for measuring NLU progress, even for as yet undiscovered models and architectures. Luckily, if the benchmark does turn out to saturate quickly, we will always be able to collect a new round.

A Performance on challenge datasets
Recently, several hard test sets have been made available for revealing the biases NLI models learn from their training datasets (Nie and Bansal, 2017;McCoy et al., 2019;Gururangan et al., 2018;Naik et al., 2018). We examine model performance on two of these: the SNLI-Hard (Gururangan et al., 2018) test set, which consists of examples that hypothesis-only models label incorrectly, and the NLI stress tests (Naik et al., 2018)

B Further linguistic analysis
We compare the incidence of linguistic phenomena in ANLI with extant popular NLI datasets to get an idea of what our dataset contains. We observe that FEVER and SNLI datasets generally contain many fewer hard linguistic phenomena than MultiNLI and ANLI (see Table 8). ANLI and MultiNLI have roughly the same percentage of hypotheses that exceeding twenty words in length, and/or contain negation (e.g., 'never', 'no'), tokens of 'or', and modals (e.g., 'must', 'can'). MultiNLI hypotheses generally contains more pronouns, quantifiers (e.g., 'many', 'every'), WH-words (e.g., 'who', 'why'), and tokens of 'and' than do their ANLI counterparts-although A3 reaches nearly the same percentage as MultiNLI for negation, and modals. However, ANLI contains more cardinal numerals and time terms (such as 'before', 'month', and 'tomorrow') than MultiNLI. These differences might be due to the fact that the two datasets are constructed from different genres of text. Since A1 and A2 contexts are constructed from a single Wikipedia data source (i.e., HotPotQA data), and most Wikipedia articles include dates in the first line, annotators appear to prefer constructing hypotheses that highlight numerals and time terms, leading to their high incidence.
Focusing on ANLI more specifically, A1 has roughly the same incidence of most tags as A2 (i.e., within 2% of each other), which, again, accords with the fact that we used the same Wikipedia data source for A1 and A2 contexts. A3, however, has the highest incidence of every tag (except for numbers and time) in the ANLI dataset. This could be due to our sampling of A3 contexts from a wider range of genres, which likely affected how annotators chose to construct A3 hypotheses; this idea is supported by the fact that A3 contexts differ in tag percentage from A1 and A2 contexts as well.
The higher incidence of all tags in A3 is also interesting, because it could be taken as providing yet another piece of evidence that our HAMLET data collection procedure generates increasingly more difficult data as rounds progress.
C Dataset properties Table 9 shows the label distribution. Figure 4 shows a histogram of the number of tries per good verified example across for the three different rounds. Figure 5 shows the time taken per good verified example. Figure 6 shows a histogram of the number of tokens for contexts and hypotheses across three rounds. Figure 7 shows the proportion of different types of collected examples across three rounds.
Inter-annotator agreement Table 10 reports the inter-annotator agreement for verifiers on the dev and test sets. For reference, the Fleiss' kappa of FEVER (Thorne et al., 2018) is 0.68 and of SNLI (Bowman et al., 2015) is 0.70. Table 11 shows the percentage of agreement of verifiers with the intended author label.

D Examples
We include more examples of collected data in Table 12.

E User interface
Examples of the user interface are shown in Figures  8, 9 and 10.        Bernardo Provenzano (31 January 1933 -13 July 2016) was a member of the Sicilian Mafia ("Cosa Nostra") and was suspected of having been the head of the Corleonesi, a Mafia faction that originated in the town of Corleone, and de facto "capo di tutti capi" (boss of all bosses) of the entire Sicilian Mafia until his arrest in 2006.
It was never confirmed that Bernardo Provenzano was the leader of the Corleonesi.
Provenzano was only suspected as the leader of the mafia. It wasn't confirmed.

A2 (Wiki)
E N E E Tricky Presupposition, Standard Negation HMAS "Lonsdale" is a former Royal Australian Navy (RAN) training base that was located at Beach Street, Port Melbourne , Victoria, Australia. Originally named "Cerberus III", the Naval Reserve Base was commissioned as HMAS "Lonsdale" on 1 August 1940 during the Second World War.
Prior to being renamed, Lonsdale was located in Perth, Australia.
A naval base cannot be movedbased on the information in the scenario, the base has always been located in Victoria.

A2
C N C C Tricky Presupposition, Reasoning Facts Toolbox Murders is a 2004 horror film directed by Tobe Hooper, and written by Jace Anderson and Adam Gierasch. It is a remake of the 1978 film of the same name and was produced by the same people behind the original. The film centralizes on the occupants of an apartment who are stalked and murdered by a masked killer. "We had to make a decision between making payroll or paying the debt," Melton said Monday. "If we are unable to make payroll Oct. 19, we will definitely be able to make it next week Oct. 26 based on the nature of our sales taxes coming in at the end of the month. However we will have payroll the following week again on Nov. 2 and we are not sure we will be able to make that payroll because of the lack of revenue that is coming in." The company will not be able to make payroll on October 19 th and will instead dispense it on October 26 th It's not definitely correct nor definitely incorrect because the company said "if" they can't make it on the 19 th they will do it on the 26 th , they didn't definitely say they won't make it on the 19 th A3 (News) N E N C N Reasoning Plausibility Likely, Tricky Presupposition The Survey: Greg was answering questions. He had been asked to take a survey about his living arrangements. He gave all the information he felt comfortable sharing. Greg hoped the survey would improve things around his apartment. THe complex had really gone downhill lately.
He gave some of the information he felt comfortable sharing.
Greg gave all of the information he felt comfortable, not some. It was difficult for the system because it couldn't tell a significant difference between to word "some" and "all." A3 (Fiction) C E C C Tricky (Scalar Implicature) Table 12: Extra examples from development sets. 'An' refers to round number, 'orig.' is the original annotator's gold label, 'pred.' is the model prediction, 'valid.' is the validator labels, 'reason' was provided by the original annotator, 'Annotations' is the tags determined by linguist expert annotator.