Dynabench: Rethinking Benchmarking in NLP

We introduce Dynabench, an open-source platform for dynamic dataset creation and model benchmarking. Dynabench runs in a web browser and supports human-and-model-in-the-loop dataset creation: annotators seek to create examples that a target model will misclassify, but that another person will not. In this paper, we argue that Dynabench addresses a critical need in our community: contemporary models quickly achieve outstanding performance on benchmark tasks but nonetheless fail on simple challenge examples and falter in real-world scenarios. With Dynabench, dataset creation, model development, and model assessment can directly inform each other, leading to more robust and informative benchmarks. We report on four initial NLP tasks, illustrating these concepts and highlighting the promise of the platform, and address potential objections to dynamic benchmarking as a new standard for the field.


Introduction
While it used to take decades for machine learning models to surpass estimates of human performance on benchmark tasks, that milestone is now routinely reached within just a few years for newer datasets (see Figure 1). As with the rest of AI, NLP has advanced rapidly thanks to improvements in computational power, as well as algorithmic breakthroughs, ranging from attention mechanisms (Bahdanau et al., 2014;Luong et al., 2015), to Transformers (Vaswani et al., 2017), to pre-trained language models (Howard and Ruder, 2018;Devlin et al., 2019;Liu et al., 2019b;Radford et al., 2019;Brown et al., 2020). Equally important has been the rise of benchmarks that support the development of ambitious new data-driven models and that encourage apples-to-apples model comparisons. Benchmarks provide a north star goal for researchers, and are part of the reason we can confidently say we have made great strides in our field. In light of these developments, one might be forgiven for thinking that NLP has created models with human-like language capabilities. Practitioners know that, despite our progress, we are actually far from this goal. Models that achieve super-human performance on benchmark tasks (according to the narrow criteria used to define human performance) nonetheless fail on simple challenge examples and falter in real-world scenarios. A substantial part of the problem is that our benchmark tasks are not adequate proxies for the sophisticated and wide-ranging capabilities we are targeting: they contain inadvertent and unwanted statistical and social biases that make them artificially easy and misaligned with our true goals.
We believe the time is ripe to radically rethink benchmarking. In this paper, which both takes a position and seeks to offer a partial solution, we introduce Dynabench, an open-source, web-based research platform for dynamic data collection and model benchmarking. The guiding hypothesis be-hind Dynabench is that we can make even faster progress if we evaluate models and collect data dynamically, with humans and models in the loop, rather than the traditional static way.
Concretely, Dynabench hosts tasks for which we dynamically collect data against state-of-theart models in the loop, over multiple rounds. The stronger the models are and the fewer weaknesses they have, the lower their error rate will be when interacting with humans, giving us a concrete metrici.e., how well do AI systems perform when interacting with humans? This reveals the shortcomings of state-of-the-art models, and it yields valuable training and assessment data which the community can use to develop even stronger models.
In this paper, we first document the background that led us to propose this platform. We then describe the platform in technical detail, report on findings for four initial tasks, and address possible objections. We finish with a discussion of future plans and next steps.

Background
Progress in NLP has traditionally been measured through a selection of task-level datasets that gradually became accepted benchmarks (Marcus et al., 1993;Pradhan et al., 2012). Recent well-known examples include the Stanford Sentiment Treebank (Socher et al., 2013), SQuAD (Rajpurkar et al., 2016(Rajpurkar et al., , 2018, SNLI (Bowman et al., 2015), and MultiNLI (Williams et al., 2018). More recently, multi-task benchmarks such as SentEval (Conneau andKiela, 2018), DecaNLP (McCann et al., 2018), GLUE (Wang et al., 2018), and Super-GLUE  were proposed with the aim of measuring general progress across several tasks. When the GLUE dataset was introduced, "solving GLUE" was deemed "beyond the capability of current transfer learning methods" (Wang et al., 2018). However, GLUE saturated within a year and its successor, SuperGLUE, already has models rather than humans at the top of its leaderboard. These are remarkable achievements, but there is an extensive body of evidence indicating that these models do not in fact have the humanlevel natural language capabilities one might be lead to believe.

Challenge Sets and Adversarial Settings
Whether our models have learned to solve tasks in robust and generalizable ways has been a topic of much recent interest. Challenging test sets have shown that many state-of-the-art NLP models struggle with compositionality Kim and Linzen, 2020;Yu and Ettinger, 2020;White et al., 2020), and find it difficult to pass the myriad stress tests for social May et al., 2019;Nangia et al., 2020) and/or linguistic competencies (Geiger et al., 2018Naik et al., 2018;Glockner et al., 2018;White et al., 2018;Warstadt et al., 2019;Gauthier et al., 2020;Hossain et al., 2020;Jeretic et al., 2020;Lewis et al., 2020;Saha et al., 2020;Schuster et al., 2020;Sugawara et al., 2020;. Yet, challenge sets may suffer from performance instability (Liu et al., 2019a;Rozen et al., 2019;Zhou et al., 2020) and often lack sufficient statistical power (Card et al., 2020), suggesting that, although they may be valuable assessment tools, they are not sufficient for ensuring that our models have achieved the learning targets we set for them.
Models are susceptible to adversarial attacks, and despite impressive task-level performance, state-of-the-art systems still struggle to learn robust representations of linguistic knowledge (Ettinger et al., 2017), as also shown by work analyzing model diagnostics (Ettinger, 2020;Ribeiro et al., 2020). For example, question answering models can be fooled by simply adding a relevant sentence to the passage (Jia and Liang, 2017).
Text classification models have been shown to be sensitive to single input character change (Ebrahimi et al., 2018b) and first-order logic inconsistencies (Minervini and Riedel, 2018). Similarly, machine translation systems have been found susceptible to character-level perturbations (Ebrahimi et al., 2018a) and synthetic and natural noise (Belinkov and Bisk, 2018;Khayrallah and Koehn, 2018). Natural language inference models can be fooled by simple syntactic heuristics or hypothesis-only biases (Gururangan et al., 2018;Poliak et al., 2018;Tsuchiya, 2018;Belinkov et al., 2019;McCoy et al., 2019). Dialogue models may ignore perturbations of dialogue history (Sankar et al., 2019). More generally, Wallace et al. (2019) find universal adversarial perturbations forcing targeted model errors across a range of tasks. Recent work has also focused on evaluating model diagnostics through counterfactual augmentation (Kaushik et al., 2020), decision boundary analysis (Gardner et al., 2020;Swayamdipta et al., 2020), and behavioural testing (Ribeiro et al., 2020).

Adversarial Training and Testing
Research progress has traditionally been driven by a cyclical process of resource collection and architectural improvements. Similar to Dynabench, recent work seeks to embrace this phenomenon, addressing many of the previously mentioned issues through an iterative human-and-model-in-the-loop annotation process (Yang et al., 2017;Dinan et al., 2019;Chen et al., 2019;Bartolo et al., 2020;, to find "unknown unknowns" (Attenberg et al., 2015) or in a never-ending or life-long learning setting (Silver et al., 2013;Mitchell et al., 2018). The Adversarial NLI (ANLI) dataset , for example, was collected with an adversarial setting over multiple rounds to yield "a 'moving post' dynamic target for NLU systems, rather than a static benchmark that will eventually saturate". In its few-shot learning mode, GPT-3 barely shows "signs of life" (Brown et al., 2020) (i.e., it is barely above random) on ANLI, which is evidence that we are still far away from human performance on that task.

Other Related Work
While crowdsourcing has been a boon for largescale NLP dataset creation (Snow et al., 2008;Munro et al., 2010), we ultimately want NLP systems to handle "natural" data (Kwiatkowski et al., 2019) and be "ecologically valid" (de Vries et al., 2020). Ethayarajh and Jurafsky (2020) analyze the distinction between what leaderboards incentivize and "what is useful in practice" through the lens of microeconomics. A natural setting for exploring these ideas might be dialogue (Hancock et al., 2019;Shuster et al., 2020). Other works have pointed out misalignments between maximum-likelihood training on i.i.d. train/test splits and human language (Linzen, 2020; Stiennon et al., 2020).
We think there is widespread agreement that something has to change about our standard evaluation paradigm and that we need to explore alternatives. The persistent misalignment between benchmark performance and performance on challenge and adversarial test sets reveals that standard evaluation paradigms overstate the ability of our models to perform the tasks we have set for them. Dynabench offers one path forward from here, by allowing researchers to combine model development with the stress-testing that needs to be done to achieve true robustness and generalization.

Dynabench
Dynabench is a platform that encompasses different tasks. Data for each task is collected over multiple rounds, each starting from the current state of the art. In every round, we have one or more target models "in the loop." These models interact with humans, be they expert linguists or crowdworkers, who are in a position to identify models' shortcomings by providing examples for an optional context. Examples that models get wrong, or struggle with, can be validated by other humans to ensure their correctness. The data collected through this process can be used to evaluate state-of-the-art models, and to train even stronger ones, hopefully creating a virtuous cycle that helps drive progress in the field. Figure 2 provides a sense of what the example creation interface looks like.
As a large-scale collaborative effort, the platform is meant to be a platform technology for humanand-model-in-the-loop evaluation that belongs to the entire community. In the current iteration, the platform is set up for dynamic adversarial data collection, where humans can attempt to find modelfooling examples. This design choice is due to the fact that the average case, as measured by maximum likelihood training on i.i.d. datasets, is much less interesting than the worst (i.e., adversarial) case, which is what we want our systems to be able to handle if they are put in critical systems where they interact with humans in real-world settings.
However, Dynabench is not limited to the adversarial setting, and one can imagine scenarios where humans are rewarded not for fooling a model or ensemble of models, but for finding examples that models, even if they are right, are very uncertain about, perhaps in an active learning setting. Similarly, the paradigm is perfectly compatible with collaborative settings that utilize human feedback, or even negotiation. The crucial aspect of this proposal is the fact that models and humans interact live "in the loop" for evaluation and data collection.
One of the aims of this platform is to put expert linguists center stage. Creating model-fooling examples is not as easy as it used to be, and finding interesting examples is rapidly becoming a less trivial task. In ANLI, the verified model error rate for crowd workers in the later rounds went below 1-in-10 , while in "Beat the AI", human performance decreased while time per valid adversarial example went up with stronger models in the loop (Bartolo et al., 2020). For expert linguists, we expect the model error to be much higher, but if the platform actually lives up to its virtuous cycle promise, that error rate will go down quickly. Thus, we predict that linguists with expertise in exploring the decision boundaries of machine learning models will become essential.
While we are primarily motivated by evaluating progress, both ANLI and "Beat the AI" show that models can overcome some of their existing blind spots through adversarial training. They also find that best model performance is still quite far from that of humans, suggesting that while the collected data appears to lie closer to the model decision boundaries, there still exist adversarial examples beyond the remit of current model capabilities.

Features and Implementation Details
Dynabench offers low-latency, real-time feedback on the behavior of state-of-the-art NLP models. The technology stack is based on PyTorch (Paszke et al., 2019), with models served via TorchServe. 1 1 https://pytorch.org/serve The platform not only displays prediction probabilities, but through an "inspect model" functionality, allows the user to examine the token-level layer integrated gradients (Sundararajan et al., 2017), obtained via the Captum interpretability library. 2 For each example, we allow the user to explain what the correct label is, as well as why they think it fooled a model if the model got it wrong; or why the model might have been fooled if it wasn't. All collected model-fooling (or, depending on the task, even non-model-fooling) examples are verified by other humans to ensure their validity.
Task owners can collect examples through the web interface, by engaging with the community, or through Mephisto, 3 which makes it easy to connect, e.g., Mechanical Turk workers to the exact same backend. All collected data will be open sourced, in an anonymized fashion.
In its current mode, Dynabench could be de-scribed as a fairly conservative departure from the status quo. It is being used to develop datasets that support the same metrics that drive existing benchmarks. The crucial change is that the datasets are now dynamically created, allowing for more kinds of evaluation-e.g., tracking progress through rounds and across different conditions.

Initial Tasks
We have selected four official tasks as a starting point, which we believe represent an appropriate cross-section of the field at this point in time. Natural Language Inference (NLI) and Question Answering (QA) are canonical tasks in the field. Sentiment analysis is a task that some consider "solved" (and is definitely treated as such, with all kinds of ethically problematic repercussions), which we show is not the case. Hate speech is very important as it can inflict harm on people, yet classifying it remains challenging for NLP.
Natural language inference. Built upon the semantic foundation of natural logic (Sánchez Valencia, 1991, i.a.) and hailing back much further (van Benthem, 2008), NLI is one of the quintessential natural language understanding tasks. NLI, also known as 'recognizing textual entailment ' (Dagan et al., 2006), is often formulated as a 3-way classification problem where the input is a context sentence paired with a hypothesis, and the output is a label (entailment, contradiction, or neutral) indicating the relation between the pair. We build on the ANLI dataset ) and its three rounds to seed the Dynabench NLI task. During the ANLI data collection process, the annotators were presented with a context (extracted from a pre-selected corpus) and a desired target label, and asked to provide a hypothesis that fools the target model adversary into misclassifying the example. If the target model is fooled, the annotator was invited to speculate about why, or motivate why their example was right. The target model of the first round (R1) was a single BERT-Large model fine-tuned on SNLI and MNLI, while the target model of the second and third rounds (R2, R3) was an ensemble of RoBERTa-Large models fine-tuned on SNLI, MNLI, FEVER (Thorne et al., 2018) recast as NLI, and all of the ANLI data collected prior to the corresponding round. The contexts for Round 1 and Round 2 were Wikipedia passages curated in  and the contexts for Round 3 were from various domains. Results indi-cate that state-of-the-art models (which can obtain 90%+ accuracy on SNLI and MNLI) cannot exceed 50% accuracy on rounds 2 and 3.
With the launch of Dynabench, we have started collection of a fourth round, which has several innovations: not only do we select candidate contexts from a more diverse set of Wikipedia featured articles but we also use an ensemble of two different models with different architectures as target adversaries to increase diversity and robustness. Moreover, the ensemble of adversaries will help mitigate issues with creating a dataset whose distribution is too closely aligned to a particular target model or architecture. Additionally, we are collecting two types of natural language explanations: why an example is correct and why a target model might be wrong. We hope that disentangling this information will yield an additional layer of interpretability and yield models that are as least as explainable as they are robust.
Question answering. The QA task takes the same format as SQuAD1.1 (Rajpurkar et al., 2016), i.e., given a context and a question, extract an answer from the context as a continuous span of text. The first round of adversarial QA (AQA) data comes from "Beat the AI" (Bartolo et al., 2020). During annotation, crowd workers were presented with a context sourced from Wikipedia, identical to those in SQuAD1.1, and asked to write a question and select an answer. The annotated answer was compared to the model prediction using a wordoverlap F 1 threshold and, if sufficiently different, considered to have fooled the model. The target models in round 1 were BiDAF (Seo et al., 2017), BERT-Large, and RoBERTa-Large.
The model in the loop for the current round is RoBERTa trained on the examples from the first round combined with SQuAD1.1. Despite the super-human performance achieved on SQuAD1.1, machine performance is still far from humans on the current leaderboard. In the current phase, we seek to collect rich and diverse examples, focusing on improving model robustness through generative data augmentation, to provide more challenging model adversaries in this constrained task setting. We should emphasize that we don't consider this task structure representative of the broader definition even of closed-domain QA, and are looking to expand this to include unanswerable questions (Rajpurkar et al., 2018), longer and more complex passages, Yes/No questions and multi-span answers (Kwiatkowski et al., 2019), and numbers, dates and spans from the question (Dua et al., 2019) as model performance progresses. Sentiment analysis. The sentiment analysis project is a multi-pronged effort to create a dynamic benchmark for sentiment analysis and to evaluate some of the core hypotheses behind Dynabench. Potts et al. (2020) provide an initial report and the first two rounds of this dataset.
The task is structured as a 3-way classification problem: positive, negative, and neutral. The motivation for using a simple positive/negative dichotomy is to show that there are still very challenging phenomena in this traditional sentiment space. The neutral category was added to avoid (and helped trained models avoid) the false presupposition that every text conveys sentiment information (Pang and Lee, 2008). In future iterations, we plan to consider additional dimensions of sentiment and emotional expression (Alm et al., 2005;Neviarouskaya et al., 2010;Wiebe et al., 2005;Liu et al., 2003;Sudhof et al., 2014).
In this first phase, we examined the question of how best to elicit examples from workers that are diverse, creative, and naturalistic. In the "prompt" condition, we provide workers with an actual sentence from an existing product or service review and ask them to edit it so that it fools the model. In the "no prompt" condition, workers try to write original sentences that fool the model. We find that the "prompt" condition is superior: workers generally make substantial edits, and the resulting sentences are more linguistically diverse than those in the "no prompt" condition.
In a parallel effort, we also collected and validated hard sentiment examples from existing corpora, which will enable another set of comparisons that will help us to refine the Dynabench protocols and interfaces. We plan for the dataset to continue to grow, probably mixing attested examples with those created on Dynabench with the help of prompts. With these diverse rounds, we can address a wide range of question pertaining to dataset artifacts, domain transfer, and overall robustness of sentiment analysis systems.
Hate speech detection. The hate speech task classifies whether a statement expresses hate against a protected characteristic or not. Detecting hate is notoriously difficult given the important role played by context and speaker (Leader May-nard and Benesch, 2016) and the variety of ways in which hate can be expressed (Waseem et al., 2017). Few high-quality, varied and large training datasets are available for training hate detection systems (Vidgen and Derczynski, 2020;Poletto et al., 2020;Vidgen et al., 2019).
We organised four rounds of data collection and model training, with preliminary results reported in . In each round, annotators are tasked with entering content that tricks the model into giving an incorrect classification. The content is created by the annotators and as such is synthetic in nature. At the end of each round the model is retrained and the process is repeated. For the first round, we trained a RoBERTa model on 470,000 hateful and abusive statements 4 . For subsequent rounds the model was trained on the original data plus content from the prior rounds. Due to the complexity of online hate, we hired and trained analysts rather than paying for crowd-sourced annotations. Each analyst was given training, support, and feedback throughout their work.
In all rounds annotators provided a label for whether content is hateful or not. In rounds 2, 3 and 4, they also gave labels for the target (i.e., which group has been attacked) and type of statement (e.g., derogatory remarks, dehumanization, or threatening language). These granular labels help to investigate model errors and improve performance, as well as directing the identification of new data for future entry. For approximately half of entries in rounds 2, 3 and 4, annotators created "perturbations" where the text is minimally adjusted so as to flip the label (Gardner et al., 2020; Kaushik et al., 2020). This helps to identify decision boundaries within the model, and minimizes the risk of overfitting given the small pool of annotators.
Over the four rounds, content becomes increasingly adversarial (shown by the fact that target models have lower performance on later rounds' data) and models improve (shown by the fact that the model error rate declines and the later rounds' models have the highest accuracy on each round). We externally validate performance using the HATE-CHECK suite of diagnostic tests from Röttger et al. (2020). We show substantial improvement over the four rounds, and our final round target model achieves 94% on HATECHECK, outperforming the models presented by the original authors.

Caveats and Objections
There are several obvious and valid objections one can raise. We do not have all the answers, but we can try to address some common concerns.
Won't this lead to unnatural distributions and distributional shift? Yes, that is a real risk. First, we acknowledge that crowdsourced texts are likely to have unnatural qualities: the setting itself is artificial from the perspective of genuine communication, and crowdworkers are not representative of the general population. Dynabench could exacerbate this, but it also has features that can help alleviate it. For instance, as we discussed earlier, the sentiment analysis project is using naturalistic prompt sentences to try to help workers create more diverse and naturalistic data. Second, if we rely solely on dynamic adversarial collection, then we increase the risks of creating unnatural datasets. For instance, Bartolo et al. (2020) show that training solely on adversarially-collected data for QA was detrimental to performance on non-adversarially collected data. However, they also show that models are capable of simultaneously learning both distributions when trained on the combined data, retaining if not slightly improving performance on the original distribution (of course, this may not hold if we have many more examples of one particular kind). Ideally, we would combine adversarially collected data with non-adversarial-preferably naturally collecteddata, so as to capture both the average and worst case scenarios in our evaluation.
Finally, we note that Dynabench could enable the community to explore the kinds of distributional shift that are characteristic of natural languages. Words and phrases change their meanings over time, between different domains, and even between different interlocutors. Dynabench could be a tool for studying such shifts and finding models that can succeed on such phenomena.
What if annotators "overfit" on models? A potential risk is cyclical "progress," where improved models forget things that were relevant in earlier rounds because annotators focus too much on a particular weakness. Continual learning is an exciting research direction here: we should try to understand distributional shift better, as well as how to characterize how data shifts over time might impact learning, and how any adverse effects might be overcome. Because of how most of us have been trained, it is natural to assume that the last round is automatically the best evaluation round, but that does not mean that it should be the only round: in fact, most likely, the best way to evaluate progress is to evaluate on all rounds as well as any high-quality static test set that exists, possibly with a recency-based discount factor. To make an analogy with software testing, similar to checklists (Ribeiro et al., 2020), it would be a bad idea to throw away old tests just because you've written some new ones. As long as we factor in previous rounds, Dynabench's dynamic nature offers a way out from forgetting and cyclical issues: any model biases will be fixed in the limit by annotators exploiting vulnerabilities.
Another risk is that the data distribution might be too heavily dependent on the target model in the loop. When this becomes an issue, it can be mitigated by using ensembles of many different architectures in the loop, for example the top current state-of-the-art ones, with multiple seeds. 5 How do we account for future, not-yet-in-theloop models? Obviously, we can't-so this is a very valid criticism. However, we can assume that an ensemble of model architectures is a reasonable approximation, if and only if the models are not too bad at their task. This latter point is crucial: we take the stance that models by now, especially in aggregate, are probably good enough to be reasonably close enough to the decision boundaries-but it is definitely true that we have no guarantees that this is the case.
How do we compare results if the benchmark keeps changing? This is probably the main hurdle from a community adoption standpoint. But if we consider, e.g., the multiple iterations of Se-mEval or WMT datasets over the years, we've already been handling this quite well-we accept that a model's BLEU score on WMT16 is not comparable to WMT14. That is, it is perfectly natural for benchmark datasets to evolve as the community makes progress. The only thing Dynabench does differently is that it anticipates dataset saturation and embraces the loop so that we can make faster and more sustained progress.
What about generative tasks? For now Dynabench focuses on classification or span extraction tasks where it is relatively straightforward to establish whether a model was wrong. If instead the evaluation metric is something like ROUGE or BLEU and we are interested in generation, we need a way to discretize an answer to determine correctness, since we wouldn't have ground truth annotations; which makes determining whether a model was successfully fooled less straightforward. However, we could discretize generation by re-framing it as multiple choice with hard negatives, or simply by asking the annotator if the generation is good enough. In short, going beyond classification will require further research, but is definitely doable.
Do we need models in the loop for good data?
The potential usefulness of adversarial examples can be explained at least in part by the fact that having an annotation partner (so far, a model) simply provides better incentives for generating quality annotation. Having the model in the loop is obviously useful for evaluation, but it's less clear if the resultant data is necessarily also useful in general for training. So far, there is evidence that adversarially collected data provides performance gains irrespective of the model in the loop Dinan et al., 2019;Bartolo et al., 2020). For example, ANLI shows that replacing equal amounts of "normally collected" SNLI and MNLI training data with ANLI data improves model performance, especially when training size is small , suggesting higher data efficiency. How-ever, it has also been found that model-in-the-loop counterfactually-augmented training data does not necessarily lead to better generalization (Huang et al., 2020). Given the distributional shift induced by adversarial settings, it would probably be wisest to combine adversarially collected data with nonadversarial data during training (ANLI takes this approach), and to also test models in both scenarios. To get the most useful training and testing data, it seems the focus should be on collecting adversarial data with the best available model(s), preferably with a wide range of expertise, as that will likely be beneficial to future models also. That said, we expect this to be both task and model dependent. Much more research is required, and we encourage the community to explore these topics.
Is it expensive? Dynamic benchmarking is indeed expensive, but it is worth putting the numbers in context, as all data collection efforts are expensive when done at the scale of our current benchmark tasks. For instance, SNLI has 20K examples that were separately validated, and each one of these examples cost approximately $0.50 to obtain and validate (personal communication with SNLI authors). Similarly, the 40K validated examples in MultiNLI cost $0.64 each (p.c., MultiNLI authors). By comparison, the average cost of creation and validation for ANLI examples is closer to $1.00 (p.c., ANLI authors). This is a substantial increase at scale. However, dynamic adversarial datasets may also last longer as benchmarks. If true, then the increased costs could turn out to be a bargain.
We should acknowledge, though, that dynamic benchmarks will tend to be more expensive than regular benchmarks for comparable tasks, because not every annotation attempt will be model-fooling and validation is required. Such expenses are likely to increase through successive rounds, as the models become more robust to workers' adversarial attacks. The research bet is that each example obtained this way is actually worth more to the community and thus worth the expense.
In addition, we hope that language enthusiasts and other non-crowdworker model breakers will appreciate the honor that comes with being high up on the user leaderboard for breaking models. We are working on making the tool useful for education, as well as gamifying the interface to make it (even) more fun to try to fool models, as a "game with a purpose" (Von Ahn and Dabbish, 2008), for example through the ability to earn badges.

Conclusion and Outlook
We introduced Dynabench, a research platform for dynamic benchmarking. Dynabench opens up exciting new research directions, such as investigating the effects of ensembles in the loop, distributional shift characterisation, exploring annotator efficiency, investigating the effects of annotator expertise, and improving model robustness to targeted adversarial attacks in an interactive setting. It also facilitates further study in dynamic data collection, and more general cross-task analyses of humanand-machine interaction. The current iteration of the platform is only just the beginning of a longer journey. In the immediate future, we aim to achieve the following goals: Anyone can run a task. Having created a tool that allows for human-in-the-loop model evaluation and data collection, we aim to make it possible for anyone to run their own task. To get started, only three things are needed: a target model, a (set of) context(s), and a pool of annotators.
Multilinguality and multimodality. As of now, Dynabench is text-only and focuses on English, but we hope to change that soon.

Live model evaluation. Model evaluation
should not be about one single number on some test set. If models are uploaded through a standard interface, they can be scored automatically along many dimensions. We would be able to capture not only accuracy, for example, but also usage of computational resources, inference time, fairness, and many other relevant dimensions. This will in turn enable dynamic leaderboards, for example based on utility (Ethayarajh and Jurafsky, 2020). This would also allow for backward-compatible comparisons, not having to worry about the benchmark changing, and automatically putting new state of the art models in the loop, addressing some of the main objections.
One can easily imagine a future where, in order to fulfill reproducibility requirements, authors do not only link to their open source codebase but also to their model inference point so others can "talk with" their model. This will help drive progress, as it will allow others to examine models' capabilities and identify failures to address with newer even better models. If we cannot always democratize the training of state-of-the-art AI models, at the very least we can democratize their evaluation.