Errudite: Scalable, Reproducible, and Testable Error Analysis

Though error analysis is crucial to understanding and improving NLP models, the common practice of manual, subjective categorization of a small sample of errors can yield biased and incomplete conclusions. This paper codifies model and task agnostic principles for informative error analysis, and presents Errudite, an interactive tool for better supporting this process. First, error groups should be precisely defined for reproducibility; Errudite supports this with an expressive domain-specific language. Second, to avoid spurious conclusions, a large set of instances should be analyzed, including both positive and negative examples; Errudite enables systematic grouping of relevant instances with filtering queries. Third, hypotheses about the cause of errors should be explicitly tested; Errudite supports this via automated counterfactual rewriting. We validate our approach with a user study, finding that Errudite (1) enables users to perform high quality and reproducible error analyses with less effort, (2) reveals substantial ambiguities in prior published error analyses practices, and (3) enhances the error analysis experience by allowing users to test and revise prior beliefs.


Introduction
The attempt to analyze when, how, and why models fail (error analysis) is a crucial part of the development cycle. Understanding model shortcomings helps NLP developers revise their models, uncover bugs, make deployment decisions, and communicate model performance. Two common forms of error analysis are (1) data grouping, where aggregate metrics are computed for particular slices of interest (e.g., accuracy over question types in machine comprehension, per-label performance in semantic role labeling) ( In practice, however, groupings and counterfactual tests are very coarse or limited. The input to most NLP tasks is unstructured text, which makes systematic in-depth error analysis challenging. Even answering simple questions such as "how accurate is my model when person names are involved?" requires extensive coding, and the use of additional tools such as NER or POS taggers. Due to such difficulties, a common alternative is to group a subset of error samples with manual labels on potential error causes. While useful, the high cost of manual labeling limits analyses to small samples. We surveyed 10 papers with error analyses that examine a sample of incorrect predictions, e.g., (Wadhwa et al., 2018; Min et al., 2017) 1 , and found the sample sizes ranged from 50 to 200 model errors (µ = 85.5, a range corroborated by our user study survey) -frequently covering less than 5% of the total errors. Such small samples are likely unrepresentative of the true error distribution, resulting in high sampling error in the analysis. Furthermore, due to subjectivity, the labels themselves are not precisely defined (Chang et al., 2017). Indeed, our user study ( §5) reveals that inter-researcher agreement is very low even for simple labels, an inconsistency that greatly harms reproducibility.
Focusing exclusively on errors -while overlooking successful predictions for instances with similar attributes -may also lead researchers to make biased conclusions, and mistakenly prioritize groups that are in fact well-handled on average (Rondeau and Hazen, 2018). Finally, there may be multiple plausible explanations for an error, with the true cause not immediately apparent.   Figure 2 illustrates an incorrect prediction from a machine comprehension (MC) model that could be caused by the presence of a distractor entity with the same type as the ground truth (PERSON), the need to perform multi-sentence reasoning, a combination of both, or something else altogether. In a manual analysis, researchers may gravitate to the first or most salient explanation, without verifying them via counterfactual analysis (e.g., by removing the distractor).
We present an error analysis methodology grounded in three principles: hypothesized error causes should be (1) formalized in a precise and reproducible manner, (2) applied to all instances rather than a small sample of errors, and (3) tested explicitly via counterfactual analysis. We instantiate these principles in the design of an interactive system called Errudite. At the core of Errudite is an expressive domain-specific language (DSL) for precisely querying instances based on linguistic features. The DSL concretizes unambiguous error hypotheses, allows grouping to scale to all instances, and enables rewriting for counterfactual testing. For example, it makes it easy to create a precise group containing all instances where the ground truth and the prediction share entity type (which would include the example in Figure 2), verify how often the model gets distracted, and check if the model turns to the correct entity when the distractor is removed. This sequence is precisely what we use to illustrate the design of Errudite ( §3). At each step in the sequence, Errudite helps users inspect and refine their hypotheses in real time with interactive visualizations (Figure 1) and query suggestions based on programming-bydemonstration ( §4). We validate our methodology and Errudite via a user study ( §5), where MC experts applied it to gain valuable and reproducible insights into model behavior. The same users, when given identical descriptions of an error type from a prior published analysis and asked to reproduce it, produced groups that vary in size from 13.8% to 45.2% of all errors -which illustrates the ambiguity in subjective manual labeling.
In summary, we contribute: (1) an enumeration of key challenges for NLP error analysis: manual, subjective inspection of a small sample of errors can be ambiguous, biased, and miss the root cause of errors; (2) principles for informative error analysis: precise and reproducible, scalable, and testable; (3) the design of Errudite, an interactive graphical tool that instantiates these principles by systematically grouping and rewriting instances using a domain-specific language; and (4) a user study and case studies comparing Errudite with status quo error analysis practices. Errudite is available as an open source resource at https:// github.com/uwdata/errudite, together with all analyses in this paper for easy replication.

Error Analysis Principles & Errudite
We identify three principles (abbreviated to the three subsection titles) for effective and unbiased error analysis, and describe tactics in Errudite that instantiate them. 4

Precise and Reproducible Hypotheses
Manual labeling of errors involves forming qualitative descriptions that implicitly refer to characteristics of the input and/or model output, often in an ambiguous form. For example, "the model is bad on long questions" refers to questions that have more than N tokens, with N left open to interpretation. In order to make error analysis scalable (not dependent on manual labels) and reproducible (unambiguous), our first principle is therefore P1: error hypotheses should be defined precisely with concrete descriptions, e.g., describing questions as "longer than 20 tokens" rather than "long." Errudite enables this through a domain-specific language (DSL) with targets, attribute extractors and operators, in increasing order of abstraction.
Targets are primitives which allow users to access inputs and outputs at different levels of granularity, such as the question (q), passage context (c), ground truth (g), the prediction of a model m (denoted by p(m)), sentence and token. Targets can be composed, e.g., sentence(g) extracts the sentence that contains the ground truth span.
Attribute extractors act on targets to extract fundamental instance metadata (e.g., length(q) returns the length of a question). These include (1) basic extractors like length, (2) general purpose linguistic features like token LEMMA, POS tags, and entity (ENT) annotations, (3) standard prediction performance metrics such as f1 or accuracy, (4) between-target relations such as overlap(t1, t2), and (5) domain-specific attributes (e.g., for MC or VQA) such as question type and answer type (Wadhwa et al., 2018; Shen et al., 2017). Table 1 provides an abridged listing of extractors, with example values from Figure 2. 5 Finally, extractors are composable through standard logical and numerical operators, serving as building blocks for more complex attributes. For example, to create a boolean attribute that checks if the ground truth span contains an entity, the != operator is used, yielding ENT(g)!="". A more complex example is counting the number of times the ground truth entity appears in the passage context: count(token(c, pattern=ENT(g))). Being reusable and composable makes extractors much more expressive than predefined attributes, and helps formulate much richer hypotheses.
Errudite's data grouping and rewriting (introduced below) are both supported by these abstractions in the DSL. Precise hypotheses and queries enable reproducible analyses that can be shared between research groups, and automatically applied to new datasets and models.

Analyze All Relevant Instances
Random spot checking of errors can lead to confirmation bias and spurious conclusions (Rondeau To check whether the target contains certain pattern. pattern automatically detects queries on POS tags and entity types. starts with(q,pattern="who VBZ") == True has pattern(g,pattern="PERSON") == True overlap(t1,t2) The ratio of t1 tokens that also occur in t2. overlap(q, sentence(g)) == 0.25 Table 1: Definitions for a subset of attribute extractors, including sample values from the example in Figure 2.
and Hazen, 2018). To avoid these, we propose P2: error prevalence should be assessed over the entire dataset. Grouping queries created with the DSL can scale the analysis to cover not only errors that are otherwise missed by small samples, but also correct cases that are typically overlooked. We now provide an example that illustrates the pitfalls of not following this principle, and how including all of the relevant successes and failures can lead to different insights than looking at a small sample of mistakes. Distractor Example. The distractor hypothesis states that BiDAF is good at matching questions to entity types (e.g., knowing when a PERSON is expected as an answer), but is often distracted by other spans with the same entity type (e.g., other PERSONs), leading to wrong predictions as the in Figure 2. This is a hypothesis independently raised by four out of ten user study participants ( §5). 6 Consider the group is distracted, defined by the following query: ENT(g) != "" and count(token(c, pattern=ENT(g))) > count(token(g, pattern=ENT(g))) and ENT(g) == ENT(p(m)) and f1(m) == 0 The query can be broken down into the following conditions: the ground truth is an entity (line 1); there are potential distractors -i.e., there are more tokens matching the ground truth entity type (ENT(g)) in the whole context than in the ground truth (lines 2-3); the prediction entity type matches the ground truth one (line 4); and the prediction is incorrect (line 5). Starting from all instances, we can subset groups by applying these conditions successively in order. Errudite conveys useful statistics about the groups via visualizations, as in Figure 3. 6 Participants tested the hypothesis for a specific entity type (numbers). We present a more general case here. If we only consider is distracted, without also considering correct predictions, we might conclude that the distractor hypothesis is correct: the 192 instances in the group are all cases where BiDAF predicts a wrong span that has the same entity type as the ground truth, and the group accounts for 5.7% of all BiDAF errors. However, looking at the groups in succession reveals a different, and more complete story: BiDAF predicts the exact correct span (exact match) 68% of the time overall, which rises to 80% when the ground truth is an entity. When other entities with the same type are present in the passage, BiDAF is still 79% accurate (i.e., it is not particularly worse when there are potential distractors), and conditioned on having matched the question to the right entity type, it is quite accurate (88% exact match). The user study participants who previously believed the distractor hypothesis either rejected or revised it after creating similar groups.

Explicitly Test Error Hypotheses
In the example from the previous section, the presence of distractors in the context of a wrong prediction does not necessarily indicate that distractors were the root cause of the mistake. To isolate the essential cause of errors, we state P3: error hypotheses should be explicitly tested. This re-quires answers to counterfactual questions, such as "If the predicted distractor was not there, would the model predict correctly?" Errudite allows manual editing of individual examples (i.e., changing the input arbitrarily), a common practice to verify if the suspected error causes are really causes. While useful for quick spot tests on single instances, manual editing does not scale. For scalable counterfactual analysis, Errudite uses rules to rewrite all relevant instances within a group -similar to search and replace but with the flexibility and power of the Errudite DSL.
A rewrite rule is specified using the syntax rewrite(target,from→to), where target indicates the part of the instance that should be rewritten by replacing from with to. Both from and to can include linguistic annotations, in ALL CAPS. A rule to replace "Who" followed by a verb with "What person" followed by the same verb is written as rewrite(q,"who VERB"→"what person VERB"). For convenience, Errudite also includes default rules suggested in formative interviews with MC experts, such as "remove all sentences except the one that contains the ground truth", and "replace pronouns (he) with raw references (John Smith) from a coreference model." Returning to our distractor example, we can verify whether distractors are causing mistakes by using a rewrite rule on the is distracted group, replacing the predicted distractor with a non-entity, placeholder token "#": rewrite(c, STRING(p(m))→"#"). in case A it seems other factors are at play. In case C, further analysis indicates that the predicted span is almost always a different distractor (i.e., has the same entity type). Thus, while BiDAF is fairly accurate when the distractors are present and the entity type is matched (88%), when it is incorrect, it seems distractors are indeed confusing the model. This kind of analysis is rarely seen (if at all) in the literature; yet it helps users develop insights not available through data grouping.

Interactive User Interface
We now walk through the interactive interface of Errudite in more detail. The interface not only integrates the entire analysis process, but also provides additional exploration support such as visualizing data distributions, suggesting potential queries, and presenting the grouping and rewriting results. While not strictly necessary for the error analysis principles previously outlined, it makes their application much more straightforward by helping users formulate and inspect their hypotheses in real time, and at scale (P2).
Attribute distribution. To guide the exploration, group creation and refinement, Errudite supports defining complex attributes and inspecting their distributions. An example in Figure 5 shows the histogram of ground truth entity types. It displays the relative frequency of different entity answers, as well as the proportion of incorrect predictions. The histograms are updated to show conditional distributions when a user selects a group. Figure 5(a) shows histograms for the ground truth entity type in the group is entity: when the answer is an entity, it is most often a DATE, PERSON, ORG, or CARDINAL. Figure 5(b) displays the same histogram for the group is distracted. We note that the frequency of "distraction" mistakes for PERSON and CARDI-NAL are higher, while lower for ORG, relative to the base frequencies in Figure 5(a), an insight that may warrant further investigation.
Programming-by-Demonstration. To make it easier for users to formulate group queries and rewrite rules, interactive selections can trigger suggestions for related DSL statements. If a user selects any text span in an instance in the central browser, she is shown suggestions for related queries. For example, selecting "John" in Figure 1 (or Figure 2) triggers the following suggestions: starts with(p(m), pattern="NNP") starts with(p(m), pattern="PERSON") answer type(g) == answer type(p(m)) exact match(m) == 0 is correct sent(m) == False overlap(q, sentence(p(m))) > overlap(q, sentence(g)) These suggestions cover pattern searches (lines 1-2) ranked by their occurrence frequency and error rate, and target comparisons (lines 4-7), which are particularly relevant when the prediction or ground truth is selected. Selecting a different text span yields different suggestions, heuristically ranked and filtered with the goal of surfacing queries likely to be of interest.  For rewrite rules, we use a technique inspired by Ribeiro et al. (2018) to generalize manual edits into suggested rewrite rules: including context, POS tags and named entities, attempting to maximize coverage and relevance without redundancy. Figure 6 shows an example in which various suggestions are displayed after a user rewrites an instance by changing "Who" to "What person." Appendix B provides a more detailed description of our searching and ranking criteria.
Layout. The UI (Figure 1) contains three main components. The central component contains a fil-ter panel (C) and an instance browser (D), which help examine the results of data groupings or rewrite rules for iterative refinement. The collapsible sidebar on the left contains a list of different models being analyzed with summary statistics (A) and customizable attribute histograms (B). The one on the right contains a list of saved data groups (E) and rewrite rules (F); these can be loaded into the central component via mouse click. All groups and rewrite rules can be saved and loaded through the interface, so the analysis can be easily shared and reproduced.

User Study
We conducted user studies to evaluate Errudite. Though less common in NLP, this type of evaluation is widely used in fields like Human-Computer Interaction for understanding how certain methods or systems impact the intended user group (Nielsen, 1994; Olsen Jr, 2007) -precisely our objective here. We recruited ten participants with prior Machine Comprehension experience (developed 1-6 models each, µ = 3.1, σ = 2.02) for a 90-minute study: four NLP graduate students and six researchers or QA engineers from industry. Participants analyzed BiDAF on SQuAD v1.1.
User studies can take various forms, ranging from experiments that quantitatively compare human performance, to interviews or observational studies that qualitatively inspect users' behaviors and perspectives. We take a more qualitative approach, as we are primarily interested in how Errudite shapes participants' error analysis experience. The study started with a background survey about users' prior experience in MC and error analysis. After a walk-through tour of Errudite (described in Appendix A.3), participants were asked to perform two tasks: Replication ( §5.1), in which they attempted to reproduce the error analysis from Seo et al. (2016); and Exploration ( §5.2), in which they freely explored the model and reported their findings. We collected multiple subjective measures from participants in the form of five-point Likert scale ratings (Likert, 1932), with 5 being strongly positive and 1 strongly negative. Participants were compensated at a rate of $25/hr.

Task 1: Replication
The goal of this task was two-fold: (1) to verify if Errudite is flexible enough to support the creation of groups traditionally labeled by hand, and 0% 10% 20% 30% 40% 50%

Error Coverage
Boundary Multi-sentence Group Figure 7: Percentage of errors covered by user-defined groups: Boundary (µ = 30.9%, σ = 10.5%) and Multisentence (µ = 13.5%, σ = 8.29%). The dispersion of grey ticks shows that users come up with different definitions for groups described by Seo et al. (2016), even when they think they replicated the group faithfully.
(2) to assess the reproducibility of current ad-hoc error analysis methods. Results. Participants rated the accuracy of the replication of each group after seeing a variety of examples, i.e., "how close the approximation matches the paper definition." For the groups we wrote queries for, participants were confident that Preprocess was accurate (µ = 4.3, σ = 0.64), but ambivalent towards Paraphrase (µ = 3.1, σ = 0.54). Participants' comments indicated the ambivalence did not come from Errudite: 6 participants disagreed with the example given by Seo et al. (2016), and participants who gave low ratings found Paraphrase itself too fuzzy and confusing to formalize. Despite being used widely as an error group (Kundu and Ng, 2018; Chen et al., 2016), participants had conflicting understandings of Paraphrase, either as "the question and the ground truth sentence are semantically similar but with great lexical variations", or "the predicted answer is a paraphrased version of the ground truth." When replicating groups themselves, participants were able to express the queries they wanted. Participants were not very confident in the accuracy of their produced Multi-sentence group (µ = 2.8, σ = 1.32), for reasons similar to Paraphrase: they thought the group was under-specified in the original analysis. More interestingly, users were the most confident in the fidelity of an apparently "easy" group Boundary (µ = 4.8, σ = 0.60), yet the groups they produced were wildly different (Figure 7). While users were able to express what they thought was meant by "imprecise error boundaries", they applied different definitions.
For example, one user defined the group as (D1) "the predicted span can be off by at most two tokens both on the left and right" (yielding 22.1% of all BiDAF errors), while another defined it as (D2) "there is no exact match but high overlap -F 1 is higher than 0.7" (yielding 13.8% of all errors). Figure 8 shows samples that fit the two definitions or just one of them. Errudite makes the different interpretations explicit. The author of D2 observed examples like Figure 8(c) in his samples, but decided ultimately that what mattered was just the returned short text, not the span index. In contrast, D1's author carefully refined his initial query precisely to rule out cases like Figure 8(c).
In summary, users were able to express their intended groups well with Errudite, but they were unable to consistently replicate the analysis of Seo et al. (2016) -even when they thought they diddue to the ambiguity inherent in manual grouping.

Task 2: Exploration
To assess the usefulness of Errudite, we let participants freely analyze BiDAF. We asked them to "think aloud" in real time, vocalizing their hypotheses, intriguing observations, objectives, and expectations. At the end of the session, subjects rated each of their discovered insights in terms of (1) importance (very trivial to very helpful), (2) confidence in insight correctness, and (3) relative ease of discovery compared to existing methods.
Results. All participants found at least one insight by building semantically meaningful groups or rewrite rules. On average, subjects reported µ = 2.1 findings (σ = 0.94). Some insights confirmed prior hypotheses about BiDAF more formally, increasing users' confidence. For example, one user created a group to verify that mistakes frequently occur when there is significant overlap between the question and a sentence that does not contain the ground truth. Indeed, that group accounts for about 18% of BiDAF errors. Other insights extended previous knowledge, such as explorations by two users who examined low performance on "why" questions (Appendix A.4). They also rejected some prior hypotheses after using Errudite, such as the distractor case in §3. Participants rated their findings to be important (µ = 3.7, σ = 1.12), were confident that their findings were valid (µ = 4.0, σ = 1.05), and consistently agreed that Errudite made finding insights easier (µ = 4.9, σ = 0.35). Participants agreed that they learned more about the model (µ = 3.9, σ= 0.94), and valued Errudite's support for assessing their hypotheses.

Usability and User Feedback
When rating the usefulness of different components of the tool (Figure 9), users rated the DSL (µ = 4.8, σ = 0.40) and the attribute distribution (µ = 4.3, σ = 0.78) as very useful, and rated query suggestions (µ= 3.6, σ = 0.91) and rewrite rules (µ = 3.6, σ = 1.11) as potentially useful. We hypothesize that rewrite rules pose a learning curve that makes them difficult to evaluate in a single session. This kind of counterfactual analysis is not common and a few participants were concerned about possible unintended side effects of edits. We also asked participants to describe their impressions with free-form comments, which were very positive for all of them -all thought Errudite enhanced their error analysis experience. In particular, four users stated that they felt it systematically scaled up the analysis, making it more precise and thus inspiring more confidence. Five users noted how much faster exploration became with Errudite, and how having a good set of building blocks and visualizations let them bypass the large coding overhead needed to otherwise test a single hypothesis about a model.  2017)). While useful and accessible, they do not allow more semantically meaningful observations (like distractors or paraphrases). In contrast, some define groups that are highly specific to a particular dataset or model, such as hand-crafting factors to quantify MC instance difficulties (Rondeau and Hazen, 2018). While often insightful, these suffer from potential pitfalls similar to labeling individual instances: they are laborious, often subjective, and hard to reproduce. In other words, just as in manual error labeling mentioned in §1, typical automatic grouping also struggles with the trade-off between being reproducible/scalable, and being in-depth and meaningful. In contrast, Errudite addresses the challenge with an expressive domain-specific language, which helps users build filters that can slice the entire dataset, and thereby build scalable and semantically meaningful groups.
Chung et al. (2018) made a similar attempt to balance the trade-off in Slice Finder, a framework that uses statistical techniques to identify large and interpretable slices that models perform poorly on. However, their purely automated data slicing does not allow users to customize groups based on their own hypotheses. Furthermore, Slice Finder only uses predefined attributes. While this is reasonable in the context of structured data classifier that they tested (with features explicitly defined), it is not flexible enough for unstructured text in NLP. Other interactive error analysis tools tend to face similar customization issues. QADiver (Lee et al., 2018) enriches question attributes in SQuAD 2.0 by including factors like word frequencies and question-context word match ratios, but users cannot query or create groups based on these attributes. QSAnglyzer (Chen and Kim, 2017) aims at category-oriented analysis by pre-defining seven groups for QA models, but there is limited support for group customization. ActiVis (Kahng et al., 2018) allows for flexible data attribute and group definitions, but only supports group creation prior to the interactive process. Rarely does a user to know which group they want to inspect before-hand, and thus it is to be expected that users would revert to coarse and easy-to-program groups. Errudite emphasizes customization: it allows users to define extractors for rich instance attributes, and helps them adjust their groups in real-time with quick trial and error, visualizations, and suggestions based on programming-by-demonstration.

Counterfactual Analysis
Counterfactual attacks to models have taken various forms, e.g., by adding distracting sentences to the context in MC (Jia and Liang, 2017), or feeding partial questions or wrong images into models ( . Slightly closer to our work is SEARs (Ribeiro et al., 2018) (also incorporated into QADiver), which also takes the form of rewrite rules: it generates semanticpreserving rules that cause models to change predictions. However, these focus on robustness, i.e., counterfactual perturbations are mainly for the purpose of detecting over-stability or oversensitivity. In contrast, our counterfactual analysis is for the purpose of understanding why models fail in certain groups. Furthermore, our DSL allows for more complex counterfactual rules and for applying rules only to certain groups, such as "delete the predicted distractor for instances in the is distracted group." As far as we know, such analysis is novel, and a promising direction for more in-depth error analysis.

Conclusion and Future Work
In this paper, we characterize deficiencies with current error analysis methods used in NLP: they are laborious and subjective, which can lead to high variance and low reproducibility. Moreover, by focusing on error cases independent of situations where the model is correct, they can yield biased results. Finally, since it is difficult to perform counterfactual analysis, the root cause of errors can easily be overlooked.
In response, we identify three principles required for successful error analysis, and present an interactive tool called Errudite to enable their application: (1) building precise instances groups with composable building blocks in a domainspecific language; (2) scaling the analysis to cover all the relevant successes and failures by automatically building large groups with filtering queries, and providing visual summaries for them; and (3) testing error hypotheses using counterfactual analysis by rewriting the instances with rules. Data groups and rewrite rules can be easily saved and shared for replication or for analysis of different models with the same groups and rules.
We conduct a detailed user study with NLP experts, confirming that Errudite makes hypothesis definitions both concrete and apparent, reduces sampling bias, and helps researchers verify the true causes of errors. We find that Errudite significantly lowers the barrier for insightful error analysis, hopefully leading to a more in-depth understanding of current models, and to safeguard deployments and improve the state of the art.
While our primary experiments are on Machine Comprehension, the DSL primitives in Errudite are general enough to make extensions to other tasks and domains straightforward. For example, we have extended Errudite to Visual Question Answering with only minor adjustments to the performance metrics and the instance browser (to include images). We share case studies in Appendices A.1 and A.2, together with further analysis on SQuAD (Appendix A.4). Similar adjustments could be done to extend Errudite to other tasks such as Machine Translation, Natural Language Inference, and text classification, along with customization of domain-specific attributes.
In the future, we hope to design and evaluate more sophisticated query and attribute suggestions to guide exploration by less expert users, as well as social features that facilitate collaboration within an organization to promote sharing, review, reuse, and extension of error analyses.

A Additional Use Cases
We use case studies in Visual Question Answering and Machine Comprehension to further demonstrate the usefulness of Errudite. A video demo is available at https://youtu.be/Dil5i0AYyu8.
A.1 VQA: Breaking down "How many" Figure 10: Two "how many" examples: VQACounting improves on SAAA for instance (a), but predicts an even higher count in (b). Highlighting "how many brownish", we create groups based on the suggested query A.
We demonstrate Errudite's power on comparing multiple models in the context of Visual Question Answering (VQA). We analyze SAAA (Kazemi and Elqursh, 2017) and VQACounting (Zhang et al., 2018) concurrently on the validation set of VQA v1 (Antol et al., 2015), which contains 21,512 instances. VQACounting is built on top of SAAA, with increased performance on counting questions. Querying "how many" questions, we notice two interesting cases in Figure 10: VQACounting correctly predicts the "how many people" question in (a), but is worse than SAAA (also wrong) in (b). We suspect the token following "how many" can make a difference. Highlighting "how many brownish", we follow the first returned suggestion ( Figure 10A) to build a how many ADJ group (starts with(q, pattern="how many ADJ")), and similarly, a how many noun group.
Per-group comparison shows VQACounting improves SAAA more on "how many NOUN" than "how many ADJ" (Figure 11) questions: the former has an increase of accuracy from 38% to 49%, whereas the latter only shows 3% improvement. However, note the group size difference: the NOUN group is 14 times larger than ADJ. In fact, Figure 11: VQACounting improves much more on how many NOUN, compared to how many ADJ, though many fewer instances follow the latter pattern. Figure 12: Extracting the POS tag for the token immediately after "how many", we notice most instances follow a "how many NOUN" pattern. extracting the POS tags following "how many" into an attribute, we see "NOUN" drastically stands out, suggesting a very imbalanced data distribution ( Figure 12).

A.2 VQA: Ambiguous Questions
With groups and attributes independent of models or predictions, Errudite can help analyze the consistencies and ambiguities of the datasets. In this case, we use Errudite to group all the "ambiguous VQA questions", or questions where the answers exhibit high human annotator disagreement. If humans cannot agree on the answer, it is to be expected that machine learning models will not be accurate. In the annotations for the VQA v1 dataset, each question collects up to 10 human answers, while in evaluation an answer is considered fully accurate if it matches the answer of at least three humans. We count the unique ground truth annotations for each instance (count(g)), which results in the distribution shown in Figure 13: instances with more ground truth labels are more poorly predicted. Querying for instances with count(g) > 5, we find many instances like the ones in Figure 14, covering 29.9% of all the errors. This means the dataset is far from "clean" and that 30% of the model's mistakes should probably not be considered mistakes.

A.3 MC: Incorrect Pre-processing
We used the following case as the tutorial demo in our user study ( §5). When sorting instances by their F1 score, instances like those in Figure 15(a) appear. Due to incorrect tokenization, BiDAF  treats "1641-1679" as one token, and its mismatch with the ground truth "1679" evaluation (which is token-wise) will result in F1 = 0. We simulate the above tokenization issue with a query that states (a) even though at the character level the ground truth answer is a substring of the prediction, (b) the two don't have token level overlap: STRING(g) in STRING(p(m)) and f1(m) == 0 1 Among the 26 instances returned (0.8% of all incorrect instances), we find multiple instances like the ones in Figure 15(a), and also unexpected cases like Figure 15(b).
It is unclear from these examples if tokenization is the only issue. To further assess, we define a rewrite rule that separates dashes from nearby words: rewrite(sentence(g), "-" → " -").
The rewritten instances are then queryable using a wrapper function: apply(func, rewrite="rule name") runs the query function func on the new instances generated by rule "rule name". We use the queries in Figure 16 to further divide the 13 instances rewritten (the rule cannot edit additional cases like in Figure 15 4 were predicted correctly after the rewrite, 5 remained the same (with spaces added), and 4 returned a different incorrect span after the rewrite. This counterfactual analysis confirms that these errors are not solely due to preprocessing errors.  We merge two cases from our user study in §5 to demonstrate how participants P1 and P2 can start with similar attributes and then diverge and discover complementary insights.
Both participants started by grouping "why" questions, as they observed them to have much lower performance than other primary question types. P1 realized these questions had longer predictions, and the ground truths were usually a Q: Why is Priestley usually given credit for being first to discover oxygen?
...Because he published his findings first, Prestley is usually given priority in the discover. Figure 18: A "why" question where the model ignored the apparent hint "because." small substring of the prediction (with multiple unnecessary tokens on both ends). Meanwhile, "what" questions have relatively shorter predictions. He hypothesized that reframing "why" to "what" questions could result in reasonable prediction lengths, and created a rule rewrite(q, "Why VERB"→"What is the reason that") to confirm it. Out of 151 rewritten instances, 46 had shorter predictions, and 6 had longer ones; the remaining instances had unchanged predictions. Out of the 19 instances where F1 improved after the rewrite (apply(f1(m), rewrite="why to what") > f1(m)), 13 had the prediction shortened to approximately the correct ground truth answer. P2 found the example in Figure 18 and chose a different angle. He was surprised to see the incorrect prediction, when the ground truth contained the word "because", which should make the prediction easier for BiDAF. Grouping all "why" questions with a "because" in their context: question type(q) == "why" and has pattern(c, pattern="because") 1 2 he found most instances still had a prediction following "because", and that removing "because" from the context made predictions worse. He confirmed that "because" was indeed an essential signal. The prediction in Figure 18 remained the same, and P2 therefore hypothesized that aggressive pattern matching affected this instance, as all the words surrounding the prediction "priority" were in the question. He was also surprised that there were only 40 instances in the because group, and suggested more labeling might easily help bump up the performance.
The two participants explored complementary angles on "why" question, suggesting the value of collaborative sharing among Errudite users.

B Programming-by-Demonstration
To help users express their intent, Errudite supports programming by demonstration (PBD) (Gulwani and Jain, 2017), a well-recognized technique  Figure 19: The illustrating example we used in the paper; we repeat it here to explain our programming-bydemonstration heuristics. The scenario here assumes "John" is selected by a user.   As users interact with instances, Errudite detects and returns potential queries that can assist generalization from a single observation to a larger set. As running examples, we explain our query ranking methods assuming "John" is selected in Figure 19, and "How many brownish" is selected in Figure 10. There are three broad types of suggestions with different granularity. To ensure diversity, our suggestions cover at least one query from each type, and the inter-type suggestion ranking will always be as the following: Span-related suggestions closely relate to the specific token(s) selected ("John" in Figure 19). The most typical span-related suggestions are pattern searches. We generate a list of possible linguistic patterns from the cross-product of raw token text with POS tags (coarse for multiple tokens, and fine-grained for single tokens), as well as the entity type (if any). The resulting possible patterns for "John" are "John", "NNP", "PERSON". Similarly, in Figure 10, "how many brownish" results in "how many brownish", "how many ADJ", "ADV ADJ ADJ", etc. The functional predicate used differs if the selected span lies at the beginning, middle, or end of a target (start with, has pattern, and end with).
Target-related suggestions are based on the target under inspection. For instance, we return question type when a user interacts with the question (q) in Figure 10.
A prediction (p) as in Figure 19 will instead trigger different levels of comparisons with the ground truth, including accuracy checks (exact match and is correct sent), answer type comparisons (ENT(p) == ENT(q)), answer offsets (answer offset delta) and sentence level comparisons (overlap): answer type(g) == answer type(p(m)) exact match(m) == 0 is correct sent(m) == False overlap(q, sentence(p(m))) > overlap(q, sentence(g)) 1 2 3 4 5 Instance-level suggestions are conventional attributes that domain experts often find useful. For example, performance, question type, and answer type are considered the most important "instance" suggestions if they are not triggered by the targetrelated suggestions. In addition, lengths of inputs also belong to this suggestion type.
To perform intra-group ranking, we precompute the resulting groups for each candidate suggestion, and rank their in-group error rate R e and dataset coverage C d , maximizing a usefulness score: (1) Intuitively, R e measures group difficulty. We would like to prioritize patterns that will return subsets that are not well-handled on average, resulting in high in-group error rate. The |C d −50%| term, on the other hand, ensures reasonable coverage. We prioritize groups that lean towards 50% coverage of the entire validation set, so to penalize patterns that cover too few instances to be significant, or those covering too many instances that essentially return the entire dataset. Taking the ranking of span-related suggestions as an example, candidate patterns for Figure 19 and Figure 10, and their scores S u , are shown in Table 2.

B.2 Rewrite Rule Extraction
When a source x is edited to x , we propose a set of rules R = {r 1 , ..., r m } in the same manner as Then, we apply every rule in the candidate set onto a random subset of instances S = {s 1 , ..., s n }, n = 100. Similar to Ribeiro et al.
(2018), we prioritize rules that have (1) high coverage and (2) low redundancy, while loosening their constraint on semantic equivalence: rules resulting in different semantics are still valid in our error cause testing context. In addition, we heuristically score the linguistic features used based on their specificity: we consider raw text the most specific, POS tag the least, and penalize rules that are too general and abstract (as they are likely to result in unexpected changes). For example, in addition to the rules reported in Figure 20, an additional rule found in the candidate set from "Who" to "What person" was "NOUN"→"What person". By editing random NOUNs, this rule will have high coverage, but our specificity score weights it down enough that "Who"→"What person" is ranked more highly. We then report the five highest-ranked candidate rules to the user. Table 3 lists the 10 papers we surveyed to inspect the scale of the status quo error analysis practice. Papers are randomly selected from top tier conferences, and either develop novel MC models (our primary test case), or focus on error analysis.  Table 3: Surveyed papers and their error sample sizes.