Inconsistencies in Crowdsourced Slot-Filling Annotations: A Typology and Identification Methods

Slot-filling models in task-driven dialog systems rely on carefully annotated training data. However, annotations by crowd workers are often inconsistent or contain errors. Simple solutions like manually checking annotations or having multiple workers label each sample are expensive and waste effort on samples that are correct. If we can identify inconsistencies, we can focus effort where it is needed. Toward this end, we define six inconsistency types in slot-filling annotations. Using three new noisy crowd-annotated datasets, we show that a wide range of inconsistencies occur and can impact system performance if not addressed. We then introduce automatic methods of identifying inconsistencies. Experiments on our new datasets show that these methods effectively reveal inconsistencies in data, though there is further scope for improvement.


Introduction
Slot-filling is a key component of task-driven dialog systems, providing a way for systems to extract key properties from user queries. For example, a slot-filling model can extract the tokens "New York" as a TO LOCATION slot in the query "book a flight to New York". Standard slot-filling models train or finetune on large datasets of carefully-annotated data that is domain specific. This means there is an ongoing need for annotating new datasets for new domains.
Typically, data annotation is completed via small tasks with brief instructions completed by non-expert crowd workers. Collecting high quality data in this setting is challenging, involving multiple rounds of pilot annotations and analysis to develop suitable instructions (Alonso et al., 2015). Figure 1 shows examples of inconsistencies in annotations performed by crowd workers. Training on data with these issues will lead to lower quality models, which in turn decrease the effectiveness of the overall dialog system. Most research on improving data quality has focused on mechanisms such as aggregation (Parde and Nielsen, 2017), worker filtering (Li and Liu, 2015), and attention checks (Oppenheimer et al., 2009). These all raise costs and primarily address clear inconsistencies (such as in examples 1, 4, 5, and 6) but not more subtle cases like the inclusion of "dollar" in examples 2 and 3. Additionally, annotation-asa-service (e.g., scale.ai) is being widely used by developers but often cannot be customized by the developer to add these kinds of mechanisms. If we could identify all of these inconsistencies then we could fix them without expending effort on examples that were correctly annotated and we could improve task instructions to reduce further inconsistencies.
In this paper, we catalog a typology of annotation inconsistencies that occur in slot-filling data and present several automatic methods for identifying inconsistencies. To demonstrate our ideas, we introduce and analyze three new crowd-annotated datasets. 1 By analyzing the data with our typology, we find that the distribution of errors varies substantially depending on the domain. We also measure the impact of these inconsistencies on trained models and examine the relative impact of each inconsistency type through a controlled experiment where we artificially inject errors.  Figure 1: Examples of inconsistent annotations by crowd workers. In 1, the SOURCE and TARGET labels are backwards. In 2 and 3, there is variation in whether "dollar" is included in the SOURCE span. In 4, annotations for "yen" and "usd" are missing. In 5, the label for "canadian dollar" is incorrect. In 6, "much" is annotated, but in 5 it is not.
We also propose several inconsistency identification methods that require no additional annotation, using only the data already being collected. We evaluate by comparing to manually checking every example, measuring effort required and the quality of models trained on the resulting data. The approaches have different strengths and weaknesses. One reduces effort by 16-31% and produces models within 1.6 F 1 . Another reduces effort by 50-87%, but leads to weaker models. These results indicate that inconsistency identification is possible, but there is scope to further improve the tradeoff between effort and data quality. This work provides the basis for a new direction in research for addressing inconsistencies in crowdsourcing by defining a typology of inconsistency types, collecting new benchmark datasets, and exploring directions for automatic identification of inconsistencies.

Annotation Consistency
Inconsistencies have been studied across a wide range of tasks. Part-of-speech tagging has received particular attention, with a range of methods based on model scores (Abney et al., 1999;Eskin, 2000;Matsumoto and Yamashita, 2000;van Halteren, 2000;Ma et al., 2001;Nakagawa and Matsumoto, 2002). One particular method based on variation in POS tags of n-grams (Dickinson and Meurers, 2003) has been extended to predicate-argument relations (Dickinson and Lee, 2008), and dependency parses (Dickinson, 2010;Dickinson and Smith, 2011). More recent work has explored automatic identification methods for word sense and multi-word-expression annotation (Dligach and Palmer, 2011;Hollenstein et al., 2016).

Crowdsourcing Quality
A range of options have been developed for improving crowdsourcing quality. The most common approach is to collect multiple annotations and then aggregate them (Hovy et al., 2013;Passonneau and Carpenter, 2014;Parde and Nielsen, 2017;Dumitrache et al., 2018). This can identify inconsistencies, but at significant cost as each example must be annotated multiple times. Less expensive options include using examples with known answers to test worker attention (Oppenheimer et al., 2009), or filtering workers based on qualification tasks or preliminary annotation (Li and Liu, 2015;Roit et al., 2020). However, these primarily address worker attentiveness, which does not cover all the inconsistency types we consider. In particular, we also cover inconsistencies related to subtle cases without an obvious correct answer.

Error Type Categorization
Another line of work has explored automatic identification of error types for tasks such as constituency parsing (Kummerfeld et al., 2012;, coreference resolution , semantic role labeling (He et al., 2017), and slot-filling (Béchet and Raymond, 2018). However, they focus on evaluating system outputs in comparison to a gold standard reference in order to understand shortcomings of the systems. One notable exception (Niu and Penn, 2019) establishes a taxonomy of annotation errors, although this is for a single corpus (ATIS (Hemphill et al., 1990)). These are cases where the boundaries of a slot are inconsistent. In these examples, "british" is part of TARGET in 1 but not 3, and "the" is part of TARGET in 2 but not 1, and "dollars" is part of SOURCE in 1 but not 2. A span is labeled that should not be.
Here, "many" should not be labeled.

Inconsistencies
In this section, we define six inconsistency types, measure their frequency in three new datasets, and evaluate their impact on model performance. We developed the set of inconsistencies by manually inspecting crowd-annotated data and looking for patterns in the common mistakes.

Types of Inconsistencies
We identify six inconsistency types that arise in the crowdsourcing setting when multiple non-expert human annotators work independently. Table 1 briefly describes the types through examples. Some inconsistencies arise when there is ambiguity in an aspect of the conventions for labeling slots, with different workers resolving the ambiguity differently. Other inconsistencies are mistakes that directly violate the specified annotation request, arising due to either confusion or rushed effort.

Slot Format Inconsistency
In some cases, it is clear that a slot is present, but the precise boundaries of the slot may be unclear. For example, consider labeling an ACCOUNT in the text "to my checking account please". There are four possible reasonable annotations: 1. "to my checking account Depending on how the extracted slots are used, which of these is right may differ. For example, the "my" indicates that the checking account belongs to the user, which could be important for distinguishing it from another context (e.g. "Sarah's checking"). It is possible that any of these would work for some downstream uses, but training with inconsistent labels will present a confusing training signal for models.

Omission Inconsistency
This covers cases where a span is meant to be labeled with a certain slot label, but is not. The span "savings" should be labeled as ACCOUNT in example 6, but is not. This might occur if a worker is moving quickly through examples and submits after assigning the first label.

Addition Inconsistency
These occur when a span is labeled with a slot label but should be unlabeled. Consider the following: "savings" is a valid ACCOUNT, but "checking" is not. This might occur if a worker is simply pattern matching on key terms rather than carefully reading each example.

Wrong Label Inconsistency
This is when a span is correctly identified as a slot, but assigned an incorrect label. For example: 9. "transfer 30 dollars ACCOUNT please" Here, "30 dollars" is marked as a slot, but with ACCOUNT instead of AMOUNT. These could be the result of a typo or mis-click when using an annotation tool, or due to a misunderstanding of the slot.

Swapped Label Inconsistency
This is a special case of Wrong Label inconsistency. A Swapped Label inconsistency is when (a) two spans in the same sample have the wrong label, but would be correct if they were swapped, and (b) the two labels are for the same item type (e.g., a currency, or a location). For example, consider labels for the SOURCE and TARGET currencies below: The first example clearly has the two labels reversed. The second example is ambiguous and so may be annotated one way by some workers and the reverse by others.

Chop and Join Inconsistency
These occur when either a span is labeled as one slot where several annotations are appropriate (Join), or when a span is labeled as several annotations where a single annotation is appropriate (Chop). Consider the following two samples related to scheduling a flight departure date (DATE): In the first sample, the DATE slot is applied to one continuous token span, but in the second sample the DATE slot is applied to each individual attribute (i.e., day, month, and year).

Other
When annotating inconsistencies we also include an Other category. This covers annotations that do not fit within the cases defined above, but do seem inconsistent with the rest of the annotations.

Datasets
We crowdsourced slot annotations for datasets in three domains to analyze the frequency of each inconsistency type and the effect inconsistencies have on slot-filling model performance. Table 2 shows examples from each dataset and the number of different slot types. The sentences used for two (Forex and Companies) were collected using the crowdsourcing approach described by Larson et al. (2019) and sentences for the third (Flights) were sampled from prior work (Jaech et al., 2016). The Flights and Companies datasets contain 500 samples each, and Forex contains 520. All samples are in English. We crowdsourced slot label annotations using Amazon Mechanical Turk, paying workers $0.10 per sample. Each sample was annotated by one worker, who was asked to annotate all slot types present. Instructions contained (1) descriptions of each slot label type, and (2)

Manually Labeling Inconsistencies
Once annotated by the crowd, we manually checked each annotated sample for inconsistencies. Inconsistencies in the data were labelled independently by two of the authors of this paper, who then discussed and reconciled any disagreements. We did not predefine an annotation policy for every slot, particularly the Slot Format inconsistencies. Instead, we looked at the data and followed the dominant behavior of workers over all similar annotations in a dataset, labeling annotations in the minority as inconsistencies. We used a search tool (Larson et al., 2020) to help find all relevant samples to determine the majority annotation scheme. For the Flights dataset we also used the original annotations to help identify inconsistencies (but again, followed the crowd majority conventions even when they deviated from the original data).  Table 4: Slot-filling F 1 scores when training on data with different levels of inconsistency. The evaluation sets are composed of consistently annotated data. The frequency of samples having any type of inconsistency is quite high, ranging from 27.5% to 65%. Datasets with more slot types have more inconsistencies. The Companies dataset has the highest rate of inconsistencies, confirming our hypothesis that this is a challenging dataset for non-experts to annotate. The Slot Format inconsistency is the most common type for all of the datasets and Omission is the second most common type. This observation guided the focus of the next section of the paper, where we develop automatic methods for finding inconsistencies. As expected, datasets with slot values that can be one of two slot types (e.g., TO LOCATION and FROM LOCATION) have a higher occurrence of swapped labels.

Measuring the Impact on Model Performance
Given that annotated slot-filling data is used to train slot-filling models, we study the impact that the noisy datasets have on slot-filling performance. We trained slot-filling models on three versions of each dataset: the raw crowd-annotated dataset (Crowd), a version in which we fixed Slot Format inconsistencies (Formatted), and a version in which we fixed all inconsistencies (Consistent). We include the Formatted version in our analysis since Slot Format inconsistencies are the most common. Each model was tested on a test set with consistent annotations. We used a Bi-LSTM architecture similar to Finegan-Dollak et al. (2018)'s template-based approach with randomly-initialized word embeddings for the slot-filling model. We computed the slot F 1 score using 10-fold cross-validation.
Results Table 4 shows slot-filling F 1 scores for each training set. The difference in model performance when trained on noisy versus consistently annotated training samples is substantial, with the Companies dataset seeing an absolute difference of roughly 40 percent. The differences for the Forex and Flights datasets are also quite profound. We also note that fixing Slot Format inconsistencies (Formatted) yields model improvements, but certainly not enough to reach Consistent performance, indicating that the other, less frequent inconsistencies cause substantial model degradation.
Inconsistency Type Impact To investigate the relative impact of each inconsistency type, we performed a second experiment in which we introduced artificial inconsistencies into the Consistent version of the Flights dataset. Separately for each inconsistency type, we automatically modified a fraction of the training samples to introduce that inconsistency. We trained models and measured performance. Figure 2 (next page) shows that an increase of any type of inconsistency negatively affects model performance. Swap and Chop inconsistencies have the largest effect on model performance. Fortunately, they are rare in our crowdsourced data (see Table 3). Addition inconsistencies had relatively little impact, but are also relatively rare. The remaining types, which are the more common ones, all have similar impacts.
The results from these experiments demonstrate that model performance is degraded when training on inconsistently-annotated data. This observation is the motivation of our next goal: automatically detecting inconsistent annotations.

Automatic Inconsistency Detection
Addressing the challenge of inconsistent annotations usually involves a combination of two approaches: improving annotation instructions and manual post-processing. Developing better instructions is a challenge, as it requires either foreseeing all scenarios where there may be inconsistency in annotation conventions, or looking at enough annotated data to identify inconsistencies across examples. Even if comprehensive instructions are developed, there is no guarantee that annotators will follow the conventions perfectly. Manual post-processing can address these issues, but without any means of guiding effort to potentially problematic examples it involves a substantial amount of effort. Collecting multiple annotations of each sample can reduce these issues, but increases costs and will not resolve all cases as the majority annotation for one example may be inconsistent with the majority for another example.
In this section, we explore methods of automatic inconsistency detection. These could be used either to assist in refinement of instructions or to guide post-processing to focus effort. We propose a range of automatic inconsistency detection algorithms that are a first step towards addressing this challenge.
The setting we consider is when a new dataset has been annotated by crowd workers with one annotation per example. There is no in-domain gold standard annotated data available and no other data with the same annotation scheme available. This matches the application described in Section 1: developing a dialog system for a new domain with its own set of slots.

Format Checker
The Format Checker is specifically tailored to detect Slot Format inconsistencies. It has three steps: (1) form sets of characteristic left and right n-grams for slots, (2) check when text adjacent to a span is in an n-gram set, (3) use majority voting to suggest an n-gram should be part of a slot or not. We say a sample has been "flagged" if it contains a token span that has a suggested change. Suggestions to flagged samples can be applied automatically, or on a case-by-case basis. We explain these three steps in more detail below.
N-gram sets The first step is to construct left-and right-n-gram sets. A left n-gram set is the set of all token sequences in a span that do not include the last token. For example, for the text span "my premier savings account", the left-n-gram set is {"my", "premier", "savings", "my premier", "premier savings", "my premier savings"}. A right n-gram set is the same, except it excludes the first token of a span rather than the last one. We build these sets over all text spans in the data for each slot.
Checking adjacent tokens Using the n-gram sets, we identify cases where a boundary may be inconsistent. For each slot in each example, we consider the text to the left of the slot and see if it is in the left-n-gram set. Similarly, we check if the text to the right of the slot is in the right-n-gram set. If there is a match, then there may be an inconsistency, where the adjacent text should be part of the span. For example, consider these three samples: All three samples contain an ACCOUNT slot. The left n-gram set for these samples is {"primary", "primary savings", "savings", "checking", "primary checking"}. When checking the second example, the algorithm would identify "primary" as a word that is in the left n-gram set but not in the span, which indicates that there may be a span inconsistency here.
Voting The previous step provides a set of possible inconsistencies. For each one, we then use a voting process to give further information about what the dominant convention appears to be. This works by identifying how frequently that n-gram is included in the span and how often it is adjacent to the span. If it is included more often then the suggestion is to add it in the other cases, otherwise the suggestion is to remove it from the cases it is used in. In the example above, including "primary" is more common than excluding it, and so the suggestion would be to add it in the second sample.

Label Variation Detector
This approach is an adaptation of the variation n-gram method proposed by (Dickinson and Meurers, 2003) for detecting errors in part-of-speech annotations. The approach involves two steps: (1) extract n-gram patterns, (2) use majority vote on annotations in the n-gram patterns to identify the most common pattern. These suggestions can be applied automatically, or on a case-by-case basis.
N-gram patterns In this step, for each slot in each example we extract an n-gram consisting of the slot and k tokens either side. Consider these four examples: If k is one, the n-gram extracted would be "my checking to". In these examples, the n-gram receives a SOURCE label in the first and third cases, TARGET label in the second, and no annotation in the fourth.
Identifying the most common annotation For each n-gram in the previous step, we count how many times each annotation occurs. The most common annotation is suggested as the way to standardize and all examples with the n-gram are presented. In the example above, this means that the suggestion would be to add a label in the fourth case and change the label in the second case.

FCLVD: Combining Methods
The two methods discussed so far are designed to be complementary, with the Format Checker designed to be used first to detect and fix Slot Format inconsistencies, and then the Label Variation Detector designed to detect everything else. We can combine these methods together to detect a larger set of potential inconsistencies.

CRF Agreement
In this method, we use cross-validation to train models on the data and also predict labels for the data. Any disagreement between the prediction and the data is considered an inconsistency. We use a conditional random field (CRF) (Lafferty et al., 2001) as it is very fast to train and is less likely to overfit the data than a large neural model. The reasoning behind this approach is that if there is an inconsistency the CRF will learn one of the conventions, leading to disagreements when the other convention is present.  Format Checker is tailored toward finding Slot Format inconsistencies, and hence has a much higher recall than the Label Variation Detector. The FCLVD approach yields an even higher recall.

Model Discussion
One strength of the Format Checker and Label Variation Detector is that they produce very interpretable results because they use pattern matching. Not only do the algorithms specify the token range of a flagged inconsistency, they also inform the user of precisely which other samples are annotated in a manner that is inconsistent with the flagged sample. In contrast, the CRF Agreement method provides no cross-example feedback, only specifying that a given annotation does not match its predictions. On the other hand, the CRF Agreement approach is more flexible, capable of identifying any error type, including those in the Other category. It is possible that future work could develop specific detection methods for more error types, but that remains an open question.

Experiments
We evaluated inconsistency detection performance on our three crowd-annotated datasets ( § 3.2). We measured inconsistency detection recall, precision, F 1 score, and % Flagged: Precision, recall, and F 1 are standard measures of performance. The % Flagged indicates the proportion of the samples that were flagged as containing inconsistencies. In practice this means the proportion of the data that might need to be verified by an expert. This metric is related to precision, but more directly measures the effort required when using the method. Table 5 shows results for all three datasets with the CRF method and the FCLVD method that combines the Format Checker and Label Variation Detector. The CRF has consistently higher recall, catching almost all of the inconsistencies. However, that comes at the cost of creating more work, as the % flagged is consistently higher, by a factor of five in the case of the Forex data. Table 5 presents a tradeoff between the two approaches: the CRF has higher recall but flags more samples than the actual observed inconsistency rates, while the FCLVD method flags fewer and has higher precision.  Table 7: Slot-filling F 1 scores when training on data fixed based on the output of each method. The evaluation sets are composed of consistently annotated data.

Results
Table 6 (previous page) highlights the difference in performance between the Format Checker and Label Variation Detector methods on Slot Format inconsistencies. We analyze this inconsistency type in particular since it is the most common type (see Table 3). The Format Checker has much higher recall on this inconsistency type than Label Variation Detector, which is expected since the Format Checker is specifically designed to detect Slot Format inconsistencies. When combined together, the two methods have higher recall on Slot Format inconsistencies than both methods individually.
Finally, Table 7 shows slot-filling F 1 scores when training on datasets with corrections based on each automatic inconsistency detection approach. For FCLVD, the improvement can be as large as 14 points, coming within 3 points of consistent data with checking only applied to 36% of examples (from Table 5). CRF does even better, reaching within 1.6 points of consistent data on all three datasets, but as shown in Table 5, it requires checking for 69-84% of the examples.

Conclusion
Understanding the nature and frequency of different types of annotation inconsistencies in noisy slotfilling corpora is important for the development of robust task-driven dialog systems. This paper presents a typology of inconsistency types. Applying the typology to three new crowdsourced datasets, we find the overall inconsistency rate is high, and the Slot Format and Omission types are the most common.
We show that correcting inconsistencies improves the quality of models and propose several methods of automatically detecting inconsistencies. By evaluating our approaches we find no clearly dominant choice, and that there is scope for further work to balance detection recall with the overall number of samples flagged as inconsistent. By detecting inconsistencies we can fix issues in annotations, improve task instructions, and better understand the challenges of slot-filling.