Label Noise in Context

Label noise—incorrectly or ambiguously labeled training examples—can negatively impact model performance. Although noise detection techniques have been around for decades, practitioners rarely apply them, as manual noise remediation is a tedious process. Examples incorrectly flagged as noise waste reviewers’ time, and correcting label noise without guidance can be difficult. We propose LNIC, a noise-detection method that uses an example’s neighborhood within the training set to (a) reduce false positives and (b) provide an explanation as to why the ex- ample was flagged as noise. We demonstrate on several short-text classification datasets that LNIC outperforms the state of the art on measures of precision and F0.5-score. We also show how LNIC’s training set context helps a reviewer to understand and correct label noise in a dataset. The LNIC tool lowers the barriers to label noise remediation, increasing its utility for NLP practitioners.


Introduction
Label noise-examples with incorrect or ambiguous labels in a training set-degrades the performance of the learned model, resulting in inaccurate predictions (Frénay and Verleysen, 2014). Automated data collection risks generating noisy datasets, and human annotators may introduce noise through a lack of attention or expertise.
Automatic noise-detection algorithms analyze a training set and flag "suspicious" examples that are likely mislabeled (Brodley and Friedl, 1999;Frénay and Verleysen, 2014). Suspicious examples can be deleted, automatically corrected by an algorithm, or reviewed by a human. Human review is the most effective of these mitigation options but is comparatively expensive. * The first two authors contributed equally.
sports fitness ⇒ Unexpected increase in • Why doesn't my stamina running ability seem to improve? • Is it possible for the libero • Is there a rule of thumb for to score points in setting running goals? volleyball?
• How counter-productive would having two coaches be? Two problems contribute to making human review time consuming: false positives and a lack of explanation. False positives are examples that are incorrectly flagged as noise; reviewing such examples wastes the annotator's time. Showing a reviewer a suspicious example without an explanation is effective in the simplest cases, but is likely to cause difficulty and frustration in the more common case of non-obvious noise that requires a deeper comprehension of the data.
To date, few noise-detection algorithms have been designed with human review in mind. Sluban et al. (2010) is the only work we are aware of that recognized that a noise-detection algorithm for use in a human review process should emphasize precision (i.e., reduce the proportion of false positives). However, we are unaware of any existing work that addresses the explainability of detected label noise.
We propose the Label Noise in Context system, or LNIC, which uses the neighborhood surrounding a suspicious example in the training set to improve both precision and explainability. By calculating a similarity matrix for the dataset, we are able to identify a suspicious example's neighborhood and use a method similar to a nearest-neighbors classifier to filter out false positives. Applying a set of simple heuristics to the same similarity matrix allows us to construct a training set context, like that in Table 1. Seen in isolation, an example about running ability labeled as belonging to the sports class is not obviously wrong; however, once the annotator understands that she is seeing it because there are more similar examples in the fitness class, it becomes apparent that there is a better label.
The main contributions of this work are • We describe LNIC's nearest-neighbors-based algorithm to improve precision and explainability of automatically detected label noise (Sec. 3). • We show that neighborhood-based filtering after noise-detection improves precision and F 0.5 over the state of the art for five short-text classification datasets (Sec. 4 and 5). • We present the LNIC tool for reviewing noise in context, demonstrating the value of explanations for understanding and fixing label noise (Sec. 6). A demo video is available at https://www. youtube.com/watch?v=20cigQaCc_k, and a live web demo is at http://lnic.mybluemix.net/ 2 Related Work Noise Detection. Frénay and Verleysen (2014) conducted a comprehensive survey of the various approaches to detecting and remediating label noise. Many works advocate removing label noise to improve model performance (Brodley and Friedl, 1999;Sánchez et al., 2003;Smith and Martinez, 2011). Teng (2000) advocates automatic relabeling, while others present the case for human-inthe-loop (Ekambaram et al., 2016;Fefilatyev et al., 2012;Matic et al., 1992;Sluban et al., 2010) and hybrid techniques (Miranda et al., 2009). In work contemporaneous with ours, Northcutt et al. (2019) remove examples where a classifier's confidence is low.
The most directly related work is Brodley and Friedl (1999), describing a noise detection method using predictions from an ensemble of classifiers, and Sluban et al. (2010), proposing the High Agreement Random Forest (HARF) system; both systems are described in detail in Section 3.1. Brodley and Friedl (1999) dropped suspicious examples but propose correction instead as future work. Sluban et al. (2010) note that precision of noise-detection is important when a human will review all suspicious examples. Garcia et al. (2016)'s experiments show that HARF also achieved stateof-the-art F 1 scores on a variety of datasets.
Active Learning Similar to label noise remediation, active learning (Settles, 2014) seeks to minimize the effort a human needs to expend on data labeling activities in order to improve model performance. However, active learning aims to select the most informative unlabeled data to label next, while label noise detection identifies alreadylabeled data that may require additional labeling effort. We consider active learning and label noise detection as complimentary technologies, that might be woven together within a robust model improvement flow.
At a technical level, some active learning and label noise detection techniques are based on similar foundations. Query By Committee (QBC) (Seung et al., 1992) active learning uses an ensemble of classifiers, selecting examples on which the ensemble disagrees for labeling. Similarly ensemblebased noise detection algorithms select examples where the ensemble agrees (but disagrees with the given label). Model uncertainty, which underpins many effective active learning strategies such as least confident, margin, and entropy, is also the basis of label noise detection methods such as cleanlab (Northcutt et al., 2019).
Explainability. With the rise of increasingly complex classification models, explaining classifier predictions has received a great deal of attention. Perhaps the most well-known system is LIME (Ribeiro et al., 2016). The LIME authors noted that explaining classifier predictions increases human trust and provides insights that can be used to improve the model. To explain a classifier's prediction on a particular example, the algorithm collects nearby examples and the model's predictions for them. It trains a linear model on a simpler representation of this data, allowing it to indicate which words or super-pixels are important in the classifier's decision.
Numerous recent works in NLP and machine learning emphasize explainability. Dhurandhar et al. (2018) explained classifier predictions with positive features that push an example towards its assigned class and negative features whose absence prevent an example from being placed in a different class. Lei et al. (2016) jointly trained a generator and an encoder in order to generate rationales for sentiment prediction and a similar-questionretrieval task. Mullenbach et al. (2018) used a convolutional neural network to predict codes describing the diagnosis and treatment of patients given the clinical notes on the patent encounter. Their attention mechanism not only improved the system's precision and F 1 , but also highlighted the text that was most relevant to each code. Chiyah Garcia et al. (2018)'s system used an expert-generated decision tree and a set of templates to generate natural language explanations of what an autonomous underwater vehicle was doing and why.
Despite the interest in explainable models, no work that we are aware of has attempted to make detected label noise explainable.

Algorithms
LNIC uses a three-step process. First, a noisedetection algorithm flags suspicious examples. Second, a neighborhood-based filter decides which of these examples to ignore and which to flag for human review. Finally, we generate a context, using rules to select neighbors to present to the user.

Noise-Detection Algorithms
LNIC's noise-detection phase can use any noisedetection algorithm. Here, we report on three ensemble algorithms derived from the literature: consensus (Brodley andFriedl, 1999), agreed correction, andHARF (Sluban et al., 2010). 1 Ensemble noise detection algorithms train several classifiers on cross-validation splits of the train set. Each classifier predicts labels for the left-out examples. The predicted label is the classifier's "vote" for that example. If it matches the current label, the classifier voted that the example is not suspicious; otherwise, the classifier voted that it is. In Brodley and Friedl (1999)'s consensus algorithm, if all votes agree that an example is suspicious, the algorithm flags that example as suspicious. Our agreed correction variant requires all votes from the ensemble to agree not only that an example is mislabeled, but also on what the correct label would be. HARF (Sluban et al., 2010) relies on the fact that a random forest is an ensemble of decision trees; it flags an example as suspicious if a super-majority of trees vote that it is.

Neighborhood Filtering
Neighborhood filtering reduces the number of examples that are incorrectly flagged as noise. If a majority of neighbors of an example have the same label as that example, it suggests that the example is correctly labeled, so LNIC filters it out of the list of suspicious examples.
The neighborhood filter calculates the pairwise cosine similarity of all examples in the training 1 Models and hyperparameters are listed in Appendix A data, then finds the k neighbors closest to each suspicious example s, where k is a tunable hyperparameter. If s's current label y c is also the most common among those neighbors, s is filtered from the pool of suspicious examples as a false positive, otherwise s is flagged for human review. 2 LNIC supports filtering on the feature neighborhood or the activation neighborhood. The feature neighborhood represents each example using its original feature vector (here, USE embeddings (Cer et al., 2018)). The activation neighborhood represents each example in the training set using final layer activations from a neural classifier trained on the entire data set, the idea being to project training examples into a classification space.

Context Generation
The final step of the LNIC algorithm is to apply heuristics to the neighborhood to generate a training set context. This context acts as an explanation, showing (a) which classes the noise-detection ensemble proposed as a better label for the suspicious example, and (b) the most similar examples from the current class and those proposed classes.
The ensembles in the noise-detection algorithms generate a list of predicted labels for each suspicious example. These labels plus the example's current label comprise the permitted labels for that example. The heuristic selects the example from each permitted label that is closest to the suspicious example. If there are fewer than k permitted labels (where k is the desired context size), the balance of the context is filled out by selecting the remaining k − n nearest neighbors from the permitted labels.
We build the explanation based on both the activation neighborhood and the feature neighborhood; an example that already appears in the activation context is omitted from the feature context and replaced by the next-nearest neighbor.

Experiments
We hypothesize that adding a neighborhood-based filter after noise detection reduces the rates of false positives while retaining true noisy examples. We test this by injecting noise into datasets, running algorithms over them, and measuring the correctly and incorrectly flagged suspicious examples.

Datasets
We evaluate on the short-text classification datasets listed in Table 2. 3 Phase one of the evaluation introduces label noise-effectively "corrupting" the datasets. The amount of introduced label noise was controlled by an error-rate parameter, interpreted as the fraction of the training set to mislabel.
We used two strategies to introduce label noise: random and next-best. Both selected a random sample of the training data to mislabel. The random strategy assigned a random incorrect label to each selected example. The next-best strategy assigned the "next-best" incorrect label, as predicted by a classifier trained on the entire train set; this simulates a best effort but incorrect labeling, as might be performed by a confused human labeler.

Metrics
Because the goal of the algorithm is to avoid wasting human time, our evaluation should heavily punish false positives. We therefore measure the precision of each algorithm. We also follow Sluban et al. (2010) in reporting F 0.5 , an F -score that values precision twice as much as recall.
precision · recall F 0.5 = (1 + 0.5 2 ) (0.5 2 · precision) + recall (1) Not every situation calls for precision to be valued twice as much as recall. Therefore, we also report F β (Rijsbergen, 1979) for β ∈ {1.0, 0.2, 0.1} to reflect the preferences of users who value precision and recall equally, precision five times more than recall, and precision ten times more. Figure 1 shows average precision and F 0.5 scores across the five datasets, and Table 3 further summarizes by averaging across error rates. Appendix B shows results split by dataset and error rate. Table 3 shows that, averaged across datasets and error rates, adding neighborhood filtering of any kind improves precision of all of the underlying algorithms. For randomly generated noise, this is true for F 0.5 as well. Figure 1a also shows that the neighborhood activation filter gives a large boost to precision over all three noise-detection algorithms, and the feature neighborhood filter gives a smaller but still observable benefit. For next-best noise, adding the feature neighborhood filtering improves F 0.5 , but activation neighborhood filtering slightly worsens F 0.5 . From the graph in Figure 1d, it is apparent that activation neighborhood filtering has a benefit to F 0.5 at low error rates but declines relative to the other systems as the error rate increases, crossing at error rates near 15%. Addition of too much next-best noise negatively impacts the neural network trained on the uncorrected data, distorting the activation space. While this distortion does not harm precision, it is detrimental to recall.

Results
For both random and next-best noise, agreed correction with activation neighborhood filtering achieves the best average precision. For random noise, HARF with activation-neighborhood filtering gives the best F 0.5 across noise rates. However, for next-best noise, HARF suffered a dramatic loss in recall when error rates exceeded about 12% (Figure 1d), leading it to have low overall F 0.5 . This may be due to the random forest's use of bagging: if a subset of trees trains on samples with a great deal of non-random noise, those trees could learn to misclassify systematically. Agreed correction with feature neighborhood filtering gave the highest average F 0.5 for next-best noise.
The upward trend in precision as error rates increase suggests that the same core of false positives are consistently detected. As the number of true positives increases with higher error rates, the core of false positives makes up a smaller fraction of the total number of examples flagged as suspicious. Table 4 lists F β scores. As expected, using a neighborhood filter, which reduces the number of suspicious examples shown to a user, is particularly advantageous when precision is valued more than recall (F 0.2 and F 0.1 ), but often extracts a cost when recall and precision are equally important (F 1.0 ). Thus, agreed correction with no neighborhood filter is the best system to optimize F 1.0 when using next-best noise. Nevertheless, the strongest system for F 1.0 on random noise is still HARF with activation neighborhood filtering, followed closely by consensus with activation neighborhood filtering.

The LNIC Tool
The LNIC tool implements the algorithms described above and provides a web interface to review label noise in context. The interface visually summarizes the overall label noise within a dataset and links to groups of suspicious examples in con-   (a) Precision of noise detection for randomly generated noise (b) Precision of noise detection for next-best noise.
(c) F0.5 of noise detection for randomly generated noise (d) F0.5 of noise detection for next-best noise.   Table 3: Mean precision and F 0.5 for the five datasets, averaged across all error rates. The top row in each section is a baseline system with no filtering.

illustrated in Figures 3 and 4.
Data from Stack Exchange illustrates how context helps a reviewer understand problems in a dataset. Sometimes, context shows that an example is mislabeled. Without context, it is easy for an annotator to be uncertain of whether a question about the existence of a myth belongs in the history class; it is a question about a historical civilization, after all. However, from the context in Figure 4, it is clear that even questions about the history of myths are categorized as mythology, and so the example's label should be changed to maintain consistency.  Other times, context can reveal more complex issues with the class structure of the data. Figure 5 shows a suspicious example from the health class that the noise detection algorithm suggests may belong in the fitness class. The context shows that in fact, both classes include questions about the timing of meals with regard to exercise. A human reviewer should make a decision about where the boundary between these two classes should lie and assign these utterances consistently to one class.

Conclusion
Although NLP practitioners know that label noise harms performance, and noise detection algorithms have long been available, this technology is not  review of detected errors is difficult and time consuming. LNIC makes human review of possible label noise easier and more efficient. It reduces the number of false positive examples that the reviewer must look at, providing state-of-the-art precision and F 0.5 across several short text datasets. And by providing an explanation of why the model flagged an example as suspicious, it makes the output of label noise detectors understandable and actionable. For the neighborhood filter, we set k = 5.
Our raw vector representation of all utterances was USE (Cer et al., 2018). The activations for activation-based filtering and context generation were generated using an MLPClassifier with hidden layer sizes = [100, 512].

B Appendix: Detailed Results
Results were summarized in the body of the paper for conciseness. In this appendix, we present precision and F 0.5 for each of the five datasets and for each of the error rates.

C Appendix: Enlarged Figures
This appendix contains the same images as the body of the paper, enlarged to improve accessibility.