FIND: Human-in-the-Loop Debugging Deep Text Classifiers

Since obtaining a perfect training dataset (i.e., a dataset which is considerably large, unbiased, and well-representative of unseen cases) is hardly possible, many real-world text classifiers are trained on the available, yet imperfect, datasets. These classifiers are thus likely to have undesirable properties. For instance, they may have biases against some sub-populations or may not work effectively in the wild due to overfitting. In this paper, we propose FIND -- a framework which enables humans to debug deep learning text classifiers by disabling irrelevant hidden features. Experiments show that by using FIND, humans can improve CNN text classifiers which were trained under different types of imperfect datasets (including datasets with biases and datasets with dissimilar train-test distributions).


Introduction
Deep learning has become the dominant approach to address most Natural Language Processing (NLP) tasks, including text classification. With sufficient and high-quality training data, deep learning models can perform incredibly well (Zhang et al., 2015;Wang et al., 2019). However, in real-world cases, such ideal datasets are scarce. Often times, the available datasets are small, full of regular but irrelevant words, and contain unintended biases (Wiegand et al., 2019;Gururangan et al., 2018). These can lead to suboptimal models with undesirable properties. For example, the models may have biases against some sub-populations or may not work effectively in the wild as they overfit the imperfect training data.
To improve the models, previous work has looked into different techniques beyond standard model fitting. If the weaknesses of the training datasets or the models are anticipated, strategies can be tailored to mitigate such weaknesses. For example, augmenting the training data with genderswapped input texts helps reduce gender bias in the models (Park et al., 2018;Zhao et al., 2018). Adversarial training can prevent the models from exploiting irrelevant and/or protected features (Jaiswal et al., 2019;Zhang et al., 2018). With a limited number of training examples, using human rationales or prior knowledge together with training labels can help the models perform better (Zaidan et al., 2007;Bao et al., 2018;Liu and Avci, 2019).
Nonetheless, there are side-effects of suboptimal datasets that cannot be predicted and are only found after training thanks to post-hoc error analysis. To rectify such problems, there have been attempts to enable humans to fix the trained models (i.e., to perform model debugging) (Stumpf et al., 2009;Teso and Kersting, 2019). Since the models are usually too complex to understand, manually modifying the model parameters is not possible. Existing techniques, therefore, allow humans to provide feedback on individual predictions instead. Then, additional training examples are created based on the feedback to retrain the models. However, such local improvements for individual predictions could add up to inferior overall performance (Wu et al., 2019). Furthermore, these existing techniques allow us to rectify only errors related to examples at hand but provide no way to fix problems kept hidden in the model parameters.
In this paper, we propose a framework which allows humans to debug and improve deep text classifiers by disabling hidden features which are irrelevant to the classification task. We name this framework FIND (Feature Investigation aNd Disabling). FIND exploits an explanation method, namely layer-wise relevance propagation (LRP) (Arras et al., 2016), to understand the behavior of a classifier when it predicts each training instance.
Then it aggregates all the information using word clouds to create a global visual picture of the model. This enables humans to comprehend the features automatically learned by the deep classifier and then decide to disable some features that could undermine the prediction accuracy during testing. The main differences between our work and existing work are: (i) first, FIND leverages human feedback on the model components, not the individual predictions, to perform debugging; (ii) second, FIND targets deep text classifiers which are more convoluted than traditional classifiers used in existing work (such as Naive Bayes classifiers and Support Vector Machines).
We conducted three human experiments (one feasibility study and two debugging experiments) to demonstrate the usefulness of FIND. For all the experiments, we used as classifiers convolutional neural networks (CNNs) (Kim, 2014), which are a popular, well-performing architecture for many text classification tasks including the tasks we experimented with (Gambäck and Sikdar, 2017;Johnson and Zhang, 2015;Zhang et al., 2019). The overall results show that FIND with human-in-the-loop can improve the text classifiers and mitigate the said problems in the datasets. After the experiments, we discuss the generalization of the proposed framework to other tasks and models. Overall, the main contributions of this paper are: • We propose using word clouds as visual explanations of the features learned.
• We propose a technique to disable the learned features which are irrelevant or harmful to the classification task so as to improve the classifier. This technique and the word clouds form the human-debugging framework -FIND.
• We conduct three human experiments that demonstrate the effectiveness of FIND in different scenarios. The results not only highlight the usefulness of our approach but also reveal interesting behaviors of CNNs for text classification.
The rest of this paper is organized as follows. Section 2 explains related work about analyzing, explaining, and human-debugging text classifiers. Section 3 proposes FIND, our debugging framework. Section 4 explains the experimental setup followed by the three human experiments in Section 5 to 7. Finally, Section 8 discusses generalization of the framework and concludes the paper. Code and datasets of this paper are available at https://github.com/plkumjorn/FIND.

Related Work
Analyzing deep NLP models -There has been substantial work in gaining better understanding of complex, deep neural NLP models. By visualizing dense hidden vectors, Li et al. (2016) found that some dimensions of the final representation learned by recurrent neural networks capture the effect of intensification and negation in the input text. Karpathy et al. (2015) revealed the existence of interpretable cells in a character-level LSTM model for language modelling. For example, they found a cell acting as a line length counter and cells checking if the current letter is inside a parenthesis or a quote. Jacovi et al. (2018) presented interesting findings about CNNs for text classification including the fact that one convolutional filter may detect more than one n-gram pattern and may also suppress negative n-grams. Many recent papers studied several types of knowledge in BERT (Devlin et al., 2019), a deep transformer-based model for language understanding, and found that syntactic information is mostly captured in the middle BERT layers while the final BERT layers are the most task-specific (Rogers et al., 2020). Inspired by many findings, we make the assumption that each dimension of the final representation (i.e., the vector before the output layer) captures patterns or qualities in the input which are useful for classification. Therefore, understanding the roles of these dimensions (we refer to them as features) is a prerequisite for effective human-in-the-loop model debugging, and we exploit an explanation method to gain such an understanding. Explaining predictions from text classifiers -Several methods have been devised to generate explanations supporting classifications in many forms, such as natural language texts (Liu et al., 2019), rules (Ribeiro et al., 2018), extracted rationales (Lei et al., 2016), and attribution scores (Lertvittayakumjorn and Toni, 2019). Some explanation methods, such as LIME (Ribeiro et al., 2016) andSHAP (Lundberg andLee, 2017), are model-agnostic and do not require access to model parameters. Other methods access the model architectures and parameters to generate the explanations, such as DeepLIFT (Shrikumar et al., 2017) and LRP (layer-wise relevance propagation) (Bach et al., 2015;Arras et al., 2016). In this work, we use LRP to explain not the predictions but the learned features so as to expose the model behavior to humans and enable informed model debugging.
Debugging text classifiers using human feedback -Early work in this area comes from the human-computer interaction community. Stumpf et al. (2009) studied the types of feedback humans usually give in response to machine-generated predictions and explanations. Also, some of the feedback collected (i.e., important words of each category) was used to improve the classifier via a user co-training approach. Kulesza et al. (2015) presented an explanatory debugging approach in which the system explains to users how it made each prediction, and the users then rectify the model by adding/removing words from the explanation and adjusting important weights. Even without explanations shown, an active learning framework proposed by Settles (2011) asks humans to iteratively label some chosen features (i.e., words) and adjusts the model parameters that correspond to the features. However, these early works target simpler machine learning classifiers (e.g., Naive Bayes classifiers with bag-of-words) and it is not clear how to apply the proposed approaches to deep text classifiers.
Recently, there have been new attempts to use explanations and human feedback to debug classifiers in general. Some of them were tested on traditional text classifiers. For instance, Ribeiro et al. (2016) showed a set of LIME explanations for individual SVM predictions to humans and asked them to remove irrelevant words from the training data in subsequent training. The process was run for three rounds to iteratively improve the classifiers. Teso and Kersting (2019) proposed CAIPI, which is an explanatory interactive learning framework. At each iteration, it selects an unlabelled example to predict and explain to users using LIME, and the users respond by removing irrelevant features from the explanation. CAIPI then uses this feedback to generate augmented data and retrain the model. While these recent works use feedback on lowlevel features (input words) and individual predictions, our framework (FIND) uses feedback on the learned features with respect to the big picture of the model. This helps us avoid local decision pitfalls which usually occur in interactive machine learning (Wu et al., 2019). Overall, what makes our contribution different from existing work is that (i) we collect the feedback on the model, not the individual predictions, and (ii) we target deep text classifiers which are more complex than the models used in previous work.

Motivation
Generally, deep text classifiers can be divided into two parts. The first part performs feature extraction, transforming an input text into a dense vector (i.e., a feature vector) which represents the input. There are several alternatives to implement this part such as using convolutional layers, recurrent layers, and transformer layers. The second part performs classification passing the feature vector through a dense layer with softmax activation to get predicted probability of the classes. These deep classifiers are not transparent, as humans cannot interpret the meaning of either the intermediate vectors or the model parameters used for feature extraction. This prevents humans from applying their knowledge to modify or debug the classifiers.
In contrast, if we understand which patterns or qualities of the input are captured in each feature, we can comprehend the overall reasoning mechanism of the model as the dense layer in the classification part then becomes interpretable. In this paper, we make this possible using LRP. By understanding the model, humans can check whether the input patterns detected by each feature are relevant for classification. Also, the features should be used by the subsequent dense layer to support the right classes. If these are not the case, debugging can be done by disabling the features which may be harmful if they exist in the model. Figure 1 shows the overview of our debugging framework, FIND.

Notation
Let us consider a text classification task with |C| classes where C is the set of all classes and let V be a set of unique words in the corpus (the vocabulary). A training dataset D = {(x 1 , y 1 ), . . . , (x N , y N )} is given, where x i is the i-th document containing a sequence of L words, [x i1 , x i2 , ..., x iL ], and y i ∈ C is the class label of x i . A deep text classifier M trained on dataset D classifies a new input document x into one of the classes (i.e., M (x) ∈ C). In addition, M can be divided into two parts -a feature extraction part M f and a classification part M c .
. . , f d ] ∈ R d is the feature vector of x, while W ∈ R |C|×d and b ∈ R |C| are parameters of the dense layer of M c . The final output is the predicted probability vector p ∈ [0, 1] |C| .

Understanding the Model
To understand how the model M works, we analyze the patterns or characteristics of the input that activate each feature f i . Specifically, using LRP 1 , for each f i of an example x j in the training dataset, we calculate a relevance vector r ij ∈ R L showing the relevance scores (the contributions) of each word in x j for the value of f i . After doing this for all d features of all training examples, we can produce word clouds to help the users better understand the model M .
Word clouds -For each feature f i , we create (one or more) word clouds to visualize the patterns in the input texts which highly activate f i . This can be done by analyzing r ij for all x j in the training data and displaying, in the word clouds, words or n-grams which get high relevance scores. Note that different model architectures may have different ways to generate the word clouds so as to effectively reveal the behavior of the features.
For CNNs, the classifiers we experiment with in this paper, each feature has one word cloud containing the n-grams, from the training examples, which were selected by the max-pooling of the CNNs. For instance, Figure 2, corresponding to a feature of filter size 2, shows bi-grams (e.g., "love love", "love my", "loves his", etc.) whose font size corresponds to the feature values of the bi-grams. This is similar to how previous works analyze CNN features (Jacovi et al., 2018;Lertvittayakumjorn and Toni, 2019), and it is equivalent to back-propagating the feature values to the input using LRP and cropping the consecutive input words with non-zero LRP scores to show in the word clouds. 2 Figure 2: A word cloud (or, literally, an n-gram cloud) of a feature from a CNN.

Disabling Features
As explained earlier, we want to know whether the learned features are valid and relevant to the classification task and whether or not they get appropriate weights from the next layer. This is possible by letting humans consider the word cloud(s) of each feature and tell us which class the feature is relevant to. A word cloud receiving human answers that are different from the class it should support (as indicated by W) exhibits a flaw in the model. For example, if the word cloud in Figure 2 represents the feature f i in a sentiment analysis task but the i th column of W implies that f i supports the negative sentiment class, we know the model is not correct here. If this word cloud appears in a product categorization task, this is also problematic because the phrases in the word cloud are not discriminative of any product category. Hence, we provide options for the users to disable the features which correspond to any problematic word clouds so that the features do not play a role in the classification. To enable this to happen, we modify M c to be M c where p = M c (f ) = softmax((W Q)f + b) and Q ∈ R |C|×d is a masking matrix with being an element-wise multiplication operator. Initially, all elements in Q are ones which enable all the connections between the features and the output.
To disable feature f i , we set the i th column of Q to be a zero vector. After disabling features, we then freeze the parameters of M f and fine-tune the parameters of M c (except the masking matrix Q) with the original training dataset D in the final step.

Experimental Setup
All datasets and their splits used in the experiments are listed in Table 1. We will explain each of them in the following sections. For each classification task, we ran and improved three models, using different random seeds, independently of one another, and the reported results are the average of the three runs. Regarding the models, we used 1D CNNs with the same structures for all the tasks and datasets. The convolution layer had three filter sizes [2, 3, 4] with 10 filters for each size (i.e., d = 10 × 3 = 30). All the activation functions were ReLU except the softmax at the output layer. The input documents were padded or trimmed to have 150 words (L = 150). We used pre-trained 300-dim GloVe vectors (Pennington et al., 2014) as non-trainable weights in the embedding layers. All the models were implemented using Keras and trained with Adam optimizer. We used iNNvestigate (Alber et al., 2018) to run LRP on CNN features. In particular, we used the LRP-propagation rule to stabilize the relevance scores ( = 10 −7 ). Finally, we used Amazon Mechanical Turk (MTurk) to collect crowdsourced responses for selecting features to disable. Each question was answered by ten workers and the answers were aggregated using majority votes or average scores depending on the question type (as explained next).

Exp 1: Feasibility Study
In this feasibility study, we assessed the effectiveness of word clouds as visual explanations to reveal the behavior of CNN features. We trained CNN models using small training datasets and evaluated the quality of CNN features based on responses from MTurk workers to the feature word clouds. Then we disabled features based on their average quality scores. The assumption was: if the scores of the disabled features correlated with the drop in the model predictive performance, it meant that humans could understand and accurately assess CNN features using word clouds. We used small training datasets so that the trained CNNs had features with different levels of quality. Some features detected useful patterns, while others overfitted the training data.

Human Feedback Collection and Usage
We used human responses on MTurk to assign ranks to features. As each classifier had 30 original features (d = 30), we divided them into three ranks (A, B, and C) each of which with 10 features. We expected that features in rank A are most relevant and useful for the prediction task, and features in rank C least relevant, potentially undermining the performance of the model. To make the annotation more accessible to lay users, we designed the questions to ask whether a given word cloud is (mostly or partially) relevant to one of the classes or not, as shown in Figure 3. If the answer matches how the model really uses this feature (as indicated by W), the feature gets a positive score from this human response. For example, if the CNN feature of the word cloud in Figure 3 is used by the model for the negative sentiment class, the scores of the five options in the figure are -2, -1, 0, 1, 2, respectively. We collected ten responses for each question and used the average score to sort the features descendingly. After sorting, the 1 st -10 th features, 11 th -20 th features, and 21 st -30 th features are considered as rank A, B, and C, respectively. 3 To show the effects of feature disabling, we compared the original model M with the modified  Figure 4 shows the distribution of average feature scores from one of the three CNN instances for the Yelp dataset. Examples of the word clouds from each rank are displayed in Figure 5. We can clearly see dissimilar qualities of the three features. Some participants answered that the rank B feature in Figure 5 was relevant to the positive class (probably due to the word 'delicious'), and the weights of this feature in W agreed (Positive:Negative = 0.137:-0.135). Interestingly, the rank C feature in Figure 5 got a negative score because some participants believed that this word cloud was relevant to the positive class, but actually the model used this feature as evidence for the negative class (Positive:Negative = 0.209:0.385).

Results and Discussions
Considering all the three runs, Figure 6 (top) shows the average macro F1 score of the original model (the blue line) and of each modified model. The order of the performance drops is AB > A > AC > BC > B > Original > C. This makes sense because disabling important features (rank A and/or B) caused larger performance drops, and the overall results are consistent with the average fea- ture scores given by the participants (as in Figure  4). It confirms that using word clouds is an effective way to assess CNN features. Also, it is worth noting that the macro F1 of the model slightly increased when we disabled the low-quality features (rank C). This shows that humans can improve the model by disabling irrelevant features. The CNNs for the Amazon Products dataset also behaved in a similar way ( Figure 6 -bottom), except that disabling rank C features slightly undermined, not increased, performance. This implies that even the rank C features contain a certain amount of useful knowledge for this classifier. 4

Exp 2: Training Data with Biases
Given a biased training dataset, a text classifier may absorb the biases and produce biased predictions against some sub-populations. We hypothesize that if the biases are captured by some of the learned features, we can apply FIND to disable such features and reduce the model biases.

Datasets and Metrics
We focus on reducing gender bias of CNN models trained on two datasets -Biosbias (De-Arteaga et al., 2019) and Waseem (Waseem and Hovy, 2016). For Biosbias, the task is predicting the occupation of a given bio paragraph, i.e., whether the person is 'a surgeon' (class 0) or 'a nurse' (class 1). Due to the gender imbalance in each occupation, a classifier usually exploits gender information when making predictions. As a result, bios of female surgeons and male nurses are often misclassified. For Waseem, the task is abusive language detection -assessing if a given text is abusive (class 1) or not abusive (class 0). Previous work found that this dataset contains a strong negative bias against females (Park et al., 2018). In other words, texts related to females are usually classified as abusive although the texts themselves are not abusive at all. Also, we tested the models, trained on the Waseem dataset, using another abusive language detection dataset, Wikitoxic (Thain et al., 2017), to assess generalizability of the models. To quantify gender biases, we adopted two metrics -false positive equality difference (FPED) and false negative equality difference (FNED) (Dixon et al., 2018). The lower these metrics are, the less biases the model has. 4 We also conducted the same experiments here with bidirectional LSTM networks (BiLSTMs) which required a different way to generate the word clouds (see Appendix C). The results on BiLSTMs, however, are not as promising as on CNNs. This might be because the way we created word clouds for each BiLSTM feature was not an accurate way to reveal its behavior. Unlike for CNNs, understanding recurrent neural network features for text classification is still an open problem.

Human Feedback Collection and Usage
Unlike the interface in Figure 3, for each word cloud, we asked the participants to select the relevant class from three options (Biosbias: surgeon, nurse, it could be either / Waseem: abusive, nonabusive, it could be either). The feature will be disabled if the majority vote does not select the class suggested by the weight matrix W. To ensure that the participants do not use their biases while answering our questions, we firmly mentioned in the instructions that gender-related terms should not be used as an indicator for one or the other class.

Results and Discussions
The results of this experiment are displayed in Figure 7. For Biosbias, on average, the participants' responses suggested us to disable 11.33 out of 30 CNN features. By doing so, the FPED of the models decreased from 0.250 to 0.163, and the FNED decreased from 0.338 to 0.149. After investigating the word clouds of the CNN features, we found that some of them detected patterns containing both gender-related terms and occupation-related terms such as "his surgical expertise" and "she supervises nursing students". Most of the MTurk participants answered that these word clouds were relevant to the occupations, and thus the corresponding features were not disabled. However, we believe that these features might contain gender biases. So, we asked one annotator to consider all the word clouds again and disable every feature for which the prominent n-gram patterns contained any genderrelated terms, no matter whether the patterns detect occupation-related terms. With this new disabling policy, 12 out of 30 features were disabled on average, and the model biases further decreased, as shown in Figure 7 (Debugged (One)). The sideeffect of disabling 33% of all the features here was only a slight drop in the macro F1 from 0.950 to 0.933. Hence, our framework was successful in reducing gender biases without severe negative effects in classification performance.
Concerning the abusive language detection task, on average, the MTurk participants' responses suggested us to disable 12 out of 30 CNN features. Unlike Biosbias, disabling features based on MTurk responses unexpectedly increased the gender bias for both Waseem and Wikitoxic datasets. However, we found one similar finding to Biosbias, that many of the CNN features captured n-grams which were both abusive and related to a gender such as 'these  Figure 7: The average FPED and FNED of the CNN models in Experiment 2 (the lower, the better).
girls are terrible' and 'of raping slave girls', and these features were not yet disabled. So, we asked one annotator to disable the features using the new "brutal" policy -disabling all which involved gender words even though some of them also detected abusive words. By disabling 18 out of 30 features on average, the gender biases were reduced for both datasets (except FPED on Wikitoxic which stayed close to the original value). Another consequence was that we sacrificed 4% and 1% macro F1 on the Waseem and Wikitoxic datasets, respectively. This finding is consistent with (Park et al., 2018) that reducing the bias and maintaining the classification performance at the same time is very challenging.

Exp 3: Dataset Shift
Dataset shift is a problem where the joint distribution of inputs and outputs differs between training and test stage (Quionero-Candela et al., 2009). Many classifiers perform poorly under dataset shift because some of the learned features are inapplicable (or sometimes even harmful) to classify test documents. We hypothesize that FIND is useful for investigating the learned features and disabling the overfitting ones to increase the generalizability of the model.

Datasets
We considered two tasks in this experiment. The first task aims to classify "Christianity" vs "Atheism" documents from the 20 Newsgroups dataset 5 . This dataset is special because it contains a lot of artifacts -tokens (e.g., person names, punctuation marks) which are not relevant, but strongly co-occur with one of the classes. For evaluation, we used the Religion dataset by Ribeiro et al. (2016), containing "Christianity" and "Atheism" web pages, as a target dataset. The second task is sentiment analysis. We used, as a training dataset, Amazon Clothes, with reviews of clothing, shoes, 5 http://qwone.com/˜jason/20Newsgroups/  Zhang et al., 2015), and the Yelp dataset (which was used in Experiment 1). Amazon Music contains only reviews from the "Digital Music" product category which was found to have an extreme distribution shift from the clothes category (Hendrycks et al., 2020). Amazon Mixed compiles the reviews from various kinds of products, while Yelp focuses on restaurant reviews.

Human Feedback Collection and Usage
We collected responses from MTurk workers using the same user interfaces as in Experiment 2. Simply put, we asked the workers to select a class which was relevant to a given word cloud and checked if the majority vote agreed with the weights in W.

Results and Discussions
For the first task, on average, 14.33 out of 30 features were disabled and the macro F1 scores of the 20Newsgroups before and after debugging are 0.853 and 0.828, respectively. The same metrics of the Religion dataset are 0.731 and 0.799. This shows that disabling irrelevant features mildly undermined the predictive performance on the indistribution dataset, but clearly enhanced the performance on the out-of-distribution dataset (see Figure 8, left). This is especially evident for the Atheism class for which the F1 score increased around 15% absolute. We noticed from the word clouds that many prominent words for the Atheism class learned by the models are person names (e.g., Keith, Gregg, Schneider) and these are not applicable to the Religion dataset. Forcing the models to use only relevant features (detecting terms like 'atheists' and 'science'), therefore, increased the macro F1 on the Religion dataset.
Unlike 20Newsgroups, Amazon Clothes does not seem to have obvious artifacts. Still, the re-sponses from crowd workers suggested that we disable 6 features. The disabled features were correlated to, but not the reason for, the associated class. For instance, one of the disabled features was highly activated by the pattern "my .... year old" which often appeared in positive reviews such as "my 3 year old son loves this.". However, these correlated features are not very useful for the three outof-distribution datasets (Music, Mixed, and Yelp). Disabling them made the model focus more on the right evidence and increased the average macro F1 for the three datasets, as shown in Figure 8 (right). Nonetheless, the performance improvement here was not as apparent as in the previous task because, even without feature disabling, the majority of the features are relevant to the task and can lead the model to the correct predictions in most cases. 6

Discussion and Conclusions
We proposed FIND, a framework which enables humans to debug deep text classifiers by disabling irrelevant or harmful features. Using the proposed framework on CNN text classifiers, we found that (i) word clouds generated by running LRP on the training data accurately revealed the behaviors of CNN features, (ii) some of the learned features might be more useful to the task than the others and (iii) disabling the irrelevant or harmful features could improve the model predictive performance and reduce unintended biases in the model.

Generalization to Other Models
In order to generalize the framework beyond CNNs, there are two questions to consider. First, what is an effective way to understand each feature? We exemplified this with two word clouds representing each BiLSTM feature in Appendix C, and we plan to experiment with advanced visualizations such as LSTMVis (Strobelt et al., 2018) in the future. Second, can we make the model features more interpretable? For example, using ReLU as activation functions in LSTM cells (instead of tanh) renders the features non-negative. So, they can be summarized using one word cloud which is more practical for debugging.
In general, the principle of FIND is understanding the features and then disabling the irrelevant ones. The process makes visualizations and interpretability more actionable. Over the past few years, we have seen rapid growth of scientific research in both topics (visualizations and interpretability) aiming to understand many emerging advanced models including the popular transformer-based models (Jo and Myaeng, 2020;Voita et al., 2019;Hoover et al., 2020). We believe that our work will inspire other researchers to foster advances in both topics towards the more tangible goal of model debugging.

Generalization to Other Tasks
FIND is suitable for any text classification tasks where a model might learn irrelevant or harmful features during training. It is also convenient to use since only the trained model and the training data are required as input. Moreover, it can address many problems simultaneously such as removing religious and racial bias together with gender bias even if we might not be aware of such problems before using FIND. In general cases, FIND is at least useful for model verification.
For future work, it would be interesting to extend FIND to other NLP tasks, e.g., question answering and natural language inference. This will require some modifications to understand how the features capture relationships between two input texts.

Limitations
Nevertheless, FIND has some limitations. First, the word clouds may reveal sensitive contents in the training data to human debuggers. Second, the more hidden features the model has, the more human effort FIND needs for debugging. For instance, BERT-base (Devlin et al., 2019) has 768 features (before the final dense layer) which require lots of human effort to perform investigation. In this case, it would be more efficient to use FIND to disable attention heads rather than individual features (Voita et al., 2019). Third, it is possible that one feature detects several patterns (Jacovi et al., 2018) and it will be difficult to disable the feature if some of the detected patterns are useful while the others are harmful. Hence, FIND would be more effective when used together with disentangled text representations (Cheng et al., 2020). Consider a neuron k whose value is computed using n neurons in the previous layer, where x k is the value of the neuron k, g is a nonlinear activation function, w jk and b k are weights and bias in the network, respectively. We can see that the contribution of a single node j to the value of the node k is z jk = x j w jk + b k n assuming that the bias term b k is distributed equally to the n neurons. LRP works by propagating the activation of a neuron of interest back through the previous layers in the network proportionally. We call the value each neuron receives a relevance score (R) of the neuron. To back propagate, if the relevance score of the neuron k is R k , the relevance score that the neuron j receives from the neuron k is To make the relevance propagation more stable, we add a small positive number (as a stabilizer) to the denominator of the propagation rule: We used this propagation rule, so called LRP-, in the experiments of this paper. For more details about LRP propagation rules, please see Montavon et al. (2019).
To explain a prediction of a CNN text classifier, we propagate an activation value of the output node back to the word embedding matrix. After that, the relevance score of an input word equals the sum of relevance scores each dimension of its word vector receives. However, in this paper, we want to analyze the hidden features rather than the output, so we start back propagating from the hidden features instead to capture patterns of input words which highly activate the features.

B Multiclass Classification
As shown in Figure 9, we used a slightly different user interface in Experiment 1 for the Amazon Products dataset which is a multiclass classification task. In this setting, we did not provide the options for mostly and partly relevant; otherwise, there would have been nine options per question which are too many for the participants to answer accurately. With the user interface in Figure 9, we gave a score to the feature f i based on the participant answer. To explain, we re-scaled values in the i th column of W to be in the range [0,1] using min-max normalization and gave the normalized value of the chosen class as a score to the feature f i . If the participant selects None, this feature gets a zero score. The distribution of the average feature scores for this task (one CNN) is displayed in Figure 10.

C Bidirectional LSTM networks
To understand BiLSTM features, we created two word clouds for each feature. The first word cloud contains top three words which gain the highest positive relevance scores from each training example, while the second word cloud does the same but for the top three words which gain the lowest negative relevance scores (see Figure 11). Furthermore, we also conducted Experiment 1 for BiLSTMs. Each direction of the recurrent layer had 15 hidden units and the feature vector was obtained by taking element-wise max of all the hidden states (i.e., d = 15 × 2 = 30). We adapted the code of (Arras et al., 2017) to run LRP on BiLSTMs. Regarding human feedback collection, we collected feedback from Amazon Mechanical Turk workers by splitting the pair of word clouds into two and asking the question about the relevant class independently of each other. The answer of the positive relevance word cloud should be consistent with the weight matrix W, while the answer of the negative relevance word cloud should be the opposite of the weight matrix W. The score of a BiLSTM feature is the sum of its scores from the positive word cloud and the negative word cloud.
The results of the extra BiLSTM experiments are shown in Table 4 and 5. Table 4 shows unexpected results after disabling features. For instance, disabling rank B features caused a larger performance drop than removing rank A features. This suggests that how we created word clouds for each BiLSTM feature (i.e., displaying top three words with the highest positive and lowest negative rel-  evance) might not be an accurate way to explain the feature. Nevertheless, another observation from Table 4 is that even when we disabled two-third of the BiLSTM features, the maximum macro F1 drop was less than 5%. This suggests that there is a lot of redundant information in the features of the BiLSTMs.

D Metrics for Biases
In this paper, we used two metrics to quantify biases in the models -False positive equality difference (FPED) and False negative equality difference (FNED) -with the following definitions (Dixon et al., 2018).
• Wikitoxic: The dataset can be downloaded here 10 . We used only examples which were given the same label by all the annotators.
• 20Newsgroups: We downloaded the standard splits of the dataset using scikit-learn 11 . The header and the footer of each text were removed.

F Full Experimental Results
Tables 2-9 in this section report the full results of all the experiments and datasets. All the results shown are averaged from three runs. Boldface numbers are the best scores in the columns. They are further underlined if they are significantly better than the scores of all the other models. We conducted the statistical significance analysis using approximate randomization test with 1,000 iterations and a significance level α of 0.05 (Noreen, 1989;Graham et al., 2014).        0.784 ± 0.01 0.799 ± 0.00 0.792 ± 0.01 0.793 ± 0.00 Disabling (MTurk) 0.793 ± 0.00 0.801 ± 0.00 0.797 ± 0.00 0.797 ± 0.00