Human-grounded Evaluations of Explanation Methods for Text Classification

Due to the black-box nature of deep learning models, methods for explaining the models’ results are crucial to gain trust from humans and support collaboration between AIs and humans. In this paper, we consider several model-agnostic and model-specific explanation methods for CNNs for text classification and conduct three human-grounded evaluations, focusing on different purposes of explanations: (1) revealing model behavior, (2) justifying model predictions, and (3) helping humans investigate uncertain predictions. The results highlight dissimilar qualities of the various explanation methods we consider and show the degree to which these methods could serve for each purpose.


Introduction
Explainable Artificial Intelligence (XAI) is aimed at providing explanations for decisions made by AI systems. The explanations are useful for supporting collaboration between AIs and humans in many cases (Samek et al., 2018). Firstly, if an AI outperforms humans in a certain task (e.g., Al-phaGo (Silver et al., 2016)), humans can learn and distill knowledge from the given explanations. Secondly, if an AI's performance is close to human intelligence, the explanations can increase humans' confidence and trust in the AI (Symeonidis et al., 2009). Lastly, if an AI is duller than humans, the explanations help humans verify the decisions made by the AI and also improve the AI (Biran and McKeown, 2017).
One of the challenges in XAI is to explain prediction results given by deep learning models, which sacrifice transparency for high prediction performance. This type of explanations is called local explanations as they explain individual predictions (in contrast to global explanations which explain the trained model independently of any specific prediction). There have been several methods proposed to produce local explanations. Some of them are model-agnostic, applicable to any machine learning model (Ribeiro et al., 2016;Lundberg and Lee, 2017). Others are applicable to a class of models such as neural networks (Bach et al., 2015a;Dimopoulos et al., 1995) or to a specific model such as Convolutional Neural Networks (CNNs) (Zhou et al., 2016).
With so many explanation methods available, the next challenge is how to evaluate them so as to choose the right methods for different settings. In this paper, we focus on human-grounded evaluations of local explanation methods for text classification. Particularly, we propose three evaluation tasks which target different purposes of explanations for text classification -(1) revealing model behavior to human users, (2) justifying the predictions, and (3) helping humans investigate uncertain predictions. We then use the proposed tasks to evaluate nine explanation methods working on a standard CNN for text classification. These explanation methods are different in several aspects. For example, regarding granularity, four explanation methods select words from the input text as explanations, whereas the other five select ngrams as explanations. In terms of generality, one of the explanation methods is model-agnostic, two are random baselines, another two (newly proposed in this paper) are specific to 1D CNNs for text classification, and the rest are applicable to neural networks in general. Overall, the contributions of our work can be summarized as follows.
• We propose three human-grounded evaluation tasks to assess the quality of explanation methods with respect to different purposes of usage for text classification. (Section 3) • To increase diversity in the experiments, we develop two new explanation methods for CNNs for text classification. One is based on gradient-based analysis (Grad-CAM-Text). The other is based on model extraction using decision trees. (Section 4.3.1-4.3.2) • We evaluate both new methods as well as random baselines and well-known existing methods using the three evaluation tasks proposed. The results highlight dissimilar qualities of the explanation methods and show the degree to which these methods could serve for each purpose. (Section 5)

Terminology Used
We use the following terms throughout the paper.
(1) Model: a deep learning classifier we want to explain, e.g., a CNN. (2) Explanation: an ordered list of text fragments (words or n-grams) in the input text which are most relevant to a prediction. Explanations for and against the predicted class are called evidence and counter-evidence, respectively.
(3) (Local) explanation method: a method producing an explanation for a model and an input text. (4) Evaluation method: a process to quantitatively assign to explanations scores which reflect the quality of the explanation method.

Background and Related Work
This section discusses recent advances of explanation methods and evaluation for text classification as well as background knowledge about 1D CNNs -the model used in the experiments.

Local Explanation Methods
Generally, there are several ways to explain a result given by a deep learning model, such as explaining by examples  and generating textual explanations (Liu et al., 2019). For text classification in particular, most of the existing explanation methods identify parts of the input text which contribute most towards the predicted class (so called attribution methods or relevance methods) by exploiting various techniques such as input perturbation (Li et al., 2016), gradient analysis (Dimopoulos et al., 1995), and relevance propagation (Arras et al., 2017b). Besides, there are other explanation methods designed for specific deep learning architectures such as attention mechanism (Ghaeini et al., 2018) and extractive rationale generation (Lei et al., 2016). We select some well-known explanation methods (which are applicable to CNNs for text classi-fication) and evaluate them together with two new explanation methods proposed in this paper.

Evaluation Methods
Focusing on text classification, early works evaluated explanation methods by word deletiongradually deleting words from the input text in the order of their relevance and checking how the prediction confidence drops (Arras et al., 2016;Nguyen, 2018). Arras et al. (2017a) and Xiong et al. (2018) used relevance scores generated by explanation methods to construct document vectors by weighted-averaging word vectors and checked how well traditional machine learning techniques manage these document vectors. Poerner et al. (2018) proposed two evaluation paradigms -hybrid documents and morphosyntactic agreements. Both check whether an explanation method correctly points to the (known) root cause of the prediction. Note that all of the aforementioned evaluation methods are conducted with no humans involved.
For human-grounded evaluation, Mohseni and Ragan (2018) proposed a benchmark which contains a list of relevant words for the actual class of each input text, identified by human experts. However, comparing human explanations with the explanations given by the tested method may be inappropriate since the mismatches could be due to not only the poor explanation method but also the inaccuracy of the model or the model reasoning differently from humans. Nguyen (2018) asked humans to guess the output of a text classifier, given an input text with the highest relevant words highlighted by the tested explanation method. Informative (and discriminative) explanations will lead to humans' correct guesses. Ribeiro et al. (2016) asked humans to choose a model which can generalize better by considering their local explanations. Also, they let humans remove irrelevant words, existing in the explanations, from the corpus to improve the prediction performance. Compared to previous work, our work is more comprehensive in terms of the various human-grounded evaluation tasks proposed and the number and dimensions of explanation methods being evaluated.

CNNs for Text Classification
CNNs have been found to achieve promising results in many text classification tasks (Johnson and Zhang, 2015;Gambäck and Sikdar, 2017;Zhang et al., 2019). for text classification which consists of four main steps: (i) embedding an input text into an embedded matrix W; (ii) applying K fixed-size convolution filters to W to find n-grams that possibly discriminate one class from the others; (iii) pooling only the maximum value found by each filter, corresponding to the most relevant n-gram in text, to construct a filter-based feature vector, v, of the input; and (iv) using fully-connected layers (F C) to predict the results, and applying a softmax function to the outputs to obtain predicted probability of the classes (p), i.e., p = sof tmax(F C(v)). While the original version of this model uses only one linear layer as F C (Kim, 2014), more hidden layers can be added to increase the model capacity for prediction. Also, more than one filter size can be used to detect n-grams with short-and longspan relations (Conneau et al., 2017).

Human-grounded Evaluation Methods
We propose three human tasks to evaluate explanation methods for text classification as summarized in Table 1. Figure 2 gives an example question for each task, discussed next.

Revealing the Model Behavior
Task 1 evaluates whether explanations can expose irrational behavior of a poor model. This property of explanation methods is very useful when we do not have a labelled dataset to evaluate the model quantitatively. To set up the task, firstly, we train two models to make them have different performance on classifying testing examples (i.e., different capability to generalize to unseen data). Then we use these models to classify an input text and apply the explanation method of interest to explain the predictions -highlighting top-m evidence text fragments on the text for each model. Next, we ask humans, based on the highlighted texts from the two models, which model is more reasonable?
If the performance of the two models is clearly different, good explanation methods should enable humans to notice the poor model, which is more likely to decide based on non-discriminative words, even though both models predict the same class for an input text.
Additionally, there are some important points to note for this task. First, the chosen input texts must be classified into the same class by both models so the humans make decisions based only on the different explanations. However, it is worth to consider both the cases where both models correctly classify and where they misclassify. Second, we provide the choices for the two models along with confidence levels for the humans to select. If they select the right model with high confidence, the explanation method will get a higher positive score. In contrast, a confident but incorrect answer results in a large negative score. Also, the humans have the option to state no preference, for which the explanation method will get a zero score (See the last row of Table 1).

Justifying the Predictions
Explanations are sometimes used by humans as the reasons for the predicted class. This task tests whether the evidence texts are truly related to the predicted class and can distinguish it from the other classes, so called class-discriminative (Selvaraju et al., 2017). To set up the task, we use a well-trained model and select an input example classified by this model with high confidence (max c p c > τ h where τ h is a threshold parameter), so as to reduce the cases of unclear explanations due to low model accuracy or text ambiguity. (Note that we will look at low-confidence predictions later in Task 3.) Then we show only the top-m evidence text fragments generated by the method of interest to humans and ask them to guess the class of the document containing the evidence. The explanation method which makes the humans surely guess the class predicted by the model will get a high positive score. As in the previous task, this task considers both the correct and incorrect predictions with high confidence to see how well the explanations justify each of the cases. For incorrect predictions, an explanation method gets a positive score when a human guesses the same incorrect class after seeing the explanation. In real applications, convincing explanations for incorrect classes can help humans understand the Task 1 (Section 3.1) Task 2 (Section 3.2) Task 3 (Section 3.   model's weakness and create additional fixing examples to retrain and improve the model.

Investigating Uncertain Predictions
If an AI system makes a prediction with low confidence, it may need to raise the case with humans and let them decide, but with the analyzed results as additional information. This task aims to check if the explanations can help humans comprehend the situation and correctly classify the input text or not. To set up, we use a well-trained model and an input text classified by this model with low confidence (max c p c < τ l where τ l is a threshold parameter). Then we apply the explanation method of interest to find top-m evidence and topm counter-evidence texts of the predicted class. We present both types of evidence to humans 1 together with the predicted class and probability p and ask the humans to use all the information to guess the actual class of the input text, without seeing the input text itself. The scoring criteria of this task are similar to the previous tasks except that we do not provide the "no preference" option as the humans can still rely on the predicted scores when all the explanations are unhelpful.

Datasets
We used two English textual datasets for the three tasks.
(1) Amazon Review Polarity is a sentiment analysis dataset with positive and negative classes . We randomly selected 100K, 50K, and 100K examples for training, validating, and testing the CNN models, respectively.
(2) ArXiv Abstract is a text classification dataset we created by collecting abstracts of scientific articles publicly available on ArXiv 2 . Particularly, we collected abstracts from the "Computer Science (CS)", "Mathematics (MA)", and "Physics (PH)" categories, which are the three main categories on ArXiv. We then created a dataset with three disjoint classes removing the abstracts which belong to more than one of the three categories. In the experiments, we randomly selected 6K, 1.5K, and 10K examples for training, validating, and testing the CNN model, respectively.

Classification Models: 1D CNNs
As for the classifiers, we used 1D CNNs with the same structures for all the tasks and datasets. Specifically, we used 200-dim GloVe vectors as non-trainable weights in the embedding layer (Pennington et al., 2014). The convolution layer had three filter sizes [2, 3, 4] with 50 filters for each size, while the intermediate fully-connected layer had 150 units. The activation functions of the filters and the fully-connected layers are ReLU (except the softmax at the output layer). The models were implemented using Keras and trained with Adam optimizer. The macro-average F1 are 0.90 and 0.94 for the Amazon and the ArXiv datasets, respectively. Overall, the ArXiv appears to be an easier task as it is likely solvable by looking at individual keywords. In contrast, the Ama- zon sentiment analysis is not quite easy. Many reviews mention both pros and cons of the products, so a classifier needs to analyze several parts of the input to reach a conclusion. However, this is still manageable by the CNN architecture we used.
Also, in task 1, we need another model which performs worse than the well-trained model. In this experiment, we trained the second CNNs (i.e., the worse models) for the two datasets in different ways to examine the capability of explanation methods in two different scenarios. For

Explanation Methods
We evaluated nine explanation methods as summarized in Table 2. First, we used Random (W) and Random (N) as two baselines selecting words and non-overlapping n-grams randomly from the input text as evidence and counter-evidence. For the n-gram random baseline (and other n-gram based explanation methods in this paper), n is one of the CNN filter sizes [2, 3, 4]. Second, we selected LIME which is a wellknown model-agnostic perturbation-based method (Ribeiro et al., 2016). It trains a linear model using samples (5,000 samples in this paper) around the input text to explain the importance of each word towards the prediction. The importance scores can be either positive (for the predicted class) or negative (against the predicted class).
Third, we selected layer-wise relevance propagation (LRP), specifically -LRP (Bach et al., 2015b), and DeepLIFT (Shrikumar et al., 2017) which are applicable to neural networks in general and performed very well in several evalua-tions using proxy tasks (Xiong et al., 2018;Poerner et al., 2018). LRP propagates the output of the target class (before softmax) back through layers to find attributing words, while DeepLIFT does the same but propagates the difference between the output and the predicted output of the reference input (i.e., all-zero embeddings in this paper). These two methods assign relevance scores to every word in the input text. Words with the highest and the lowest scores are selected as evidence for and counter-evidence against the predicted class, respectively. Also, we extended LRP and DeepLIFT to generate explanations at an ngram level. We considered all possible n-grams in the input text where n is one of the CNN filter sizes. Then the explanations are generated based on the relevance score of each n-gram, i.e., the sum of relevance scores of all words in the n-gram.
Next, we searched for model-specific explanation methods which target 1D CNNs for text classification. We found that Jacovi et al. (2018) proposed one: listing only n-grams corresponding to feature values in v (see Figure 1) that pass thresholds for their filters. Each of the thresholds is set subject to sufficient purity of the classification results above it. However, their method is applicable to CNNs with only one linear layer as F C, while our CNNs have an additional hidden layer (with ReLU activation). So, we could not compare with their method in this work. To increase diversity in the experiments, we therefore propose two additional model-specific methods applicable to 1D CNNs with multiple layers in F C, presented next.

Grad-CAM-Text
We adapt Grad-CAM (Selvaraju et al., 2017), originally devised for explaining 2D CNNs, to find the most relevant n-grams for text classification. Since each value in the feature vector v corresponds to an n-gram selected by a filter, we use E j,k to show the effect of an n-gram selected by the k th filter towards the prediction of class j: (1) The partial derivative term shows how much the prediction of class j changes if the value from the k th filter slightly changes. As we are finding the evidence for the target class j, we consider only the positive value of the derivative. Then E j,k combines this term with the strength of v k to show the overall effect of the k th filter for the input text. Next, we calculate the effect of each word w i in the input text by aggregating the effects of all the n-grams containing w i .
where N k is an n-gram detected by the k th filter. Lastly, we select, as the evidence, non-overlapping n-grams which are detected by at least one of the filters and have the highest sums of the effects of all the words they contain. For example, to decide whether we will select the n-gram N k as an evidence text or not, we consider w i ∈N k E j,w i . Note that we can find counter-evidence by changing, in equation (1), from max to min.

Decision Trees
This explanation method is based on model extraction (Bastani et al., 2017). We create a decision tree (DT ) which mimics the behavior of the classification part (fully-connected layers) of the trained CNN. Given a filter-based feature vector v, the DT needs to predict the same class as predicted by the CNN. Formally, we want For multi-class classification, we construct one DT for each class (one vs. rest classification). We employ CART with Gini index for learning DTs (Leo et al., 1984). All the training examples are generated by the trained CNN using a training dataset, whereas a validation dataset is used to prune the DTs to prevent overfitting. Also, for each feature v j in v, we calculate the Pearson's correlation between v j and the output of each class (before softmax) in F C(v) using the training dataset, so we know which class is usually predicted given a high score of v j (i.e., correlated most to this feature). We use c j denoting the most correlated class of the feature v j .
We can consider the DTs as a global explanation of the model as it explains the CNN in general. To create a local explanation, we use the DT of the predicted class to classify the input. At each decision node, we collect associated n-grams passing the nodes' thresholds to be evidence for (or counter-evidence against) the predicted class (depending on the most correlated class of each splitting feature). For example, an input text X is classified to class a, so we use the DT of class a to predict the input. If a decision node checks whether feature v j of this input is greater than 0.25 and assume it is true for this input, the n-gram corresponding to v j will be evidence if the most correlated class of v j is class a (i.e., c j = a). Otherwise, it will be counter-evidence if c j = a.
An example from the Amazon dataset, Actual: Pos, Predicted: Pos, (Predicted scores: Pos 0.514, Neg 0.486): "OK but not what I wanted: These would be ok but I didn't realize just how big they are. I wanted something I could actually cook with. They are a full 12" long. The handles didn't fit comfortably in my hand and the silicon tips are hard, not rubbery texture like I'd imagined. The tips open to about 6" between them. Hope this helps someone else know ..."

Method
Top-3 evidence texts  Table 3: Examples of evidence and counter-evidence texts generated by some of the explanation methods.

Implementations 3
We used public libraries of LIME 4 , LRP (Alber et al., 2018), and DeepLIFT 5 in our experiments. Besides, the code for computing Grad-CAM-Text was adapted from keras-vis 6 , whereas we used scikit-learn (Pedregosa et al., 2011) for decision tree construction. All the DTs achieved over 80% macro-F1 in mimicking the CNNs' predictions.
For the task parameters, we set m = 3, τ h = 0.9, and τ l = 0.7. For each task and dataset, we used 100 input texts, half of which were classified correctly by the model(s) and the rest were misclassified. So, with nine explanation methods being evaluated, each task had 900 questions per dataset for human participants to answer. Examples of questions for each task are given in Figure 2.
For the Amazon dataset, we posted our tasks on Amazon Mechanical Turk (MTurk). To ensure the quality of crowdsourcing, each question was answered by three workers and the scores were averaged. For the ArXiv dataset which requires background knowledge of the related subjects, we recruited graduates and post-graduate students in Computer Science, Mathematics, Physics, and Engineering to perform the tasks, and each question was answered by one participant. In total, we had 367 and 121 participants for the Amazon and the ArXiv datasets, respectively.

Results and Discussion
Examples of the generated explanations are shown in Table 3 and a separate appendix. Table 4 shows the average scores of each explanation method for each task and dataset, while Figure 3 displays 3 The code and datasets of this paper are available at https://github.com/plkumjorn/CNNAnalysis 4 https://github.com/marcotcr/lime 5 https://github.com/kundajelab/deeplift 6 https://github.com/raghakot/keras-vis the distributions of individual scores for all three tasks. We do not show the distributions of tasks 2 and 3 of the Amazon dataset as they look similar to the associated ones of the ArXiv dataset.

Task 1
For the Amazon dataset, though Grad-CAM-Text achieved the highest overall score, the performance was not significantly different from other methods including the random baselines. Also, the inter-rater agreement for this task was quite poor. It suggests that existing explanation methods cannot apparently reveal irrational behavior of the underfitting CNN to lay human users. So, the scores of most explanation methods distribute symmetrically around zero, as shown in Figure 3(a). For the ArXiv dataset, LRP (N) and DeepLIFT (N) got the highest scores when both CNNs predicted correctly. Hence, they can help humans identify the poor model to some extent. However, there was no clear winner when both CNNs predicted wrongly. One plausible reason is that evidence for an incorrect prediction, even by a welltrained CNN, is usually not convincing unless we set a (high) lower bound of the confidence of the predictions (as we did in task 2).
The last row reports inter-rater agreement measures (Fleiss' kappa) in the format of α / β where α considers answers with human confidence levels (5 categories for task 1-2 and 4 categories for task 3) and β considers answers regardless of the human confidence levels (3 categories for task 1-2 and 2 categories for task 3). text which is highlighted in a strange way, such as "... greedy algorithm. In this paper, we ...". Hence, in real applications, syntax integrity should be taken into account to generate explanations.

Task 2
LIME clearly achieved the best results in task 2 followed by Grad-CAM-Text and DTs. These methods are class discriminative, being able to find good evidence for the predicted class regardless of whether the prediction is correct. We believe that LIME performed well because it tests that the missing of evidence words from the input text greatly reduces the probability of the predicted class, so these words are semantically related to the predicted class (given that the model is accurate). Meanwhile, the DTs method selects evidence based on the most correlated class of the splitting features. So, the evidence n-grams are more likely related to the predicted class than the other classes. However, they may be less relevant than LIME's as the evidence is generated from a global explanation of the model (DTs). Besides, Grad-CAM-Text worked relatively well here probably because it preserves the class discriminative property of Grad-CAM (Selvaraju et al., 2017).
By contrast, LRP and DeepLIFT generated acceptable evidence only for the correct predictions. Also, LRP (N) and DeepLIFT (N) performed better than LRP (W) and DeepLIFT (W) in both datasets. This might be because one evidence ngram contains more information than one evidence word. Nevertheless, even the Random (N) method surpasses the LRP (W) and the DeepLIFT (W) for the ArXiv dataset. Thereby, whenever we use LRP and DeepLIFT, we should present to humans the most relevant words together with their contexts.

Task 3
The negative scores under the columns of task 3 show that using explanations to rectify the predictions is not easy. Hence, the overall average scores of many explanation methods stay close to zero.
DTs performed well only on the Amazon dataset. The average numbers of n-grams per explanation, generated by the DTs, are 2.00 and 1.77 for the Amazon and ArXiv datasets, respectively. Also, the reported n-grams could be repetitive and overlapping. This reduced the amount of useful information displayed, and it may be insufficient for humans to choose one of the CS, MA, and PH categories, which are more similar to one another than the positive and negative sentiments.
Meanwhile, LRP (N) performed consistently well on both datasets. This is reasonable considering our discussions in task 2. First, LRP (N) generates good evidence for correct predictions, so it can gain high scores in the columns. On the other hand, the evidence for incorrect predictions () is usually not convincing, so the counterevidence (which is likely to be the evidence of the correct class) can attract humans' attention. Furthermore, the fact that LRP is not class discriminative does not harm it in this task as humans can recognize an evidence text even if it is selected by the LRP (N) as counter-evidence (and vice versa).
For example, in the ArXiv dataset, we found a case in which the predicted class is PH (score = 0.48) but the correct class is CS (score = 0.07). LRP (N) selected 'armed bandit settings with', 'the Wasserstein distance', and 'derive policy gradients' as evidence for the class PH. These n-grams, however, are not truly related to PH. Rather, they revealed the true class of this text and made a human choose the CS option with high confidence despite the low predicted score.
Regarding LIME, the situation is reversed as LIME can find both good evidence and counterevidence. These make humans be indecisive and, possibly, select a wrong option as the explanation is presented at a word level (without any contexts).

Model Complexity
Apart from the results of the three tasks, it is worth to discuss the size of the DTs which mimic the four CNNs in our experiments. As shown in Table  5, the size of the DTs can reflect the complexity of the CNNs. Although the well-trained CNN of the Amazon dataset got 0.9 F1 score, the DTs of this CNN needed more than 5,500 nodes to achieve 85% fidelity (compared to only hundreds of nodes required for the ArXiv dataset). This illustrates the high complexity of the Amazon task compared to the ArXiv task even though both tasks were managed effectively by the same CNN architecture.
For the ArXiv dataset, the DTs of the poor CNN are smaller than the ones of the well-trained CNN.  This is likely because the poor CNN was trained on a specific dataset (i.e., selected subtopics of the main categories), so it had to deal with fewer discriminative patterns in texts compared to the first CNN trained using texts from all subtopics. Further studies of this quality of the DTs would be useful for some applications, e.g., measuring model complexity (Bianchini and Scarselli, 2014) and model compression (Cheng et al., 2018).

Conclusion
We proposed three human tasks to evaluate local explanation methods for text classification. Using the tasks in this paper, we experimented on 1D CNNs and found that (i) LIME is the most class discriminative method, justifying predictions with relevant evidence; (ii) LRP (N) works fairly well in helping humans investigate uncertain predictions; (iii) using explanations to reveal model behavior is challenging, and none of the methods achieved impressive results; (iv) whenever using LRP and DeepLIFT, we should present to humans the most relevant words together with their contexts and (v) the size of the DTs can also reflect the model complexity. Lastly, we consider evaluating on other datasets and other advanced architectures beneficial future work as it may reveal further interesting qualities of the explanation methods.