A Diagnostic Study of Explainability Techniques for Text Classification

Recent developments in machine learning have introduced models that approach human performance at the cost of increased architectural complexity. Efforts to make the rationales behind the models' predictions transparent have inspired an abundance of new explainability techniques. Provided with an already trained model, they compute saliency scores for the words of an input instance. However, there exists no definitive guide on (i) how to choose such a technique given a particular application task and model architecture, and (ii) the benefits and drawbacks of using each such technique. In this paper, we develop a comprehensive list of diagnostic properties for evaluating existing explainability techniques. We then employ the proposed list to compare a set of diverse explainability techniques on downstream text classification tasks and neural network architectures. We also compare the saliency scores assigned by the explainability techniques with human annotations of salient input regions to find relations between a model's performance and the agreement of its rationales with human ones. Overall, we find that the gradient-based explanations perform best across tasks and model architectures, and we present further insights into the properties of the reviewed explainability techniques.


Introduction
Understanding the rationales behind models' decisions is becoming a topic of pivotal importance, as both the architectural complexity of machine learning models and the number of their application domains increases.Having greater insight into the models' reasons for making a particular prediction has already proven to be essential for discovering potential flaws or biases in medical diagnosis (Caruana et al., 2015) and judicial sentencing (Rich, 2016).In addition, European law has man- dated "the right . . . to obtain an explanation of the decision reached" (Regulation, 2016).
Explainability methods attempt to reveal the reasons behind a model's prediction for a single data point, as shown in Figure 1.They can be produced post-hoc, i.e., with already trained models.Such post-hoc explanation techniques can be applicable to one specific model (Martens et al., 2008;Wagner et al., 2019) or to a broader range thereof (Ribeiro et al., 2016;Lundberg and Lee, 2017).They can further be categorised as: employing model gradients (Sundararajan et al., 2017;Simonyan et al., 2013), being perturbation based (Shapley, 1953;Zeiler and Fergus, 2014) or providing explanations through model simplifications (Ribeiro et al., 2016;Johansson et al., 2004).There also exist explainability methods that generate textual explanations (Camburu et al., 2018) and are trained posthoc or jointly with the model at hand.
While there is a growing amount of explainability methods, we find that they can produce vary-ing, sometimes contradicting explanations, as illustrated in Figure 1.Hence, it is important to assess existing techniques and to provide a generally applicable and automated methodology for choosing one that is suitable for a particular model architecture and application task (Jacovi and Goldberg, 2020).Robnik-Šikonja and Bohanec (2018) compiles a list of property definitions for explainability techniques, but it remains a challenge to evaluate them in practice.Several other studies have independently proposed different setups for probing varied aspects of explainability techniques (DeYoung et al., 2020;Sundararajan et al., 2017).However, existing studies evaluating explainability methods are discordant and do not compare to properties from previous studies.In our work, we consider properties from related work and extend them to be applicable to a broader range of downstream tasks.
Furthermore, to create a thorough setup for evaluating explainability methods, one should include at least: (i) different groups of explainability methods (explanation by simplification, gradient-based, etc.), (ii) different downstream tasks, and (iii) different model architectures.However, existing studies usually consider at most two of these aspects, thus providing insights tied to a specific setup.
We propose a number of diagnostic properties for explainability methods and evaluate them in a comparative study.We consider explainability methods from different groups, all widely applicable to most ML models and application tasks.We conduct an evaluation on three text classification tasks, which contain human annotations of salient tokens.Such annotations are available for Natural Language Processing (NLP) tasks, as they are relatively easy to obtain.This is in contrast to ML sub-fields such as image analysis, for which we only found one relevant dataset -536 manually annotated object bounding boxes for Visual Question Answering (Subramanian et al., 2020).
We further compare explainability methods across three of the most widely used model architectures -CNN, LSTM, and Transformer.The Transformer model achieves state-of-the-art performance on many text classification tasks but has a complex architecture, hence methods to explain its predictions are strongly desirable.The proposed properties can also be directly applied to Machine Learning (ML) subfields other than NLP.The code for the paper is publicly available. 11 https://github.com/copenlu/xai-benchmarkIn summary, the contributions of this work are: • We compile a comprehensive list of diagnostic properties for explainability and automatic measurement of them, allowing for their effective assessment in practice.• We study and compare the characteristics of different groups of explainability techniques in three different application tasks and three different model architectures.• We study the attributions of the explainability techniques and human annotations of salient regions to compare and contrast the rationales of humans and machine learning models.
Explainability methods provide explanations of different qualities, so assessing them systematically is pivotal.A common attempt to reveal shortcomings in explainability techniques is to reveal a model's reasoning process with counter-examples (Alvarez-Melis and Jaakkola, 2018;Kindermans et al., 2019;Atanasova et al., 2020b), finding different explanations for the same output.However, single counter-examples do not provide a measure to evaluate explainability techniques (Jacovi and Goldberg, 2020).
Another group of studies performs human evaluation of the outputs of explainability methods (Lertvittayakumjorn and Toni, 2019;Narayanan et al., 2018).Such studies exhibit low interannotator agreement and reflect mostly what appears to be reasonable and appealing to the annotators, not the actual properties of the method.
The most related studies to our work design measures and properties of explainability techniques.Robnik-Šikonja and Bohanec (2018) propose an extensive list of properties.The Consistency property captures the difference between explanations of different models that produce the same prediction; and the Stability property measures the difference between the explanations of similar instances given a single model.We note that similar predictions can still stem from different reasoning paths.Instead, we propose to explore instance activations, which reveal more of the model's reasoning process than just the final prediction.The authors propose other properties as well, which we find challenging to apply in practice.We construct a comprehensive list of diagnostic properties tied with measures that assess the degree of each characteristic.
Another common approach to evaluate explainability methods is to measure the sufficiency of the most salient tokens for predicting the target label (DeYoung et al., 2020).We also include a sufficiency estimate, but instead of fixing a threshold for the tokens to be removed, we measure the decrease of a model's performance, varying the proportion of excluded tokens.Other perturbation-based evaluation studies and measures exist (Sundararajan et al., 2017;Adebayo et al., 2018), but we consider the above, as it is the most widely applied.
Another direction of explainability evaluation is to compare the agreement of salient words annotated by humans to the saliency scores assigned by explanation techniques (DeYoung et al., 2020).We also consider the latter and further study the agreement across model architectures, downstream tasks, and explainability methods.While we consider human annotations at the word level (Camburu et al., 2018;Lei et al., 2016), there are also datasets (Clark et al., 2019;Khashabi et al., 2018) with annotations at the sentence-level, which would require other model architectures, so we leave this for future work.
Existing studies for evaluating explainability heavily differ in their scope.Some concentrate on a single model architecture -BERT-LSTM (DeYoung et al., 2020), RNN (Arras et al., 2019), CNN (Lertvittayakumjorn and Toni, 2019), whereas a few consider more than one model (Guan et al., 2019;Poerner et al., 2018).Some studies concentrate on one particular dataset (Guan et al., 2019;Arras et al., 2019), while only a few generalize their findings over downstream tasks (DeYoung et al., 2020;Vashishth et al., 2019).Finally, existing studies focus on one (Vashishth et al., 2019) or a single group of explainability methods (DeYoung et al., 2020;Adebayo et al., 2018).Our study is the first to propose a unified comparison of different groups of explainability techniques across three text classification tasks and three model architectures.

Evaluating Attribution Maps
We now define a set of diagnostic properties of explainability techniques, and propose how to quantify them.Similar notions can be found in related work (Robnik-Šikonja and Bohanec, 2018;DeYoung et al., 2020), and we extend them to be generally applicable to downstream tasks.We first introduce the prerequisite notation.Let X = {(x i , y i , w i )|i ∈ [1, N ]} be the test dataset, where each instance consists of a list of tokens x i = {x i,j |j ∈ [1, |x i |]}, a gold label y i , and a gold saliency score for each of the tokens in x i : Let ω be an explanation technique that, given a model M , a class c, and a single instance x i , computes saliency scores for each token in the input: Finally, let M = M 1 , . . .M K be models with the same architecture, each trained from a randomly chosen seed, and let M = M 1 , . . .M K be models of the same architecture, but with randomly initialized weights.
Agreement with human rationales (HA).This diagnostic property measures the degree of overlap between saliency scores provided by human annotators, specific to the particular task, and the word saliency scores computed by an explainability technique on each instance.The property is a simple way of approximating the quality of the produced feature attributions.While it does not necessarily mean that the saliency scores explain the predictions of a model, we assume that explanations with high agreement scores would be more comprehensible for the end-user as they would adhere more to human reasoning.With this diagnostic property, we can also compare how the type and the performance of a model and/or dataset affect the agreement with human rationales when observing one type of explainability technique.
During evaluation, we provide an estimate of the average agreement of the explainability technique across the dataset.To this end, we start at the instance level and compute the Average Precision (AP) of produced saliency scores ω M x i ,c by comparing them to the gold saliency annotations w i .Here, the label for computing the saliency scores is the gold label: c = y i .Then, we compute the average across all instances, arriving at Mean AP (MAP): gle instance can receive several saliency scores, indicating its contribution to the prediction of each of the classes.Thus, when a model recognizes a highly indicative pattern of the predicted class k, the tokens involved in the pattern would have highly positive saliency scores for this class and highly negative saliency scores for the remaining classes.On the other hand, when the model is not highly confident, we can assume that it is unable to recognize a strong indication of any class, and the tokens accordingly do not have high saliency scores for any class.Thus, the computed explanation of an instance i should indicate the confidence p i,k of the model in its prediction.
We propose to measure the predictive power of the produced explanations for the confidence of the model.We start by computing the Saliency Distance (SD) between the saliency scores for the predicted class k to the saliency scores of the other classes K/k (Eq.2).Given the distance between the saliency scores, we predict the confidence of the class with logistic regression (LR) and finally compute the Mean Absolute Error -MAE (Eq.3), of the predicted confidence to the actual one.
For tasks with two classes, D is the subtraction of the saliency value for class k and the other class.
For more than two classes, D is the concatenation of the max, min, and average across the differences of the saliency value for class k and the other classes.Low MAE indicates that model's confidence can be easily identified by looking at the produced explanations.Faithfulness (F).Since explanation techniques are employed to explain model predictions for a single instance, an essential property is that they are faithful to the model's inner workings and not based on arbitrary choices.A well-established way of measuring this property is by replacing a number of the most-salient words with a mask token (DeYoung et al., 2020) and observing the drop in the model's performance.To avoid choosing an unjustified percentage of words to be perturbed, we produce several dataset perturbations by masking 0, 10, 20, . . ., 100% of the tokens in order of decreasing saliency, thus arriving at X ω 0 , X ω 10 , . . ., X ω 100 .Finally, to produce a single number to mea-sure faithfulness, we compute the area under the threshold-performance curve (AUC-TP): where P is a task specific performance measure and i ∈ [0, 10, . . ., 100].We also compare the AUC-TP of the saliency methods to a random saliency map to find whether there are explanation techniques producing saliency scores without any contribution over a random score.
Using AUC-TP, we perform an ablation analysis which is a good approximation of whether the most salient words are also the most important ones for a model's prediction.However, some prior studies (Feng et al., 2018) find that models remain confident about their prediction even after stripping most input tokens, leaving a few that might appear nonsensical to humans.The diagnostic properties that follow aim to facilitate a more in-depth analysis of the alignment between the inner workings of a model and produced saliency maps.

Rationale Consistency (RC).
A desirable property of an explainability technique is to be consistent with the similarities in the reasoning paths of several models on a single instance.Thus, when two reasoning paths are similar, the scores provided by an explainability technique ω should also be similar, and vice versa.Note that we are interested in similar reasoning paths as opposed to similar predictions, as the latter does not guarantee analogous model rationales.For models with diverse architectures, we expect rationales to be diverse as well and to cause low consistency.Therefore, we focus on a set of models with the same architecture, trained from different random seeds as well as the same architecture, but with randomly initialized weights.The latter would ensure that we can have model pairs (M s , M p ) with similar and distant rationales.We further claim that the similarity in the reasoning paths could be measured effectively with the distance between the activation maps (averaged across layers and neural nodes) produced by two distinct models (Eq.5).The distance between the explanation scores is computed simply by subtracting the two (Eq.6).Finally, we compute Spearman's ρ between the similarity of the explanation scores and the similarity of the attribution maps (Eq.7).
The higher the positive correlation is, the more consistent the attribution method would be.We choose Spearman's ρ as it measures the monotonic correlation between the two variables.On the other hand, Pearson's ρ measures only the linear correlation, and we can have a non-linear correlation between the activation difference and the saliency score differences.When subtracting saliency scores and layer activations, we also take the absolute value of the vector difference as the property should be invariant to order of subtraction.An additional benefit of the property is that low correlation scores would also help to identify explainability techniques that are not faithful to a model's rationales.
Dataset Consistency (DC).The next diagnostic property is similar to the above notion of rationale consistency but focuses on consistency across instances of a dataset as opposed to consistency across different models of the same architecture.In this case, we test whether instances with similar rationales also receive similar explanations.While Rationale Consistency compares instance explanations of the same instance for different model rationales, Dataset Consistency compares explanations for pairs of instances on the same model.We again measure the similarity between instances x i and x j by comparing their activation maps, as in Eq. 8.The next step is to measure the similarity of the explanations produced by an explainability technique ω, which is the difference between the saliency scores as in Eq. 9. Finally, we measure Spearman's ρ between the similarity in the activations and the saliency scores as in Eq. 10.We again take the absolute value of the difference.

Models
We experiment with different commonly used base models, namely CNN (Fukushima, 1980), LSTM (Hochreiter and Schmidhuber, 1997), and the Transformer (Vaswani et al., 2017) architecture BERT (Devlin et al., 2019).The selected models allow for a comparison of the explainability techniques on diverse model architectures.et al., 2000) as an activation function.The result is compressed down with a max-pooling layer, passed through a dropout layer, and into a fine linear layer responsible for the prediction.The final layer has a size equal to the number of classes in the dataset.
The LSTM model again contains an embedding layer initialized with the GloVe embeddings.The embeddings are passed through several bidirectional LSTM layers.The final output of the recurrent layers is passed through three linear layers and a final dropout layer.
For the Transformer model, we fine-tune the pre-trained basic, uncased language model (LM) (Wolf et al., 2019).The fine-tuning is performed with a linear layer on top of the LM with a size equal to the number of classes in the corresponding task.Further implementation details for all of the models, as well as their F1 scores, are presented in A.1.
Starting with the gradient-based approaches, we select Saliency (Simonyan et al., 2013) as many other gradient-based explainability methods build on it.It computes the gradient of the output w.r.t. the input.We also select two widely used improvements of the Saliency technique, namely InputX-Gradient (Kindermans et al., 2016), and Guided Backpropagation (Springenberg et al., 2014).In-putXGradient additionally multiplies the gradient with the input and Guided Backpropagation overwrites the gradients of ReLU functions so that only non-negative gradients are backpropagated.
From the perturbation-based approaches, we employ Occlusion (Zeiler and Fergus, 2014), which replaces each token with a baseline token (as per standard, we use the value zero) and measures the change in the output.Another popular perturbation-based technique is the Shapley Value Sampling (Castro et al., 2009).It is based on the Shapley Values approach that computes the average marginal contribution of each word across all possible word perturbations.The Sampling variant allows for a faster approximation of Shapley Values by considering only a fixed number of random perturbations as opposed to all possible perturbations.Finally, we select the simplification-based explanation technique LIME (Ribeiro et al., 2016).For each instance in the dataset, LIME trains a linear model to approximate the local decision boundary for that instance.
Generating explanations.The saliency scores from each of the explainability methods are generated for each of the classes in the dataset.As all of the gradient approaches provide saliency scores for the embedding layer (the last layer that we can compute the gradient for), we have to aggregate them to arrive at one saliency score per input token.As we found different aggregation approaches in related studies (Bansal et al., 2016;DeYoung et al., 2020), we employ the two most common methods -L2 norm and averaging (denoted as µ and 2 in the explainability method names).Table 3: Mean of the diagnostic property measures for all tasks and models.The best result for the particular model architecture and downstream task is in bold and the second-best is underlined.

Results and Discussion
We report the measures of each diagnostic property as well as FLOPs as a measure of the computing time used by the particular method.For all diagnostic properties, we also include the randomly assigned saliency as a baseline.

Results
Of the three model architectures, unsurprisingly, the Transformer model performs best, while the CNN and the LSTM models are close in performance.
It is only for the IMDB dataset that the LSTM model performs considerably worse than the CNN, which we attribute to the fact that the instances contain a large number of tokens, as shown in Table 1.As this is not the core focus of this paper, detailed results can be found in the supplementary material.
Overall results.Table 3 presents the mean of all properties across tasks and models with all property measures normalized to be in the range [0,1].We see that gradient-based explainability techniques always have the best or the second-best performance for the diagnostic properties across all three model architectures and all three downstream tasks.Note that, InputXGrad µ and GuidedBP µ , which are computed with a mean aggregation of the scores, have some of the worst results.We conjecture that this is due to the large number of values that are averaged -the mean smooths out any differences in the values.In contrast, the L2 norm aggregation amplifies the presence of large and small values in the vector.From the non-gradient based explainability methods, LIME has the best performance, where in two out of nine cases it has the best performance.It is followed by ShapSampl and Occlusion.We can conclude that the occlusion based methods overall have the worst performance according to the diagnostic properties.
Furthermore, we see that the explainability methods achieve better performance for the e-SNLI and the TSE datasets with the Transformer and LSTM architectures, whereas the results for the IMDB dataset are the worst.We hypothesize that this is due to the longer text of the input instances in the IMDB dataset.The scores also indicate that the explainability techniques have the highest diagnostic property measures for the CNN model with the e-SNLI and the IMDB datasets, followed by the LSTM, and the Transformer model.We suggest that the performance of the explainability tools can be worse for large complex architectures with a huge number of neural nodes, like the Transformer one, and perform better for small, linear architectures like the CNN.Diagnostic property performance.Figure 2 shows the performance of each explainability technique for all diagnostic properties on the e-SNLI dataset, and Figure 3 -for the TSE dataset, which are considerably bigger than IMDB.The IMDB dataset shows similar tendencies and a corresponding figure can be found in the supplementary material.
Agreement with human rationales.
We observe that the best performing explainability technique for the Transformer model is InputXGrad 2 followed by the gradient-based ones with L2 norm aggregation.While for the CNN and the LSTM models, we observe similar trends, their MAP scores are always lower than for the Transformer, which indicates a correlation between the performance of a model and its agree-  ment with human rationales.Furthermore, the MAP scores of the CNN model are higher than for the LSTM model, even though the latter achieves higher F1 scores on the e-SNLI dataset.This might indicate that the representations of the LSTM model are less in line with human rationales.Finally, we note that the mean aggregations of the gradientbased explainability techniques have MAP scores close to or even worse than those from the randomly initialized models.
Faithfulness.We find that gradient-based techniques have the best performance for the Faithfulness diagnostic property.On the e-SNLI dataset, it is particularly InputXGrad 2 , which performs well across all model architectures.We further find that the CNN exhibits the highest Faithfulness scores for seven out of nine explainability methods.We hypothesize that this is due to the simple architecture with relatively few neural nodes compared to the recurrent nature of the LSTM model and the large number of neural nodes in the Transformer architecture.Finally, models with high Faithfulness scores do not necessarily have high Human agreement scores and vice versa.This suggests that these two are indeed separate diagnostic properties, and the first should not be confused with estimating the faithfulness of the techniques.
Confidence Indication.We find that the Confidence Indication of all models is predicted most accurately by the ShapSampl, LIME, and Occlusion explainability methods.This result is expected, as they compute the saliency of words based on differences in the model's confidence using different instance perturbations.We further find that the CNN model's confidence is better predicted with InputXGrad µ .The lowest MAE with the balanced dataset is for the CNN and LSTM models.We hypothesize that this could be due to these models' overconfidence, which makes it challenging to detect when the model is not confident of its prediction.
Rationale Consistency.There is no single universal explainability technique that achieves the highest score for Rationale Consistency property.We see that LIME can be good at achieving a high performance, which is expected, as it is trained to approximate the model's performance.The latter is beneficial, especially for models with complex architectures like the Transformer.The gradientbased approaches also have high Rationale Consistency scores.We find that the Occlusion technique is the best performing for the LSTM across all tasks, as it is the simplest of the explored explainability techniques, and does not inspect the model's internals or try to approximate them.This might serve as an indication that LSTM models, due to their recurrent nature, can be best explained with simple perturbation based methods that do not examine a model's reasoning process.
Dataset Consistency.Finally, the results for the Dataset Consistency property show low to moderate correlations of the explainability techniques with similarities across instances in the dataset.The correlation is present for LIME and the gradientbased techniques, again with higher scores for the L2 aggregated gradient-based methods.
Overall.To summarise, the proposed list of diagnostic properties allows for assessing existing explainability techniques from different perspectives and supports the choice of the best performing one.Individual property results indicate that gradient-based methods have the best performance.The only strong exception to the above is the better performance of ShapSampl and LIME for the Confidence Indication diagnostic property.However, ShapSampl, LIME and Occlusion take considerably more time to compute and have worse performance for all other diagnostic properties.

Conclusion
We proposed a comprehensive list of diagnostic properties for the evaluation of explainability techniques from different perspectives.We further used them to compare and contrast different groups of explainability techniques on three downstream tasks and three diverse architectures.We found that gradient-based explanations are the best for all of the three models and all of the three downstream text classification tasks that we consider in this work.Other explainability techniques, such as ShapSampl, LIME and Occlusion take more time to compute, and are in addition considerably less faithful to the models and less consistent with the rationales of the models and similarities in the datasets.Machine Learning Models .The models used in our experiments are trained on the training splits, and the parameters are selected according to the development split.We conducted fine-tuning in a grid-search manner with the ranges and parameters we describe next.We use superscripts to indicate when a parameter value was selected for one of the datasets e-SNLI -1, Movie Review -2, and TSE -3.For the CNN model, we experimented with the following parameters: embedding dimension ∈ {50, 100, 200, 300 1,2,3 }, batch size ∈ {16 2 , 32, 64 3 , 128, 256 1 }, dropout rate ∈ {0.05 1,2,3 , 0.1, 0.15, 0.2}, learning rate for an Adam optimizer ∈ {0.01, 0.03, 0.001 2,3 , 0.003, 0.0001 1 , 0.0003}, [3,4,5,6], [4,5,6], [4,5,6,7] 1 }, and number of output channels ∈ {50 2,3 , 100, 200, 300 1 }.We leave the stride and the padding parameters to their default values -one and zero.
The CNN and LSTM models are trained with an early stopping over the validation accuracy with a patience of five and a maximum number of training epochs of 100.We also experimented with other optimizers, but none yielded improvements.
Finally, for the Transformer model we finetuned the pre-trained basic, uncased LM (Wolf et al., 2019)(110M parameters) where the maximum input size is 512, and the hidden size of each layer of the 12 layers is 768.We performed a grid-search over learning rate of ∈ {1e − 5, 2e − 5 1,2 , 3e − 5 3 , 4e − 5, 5e − 5}.The models were trained with a warm-up period where the learning rate increases linearly between 0 and 1 for 0.05% of the steps found with a grid-search.We train the models for five epochs with an early stopping with patience of one as the Transformer models are easily fine-tuned for a small number of epochs.
All experiments were run on a single NVIDIA TitanX GPU with 8GB, and 4GB of RAM and 4 Intel Xeon Silver 4110 CPUs.
The models were evaluated with macro F1 score, which can be found here https://scikit-learn.org/stable/modules/ generated/sklearn.metrics.precision_recall_fscore_support.htmland is defined as follows: where TP is the number of true positives, FP is the number of false positives, and FN is the number of false negatives.Explainability generation.When evaluating the Confidence Indication property of the explainability measures, we train a logistic regression for 5 splits and provide the MAE over the five test splits.As for some of the models, e.g.Transformer, the confidence is always very high, the LR starts to predict only the average confidence.To avoid this, we additionally randomly up-sample the training instances with a smaller confidence, making the number of instances in each confidence interval [0.0-0.1],. . .[0.9-1.0]) to be the same as the maximum number of instances found in one of the separate intervals.
For both Rationale and Dataset Consistency properties, we consider Spearman's ρ.While Pearson's ρ measures only the linear correlation between two variables (a change in one variable should be proportional to the change in the other variable), Spearman's ρ measures the monotonic correlation (when one variable increases, the other increases, too).In our experiments, we are interested in the monotonic correlation as all activation differences don't have to be linearly proportional to the differences of the explanations and therefore measure Spearman's ρ.
The Dataset Consistency property is estimated over instance pairs from the test dataset.As computing it for all possible pairs in the dataset is computationally expensive, we select 2 000 pairs from each dataset in order of their decreasing word overlap and sample 2 000 from the remaining instance pairs.This ensures that we compute the diagnostic property on a set containing tuples of similar and different instances.

Figure 1 :
Figure 1: Example of the saliency scores for the words (columns) of an instance from the Twitter Sentiment Extraction dataset.They are produced by the explainability techniques (rows) given a Transformer model.The first row is the human annotation of the salient words.The scores are normalized in the range [0, 1].

Figure 2 :
Figure 2: Diagnostic property evaluation for all explainability techniques, on the e-SNLI dataset.The and signs indicate that higher, correpspondingly lower, values of the property measure are better.

Figure 3 :
Figure 3: Diagnostic property evaluation for all explainability techniques, on the TSE dataset.The and signs indicate that higher, correspondingly lower, values of the property measure are better.

A. 2
Spider Figure for the IMDB dataset A.3 Detailed explainability techniques evaluation results.

Figure 4 :
Figure 4: Diagnostic property evaluation for all explainability techniques, on the IMDB dataset.The and signs following the names of each explainability method indicate that higher, correspondingly lower, values of the property measure are better.

Table 1 :
) Datasets with human-annotated saliency explanations.The Size column presents the dataset split sizes we use in our experiments.The Length column presents the average number of instance tokens in the test set (inst.) and the average number of human annotated explanation tokens (expl.).

Table 1
provides an overview of the used datasets.For e-SNLI, models predict inference -contradiction, neutral, or entailment -between sentence tuples.For the Movie Reviews dataset, models predict the sentiment -positive, negative, or neutral -of reviews with multiple sentences.Finally, for the TSE dataset, models predict tweets' sentiment -positive, negative, or neutral.The e-SNLI dataset provides three dataset splits with humanannotated rationales, which we use as training, dev, and test sets, respectively.The Movie Reviews dataset provides rationale annotations for nine out of ten splits.Hence, we use the ninth split as a test and the eighth split as a dev set, while the rest are used for training.Finally, the TSE dataset only provides rationale annotations for the training dataset, and we therefore randomly split it into 80/10/10% chunks for training, development and testing.

Table 2 :
Models' F1 score on the test and the validation datasets.The results present the average and the standard deviation of the Performance measure over five models trained from different seeds.The random versions of the models are again five models, but only randomly initialized, without training.
volutional, a max-pooling, and a linear layer.The embedding layer is initialized with GloVe (Pennington et al., 2014) embeddings and is followed by a dropout layer.The convolutional layer computes convolutions with several window sizes and multiple-output channels with ReLU (Hahnloser

Table 4 :
Hyper-parameter tuning details.Time is the average time (mean and standard deviation in brackets) measured in minutes required for a particular model with all hyper-parameter combinations.Score is the mean and standard deviation of the performance on the validation set as a function of the number of the different hyper-parameter searches.

Table 5 :
Evaluation of the explainability techniques with Human Agreement (HA) and time for computation.HA is measured with Mean Average Precision (MAP) with the gold human annotations, MAP of a Randomly initialized model (MAP RI).The time is computed with FLOPs.The presented numbers are averaged over five different models and the standard deviation of the scores is presented in brackets.Explainability methods with the best MAP for a particular dataset and model are in bold, while the best MAP across all models for a dataset is underlined as well.Methods that have MAP worse than the randomly generated saliency are in red.

Table 8 :
Rationale Consistency Spearman's ρ correlation.The estimated p-value for the correlation is provided in the brackets.The best results for a particular dataset and model are in bold and the best results across a dataset are also underlined.Correlation lower that the one of the randomly sampled saliency scores are colored in red.