Evaluating Saliency Methods for Neural Language Models

Saliency methods are widely used to interpret neural network predictions, but different variants of saliency methods often disagree even on the interpretations of the same prediction made by the same model. In these cases, how do we identify when are these interpretations trustworthy enough to be used in analyses? To address this question, we conduct a comprehensive and quantitative evaluation of saliency methods on a fundamental category of NLP models: neural language models. We evaluate the quality of prediction interpretations from two perspectives that each represents a desirable property of these interpretations: plausibility and faithfulness. Our evaluation is conducted on four different datasets constructed from the existing human annotation of syntactic and semantic agreements, on both sentence-level and document-level. Through our evaluation, we identified various ways saliency methods could yield interpretations of low quality. We recommend that future work deploying such methods to neural language models should carefully validate their interpretations before drawing insights.


Introduction
While neural network models for Natural Language Processing (NLP) have recently become popular, a general complaint is that their internal decision mechanisms are hard to understand. To alleviate this problem, recent work has deployed interpretation methods on top of the neural network models. Among them, there is a category of interpretation methods called saliency method that is especially widely adopted (Li et al., 2016a,b;Arras et al., 2016Arras et al., , 2017Mudrakarta et al., 2018;Ding et al., 2019). At a very high level, these methods assign an importance score to each feature in the input feature set F , regarding a specific prediction y made by a neural network model M . Such feature importance scores can hopefully shed light on the neural network models' internal decision mechanism. V U.S. companies wanting to expand in Europe SG U.S. companies wanting to expand in Europe IG U.S. companies wanting to expand in Europe Table 1: An example from our evaluation where different saliency methods assign different importance scores for the same model (Transformer language model) and the same next word prediction (are). V, SG and IG are different saliency methods (see Section 2). The tints of green and yellow mark the magnitude of positive and negative importance scores, respectively.
While analyzing saliency interpretations uncovers useful insights for their respective task of interest, different saliency methods often give different interpretations even when the internal decision mechanism remains the same (with F , y and M held constant), as exemplified in Table 1. Even so, most existing work that deploys these methods often makes an ungrounded assumption that a specific saliency method can reliably uncover the internal model decision mechanism or, at most, relies merely on qualitative inspection to determine their applicability. Such practice has been pointed out in Adebayo et al. (2018); Lipton (2018); Belinkov and Glass (2019) to be potentially problematic for model interpretation studies -it can lead to misleading conclusions about the deep learning model's reasoning process. On the other hand, in the context of NLP, the quantitative evaluation of saliency interpretations largely remains an open problem (Belinkov and Glass, 2019).
In this paper, we address this problem by building a comprehensive quantitative benchmark to evaluate saliency methods. Our benchmark focuses on a fundamental category of NLP models: neural language models. Following the concepts proposed by Jacovi and Goldberg (2020), our benchmark evaluates the credibility of saliency interpretations from two aspects: plausibility and faithfulness. In short, plausibility measures how much these interpretations align with basic human intuitions about the model decision mechanism, while faithfulness measures how consistent the interpretations are regarding perturbations that are supposed to preserve the same model decision mechanism on either the input feature F or the model M .
With these concepts in mind, our main contribution is materializing these tests' procedure in the context of neural language modeling and building four test sets from existing linguistic annotations to conduct these tests. Our study covering SOTAlevel models on three different network architectures reveals that saliency methods' applicability depends heavily on specific choices of saliency methods, model architectures, and model configurations. We suggest that future work deploying these methods to NLP models should carefully validate their interpretations before drawing conclusions.
This paper is organized as follows: Section 2 briefly introduces saliency methods; Section 3 describes the plausibility and faithfulness tests in our evaluation; Section 4 presents the datasets we built for the evaluation; Section 5 presents our experiment setup and results; Section 6 discusses some limitations and implications of the evaluation; Section 7 concludes the paper.

Saliency
The notion of saliency discussed in this paper is a category of neural network interpretation methods that interpret a specific prediction y made by a neural network model M , by assigning a distribution of importance Ψ(F ) over the input feature set F of the original neural network model.
The most basic and widely used method is to assign importance by the gradient (Simonyan et al., 2013), which we refer to as vanilla gradient method (V). For each x ∈ F , ψ(x) = ∂py ∂x , while p y is the score of prediction y generated by M . We also examine two improved version of gradient-based saliency: SmoothGrad (SG) (Smilkov et al., 2017) and Integrated Gradients (IG) (Sundararajan et al., 2017). SmoothGrad reduces the noise in vanilla gradient-based scores by constructing several corrupted instances of the original input by adding Gaussian noise, followed by averaging the scores. Integrated Gradients computes feature importance by computing a line integral of the vanilla saliency from a "baseline" point F 0 to the input F in the feature space. We refer the readers to the cited papers for details of these saliency methods.
There is a slight complication in the meaning of F when applying these methods in the context of NLP: all the methods above will generate one importance score for each dimension of the word embedding, but most applications of saliency to NLP want a word-level importance score. Hence, we need composition schemes to combine scores over word embedding dimensions into a single score for each word. In the rest of this paper, we assume the "features" in the feature set F are input words to the language model, and word-level importance scores are composed using the gradient · input scheme (Denil et al., 2014;Ding et al., 2019). 1

Evaluation Paradigm
In this section, we first introduce the notion of plausibility and faithfulness in the context of neural network interpretations (following Jacovi and Goldberg (2020)), and then, respectively, introduce the test we adopt to evaluate them.

Plausibility
Concept An interpretation is plausible if it aligns with human intuitions about how a specific neural model makes decisions. For example, intuitively, an image classifier can identify the object in the image because it can capture some features of the main object in the image. Hence, a plausible interpretation would assign high importance to the area occupied by the main object. This idea of comparison with human-annotated ground-truth (often as "bounding-boxes" signaling the main object's area) is used by various early studies in computer vision to evaluate saliency methods' reliability (Jiang et al., 2013, inter alia). However, the critical challenge of such evaluations for neural language models is the lack of such ground-truth annotations.
Test To overcome this challenge, we follow Poerner et al. (2018) to construct ground-truth annotations from existing lexical agreement annotations. Consider, for example, the case of morphological number agreement. Intuitively, when the language model predicts a verb with a singular morphological number, the singular nouns in the prefix should be considered important features, and vice versa. Based on this intuition, we divide the nouns in the prefix into two different sets: the cue set C, which shares the same morphological number as the verb in the sentence; and the attractor set A, which has a different morphological number than the verb in the sentence. Then, according to the prediction y made by the model M , the test will be conducted under one of the two following scenarios: • Expected: when y is the verb with the correct number, the interpretation passes the test if max w∈C ψ(w) > max w∈A ψ(w) • Alternative: when y is the verb with the incorrect number, the interpretation passes the test if max w∈C ψ(w) < max w∈A ψ(w) However, this test has a flaw: while the evaluation criteria focus on a specific category of lexical agreement, the prediction of a word could depend on multiple lexical agreements simultaneously. To illustrate this point, consider the verb prediction following the prefix "At the polling station people ...". Suppose the model M predicts the verb vote. One could argue that people is more important than polling station because it needs the subject to determine the morphological number of the verb. However, the semantic relation between vote and polling station is also important because that is what makes vote more likely than other random verbs, e.g. sing.
To minimize such discrepancy and constrain the scope of agreements used to make predictions, we draw inspiration from the previous work on representation probing and make adjustment to the model M we are evaluating on (Tenney et al., 2019a,b;Kim et al., 2019;Conneau et al., 2018;Adi et al., 2017;Shi et al., 2016). The idea is to take a language model that is trained to predict words (e.g., vote in the example above) and substitute the original final linear layer with a new linear layer (which we refer to as a probe) fine-tuned to predict a binary lexical agreement tag (e.g., PLURAL) corresponding to the word choice. By making this adjustment, the final layer extracts a subspace in the representation that is relevant to the prediction of particular lexical agreement during the forward computation, and reversely, filters out gradients that are irrelevant to the agreement prediction in the backward pass, creating an interpretation that is only subject to the same agreement constraints as to when the annotation for the test set is done.
Apart from the adjustment made on the model M above, we also extend Poerner et al. (2018) in the other two aspects: (1) we evaluate on one more lexical agreement: gender agreements between pronouns and referenced entities, and on both natural and synthetic datasets; (2) instead of evaluating on small models, we evaluate on large SOTA-level models for each architecture. We also show that evaluation results obtained on smaller models cannot be trivially extended to larger models.

Faithfulness
Concept An interpretation is faithful if the feature importance it assigns is consistent with the internal decision mechanism of a model. However, as Jacovi and Goldberg (2020) pointed out, the notion of "decision mechanism" lacks a standard definition and a practical way to make comparisons. Hence, as a proxy, we follow the working definition of faithfulness as proposed in their work, which states that an interpretation is faithful if the feature importance it assigns remains consistent with changes that should not change the internal model decision mechanism. Among the three relevant factors for saliency methods (prediction y, model M , and input feature set F ), we focus on consistency upon changes in model M (model consistency) and input feature set F (input consistency). 2 Note that these two consistencies respectively correspond to assumptions 1 and 2 in the discussion of faithfulness evaluation in Jacovi and Goldberg (2020). Model Consistency Test To measure model consistency, we propose to measure the consistency between feature importance Ψ M (F ) and Ψ M (F ), which is respectively generated from the original model M and a smaller model M that is trained by distilling knowledge from M . In this way, although M and M have different architectures, M is trained to mimic the behavior of M to the extent possible, and thus having similar underlying decision mechanisms. Input Consistency Test To measure input consistency, we perform substitutions in the input and measure the consistency between feature importance Ψ(F ) and Ψ(F ), where F and F are input features sets before/after the substitution. For example, the following prefix-prediction pairs should have the same feature importance distribution: • The nun bought the son a gift because (she...) • The woman bought the boy a gift because (she...) We measure consistency by Pearson correlation between pairs of importance score over the input feature set F for both tests. Also, note that although we can theoretically conduct faithfulness tests with any model M and any dataset, for the simplicity of analysis and data creation, we will use the same model M (with lexical agreement probes) and the same dataset as plausibility tests.

Data 3
Following the formulation in Section 3, we constructed four novel datasets for our benchmark, as exemplified in Table 2. Two of the datasets are concerned with number agreement of a verb with its subject. The other two are concerned with gender agreement of a pronoun with its anteceding entity mentions. For each lexical agreement type, we have one synthetic dataset and one natural dataset. Both synthetic datasets ensure there is only one cue and one attractor for each test instance, while for natural datasets, there are often more than one.
For number agreement, our synthetic dataset is constructed from selected sections of Syneval, a targeted language model evaluation dataset from Marvin and Linzen (2018), where the verbs and the subjects could be easily induced with heuristics. We only use the most challenging sections where strongly interceding attractors are involved. Our natural dataset for this task is filtered from Penn Treebank (Marcus et al., 1993, PTB), including training, development, and test. We choose PTB because it offers not only human-annotated POStags necessary for benchmark construction but also dependent subjects of verbs for further analysis.
For gender agreement, our synthetic dataset comes from the unambiguous Winobias coreference resolution dataset used in Jumelet et al. (2019), and we only use the 1000-example subset where there is respectively one male and one female antecedent. Because this dataset is intentionally designed such that most humans will find pronouns of either gender equally likely to follow the prefix, no such pronoun gender is considered to be "correct". Hence, without loss of generality, we assign the female pronoun to be the expected case. 4 Our natural dataset for this task is filtered from CoNLL-2012 shared task dataset for coreference resolution (Pradhan et al., 2012, also including training, develop-ment, and test). The prefix of each test example covers a document-level context, which usually spans several hundred words.
Plausibility Test For number agreement, the cue set C is the set of all nouns that have the same morphological number as the verb. In contrast, the attractor set A is the set of all nouns with a different morphological number. For gender agreement, the cue set C is the set of all nouns with the same gender as the pronoun, while the attractor set A is the set of all nouns with a different gender.

Model Consistency Test
No special treatment to data is needed for this test. We conduct model consistency tests on all datasets we built.

Input Consistency Test
We recognize that generating interpretation-preserving input perturbations for natural datasets is quite tricky. Hence, unlike the model consistency test, we focus on the two synthetic datasets for faithfulness tests because they are generated from templates. As can be seen from the examples, when the nouns in the cue/attractor set are substituted while maintaining the lexical agreement, the underlying model decision mechanism should be left unchanged; hence they can be viewed as interpretation-preserving perturbations. We identified 24 and 254 such interpretationpreserving templates from our Syneval and Winobias dataset and generated perturbations pairs by combining the first example of each template with other examples generated from the same template.  Table 2: Examples prefixes from the four evaluation datasets, followed by the probing tag prediction under the expected scenario. The cue and attractor sets are marked with solid Green and yellow , respectively.  Table 3: Plausibility benchmark result. Each number is the fraction of cases the interpretation passes the benchmark test, while the numbers in brackets for each architecture are the fraction of times these scenarios occur for predictions generated by the corresponding model. Results from the best interpretation method for each architecture are boldfaced. The exp. and alt. columns are breakdown of evaluation results into expected scenarios and alternative scenarios as defined in Section 3. V, SG, IG stands for the vanilla saliency, SmoothGrad, and Integrated Gradients, respectively. mentation in fairseq tookit (Ott et al., 2019). For all the task-specific "probes", the fine-tuning is performed on examples extracted from Wiki-Text-2 training data. A tuning example consists of an input prefix and a gold tag for the lexical agreement in both cases. For number agreement, we first run Stanford POS Tagger (Toutanova et al., 2003) on the data, and an example is extracted for each present tense verb and each instance of was or were. For gender agreement, an example is extracted for each gendered pronoun. During finetuning, we fix all the other parameters except the final linear layer. The final layer is tuned to minimize cross-entropy, with Adam optimizer (Kingma and Ba, 2015) and initial learning rate of 1e−3 with ReduceLROnPlateau scheduler.
We follow the setup for DistillBERT (Sanh et al., 2019) for the distillation process involved during the model consistency test, which reduces the depth of models but not the width. For our LSTM (3 layers) and QRNN model (4 layers), the M we distill is one layer shallower than the original model M . For our transformer model (16 layers), we distill a 4-layer M largely due to memory constraints.

Main Results
Plausibility According to our plausibility evaluation result, summarized in Table 3, both SG and IG consistently perform better than the vanilla saliency method regardless of different benchmark datasets and interpreted models. However, the comparison between SG and IG interpretations varies depend-ing on the model architecture and test sets.
Across different architectures, Transformer language model achieves the best plausibility except on the Syneval dataset. LSTM closely follows Transformer for most benchmarks, while the plausibility of the interpretation from QRNN is much worse. Another trend worth noting is that the gap between Transformer and the other two architectures is much larger on the CoNLL benchmark, which is the only test that involves interpreting document-level contexts. However, these architectures' prediction accuracy is similar, meaning that there is no significant modeling power difference for gender agreements in this dataset. We hence conjecture that the recurrent structure of LSTM and QRNN might diminish gradient signals with increasing time steps, which causes the deterioration of interpretation quality for long-distance agreements -a problem that Transformer is exempt from, thanks to the self-attention structure.
Faithfulness Table 4a shows the input consistency benchmark result. Firstly, it can be seen that the interpretations of LSTM and Transformer are more resilient to input perturbations than that of QRNN. This is the same trend as we observed for plausibility benchmark on these datasets. When comparing different saliency methods, we see that SG consistently outperforms for Transformer, but fails for the other two architectures, especially for QRNN. Also, note that achieving higher plausibility does not necessarily imply higher faithfulness. For example, compared to the vanilla saliency method, SG and IG almost always significantly improve plausibility but do not always improve faithfulness. This lack of improvement is different from the findings in computer vision (Yeh et al., 2019), where they show both SG and IG improve input consistency. Also, for LSTM, although SG works slightly better than IG in terms of plausibility, IG outperforms SG in terms of input consistency by a large margin. Table 4b shows the model consistency benchmark result. One should first notice that model consistency numbers are lower than input consistency across the board, and the drop is more significant for LSTM and QRNN even though their student model is not as different as the Transformer model (<20% parameter reduction vs. 61%). As a result, there is a significant performance gap in terms of best model consistency results between LSTM/QRNN and Transformer. Note that, like in plausibility results, such gap is most notable on the CoNLL dataset. On the other hand, when comparing between saliency methods, we again see that SG outperforms for Transformer while failing most of the times for QRNN and LSTM.

Analysis
Plausibility vs. Faithfulness A natural question for our evaluation is how the property of plausibility and faithfulness interact with each other. Table 5 illustrates such interaction with qualitative examples. Among them, 1 and 2 are two cases where the plausibility and input faithfulness evaluation results do not correlate. In general, the interpretations in both cases are of low quality, but they also fail in different ways. In case 1, the interpretation assigns the correct relative ranking for the cue words and attractor words, but the importance of the words outside the cue/attractor set varies upon perturbation. On the other hand, in case 2, the importance ranking among features is roughly maintained upon perturbation, but the importance score assigned for both examples do not agree with the prediction interpreted (FEMININE tag) and thus can hardly be understood by humans. It should be noted that these defects can only be revealed when both plausibility and faithfulness tests for interpretations are deployed.
Case 3 shows a scenario where the saliency method yields very different interpretations for the same input/prediction pair, indicating that interpretations from this architecture/saliency method combination are subject to changes upon changes in the architecture configurations. Finally, in case 4, we see that an architecture/saliency method combination performing well in all tests yields stable interpretations that humans can easily understand.
Sensitivity to Model Configurations Our model faithfulness evaluation shows that variations in the model configurations (number of layers) could drastically change the model interpretation in many cases. Hence, we want to answer two analysis questions: (1) are these interpretations changing for the better or worse quality-wise with the distilled smaller models? (2) are there any patterns for such changes? Due to space constraints, we only show some analysis results for question (1) in Table 6. Overall, compared to the corresponding results in Table 3 (for plausibility) and   smaller distilled models. Most remarkably, we see a drastic performance improvement for QRNN, both in plausibility and faithfulness. For LSTM and Transformer, we observe an improvement for input faithfulness on Winobias and roughly the same performance for other tests.
As for the second question, we build smaller Transformer language models with various depth, number of heads, embedding size, and feedforward layer width settings, while keeping other hyperparameters unchanged. Unfortunately, the trends are quite noisy and also heavily depends on the chosen saliency methods. 5 Hence, it is highly recommended that evaluation of saliency methods be conducted on the specific model configurations of interest, and trends of interpretation quality on 5 Detailed discussion of these analyses is in Appendix B.2. a specific model configuration should not be overgeneralized to other configurations.
Saliency vs. Probing Our evaluation incorporates probing to focus only on specific lexical agreements of interest. It should be pointed out that in the literature of representation probing, the method has always been working under the following assumption: when the model makes an expectedscenario ("correct") prediction, it is always referring to a grammatical cue, for example, the subject of the verb in the number agreement case. However, in our evaluation, we also observe some interesting phenomena in the interpretation of saliency methods that breaks the assumption, which is exemplified in Table 7. This calls for future work that aims to better understand language model behaviors by examining other possible cues used for   predictions made in representation probing under the validated cases where saliency methods could be reliably applied.

Discussion
Most existing work on evaluating saliency methods focuses only on computer vision models (Adebayo et al., 2020;Hooker et al., 2019;Adebayo et al., 2018;Heo et al., 2019;Ghorbani et al., 2019, inter alia Our evaluation is not without its limitations. The first limitation, inherited from earlier work by Poerner et al. (2018), is that our plausibility test only concerns the words in cue/attractor sets rather than other words in the input prefix. Such limitation is inevitable because the annotations from which we build our ground-truth interpretations are only concerned with a specific lexical agreement. This limitation can be mitigated by combining plausibility tests with faithfulness tests, which concern all the input prefix words.
The second limitation is that the test sets used in these benchmarks need to be constructed in a caseto-case manner, according to the chosen lexical agreements and the input perturbations. While it is hard to create plausibility test sets without human interference, future work could explore automatic input consistency tests by utilizing adversarial input generation techniques in NLP (Alzantot et al., 2018;Cheng et al., 2019Cheng et al., , 2020. It should also be noted that while our work focuses on evaluating a specific category of interpretation methods for neural language models, our evaluation paradigm can be easily extended to evaluating other interpretation methods such as attention mechanism, and with other sequence models such as masked language models (e.g., BERT). We would also like to extend these evaluations beyond English datasets, especially to languages with richer morphological inflections.

Conclusion
We conduct a quantitative evaluation of saliency methods on neural language models based on the perspective of plausibility and faithfulness. Our evaluation shows that a model interpretation can either fail due to a lack of plausibility or faithfulness, and the interpretations are trustworthy only when they do well with both tests. We also noticed that the performance of saliency interpretations are generally sensitive to even minor model configuration changes. Hence, trends of interpretation quality on a specific model configuration should not be over-generalized to other configurations.
We want the community to be aware that saliency methods, like many other post-hoc interpretation methods, still do not generate trustworthy interpretations all the time. Hence, we recommend that adopting any model interpretation method as a source of knowledge about NLP models' reasoning process should only happen after similar quantitative checks as presented in this paper are performed. We also hope our proposed test paradigm and accompanied test sets provide useful guidance to future work on evaluations of interpretation methods. Our evaluation dataset and code to reproduce the analysis are available at A potential candidate for a test case is extracted every time a word with POS tag VBZ (Verb, 3rd person singular present) or VBP (Verb, non-3rd person singular present), or a copula that is among is, are, was, were, shows up. The candidate will then be filtered subjecting to the following criteria: 1. The prefix has at least one attractor word (a noun that has a different morphological number as the verb that is predicted). This is to ensure that evaluation could be conducted in the alternative scenario.
2. The verb cannot immediately follow its grammatical subject (note: it may still immediately follow a cue word that is not a grammatical subject). This is to ensure that the signal of the subject is not overwhelmingly strong compared to the attractors.
3. Not all attractors occur earlier 10 words than the grammatical subject. Same reason as the previous criteria.
Overall, we obtained 1448 test cases out of 49168 sentences in PTB (including train, dev, and test set). We lose a vast majority of sentences mostly because of the last two criteria.

A.2 Syneval
We use the following sections of the original data (followed by their names in the data dump, Marvin and Linzen, 2018): • Agreement in a sentential complemenet: We select these sections because they all have strong interfering attractors or have cues that may potentially be mistaken as attractors. We obtained much fewer examples (6280) than the original data (249760) because lots of examples only differ in the verb or the object they use, which become duplicates when we extract prefix before the verb.
The original dataset does not come with cue/attractor annotations, but it can be easily inferred because they are generated by simple heuristics.
Note that most of these sections have only around 50% prediction accuracy with RNNs in the original paper. Our results on large-scale language models corroborate the findings in the original paper.

A.3 CoNLL
We use the dataset (Pradhan et al., 2012) with gold parses, entities mentions, and mention boundaries. A potential candidate for a test case is extracted every time a pronoun shows up. The male pronouns are he, him, himself, his, while the female pronouns are she, her, herself, hers. We don't include cases of epicene pronouns like it, they, etc. because they often involve tricky cases like entity mentions covering a whole clause. We break prefixes according to the document boundaries as provided in the original dataset unless the prefix is longer than 512 words, in which case we instead break at the nearest sentence boundary.
The annotation for this dataset does not cover the gender of entities. We are aware that the original shared task provides gender annotation, but to this day, the documentation for the data is missing and hence we cannot make use of this annotation. Hence, we instead used several heuristics to infer the gender of an entity mention, in descending order: • If an entity mention and a pronoun have a coreference relationship, they should share the same gender.
• If an entity mention starts with "Mr." or "Mrs." or "Ms.", we assign the corresponding gender.
• If the entity mention has a length of two tokens, we assume it's a name and use gender inference tools 6 to guess its gender. Note that the gender guesser may also indicate that it's not able to infer the gender, in that case, we do not assign a gender.
• If a mention is co-referenced with another mention that is not a pronoun, they should also have the same gender.
Manual inspection of the resulting data indicates that the scheme above covers the gender of most entity mentions correctly. We hope that our dataset could be further perfected by utilizing higher quality annotation on entity genders. Since each entity mention could span more than one word, we add all words within the span into their corresponding cue/attractor set. A tricky case is where two entity mention spans are nested or intersected. For the first case, we exclude a smaller span from the larger one to create two unintersected spans as the new span for the cue/attractor set. For the second case, we exclude the intersecting parts from both spans.
Finally, all candidates are filtered subject to the following two criteria: 1. The prefix should include one attractor entity.
2. The entity mention that is closest to the verb should be of different gender (either the opposite or epicene).
We obtained 586 document segments from the 2280 documents in the original data. As pointed out in Zhao et al. (2018), the CoNLL dataset is significantly biased towards male entity mentions. Nevertheless, our filtering scheme generated a relatively balanced test set: among 586 test cases, 258 are male pronouns, while 328 are female pronouns.

A.4 Winobias
We used the same data as the unambiguous coreference resolution dataset in Jumelet et al. (2019), which is in turn generated by a script from Zhao et al. (2018), except that we excluded cases where both nouns in the sentence are of the same gender. Similar to Syneval dataset, the cue and attractors could easily be inferred with heuristics.

B Additional Results
We leave some results that we cannot fit into the main paper here.

B.1 Vector Norm (VN) Composition Scheme
In this section, we explain why we chose not to cover the vector norm composition scheme (mentioned in 2) in our main evaluation results.
We would like to argue first that even mathematically, VN is not a good fit for our evaluation paradigm. Vector norm composition scheme will only indicate the importance of a feature, but will not indicate the polarity of the importance because it cannot generate a negative word importance score, which is important for our evaluation. The reason why it is important is that our plausibility evaluation does distinguish between input words that should have positive/negative importance scores by placing them in cue and attractor sets, respectively. For example, in Table 1, the singular proper noun U.S. and Europe are important input words because they could potentially lead the model to make the alternative prediction is instead of the expected prediction are. Hence, they are placed into the attractor set, and when interpreting the next word prediction are, our plausibility test expects that they should have large negative importance scores.
Besides, we did run the plausibility evaluation with vector norm composition scheme under some settings, as shown in Table 8. For the vanilla gradient saliency method, the VN composition scheme performs on-par with the gradient · input (GI) scheme (which is used for our main results). However, with SmoothGrad, the plausibility result does not significantly change like the case with the gradient · input (GI) scheme. This corroborates with the results in (Ding et al., 2019), where they also show that SmoothGrad does not improve the interpretation quality with VN composition scheme.
With these theoretical and empirical evidence, we decided to drop vector norm composition scheme for our evaluation.

B.2 Patterns for Changes of Interpretation Quality with Varying Model Configurations
As mentioned in Section 5.3, we would like to know if there are any predictable patterns in how interpretation quality changes with varying model configurations. To answer this question, we build smaller Transformer language models with various depth, number of heads, embedding size, and feedforward layer width settings, while keeping other hyperparameters unchanged.   Table 3 for notations. We show two different groups of comparison here. Figure 1 shows our investigation on the interaction between model configuration and interpretation plausibility on PTB and CoNLL test sets. In general, Integrated Gradients method works better for deeper models, while SG works better for shallower models on the PTB test set, but remains roughly the same performance for all architectures on the CoNLL test set. This indicates the noisiness of the trend we are investigating, as both interpretability methods and evaluation dataset choice can influence the trend. As for the other factors of the model configurations, the trend is even noisier (note how much rankings of different configurations change moving from shallow to deep models) and do not show any clear patterns. Figure 2, on the other hand, focuses on one specific dataset and investigates the trend on both the plausibility and input faithfulness with varying model configurations. For plausibility results, we largely see the same trend as on PTB dataset. For faithfulness results, the trend for SG is largely the same as plausibility. For IG, the variance across other factors of configurations tends to be different on shallower models vs. deeper models, but overall still shows higher numbers for deeper models like plausibility.
Overall, these analyses further support our conclusion in the main paper, that interpretation qualities are sensitive to model configuration changes, and we reiterate that evaluations of saliency methods should be conducted on the specific model configurations of interest, and trends of interpretation quality on a specific model configuration should not be over-generalized to other configurations.

C Language Model Perplexities
Parameter size and perplexity on WikiText-103 dev set for all language models are shown in Table 9 for reference. Below are the respective commands to reproduce these results.   Table 9: Parameter size (in millions) and perplexity on WikiText-103 dev set for all language models we trained.

D Additional Interpretation Examples
We show some additional interpretations generated by the state-of-the-art LSTM (Table 10), QRNN (Table 11) and Transformer (Table 12) models on PTB and CoNLL dataset, with their respective bestperforming interpretation method.