Contextual explanation rules for neural clinical classifiers

Several previous studies on explanation for recurrent neural networks focus on approaches that find the most important input segments for a network as its explanations. In that case, the manner in which these input segments combine with each other to form an explanatory pattern remains unknown. To overcome this, some previous work tries to find patterns (called rules) in the data that explain neural outputs. However, their explanations are often insensitive to model parameters, which limits the scalability of text explanations. To overcome these limitations, we propose a pipeline to explain RNNs by means of decision lists (also called rules) over skipgrams. For evaluation of explanations, we create a synthetic sepsis-identification dataset, as well as apply our technique on additional clinical and sentiment analysis datasets. We find that our technique persistently achieves high explanation fidelity and qualitatively interpretable rules.


Introduction
Understanding and explaining decisions of complex models such as neural networks has recently gained a lot of attention for engendering trust in these models, improving them, and understanding them better (Montavon et al., 2018;Belinkov and Glass, 2019). Several previous studies developing interpretability techniques provide a set of input features or segments that are the most salient for the model output. Approaches such as input perturbation and gradient computation are popular for this purpose (Ancona et al., 2018;Arras et al., 2019). A drawback of these approaches is the lack of information about interaction between different features. While heatmaps (Li et al., 2016b,a;Arras et al., 2017) and partial dependence plots (Lundberg and Lee, 2017) are popularly used, they only provide a qualitative view which quickly * Research conducted while at CLiPS. gets complex as the number of features increases. To overcome this limitation, rule induction for model interpretability has become popular, which accounts for interactions between multiple features and output classes (Lakkaraju et al., 2017;Puri et al., 2017;Ming et al., 2018;Ribeiro et al., 2018;Sushil et al., 2018;Evans et al., 2019;Pastor and Baralis, 2019). Most of these work treat the explained models as black boxes, and fit a separate interpretable model on the original input to find rules that mimic the output of the explained model. However, because the interpretable model does not have information about the parameters of the complex model, global explanation is expensive, and the explaining and explained models could fit different curves to arrive to the same output. Sushil et al. (2018) incorporates model gradients in the explanation process to overcome these challenges, but this technique cannot be used with current stateof-the-art models that use word embeddings due to their reliance on interpretable model input in the form of bag-of-words. Murdoch and Szlam (2017) explain long short term memory networks (LSTMs) (Hochreiter and Schmidhuber, 1997) by means of ngram rules, but their rules are limited to presence of single ngrams and do not capture interaction between ngrams in text. To learn explanation rules for RNNs while overcoming the limitations of the previous approaches, we have the following contributions in the paper: 1. We induce explanation rules over important skipgrams in text, while ensuring that these rules generalize to unseen data. To this end, we quantify skipgram importance in LSTMs by first pooling gradients across embedding dimensions to compute word importance, and thereby aggregating them into skipgram importance. Skipgrams incorporate word order in explanations and increase interpretability.
2. To overcome existing limitations with au-tomated explanation evaluation (Lertvittayakumjorn and Toni, 2019;Poerner et al., 2018), we provide a synthetic clinical text classification dataset for evaluating interpretability techniques. We construct this dataset according to existing medical knowledge and clinical corpus. We validate our explanation pipeline on this synthetic dataset by recovering the labeling rules of the dataset. We then apply our pipeline to two clinical datasets for sepsis classification, and one dataset for sentiment analysis. We confirm that the explanation results obtained on synthetic data are scalable to real corpora.

Explanation pipeline
We propose a method to find decision lists as explanation rules for RNNs with word embedding input. We quantify word importance in an RNN by comparing multiple pooling operations (qualitatively and quantitatively). After establishing a desired pooling technique, we move to finding importance of skipgrams, which provides larger context around words in explanations. We then find decision lists that associate the relative importance of multiple skipgrams in the RNN to an output class. This is an extension of our prior work (Sushil et al., 2018) where we find if-then-else rules for feedforward neural networks. However, the previous approach relies on using interpretable inputs independent of word order and is not scalable to the current stateof-the-art approaches that use word embeddings instead. Moreover, explanation of binary classifiers is not supported by that pipeline, and the explanation rules are not generalized to unseen examples. Furthermore, the previous explanation rules are hierarchical, and hence cannot be understood independently without parsing the entire rule hierarchy.
In the proposed research, we address all these limitations and extend the explanations to binary cases, unseen data, and to sequential neural networks with word embedding input. Additionally, these explanation rules can be understood as an independent decision path. We present the complete pipeline for our approach, which we name UNRAVEL, in Figure 1. Code for the paper is available on https: //github.com/clips/rnn_expl_rules.

Word importance computation
Saliency (importance) scores of input features are often computed as gradients of the predicted out- put node w.r.t. all the input nodes for all the instances (Simonyan et al., 2013;Adebayo et al., 2018). In neural architectures that have an embedding layer, interpretable input features are replaced by corresponding low-dimensional embeddings. Due to this, we obtain different saliency scores for different embedding dimensions of a word in a document. Because embedding dimensions are not interpretable, it is difficult to understand what these multiple saliency scores mean. To instead obtain a single score for a word by combining saliency values of all the dimensions, we consider the following commonly used pooling techniques: • L2 norm of the gradient scores (Bansal et al., 2016;Hechtlinger, 2016;Poerner et al., 2018).
• Sum of gradients across all the dimensions.
• Dot product between the embeddings and the gradient scores (Denil et al., 2014;Montavon et al., 2018;Arras et al., 2019). This additionally accounts for the embedding value itself.
We also experimented with max pooling, but we omit the discussion here because they have the same patterns as the L2 norm, albeit with higher magnitudes.
In Section 4.1, we analyze the importance scores obtained with these techniques qualitatively and quantitatively to identify the preferred one.

Skipgrams to incorporate context
One of the contributions of this work is to find explanation rules for sequential models such as RNNs. Conjunctive clauses of if-then-else rules are order independent although this order is critical for RNNs. To account for word order in input documents, some previous approaches find the most important ngrams instead of the top words only (Murdoch and Szlam, 2017;Jacovi et al., 2018). To incorporate word order also in explanation rules, we compute the importance of subsequences in the documents before combining different subsequences into conjunctive rules. We define importance of a subsequence as the mean saliency of all the tokens in that subsequence. We represent subsequences as skipgrams with length in the range [1,4] and with maximum two skip tokens 1 . After computing the scores, we retain the 50 most important skipgrams for every document (based on absolute importance scores). The number of unique skipgrams obtained in this manner is very high. To limit the complexity of explanations, we retain 5k skipgrams with the highest total absolute importance score across the entire training set and learn explanation rules over these. To this end, we create a bag-of-skipgramimportance representation of the documents, where the vocabulary corresponds to the 5k most important skipgrams across the training set. For ease of understanding, we discretize the importance scores of the skipgrams to represent five different levels of importance: {−−, −, 0, +, ++}. Here −− and ++ represent a high negative and positive importance, respectively, for the predicted output class, 0 means that the skipgram is absent in the document, and − and + indicate low negative and positive importance scores, respectively. This skipgram set, along with the output predictions of a model, is then input to a rule induction module to obtain decision lists as explanations.

Learning transferable explanations
In the prediction phase, a model merely applies the knowledge it has learned from the training data.
Hence, an explanation technique should not require prior knowledge of the test set to find global explanations of a model. We hypothesize that explanation rules should be consistently accurate between the training data and the predictions on unseen data. In accordance to this hypothesis, instead of learning explanations directly from validation or test instances, which is common in interpretability research (Ribeiro et al., 2018;Sushil et al., 2018), we modify the explanation procedure to learn accurate, transferable explanations only from the training set. We first feed the training data to our neural network and record the corresponding output predictions. These output predictions, combined with the corresponding set of top discretized skipgrams, are used to fit the rule inducer. The hyperparameters of the rule inducer are optimized to best explain the validation set outputs. Finally, we report a score that quantifies how well the learned rules transfer to the test predictions. This training scheme ensures that the explanations are generalizable to unseen data, instead of overfitting the test set.
We obtain decision lists using PART (Frank and Witten, 1998), which finds simplified paths of partial C4.5 decision trees. These decision lists can be comprehended independent of the order, and support both binary and multi-class cases.

Synthetic dataset
A big challenge for interpretability research is the evaluation of the results (Lertvittayakumjorn and Toni, 2019). Human evaluation is not ideal because a model can learn correct classification patterns that are counter-intuitive for humans (Poerner et al., 2018). In complex domains like healthcare, such an evaluation is additionally infeasible. To overcome existing limitations with automated evaluation of explanations, we create a synthetic binary clinical document classification dataset. We base the dataset construction on the sepsis screening guidelines 2 . This is a critical task for preventing deaths in ICUs (Futoma et al., 2017) and new insights about the problem are important in the medical domain. The synthetic dataset includes a subset of sentences from the freely available clinical corpus MIMIC-III (Johnson et al., 2016). Dataset construction process is described here: • From the MIMIC-III corpus, we sample 3-15 words long sentences that mention the keywords discussed in the screening guidelines, grouped into the following sets: • We populate 50k documents with 17 sentences each by randomly sampling one sentence from set I, one sentence for each comma-separated term in set Inf l, and 10 sentences from set Others. We additionally populate 20k documents with 17 sentences, all from set Others.
• We then run the CLAMP clinical NLP pipeline (Soysal et al., 2017) to identify if these keywords are negated in the documents.
• Next, we assign class labels to the documents using the following rule: if the infection term sampled from set I is not negated and at least 2 responses sampled from set Inf l are not negated =⇒ Class label is septic, Class label is non-septic otherwise.
49% of the documents are thus labeled as septic.
Sampling sentences from the MIMIC-III corpus introduces language diversity through a large vocabulary and varied sentence structures. Use of an imperfect tool to identify negation for document labeling also adds noise to the dataset. These properties are desirable because they allow for controlled explanation evaluation while also simulating real world corpora and tasks, unlike several synthetic datasets used for explanation evaluation (Arras et al., 2019;.
No signs of infection were found. Altered mental status exists. Patient is suffering from hypothermia, the set of gold terms would include all the underlined words. Among these words, infection, altered, mental, status, and hypothermia are keyword terms, and no, signs, and of are terms corresponding to the negation scope.

Model:
We split the dataset into subsets of 80-10-10% as training-validation-test sets. We obtain a vocabulary of 47,015 tokens after lower-casing the documents without removing punctuation. We replace unknown words in validation and test sets with the unk token. We train LSTM classifiers to predict the document class from the hidden representation after the final timestep, which is obtained after processing the entire document as a sequence of tokens 5 . The classifiers use randomly initialized word embeddings and a single RNN layer without attention. The hidden state size and embedding dimension are set to either 50 or 100. We use the Adam optimizer (Kingma and Ba, 2014) with learning rate 0.001 and a batch size of 64 (without hyperparameter optimization). Classification performance is shown in Table 1.

Real clinical datasets
We additionally find explanation rules for sepsis classifiers on the MIMIC-III clinical corpus. We define sepsis label as all the cases where patients are assigned one of the following diagnostic codes: We analyze two different setups after removing blank notes and the notes marked as error in the MIMIC-III corpus: 1. We use the last discharge note for every patient to classify whether the patient has sepsis. Class distribution among 58,028 instances is 90-10% for non-septic and septic cases respectively, and the vocabulary size is 229,799. The task is easier in this setup because 70% of septic cases mention sepsis directly, whereas only 13% of non-septic cases mention sepsis.
2. We classify whether a patient has a sepsis diagnosis or not using the last note about a patient excluding the categories discharge notes, social work, rehab services and nutrition. We obtain 52,691 patients in this manner, out of which only 9% are septic. The vocabulary size is 87,753. In this setup, only 17% of septic cases mention sepsis, as opposed to 6% of non-septic cases mentioning sepsis.

Models:
We train 2-layer bidirectional LSTM classifiers with 100 dimensional randomly initialized word embeddings and 100 dimensional hidden layer. We train for 50 epochs with early stopping with patience 5. The remaining data processing and implementation details are the same as discussed for synthetic dataset. Macro F1 score of classification when using discharge notes is 0.68 (septic class F1 is 0.41), and without using discharge notes is 0.60 (septic class F1 is 0.27). Majority baseline is 0.5.

Baseline explanation rules
Several existing approaches for global rule-based interpretability (Lakkaraju et al., 2017;Puri et al., 2017) have one common aspect-they directly use the original input to find explanation rules for complex classifiers without making use of the parameters of the complex models. However, these approaches don't scale to NLP tasks due to combinatorial computational complexity in finding explanation rules. For comparison, as baseline rules, we induce explanations directly from the input data without using gradients of neural models. To this end, we create a bag-of-skipgrams by binarizing the most frequent skipgrams to represent whether they are present in a document. We then train rule induction classifiers on this binarized skipgram data to explain neural outputs.
We also compare to Anchors (Ribeiro et al., 2018) for SST2 explanations by implementing their submodular pick algorithm for obtaining global explanations. Anchors does not scale to longer documents used for sepsis classification.

Evaluation metrics
We record fidelity scores of the explanation rules on the test set, and the complexity of these explanations. Fidelity scores refer to how faithful the explanations are to the test output predictions of the explained neural network. Like our prior work (Sushil et al., 2018), we use macro F1-measure of explanations compared to original predictions to quantify it. We define explanation complexity as the number of rules in an explanation.

Comparing pooling techniques
To compare different pooling techniques described in Section 2.1, we evaluate sets of most important words obtained by different techniques against gold sets of important terms for the documents.

Qualitative analysis
In Figure 2, we compare word importance distribution for the pooling techniques for an instance in the validation set of the synthetic corpus. The L2 norm provides distributions over the positive values only and the importance scores are low because it squares the gradients. Sum pooling and dot product instead return a distribution over both positive and negative values, with dot product returning a more peaked distribution. However, as we can see, sum and dot product often provide opposite importance signs for the same words. This is caused due to presence of word embeddings while computing dot product, which can take both positive and negative values. In this instance, both true and predicted classes are non-septic. Looking at Figure 2c, we find positive peaks over negative and infection, and negative peaks over altered mental status and hyperglycemia. This corresponds to the class labeling rule in the synthetic data, where non-septic class is assigned when infection terms are negated. These directions of influence are counter-intuitive for sum pooling in Figure 2b. Due to its intuitive, peaked importance distributions, dot product seems to be better than other techniques. However, we move to quantitative evaluation for a global perspective because this qualitative analysis is biased towards a specific instance and model.

Quantitative analysis
We find the top k tokens for test documents in the synthetic dataset by ranking absolute word importance scores, where k is the number of gold important terms used to label the document. We ignore the 20k documents that only consist of sentences that do not mention any keyword term, and hence have an empty gold set. We compute the accuracy of the set of most important words for every document compared to their corresponding gold set. Later, we take a mean across all the documents and report it in Table 1. We find that dot product consistently recovers more important tokens than other pooling techniques across all the classifiers, confirming the qualitative analysis earlier and the findings of Arras et al. (2019). Hence we use dot product for computing word importance before inducing explanation rules.
We additionally see that the mean accuracy is nearly twice for the classifier with 50 hidden nodes and 100 dimensional word embeddings as compared to the the larger classifier that uses 100 hidden units instead, although the latter classifier is nearly 5% more accurate. This suggests that the larger network obtains higher performance by focusing on tokens that are not incorporated within the gold keywords. The reason behind different tokens being considered important could be that our gold set of important terms is noisy: • Some tokens such as punctuation symbols are missing from the gold set, although they are important for identifying the scope of negation, as seen in Figure 3.
• Some terms in the gold set are not required for correct classification. For example: 1. Too many words are included as negation triggers. For example, in the sentence no signs of infection were found., 'no', 'signs', and 'of' are all added to the gold set as negation markers although the subset {'no', 'infection'} may be sufficient. 2. Similarly, the keyword altered mental status could already be recognized from a subset of these terms.

Explaining synthetic data classifiers
We obtain explanations of all the LSTM classifiers for the synthetic dataset. We record fidelity scores of explanations and the corresponding complexity in Table 2. We find that when we use the proposed pipeline UNRAVEL for learning gradient-informed rules, we obtain explanations with high fidelity scores also on the test data. On the other hand, with the baseline approach, we obtain nearly 15% lower fidelity scores. In addition, explanations are more complex with the baseline approach. This confirms that making use of model parameters by means of gradients acts as an additional useful cue for the rule-based explainability module, thus resulting in more faithful explanations. We present some examples of explanation rules for the most accurate LSTM classifier for the synthetic dataset in Figure 3. Here, we indicate infection keywords that were used to populate the dataset with a single underline, and the inflammatory response keywords with a double underline. The first rule in the figure indicates that if two inflammatory response criteria are highly important for the network, the term infection is highly important, and phrases negating the presence of infection

Explanation
Eval type LSTM100,E100 LSTM100,E50 LSTM50,E100 LSTM50,E50 Baseline ( Table 2: Test set fidelity scores of explanations (in %macro-F1), and number of explanation rules as the measure of explanation complexity for different LSTM classifiers on the synthetic dataset using our approach compared to the baseline approach. LSTMx,Ey refers to LSTM with x hidden nodes and y dimensional word embeddings. sg in parenthesis refers to skipgram-based explanations. : Example explanation rules for the best LSTM classifier on the synthetic dataset. Infection keywords from set I are marked with a single underline, and the corresponding inflammatory response keywords from set Inf l are marked with double underline. ++ refers to high positive importance of a term, 0 represents absence of a term, and − means that the term gets a low negative importance, i.e., presence of the term reduces the output probability. The numbers (a/b) mean that b training instances are explained by the rule, of which a are correct. The first two rules are obtained with skipgrams, and the third one is obtained on using only unigrams for explanations.  Table 3: Explanation fidelity (% macro F1) and complexity for sepsis classification: 1) With discharge notes 2) Without discharge notes, and on the SST2 dataset. The baseline method did not converge (in several weeks) for sepsis classification without discharge note and for SST2 classification. Anchors did not scale (in memory usage) to document-level sepsis datasets.
are absent, then the class is recognized as septic. This is similar to the rule we have used to label the synthetic dataset, which requires at least one infection term and at least two inflammatory response criteria to not be negated in the document for being assigned a septic class. In the next rule-applied after all the cases from the previous rule have been excluded from the dataset-if several keyword terms are absent, the document is classified as non-septic. It is useful to remember that urinary tract is usually followed by the word infection in the dataset, and several instances mentioning infection have already been explained by the previous rule and hence have been ignored by this rule. This explanation rule is also in accordance to the synthetic dataset, where 20k documents do not contain any keyword term and are labeled as non-septic. The third rule is an example rule for the same model when explanations are based on unigrams only as opposed to skipgrams. In this case, we lose the context of the negation marker no. When using skipgrams, this context of negation is available, which makes the negation scope clearer. Further, terms like evidence, fungal and urinary tract captured by skipgrams provide additional context for understanding the rules. This illustrates that even though the fidelity scores of explanations are similar, skipgram based explanations are more interpretable than only unigram-based explanations. Hence, we use skipgrams for further analysis.

Explaining clinical models
We rerun our explainability pipeline on both clinical models for sepsis classification-with and without using discharge notes (Section 3.2). For the first classifier with discharge notes, we again obtain very high fidelity scores of explanations (Table 3). The baseline explanations have significantly lower fidelity scores while also being extremely complex. On inspecting the corresponding explanation rules given in Figure 4, we find that they refer to the direct mentions of sepsis in the discharge notes. In the first rule, if sepsis major surgical is mentioned, the class is directly septic. In the second condition, it first rules out the mention of a complaint of sepsis and then checks for additional conditions. This confirms that not only does the classifier pick up on these direct mentions, but the explanations also recover this information. This illustrates the utility of UNRAVEL in understanding our models, which is the first step towards improving them. For example, if our model is learning direct mentions of sepsis as a discriminating feature, we could remove these direct mentions from the dataset before training new models to ensure that they generalize.
Next, for the more difficult case where we use only the final non-discharge note about patients to classify whether they have sepsis, the fidelity score is 77.33%. Although this score is good as an absolute number, it is much lower than other two cases. Explanations for this model are also much more complex. This highlights that more complex classifiers and explanations have lower explanation fidelity. While manually inspecting these explanations, we find that absence of terms such as diagnosis : sepsis, indication endocarditis . valve, indication bacteremia, admitting diagnosis fever and pyelonephritis are used to rule out sepsis. These are similar to the explanations of the other two datasets, albeit enriched with information about additional infections and body conditions. This confirms that the synthetic dataset closely models a real clinical use case, and suggests that these explanations rules could result into useful hypothesis generation.

Explaining sentiment classifier
Results of the SST2 explanations are given in Ta  inspecting the explanation rules for our method and Anchors respectively presented in Figures 5  and 6, we find that Anchors rules consist only of single words, as opposed to UNRAVEL, which finds conjunctions of different phrases. Furthermore, explanation rules with UNRAVEL obtain 71% classification accuracy on the original task. This performance drop compared to LSTM is ∼7% lower than gradient decomposition-based performance drop reported by Murdoch and Szlam (2017), although the numbers aren't strictly comparable because we explain different classifiers 6 . 6 Their implementation is not openly available for direct comparison.

Conclusions and Future Work
We have successfully developed a pipeline to obtain transferable, accurate gradient-informed explanation rules from RNNs. We have constructed a synthetic dataset to qualitatively and quantitatively evaluate the results, and we obtain informative explanations with high fidelity scores. We obtain similar results on clinical datasets and sentiment analysis. Our approach is transferable to all similar neural models. In future, it would be interesting to extend the capabilities of this approach to obtain more accurate, less complex and scalable explanations for classifiers with more complex patterns.