Generating Hierarchical Explanations on Text Classification via Feature Interaction Detection

Generating explanations for neural networks has become crucial for their applications in real-world with respect to reliability and trustworthiness. In natural language processing, existing methods usually provide important features which are words or phrases selected from an input text as an explanation, but ignore the interactions between them. It poses challenges for humans to interpret an explanation and connect it to model prediction. In this work, we build hierarchical explanations by detecting feature interactions. Such explanations visualize how words and phrases are combined at different levels of the hierarchy, which can help users understand the decision-making of black-box models. The proposed method is evaluated with three neural text classifiers (LSTM, CNN, and BERT) on two benchmark datasets, via both automatic and human evaluations. Experiments show the effectiveness of the proposed method in providing explanations that are both faithful to models and interpretable to humans.


Introduction
Deep neural networks have achieved remarkable performance in natural language processing (NLP) (Devlin et al., 2018;Howard and Ruder, 2018;Peters et al., 2018), but the lack of understanding on their decision making leads them to be characterized as blackbox models and increases the risk of applying them in real-world applications (Lipton, 2016;Burns et al., 2018;Jumelet and Hupkes, 2018;Jacovi et al., 2018).
Understanding model prediction behaviors has been a critical factor in whether people will trust and use these blackbox models (Ribeiro et al., 2016). A typical work on understanding decisionmaking is to generate prediction explanations for each input example, called local explanation generation. In NLP, most of existing work on local explanation generation focuses on producing wordlevel or phrase-level explanations by quantifying contributions of individual words or phrases to a model prediction (Ribeiro et al., 2016;Lundberg and Lee, 2017;Lei et al., 2016;Plumb et al., 2018).   (Murdoch et al., 2018) respectively for explaining sentiment classification. Both explanations provide scores to quantify how a word or a phrase contributes to the final prediction. For example, the explanation generated by LIME captures a keyword waste and the explanation from CD identifies an important phrase waste of.
However, neither of them is able to explain the model decision-making in terms of how words and phrases are interacted with each other and composed together for the final prediction. In this example, since the final prediction is NEGATIVE, one question that we could ask is that how the word good or a phrase related to the word good contributes to the model prediction. An explanation being able to answer this question will give users a better understanding on the model decision-making and also more confidence to trust the prediction.
The goal of this work is to reveal prediction behaviors of a text classifier by detecting feature (e.g., words or phrases) interactions with respect to model predictions. For a given text, we propose a model-agnostic approach, called HEDGE (for Hierarchical Explanation via Divisive Generation), to build hierarchical explanations by recursively detecting the weakest interactions and then dividing large text spans into smaller ones based on the interactions. As shown in Figure 1 (c), the hierarchical structure produced by HEDGE provides a comprehensive picture of how different granularity of features interacting with each other within the model. For example, it shows how the word good is dominated by others in the model prediction, which eventually leads to the correct prediction. Furthermore, the scores of text spans across the whole hierarchy also help identify the most important feature waste of good, which can be served as a phrase-level explanation for the model prediction.
The contribution of this work is three-fold: (1) we design a top-down model-agnostic method of constructing hierarchical explanations via feature interaction detection; (2) we propose a simple and effective scoring function to quantify feature contributions with respect to model predictions; and (3) we compare the proposed algorithm with several competitive methods on explanation generation via both automatic and human evaluations. The experiments were conducted on sentiment classification tasks with three neural network models, LSTM (Hochreiter and Schmidhuber, 1997), CNN (Kim, 2014), and BERT (Devlin et al., 2018), on the SST (Socher et al., 2013) and IMDB (Maas et al., 2011) datasets. The comparison with other competitive methods illustrates that HEDGE provides more faithful and human-understandable explanations.

Related Work
Over the past years, many approaches have been explored to interpret neural networks, such as contextual decomposition (CD) for LSTM (Murdoch et al., 2018) or CNN model (Godin et al., 2018), gradient-based interpretation methods (Hechtlinger, 2016;Sundararajan et al., 2017), and attention-based methods (Ghaeini et al., 2018;Serrano and Smith, 2019). However, these methods have limited capacity in realworld applications, as they require deep understanding of neural network architectures (Murdoch et al., 2018) or only work with specific models (Alvarez-Melis and Jaakkola, 2018). On the other hand, model-agnostic methods (Ribeiro et al., 2016;Lundberg and Lee, 2017) generate explanations solely based on model predictions and are applicable for any black-box models. In this work, we mainly focus on model-agnostic explanations.

Model-Agnostic Explanations
The core of generating model-agnostic explanations is how to efficiently evaluate the importance of features with respect to the prediction. So far, most of existing work on model-agnostic explanations focus on the word level. For example, Li et al. (2016) proposed Leave-one-out to probe the black-box model by observing the probability change on the predicted class when erasing a certain word. LIME proposed by Ribeiro et al. (2016) estimates individual word contribution locally by linear approximation from perturbed examples. A line of relevant works to ours is Shapleybased methods, where the variants of Shapley values (Shapley, 1953) are used to evaluate feature importance, such as SampleShapley (Kononenko et al., 2010), KernelSHAP (Lundberg and , and L/C-Shapley (Chen et al., 2018). They are still in the category of generating word-level explanations, while mainly focus on addressing the challenge of computational complexity of Shapley values (Datta et al., 2016). In this work, inspired by an extension of Shapley values (Owen, 1972;Grabisch, 1997;Fujimoto et al., 2006), we design a function to detect feature interactions for building hierarchical model-agnostic explanations in subsection 3.1. While, different from prior work of using Shapley values for feature importance evaluation, we propose an effective and simpler way to evaluate feature importance as described in subsection 3.3, which outperforms Shapley-based methods in selecting important words as explanations in subsection 4.2.

Hierarchical Explanations
Addressing the limitation of word-level explanations (as discussed in section 1) has motivated the work on generating phrase-level or hierarchical explanations. For example, Tsang et al. (2018) generated hierarchical explanations by considering the interactions between any features with exhaustive search, which is computationally expensive. Singh et al. (2019) proposed agglomerative contextual decomposition (ACD) which utilizes CD scores (Murdoch et al., 2018;Godin et al., 2018) for feature importance evaluation and employ a hierarchical clustering algorithm to aggregate features together for hierarchical explanation. Furthermore, Jin et al. (2019) indicated the limitations of CD and ACD in calculating phrase interactions in a formal context, and proposed two explanation algorithms by quantifying context independent importance of words and phrases.
A major component of the proposed method on feature interaction detection is based on the Shapley interaction index (Owen, 1972;Grabisch, 1997;Fujimoto et al., 2006), which is extended in this work to capture the interactions in a hierarchical structure. Lundberg et al. (2018) calculated features interactions via SHAP interaction values along a given tree structure. Chen and Jordan (2019) suggested to utilize a linguistic tree structure to capture the contributions beyond individual features for text classification. The difference with our work is that both methods (Lundberg et al., 2018;Chen and Jordan, 2019) require hierarchical structures given, while our method constructs structures solely based on feature interaction detection without resorting external structural information. In addition, different from Singh et al. (2019), our algorithm uses a top-down fashion to divide long texts into short phrases and words based on the weakest interactions, which is shown to be more effective and efficient in the experiments in section 4.

Method
This section explains the proposed algorithm on building hierarchical explanations (subsection 3.1) and two critical components of this algorithm: detecting feature interaction (subsection 3.2) and quantifying feature importance (subsection 3.3).

Generating Hierarchical Explanations
For a classification task, let x = (x 1 , . . . , x n ) denote a text with n words andŷ be the prediction label from a well-trained model. Furthermore, we define P = {x (0,s 1 ] , x (s 1 ,s 2 ] , . . . , x (s P −1 ,n] } be a partition of the word sequence with P text spans, where x (s i ,s i+1 ] = (x s i +1 , . . . , x s i+1 ). For a given text span x (s i ,s i+1 ] , the basic procedure of HEDGE is to divide it into two smaller text spans x (s i ,j] and x (j,s i+1 ] , where j is the dividing point (s i < j < s i+1 ), and then evaluate their contributions to the model predictionŷ.
Algorithm 1 describes the whole procedure of dividing x into different levels of text spans and evaluating the contribution of each of them. Starting from the whole text x, the algorithm first divides x into two segments. In the next iteration, it will pick one of the two segments and further split it into even smaller spans. As shown in algorithm 1, to perform the top-down procedure, we need to answer the questions: for the next timestep, which text span the algorithm should pick to split and where is the dividing point?
Both questions can be addressed via the following optimization problem: where φ(x (s i ,j] , x (j,s i+1 ] | P) defines the interaction score between x (s i ,j] and x (j,s i+1 ] given the current partition P. The detail of this score function will be explained in subsection 3.2.
For a given x (s i ,s i+1 ] ∈ P, the inner optimization problem will find the weakest interaction point to split the text span x (s i ,s i+1 ] into two smaller ones. It answers the question about where the dividing point should be for a given text span. A trivial case of the inner optimization problem is on a text span with length 2, since there is only one possible way to divide it. The outer optimization answers the question about which text span should be picked. This optimization problem can be solved by simply enumerating all the elements in a partition P. A special case of the outer optimization problem is at the first iteration t = 1, where P 0 = {x (0,n] } only has one element, which is the whole input text. Once the partition is updated, it is then added to the hierarchy H. The last step in each iteration is to evaluate the contributions of the new spans and update the contribution set C as in line 9 of the algorithm 1. For each, the algorithm evaluates its contribution to the model prediction with the feature importance function ψ(·) defined in Equation 5. The final output of algorithm 1 includes the contribution set C n−1 which contains all the produced text spans in each timestep together with their importance scores, and the hierarchy H which contains all the partitions of x along timesteps. A hierarchical explanation can be built based on C n−1 and H by visualizing the partitions with all text spans and their importance scores along timesteps, as Figure 1 (c) shows.
Note that with the feature interaction function φ(·, ·), we could also design a bottom-up approach to merge two short text spans if they have the strongest interaction. Empirically, we found that this bottom-up approach performs worse than the algorithm 1, as shown in Appendix A.

Detecting Feature Interaction
For a given text span x (s i ,s i+1 ] ∈ P and the dividing point j, the new partition will be We consider the effects of other text spans in N when calculate the interaction between x (s i ,j] and x (j,s i+1 ] , since the interaction between two words/phrases is closely dependent on the context (Hu et al., 2016;Chen et al., 2016). We adopt the Shapley interaction index from coalition game theory (Owen, 1972;Grabisch, 1997;Fujimoto et al., 2006) to calculate the interaction. For simplicity, we denote x (s i ,j] and x (j,s i+1 ] as j 1 and j 2 respectively. The interaction score is defined as (Lundberg et al., 2018), where S represents a subset of text spans in N \{j 1 , j 2 }, |S| is the size of S, and γ(j 1 , j 2 , S) is defined as follows, where x is the same as x except some missing words that are not covered by the given subset (e.g. S), f (·) denotes the model output probability on the predicted labelŷ, and E[f (x ) | S] is the expectation of f (x ) over all possible x given S. In practice, the missing words are usually replaced with a special token <pad>, and f (x ) is calculated to estimate E[f (x )|S] (Chen et al., 2018;Datta et al., 2016;Lundberg and Lee, 2017). We also adopt this method in our experiments. Another way to estimate the expectation is to replace the missing words with substitute words randomly drawn from the full dataset, and calculate the empirical mean of all the sampling data (Kononenko et al., 2010;Strumbelj and Kononenko, 2014), which has a relatively high computational complexity. With the number of text spans (features) increasing, the exponential number of model evaluations in Equation 2 becomes intractable. We calculate an approximation of the interaction score based on the assumption (Chen et al., 2018;Singh et al., 2019;Jin et al., 2019): a word or phrase usually has strong interactions with its neighbours in a sentence. The computational complexity can be reduced to polynomial by only considering m neighbour text spans of j 1 and j 2 in N . The interaction score is rewritten as where N m is the set containing j 1 , j 2 and their neighbours, and M = |N m |. In section 4, we set m = 2, which performs well. The performance can be further improved by increasing m, but at the cost of increased computational complexity.

Quantifying Feature Importance
To measure the contribution of a feature x (s i ,s i+1 ] to the model prediction, we define the importance score as where fŷ(x (s i ,s i+1 ] ) is the model output on the predicted labelŷ; max y =ŷ,y ∈Y f y (x (s i ,s i+1 ] ) is the highest model output among all classes excludingŷ. This importance score measures how far the prediction on a given feature is to the prediction boundary, hence the confidence of classifying x (s i ,s i+1 ] into the predicted labelŷ. Particularly in text classification, it can be interpreted as the contribution to a specific classŷ. The effectiveness of Equation 5 as feature importance score is verified in subsection 4.2, where HEDGE outperforms several competitive baseline methods (e.g. LIME (Ribeiro et al., 2016), SampleShapley (Kononenko et al., 2010)) in identifying important features.

Experiments
The proposed method is evaluated on text classification tasks with three typical neural network models, a long short-term memories (Hochreiter and Schmidhuber, 1997, LSTM), a convolutional neural network (Kim, 2014, CNN), and BERT ( Models. The CNN model (Kim, 2014) includes a single convolutional layer with filter sizes ranging from 3 to 5. The LSTM (Hochreiter and Schmidhuber, 1997) has a single layer with 300 hidden states. Both models are initialized with 300-dimensional pretrained word embeddings (Mikolov et al., 2013). We use the pretrained BERT model 1 with 12 trans-1 https://github.com/huggingface/ pytorch-transformers former layers, 12 self-attention heads, and the hidden size of 768, which was then fine-tuned with different downstream tasks to achieve the best performance.

Quantitative Evaluation
We adopt two metrics from prior work on evaluating word-level explanations: the area over the perturbation curve (AOPC) (Nguyen, 2018;Samek et al., 2016) and the log-odds scores (Shrikumar et al., 2017;Chen et al., 2018), and define a new evaluation metric called cohesion-score to evaluate the interactions between words within a given text span. The first two metrics measure local fidelity by deleting or masking top-scored words and comparing the probability change on the predicted label. They are used to evaluate Equation 5 in quantifying feature contributions to the model prediction.
The cohesion-score measures the synergy of words within a text span to the model prediction by shuffling the words to see the probability change on the predicted label.
AOPC. By deleting top k% words, AOPC calculates the average change in the prediction probability on the predicted class over all test data as follows, whereŷ is the predicted label, N is the number of examples, p(ŷ | ·) is the probability on the predicted class, andx i is constructed by dropping the k% top-scored words from x i . Higher AOPCs are better, which means that the deleted words are important for model prediction. To compare with other word-level explanation generation methods under this metric, we select word-level features from the bottom level of a hierarchical explanation and sort them in the order of their estimated importance to the prediction.  Log-odds. Log-odds score is calculated by averaging the difference of negative logarithmic probabilities on the predicted class over all of the test data before and after masking the top r% features with zero paddings, The notations are the same as in Equation 6 with the only difference thatx (r) i is constructed by replacing the top r% word features with the special token pad in x i . Under this metric, lower log-odds scores are better.
Cohesion-score. We propose cohesion-score to justify an important text span identified by HEDGE. Given an important text span x (a,b] , we randomly pick a position in the word sequence (x 1 , . . . , x a , x b+1 , . . . , x n ) and insert a word back. The process is repeated until a shuffled version of the original sentencex is constructed. The cohesion-score is the difference between p(ŷ | x) and p(ŷ |x). Intuitively, the words in an important text span have strong interactions. By perturbing such interactions, we expect to observe the output probability decreasing. To obtain a robust evaluation, for each sentence x i , we construct Q different word sequences {x (q) i } Q q=1 and compute the aver-age as wherex (q) i is the q th perturbed version of x i , Q is set as 100, and the most important text span in the contribution set C is considered. Higher cohesionscores are better.
The AOPCs and log-odds scores on different models and datasets are shown in Table 2, where k = r = 20. Additional results of AOPCs and logodds changing with different k and r are shown in Appendix B. For the IMDB dataset, we tested on a subset with 2000 randomly selected samples due to computation costs. HEDGE achieves the best performance on both evaluation metrics. Sam-

Methods Models
Cohesion-score SST IMDB HEDGE CNN 0.016 0.012 BERT 0.124 0.103 LSTM 0.020 0.050 ACD LSTM 0.015 0.038 Table 3: Cohesion scores of HEDGE and ACD in interpreting different models on the SST and IMDB datasets. For ACD, we adopt the existing application from the original paper (Singh et al., 2019) to explain LSTM on text classification.  pleShapley also achieves a good performance with the number of samples set as 100, but the computational complexity is 200 times than HEDGE. Other variants, L/C-Shapley and KernelSHAP, applying approximations to Shapley values perform worse than SampleShapley and HEDGE. LIME performs comparatively to SampleShapley on the LSTM and CNN models, but is not fully capable of interpreting the deep neural network BERT. The limitation of context decomposition mentioned by Jin et al. (2019) is validated by the worst performance of CD in identifying important words. We also observed an interesting phenomenon that the simplest baseline Leave-one-out can achieve relatively good performance, even better than HEDGE when k and r are small. And we suspect that is because the criteria of Leave-one-out for picking single keywords matches the evaluation metrics.
Overall, experimental results demonstrate the effectiveness of Equation 5 in measuring feature importance. And the computational complexity is only O(n), which is much smaller than other baselines (e.g. SampleShapley, and L/C-Shapley with polynomial complexity). Table 3 shows the cohesion-scores of HEDGE and ACD with different models on the SST and IMDB datasets. HEDGE outperforms ACD with LSTM, achieving higher cohesion-scores on both datasets, which indicates that HEDGE is good at capturing important phrases. Comparing the results of HEDGE on different models, the cohesion-scores of BERT are significantly higher than LSTM and CNN. It indicates that BERT is more sensitive to perturbations on important phrases and tends to utilize context information for predictions.

Qualitative Analysis
For qualitative analysis, we present two typical examples. In the first example, we compare HEDGE with ACD in interpreting the LSTM model. Figure 2 visualizes two hierarchical explanations, generated by HEDGE and ACD respectively, on a negative movie review from the SST dataset. In this case, LSTM makes a wrong prediction (POSITIVE). Figure 2(a) shows HEDGE correctly captures the sentiment polarities of bravura and emptiness, and the interaction between them as bravura exercise flips the polarity of in emptiness to positive. It explains why the model makes the wrong prediction. On the other hand, ACD incorrectly marks the two words with opposite polarities, and misses the feature interaction, as Figure 2(b) shows.
In the second example, we compare HEDGE in interpreting two different models (LSTM and BERT). Figure 3 visualizes the explanations on a positive movie review. In this case, BERT gives the correct prediction (POSITIVE), while LSTM makes  a wrong prediction (NEGATIVE). The comparison between Figure 3(a) and 3(b) shows the difference of feature interactions within the two models and explains how a correct/wrong prediction was made. Specifically, Figure 3(b) illustrates that BERT captures the key phrase not a bad at step 1, and thus makes the positive prediction, while LSTM (as shown in Figure 3(a)) misses the interaction between not and bad, and the negative word bad pushes the model making the NEGATIVE prediction. Both cases show that HEDGE is capable of explaining model prediction behaviors, which helps humans understand the decision-making. More examples are presented in Appendix C due to the page limitation.

Human Evaluation
We had 9 human annotators from the Amazon Mechanical Turk (AMT) for human evaluation. The features (e.g., words or phrases) with the highest importance score given by HEDGE and other baselines are selected as the explanations. Note that HEDGE and ACD can potentially give very long top features which are not user-friendly in human evaluation, so we additionally limit the maximum length of selected features to five. We provided the input text with different explanations in the user interface (as shown in Appendix D) and asked human annotators to guess the model's prediction (Nguyen, 2018) from {"Negative", "Positive", "N/A"} based on each explanation, where "N/A" was selected when annotators cannot guess the model's prediction. We randomly picked 100 movie reviews from the IMDB dataset for human evaluation.
There are two dimensions of human evaluation. We first compare HEDGE with other baselines using the predictions made by the same LSTM model. Second, we compare the explanations generated by HEDGE on three different models: LSTM, CNN, and BERT. We measure the number of human annotations that are coherent with the actual model predictions, and define the coherence score as the ratio between the coherent annotations and the total number of examples. Table 4 shows the coherence scores of eight different interpretation methods for LSTM on the IMDB dataset. HEDGE outperforms other baselines with higher coherence score, which means that HEDGE can capture important features which are highly consistent with human interpretations. LIME is still a strong baseline in providing interpretable explanations, while ACD and Shapley-based methods perform worse. Table 5 shows both the accuracy and coherence scores of different models. HEDGE succeeds in interpreting black-box models with relatively high coherence scores. Moreover, although BERT can achieve higher prediction accuracy than the other two models, its coherence score is lower, manifesting a potential tradeoff between accuracy and interpretability of deep models.

Conclusion
In this paper, we proposed an effective method, HEDGE, building model-agnostic hierarchical interpretations via detecting feature interactions. In

Methods
Coherence Score   this work, we mainly focus on sentiment classification task. We test HEDGE with three different neural network models on two benchmark datasets, and compare it with several competitive baseline methods. The superiority of HEDGE is approved by both automatic and human evaluations.

A Comparison between Top-down and Bottom-up Approaches
Given the sentence a waste of good performance for example, Figure 4 shows the hierarchical interpretations for the LSTM model using the bottom-up and top-down approaches respectively. Figure 4(a) shows that the interaction between waste and good can not be captured until the last (top) layer, while the important phrase waste of good can be extracted in the intermediate layer by top-down algorithm. We can see that waste flips the polarity of of good to negative, causing the model predicting negative as well. Top-down segmentation performs better than bottom-up in capturing feature interactions. The reason is that the bottom layer contains more features than the top layer, which incurs larger errors in calculating interaction scores. Even worse, the calculation error will propagate and accumulate during clustering.         Figure 11: HEDGE for BERT on a positive movie review from the SST dataset. BERT makes the correct prediction because it captures the interaction between never and fails.