DoLFIn: Distributions over Latent Features for Interpretability

Interpreting the inner workings of neural models is a key step in ensuring the robustness and trustworthiness of the models, but work on neural network interpretability typically faces a trade-off: either the models are too constrained to be very useful, or the solutions found by the models are too complex to interpret. We propose a novel strategy for achieving interpretability that – in our experiments – avoids this trade-off. Our approach builds on the success of using probability as the central quantity, such as for instance within the attention mechanism. In our architecture, DoLFIn (Distributions over Latent Features for Interpretability), we do no determine beforehand what each feature represents, and features go altogether into an unordered set. Each feature has an associated probability ranging from 0 to 1, weighing its importance for further processing. We show that, unlike attention and saliency map approaches, this set-up makes it straight-forward to compute the probability with which an input component supports the decision the neural model makes. To demonstrate the usefulness of the approach, we apply DoLFIn to text classification, and show that DoLFIn not only provides interpretable solutions, but even slightly outperforms the classical CNN and BiLSTM text classifiers on the SST2 and AG-news datasets.


Introduction
Having insights into how a trained neural network solves a given task is important, in order to build more robust, trustworthy, and accurate models (Zhou et al., 2016;Gilpin et al., 2018;Poerner et al., 2018;Belinkov et al., 2017). However gaining insights is often challenging because of the large number of parameters and the nonlinear dependencies between components of the model. One approach to 'opening the blackbox' is to use saliency maps (Jacovi et al., 2018;Gupta and Schütze, 2018), to examine which input components (e.g., words) are taken into account. Nevertheless salience alone does not tell us how exactly salient components contribute to the decision of a model. For instance, a sentiment-analysis model might mark both "boring" and "wonderful" in "This film would be wonderful, if the beginning wasn't so boring" as salient, but saliency scores do not reveal the crucial interaction between all the words in the sentence that ultimately let it express a negative sentiment. Alternatively, one can analyse the flow of information through the network, by tracking (relevance) gradients backwards (Arras et al., 2017) or decomposing (forward) contributions to all intermediate quantities (Murdoch et al., 2018;Jumelet et al., 2019). These methods address some of the problems of saliency maps, but produce results that themselves require further interpretation. For instance, such a method might quantify the relative contribution of the 3rd word when processing the 7th word in the 2nd layer of a multilayer LSTM. That quantity, however, does not easily translate to an explanation for the final prediction that the model generates.
A third approach looks at the attention mechanism (Bahdanau et al., 2014), which regulates which components a model attends to. Visualising attended components is straight-forward (Xu et al., 2015;Abnar and Zuidema, 2020) as an attention weight is the probability that the corresponding component is taken into account for further computation. At a higher level, as in (Lei et al., 2016;Bastings et al., Figure 1: Text classification models. (a) DoLFIn with Bags of latent features (BoLF), using linear-softmax layers (LSL). The whole input text s is mapped to r which is a bag of latent features. p(f i |w, s) is the probability of mapping word w (in text s) to latent feature f i , and q(c|f i ) is the probability that f i supports category c. (b) Traditional CNN proposed by (Kim, 2014). (c) BiLSTM.
one can compute rationales which are attended pieces of text. However, these approaches face some of the same problems as saliency maps; e.g., attention alone does not indicate how much an input component supports a category.
In this paper we focus, like the attention-based approach, on probability. We start from the observation that probability is a well studied mathematical tool, that has already been much used in building explainable and robust neural networks (although we also note that simply turning interpretation quantities from other approaches into probabilities by normalizing is not necessarily helpful, as these probabilities are not faithful to the true behaviour of the model). We propose DoLFIn (distributions over latent features for interpretability) based on probability; our architecture maps an input (e.g., a text) to a bag of latent features (BoLF, see Figure 1a and §3) so that interpretation quantities are probabilistic. In other words, the representation of an input is a vector whose elements range from 0 to 1, indicating to what extent a latent feature is in the bag. To do so, we employ linear-softmax layers (i.e., neural layers with softmax activation, LSL for short) to map input components to distributions over latent features, and a truncated sum to aggregate the resulting distributions. With this new type of representations, we can easily compute q(c|w, s), the probability that input component w in context s supports category c, by decomposing the term to p(f |w, s), the distribution over latent features, and q(c|f ), the probability that feature f supports category c. The former is given by an LSL and the latter can be estimated by the input-output statistics.
To illustrate the feasibility and benefits of this idea, we employ DoLFIn for text classification that, going beyond saliency maps and attention, can tell us the probability a word supports a category. We demonstrate that DoLFIn does not trade off interpretability against classification accuracy. We build DoLFIn-conv and DoLFIn-bilstm which are variants of the classical CNN proposed by (Kim, 2014) and BiLSTM text classifier (Figure 1b,c). Carrying out experiments on three popular datasets, TREC Question, SST2, and AG-news, we find that DoLFIn achieves slightly higher accuracy than CNN and BiLSTM on SST2 and AG-news. It is worth noting that, although used for text classification in this paper, DoLFIn is applicable to a wide range of classification tasks such as natural language inference and relation prediction.

Text Classification Baselines
Given a text of n words s = (w 0 , ..., w n−1 ) and a set of m categories C = {c 0 , ..., c m−1 }, a probabilistic text classifier is p(c|s), which assigns a probability to the prediction that c is the category of s. A traditional text classification architecture adopts the diagram where s ∈ R ds is a vector representing text s. The classifier module is often a linear-softmax layer.
Shown in Figure 1b, a classical CNN text classifier (Kim, 2014) utilises a convolutional layer to map each word w i (with its context) to a vector w i ∈ R dw . It then uses max pooling over {w i } n−1 i=0 to compute s for text s. A BiLSTM text classifier (Figure 1c) uses an BiLSTM to compute w i , and an aggregator concatenating the backward part of w 0 and the forward part of w n−1 to form s.
Training a text classifier is to minimise the cross-entropy loss where θ is the parameter vector and D train is a training set of s, c pairs.

DoLFIn with Bags of latent features (BoLF)
For simplicity, we introduce DoLFIn as a neural text classifier, depicted in Figure 1a, but DoLFIn should be applicable to several classification tasks.
The key part of DoLFIn, BoLF, is an aggregator consisting of linear-softmax layers (LSL) and a truncated sum. This aggregator firstly maps each w i to a distribution u i over d latent features using LSLs, i.e., u i = LSL(w i ), so that u i,j = p(f j |w i , s). It then uses the truncated sum to compute We apply the element-wise min operator so that the j-th entry of r, i.e., r j ∈ [0, 1], can be considered as a soft indicator of the extent to which the j-th latent feature f j contributes to the final output. Intuitively, we represent s by a bag of latent features r. If r j is close to 1, feature f j likely appears in the bag.
Finally, we compute to represent s, where each feature f j is represented by a vector f j ∈ R ds . The sum inside the brackets can be seen as a bag of the latent feature representations.

Interpretability
We now show how to analyse the impact of each input component w in context s to the classification decision of the model. To do so, we examine the probability q(c|w, s), which can be seen as the support of w in context s for category c. We decompose this probability by (see the purple arrows in Figure 1): where p(f j |w, s) are given by the used LSLs as mentioned above. (Note that, because we aggregate p(f j |w, s) ∀j into r, we do not need to take r into this equation.) Because directly computing q(c|f ) is not trivial, we approximate it using the statistics of the model's input-output. Recall that the latent features from s's words are aggregated into r, so that if r j is close to 1, f j is on and used to make the prediction. For simplicity, we assume that f j is on when r j > δ for δ ∈ [0, 1]. Let S be a large set of unlabelled texts. Let count S (c, f j ) be the number of texts s ∈ S, whose r j > δ and which are assigned to category c by the model. Then c count S (c , f j ) Intuitively, we take into account how many times feature f j appears to yield the prediction c. Consequently, the closer q(c|f j ) is to 1, the more likely f j supports category c. Besides, if q(.|f j ) is close to uniform, f j is not helpful for classification. In our experiments, we set δ = 0.5 and S the texts of dev sets.

Experiments
In our experimental evaluation we investigate whether DoLFIn can indeed produce interpretable solutions, without sacrificing accuracy. Our implementation is in Python with Pytorch (Paszke et al., 2019). The source code and data are available at https://github.com/lephong/dolfin. Extra information is provided in the appendix.  Table 1: Left -The statistics of TREC, SST2, and AG-news datasets. #c is the number of categories, AvgL is the average length (in words) of test texts. Right -Accuracy (%) of the four models on TREC, SST2, and AG-news. We show the mean and standard deviation across five runs.
(c) Figure 2: Top -A TREC question. Each word is highlighted according to q(c|w, s). For example, "How" is highlighted more than "bails" in the second question because q(DESC|How, s) > q(DESC|bails, s). The left-most token on each line is the label of category c (e.g. the second question is assigned to category DESC). Bottom -A text in AG-news.
Dataset We used three following text classification datasets, whose statistics are given in Table 1-left. • TREC Question (Li and Roth, 2002) (TREC for short) is for classifying questions into six categories: ABBR (Abbreviation), DECS (Description), ENTY (Entity), HUM (Human), LOC (Location), and NUM (Number). The questions are generally short (7.5 words, on average), such as "Who are cartoondom 's Super Six ?" • SST2 (Socher et al., 2013) is for predicting the binary sentiment (positive/negative) of movie reviews. Different from the dev and test sets, the train set contains labelled phrases and sentences, rather than sentences alone.
• AG-news (Zhang et al., 2015) is a news topic classification dataset with four topics: WORLD, SPORTS, BUSINESS, and SCI-TECH. Among the three datasets, AG-news is the largest in terms of the number of texts and the average length.
Models We evaluated four models CNN, BiLSTM, and DoLFIn-conv/bilstm. Most of their hyperparameters are identical to those used in (Kim, 2014) (see Appendix A). The number of latent features d is 20, 10, and 100 for DoLFIn when tested on TREC, SST2, and AG-news respectively. We used Glove word-embeddings (Pennington et al., 2014) and Adam optimizer (Kingma and Ba, 2014) with the default learning rate 0.001.
Results Table 1-right shows the results. Although DoLFIn performs worse than CNN and BiLSTM on TREC, it slightly outperforms them on SST2 and AG-news. These results suggest that using DoLFIn does not sacrifice the classification accuracy.
Interpretation To illustrate how to interpret DoLFIn, in Figure 2-top we visualise a TREC question in the dev set, where the weight for a word is q(c|w, s), given by Eq. (1). If a word (in a context) supports the category in question, it will be highlighted. For instance, when considering category DESC, we can see that word "How" supports it strongly, "bails" slightly, and the other words do not. For NUM, "many" and "are there in" have high q(NUM|w, s). DoLFIn correctly chose NUM. Figure 2-bottom shows a text in AG-news. DoLFIn reasonably focused on "PeopleSoft" and "Oracle" for SCI-TECH, and "Takeover" What, Which are mostly likely characterised by feature f 9 Figure 3: Left -Heatmap of q(c|f ) with 20 latent features for TREC. The (j+1)-th row indicates q(c|f j ) (%). For example, feature f 9 supports categories ENTY, DESC, and HUM. Right-top -Heatmap of q(f |w, s) for A TREC question. "What" in this sentence is mostly likely characterised by feature f 9 , whereas "a" can be f 7 , f 8 , f 10 , or f 15 . Right-bottom -TREC questions predicted as ENTY. Each word w has a subscript j = arg max k p(f k |w, s), and is highlighted according to p(f j |w, s). The left-most token on each line is the gold label-the predicted label.
and "merger" for BUSINESS. It then chose BUSINESS, whereas the correct topic is SCI-TECH. (This is a difficult case even to humans.) Another way to interpret the behaviour of DoLFIn is to examine q(c|f ) and p(f |w, s), especially the meanings of latent features. Figure 3-left shows a heatmap visualising q(c|f ) for TREC. We can see that features f 2 and f 16 are not helpful because they support all categories almost uniformly. Feature f 4 strongly supports ENTY whereas f 0 , f 11 , f 19 support HUM. Feature f 9 prefers ENTY but it can also be used for DESC and HUM. Interestingly, there are no features strongly supporting ABBR. DoLFIn seems to rely on the absence of all features when predicting ABBR. Figure 3-right-top shows a heatmap of p(f |w, s). Figure 3-right-bottom shows TREC questions whose words with their most probable latent features are highlighted. 1 For instance, the first "What" is mapped to feature f 9 and q(f 9 |What, s) is high (see the first row in Figure 3-right-top). In general, DoLFIn uses f 9 for words "What", "Which" that are often for ENTY, but sometimes for HUM (e.g., "What is the most popular last name ?"), and DESC ("What is Java ?") (see Figure 3-left). If DoLFIn can utilise the next words to choose ENTY, it will assign f 4 to them (e.g., DoLFIn knows that the term "What sports 4 " of the first question in Figure 3-right-bottom is for asking about an entity). Otherwise, it will again use f 9 (e.g., DoLFIn still can not decide if "What is 9 " of the third question is for asking about an entity or a description).

Conclusion
We have proposed a new architecture DoLFIN based on probability for building explainable models. DoLFIn represents input by a bag of latent features using linear-softmax layers to map input components to distributions over latent features, and a truncated sum to aggregate these resulting distributions. We showed that, different from attention and saliency maps, it is straight-forward to compute how much an input component supports a category. Demonstrating our idea, we applied DoLFIn to text classification. Compared with the classical CNN and BiLSTM text classifiers, DoLFIn achieved comparable accuracies, but much better interpretability.