LISA: Explaining Recurrent Neural Network Judgments via Layer-wIse Semantic Accumulation and Example to Pattern Transformation

Recurrent neural networks (RNNs) are temporal networks and cumulative in nature that have shown promising results in various natural language processing tasks. Despite their success, it still remains a challenge to understand their hidden behavior. In this work, we analyze and interpret the cumulative nature of RNN via a proposed technique named as Layer-wIse-Semantic-Accumulation (LISA) for explaining decisions and detecting the most likely (i.e., saliency) patterns that the network relies on while decision making. We demonstrate (1) LISA: “How an RNN accumulates or builds semantics during its sequential processing for a given text example and expected response” (2) Example2pattern: “How the saliency patterns look like for each category in the data according to the network in decision making”. We analyse the sensitiveness of RNNs about different inputs to check the increase or decrease in prediction scores and further extract the saliency patterns learned by the network. We employ two relation classification datasets: SemEval 10 Task 8 and TAC KBP Slot Filling to explain RNN predictions via the LISA and example2pattern.


Introduction
The interpretability of systems based on deep neural network is required to be able to explain the reasoning behind the network prediction(s), that offers to (1) verify that the network works as expected and identify the cause of incorrect decision(s) (2) understand the network in order to improve data or model with or without human intervention.There is a long line of research in techniques of interpretability of Deep Neural networks (DNNs) via different aspects, such as explaining network decisions, data generation, etc. Erhan et al. (2009); Hinton (2012); Simonyan et al. (2013) and Nguyen et al. (2016) focused on model aspects to interpret neural networks via activation maximization approach by finding inputs that maximize activations of given neurons.Goodfellow et al. (2014) interprets by generating adversarial examples.However, Baehrens et al. (2010) and Bach et al. (2015); Montavon et al. (2017) explain neural network predictions by sensitivity analysis to different input features and decomposition of decision functions, respectively.
Past works (Zeiler and Fergus, 2014;Dosovitskiy and Brox, 2016) have mostly analyzed deep neural network, especially CNN in the field of computer vision to study and visualize the features learned by neurons.Recent studies have investigated visualization of RNN and its variants.Tang et al. (2017) visualized the memory vectors to understand the behavior of LSTM and gated recurrent unit (GRU) in speech recognition task.For given words in a sentence, Li et al. (2016) employed heat maps to study sensitivity and meaning composition in recurrent networks.Ming et al. (2017) proposed a tool, RNNVis to visualize hidden states based on RNN's expected response to arXiv:1808.01591v1 [cs.CL] 5 Aug 2018 Forward direction (Vu et al., 2016a) inputs.Peters et al. (2018) studied the internal states of deep bidirectional language model to learn contextualized word representations and observed that the higher-level hidden states capture word semantics, while lower-level states capture syntactical aspects.Despite the possibility of visualizing hidden state activations and performancebased analysis, there still remains a challenge for humans to interpret hidden behavior of the"black box" networks that raised questions in the NLP community as to verify that the network behaves as expected.In this aspect, we address the cumulative nature of RNN with the text input and computed response to answer "how does it aggregate and build the semantic meaning of a sentence word by word at each time point in the sequence for each category in the data".

Contribution:
In this work, we analyze and interpret the cumulative nature of RNN via a proposed technique named as Layer-wIse-Semantic-Accumulation (LISA) for explaining decisions and detecting the most likely (i.e., saliency) patterns that the network relies on while decision making.We demonstrate (1) LISA: "How an RNN accumulates or builds semantics during its sequential processing for a given text example and expected response" (2) Example2pattern: "How the saliency patterns look like for each category in the data according to the network in decision making".We analyse the sensitiveness of RNNs about different inputs to check the increase or decrease in prediction scores.For an example sentence that is classified correctly, we identify and extract a saliency pattern (N-grams of words in order learned by the network) that contributes the most in prediction score.Therefore, the term example2pattern transformation for each category in the data.We employ two relation classification datasets: SemEval 10 Task 8 and TAC KBP Slot Filling (SF) Shared Task (ST) to explain RNN predictions via the proposed LISA and example2pattern techniques.

Connectionist Bi-directional RNN
We adopt the bi-directional recurrent neural network architecture with ranking loss, proposed by Vu et al. (2016a).The network consists of three parts: a forward pass which processes the original sentence word by word (Equation 1); a backward pass which processes the reversed sentence word by word (Equation 2); and a combination of both (Equation 3).The forward and backward passes are combined by adding their hidden layers.There is also a connection to the previous combined hidden layer with weight W bi with a motivation to include all intermediate hidden layers into the final decision of the network (see Equation 3).They named the neural architecture as 'Connectionist Bi-directional RNN' (C-BRNN).Figure 1 shows the C-BRNN architecture, where all the three parts are trained jointly. (1) where w t is the word vector of dimension d for a word at time step t in a sentence of length n.D is the hidden unit dimension.U f ∈ R d×D and U b ∈ R d×D are the weight matrices between hidden units and input w t in forward and backward networks, respectively; W f ∈ R D×D and W b ∈ R D×D are the weights matrices connecting hidden units in forward and backward networks, respectively.W bi ∈ R D×D is the weight matrix connecting the hidden vectors of the combined forward and backward network.Following Gupta et al. (2015) during model training, we use 3-gram and 5-gram representation of each word w t at timestep t in the word sequence, where a 3gram for w t is obtained by concatenating the corresponding word embeddings, i.e., w t−1 w t w t+1 .
Ranking Objective: Similar to Santos et al. ( 2015) and Vu et al. (2016a), we applied the ranking loss function to train C-BRNN.The ranking scheme offers to maximize the distance between the true label y + and the best competitive label c − given a data point x.It is defined as- where s θ (x) y + and s θ (x) c − being the scores for the classes y + and c − , respectively.The parameter γ controls the penalization of the prediction errors and m + and m are margins for the correct and incorrect classes.Following Vu et al. (2016a), we set γ = 2, m + = 2.5 and m − = 0.5.

Model Training and Features:
We represent each word by the concatenation of its word embedding and position feature vectors.We use word2vec (Mikolov et al., 2013) embeddings, that are updated during model training.As position features in relation classification experiments, we use position indicators (PI) (Zhang and Wang, 2015) in C-BRNN to annotate target entity/nominals in the word sequence, without necessity to change the input vectors, while it increases the length of the input word sequences, as four independent words, as position indicators (<e1>, </ e1>, <e2>, </e2>) around the relation arguments are introduced.
In our analysis and interpretation of recurrent neural networks, we use the trained C-BRNN (Figure 1) (Vu et al., 2016a) model.

LISA and Example2Pattern in RNN
There are several aspects in interpreting the neural network, for instance via (1) Data: "Which dimensions of the data are the most relevant for the task" (2) Prediction or Decision: "Explain why a certain pattern" is classified in a certain way (3) Model: "How patterns belonging to each category in the data look like according to the network".
In this work, we focus to explain RNN via decision and model aspects by finding the patterns that explains "why" a model arrives at a particu-lar decision for each category in the data and verifies that model behaves as expected.To do so, we propose a technique named as LISA that interprets RNN about "how it accumulates and builds meaningful semantics of a sentence word by word" and "how the saliency patterns look like according to the network" for each category in the data while decision making.We extract the saliency patterns via example2pattern transformation.
LISA Formulation: To explain the cumulative nature of recurrent neural networks, we show how does it build semantic meaning of a sentence word by word belonging to a particular category in the data and compute prediction scores for the expected category on different inputs, as shown in Figure 2. The scheme also depicts the contribution of each word in the sequence towards the final classification score (prediction probability).
At first, we compute different subsequences of word(s) for a given sequence of words (i.e., sentence).Consider a sequence S of words [w 1 , w 2 , ..., w k , ..., w n ] for a given sentence S of length n.We compute n number of subsequences, where each subsequence S ≤k is a subvector of words [w 1 , ...w k ], i.e., S ≤k consists of words preceding and including the word w k in the sequence S. In context of this work, extending a subsequence by a word means appending the subsequence by the next word in the sequence.Observe that the number of subsequences, n is equal to the total number of time steps in the C-BRNN.
Next is to compute RNN prediction score for the category R associated with sentence S. We compute the score via the autoregressive conditional P (R|S ≤k , M) for each subsequence S ≤k , as- we compute the network prediction, P (R|S ≤k , M) to demonstrate the cumulative property of recurrent neural network that builds meaningful semantics of the sequence S by extending each subsequence S ≤k word by word.The internal state h bi k (attached to softmax layer as in Figure 1) is involved in decision making for each input subsequence S ≤k with bias vector b y ∈ R C and hidden-to-softmax weights matrix W hy ∈ R D×C for C categories.
The LISA is illustrated in Figure 2, where each word in the sequence contributes to final classification score.It allows us to understand the network decisions via peaks in the prediction score if P (R|S ≤k , M) ≥ τ then 7: over different subsequences.The peaks signify the saliency patterns (i.e., sequence of words) that the network has learned in order to make decision.For instance, the input word 'of' following the subsequence '<e1> demolition </e1> was the cause' introduces a sudden increase in prediction score for the relation type cause-effect(e1, e2).It suggests that the C-BRNN collects the semantics layer-wise via temporally organized subsequences.Observe that the subsequence '...cause of' is salient enough in decision making (i.e., prediction score=0.77),where the next subsequence '...cause of <e2>' adds in the score to get 0.98.
Example2pattern for Saliency Pattern: To further interpret RNN, we seek to identify and extract the most likely input pattern (or phrases) for a given class that is discriminating enough in decision making.Therefore, each example input is transformed into a saliency pattern that informs us about the network learning.To do so, we first compute N-gram for each word w t in the sentence S. For instance, a 3-gram representation of w t is given by w t−1 , w t , w t+1 .Therefore, an N-gram (for N=3) sequence S of words is represented as , where w 0 and w n+1 are PADDING (zero) vectors of embedding dimension.

Analysis: Relation Classification
Given a sentence and two annotated nominals, the task of binary relation classification is to predict the semantic relations between the pairs of nominals.In most cases, the context in between the two nominals define the relationship.However, Vu et al. (2016a) has shown that the extended context helps.In this work, we focus on the building semantics for a given sentence using relationship contexts between the two nominals.
LISA Analysis: As discussed in Section 3, we interpret C-BRNN by explaining its predictions via the semantic accumulation over the subsequences S ≤k (Figure 2) for each sentence S. We select the example sentences S1-S7 (Table 1) for which the network predicts the correct relation type with high scores.For an example sentence S1, Table 2 illustrates how different subsequences are input to C-BRNN in order to compute prediction scores pp in the softmax layer for the relation cause-effect(e1, e2).We use tri-gram (section 3) word representation for each word for the examples S1-S7.3a, 3b, 3c, 3d 3e, 3f  instance in Figure 3a and Table 2, the C-BRNN builds meaning of the sentence S1 word by word, where a sudden increase in pp is observed when the input subsequence <e1> demolition </e1> was the cause is extended with the next term of in the word sequence S. Note that the relationship context between the arguments demolition and terror is sufficient enough in detecting the relationship type.Interestingly, we also observe that the prepositions (such as of, by, into, etc.) in combination with verbs are key features in building the meaningful semantics.

Figures
Saliency Patterns via example2pattern Transformation: Following the discussion in Section 3 and Algorithm 1, we transform each correctly identified example into pattern by extracting the most likely N-gram in the input subsequence(s).In each of the Figures 3a, 3b, 3c, 3d 3e, 3f and 3g, the square box in red color signifies that the relation type is correctly identified (when τ = 0.5) at this particular subsequence input (without the remaining context in the sentence).We extract the last N-gram such a subsequence.

TAC KBP Slot Filling dataset
We investigate another dataset from TAC KBP Slot Filling (SF) shared task (Surdeanu, 2013), where we use the relation classification dataset by Adel et al. (2016) in the context of slot filling.We have selected the two slots: per:loc of birth and per:spouse out of 24 types.
LISA Analysis: Following Section 4.1, we analyse the C-BRNN for LISA using sentences S8 and S9 (Table 1).Figures 3h and 3i demonstrate the cumulative nature of recurrent neural network, where we observe that the salient patterns born in <e2> and </e1> married e2 lead to correct decision making for S8 and S9, respectively.Interestingly for S8, we see a decrease in prediction score from 0.59 to 0.52 on including terms in the subsequence, following the term in.
Saliency Patterns via example2pattern Transformation: Following Section 3 and Algorithm 1, we demonstrate the example2pattern transformation of sentences S8 and S9 in Table 1 with tirgrams.In addition, Table 4 shows the tri-gram salient patterns extracted for the two slots.

Visualizing Latent Semantics
In this section, we attempt to visualize the hidden state of each test (and train) example that has accumulated (or built) the meaningful semantics during sequential processing in C-BRNN.To do this, we compute the last hidden vector h bi of the combined network (e.g., h bi attached to the softmax layer in Figure 1) for each test (and train) example and visualize (Figure 3k and 3j) using t-SNE (Maaten and Hinton, 2008).Each color represents a relation-type.Observe the distinctive clusters of accumulated semantics in hidden states for each category in the data (SemEval10 Task 8).

Conclusion and Future Work
We have demonstrated the cumulative nature of recurrent neural networks via sensitivity analysis over different inputs, i.e., LISA to understand how they build meaningful semantics and explain predictions for each category in the data.We have also detected a salient pattern in each of the example sentences, i.e., example2pattern transformation that the network learns in decision making.We extract the salient patterns for different categories in two relation classification datasets.
In future work, it would be interesting to analyse the sensitiveness of RNNs with corruption in the salient patterns.One could also investigate visualizing the dimensions of hidden states (activation maximization) and word embedding vectors with the network decisions over time.We forsee to apply LISA and example2pattern on different tasks such as document categorization, sentiment analysis, language modeling, etc.Another interesting direction would be to analyze the bag-of-word neural topic models such as Doc-NADE (Larochelle and Lauly, 2012) and iDoc-NADE (Gupta et al., 2018b) to interpret their semantic accumulation during autoregressive computations in building document representation(s).We extract the saliency patterns for each category in the data that can be effectively used in instantiating pattern-based information extraction systems, such as bootstrapping entity (Gupta and Manning, 2014) and relation extractors (Gupta et al., 2018e).

Algorithm 1
Example2pattern Transformation Input: sentence S, length n, category R, threshold τ , C-BRNN M, N-gram size N Output: N-gram saliency pattern patt 1: for k in 1 to n do 2: compute N-gram k (eqn 8) of words in S 3: for k in 1 to n do 4: compute S ≤k (eqn 7) of N-grams 5:compute P (R|S ≤k , M) using eqn 5 6: Figure 3: (a-i) Layer-wIse Semantic Accumulation (LISA) by C-BRNN for different relation types in SemEval10 Task 8 and TAC KBP Slot Filling datasets.The square in red color signifies that the relation is correctly detected with the input subsequence (enough in decision making).(j-k) t-SNE visualization of the last combined hidden unit (h bi ) of C-BRNN computed using the SemEval10 train and test sets.

4. 1
SemEval10 Shared Task 8 dataset The relation classification dataset of the Semantic Evaluation 2010 (SemEval10) shared task 8 (Hendrickx et al., 2009) consists of 19 relations (9 directed relations and one artificial class Other), 8,000 training and 2,717 testing sentences.We split the training data into train (6.5k) and development (1.5k) sentences to optimize the C-BRNN 1 data from the slot filler classification component of the slot filling pipeline, treated as relation classification

Table 1 :
Example Sentences for LISA and example2pattern illustrations.The sentences S1-S7 belong to SemEval10 Task 8 dataset and S8-S9 to TAC KBP Slot Filling (SF) shared task dataset.sists of the word w k+1 , if k = n.To generalize for i ∈ [1, N/2 ], an N-gram k of size N for word w k in C-BRNN is given by-N-gram k = [w k−i , ..., w k , ..., w k+i ] k

Table 3
and 3g demonstrate the cumulative nature and sensitiveness of RNN via prediction probability (pp) about different inputs for sentences S1-S7, respectively.For