Iterative Recursive Attention Model for Interpretable Sequence Classification

Natural language processing has greatly benefited from the introduction of the attention mechanism. However, standard attention models are of limited interpretability for tasks that involve a series of inference steps. We describe an iterative recursive attention model, which constructs incremental representations of input data through reusing results of previously computed queries. We train our model on sentiment classification datasets and demonstrate its capacity to identify and combine different aspects of the input in an easily interpretable manner, while obtaining performance close to the state of the art.


Introduction
The introduction of the attention mechanism  offered a way to demystify the inference process of neural models. By assigning scalar weights to different elements of the input, we are able to visualize and potentially understand why the model made the decision it made, or discover a deficiency in the model by tracing down a relevant aspect of the input being overlooked by the model. Specifically in natural language processing (NLP), which abounds with variable-length word sequence classification tasks, attention alleviates the issue of learning long-term dependencies in recurrent neural networks (Bengio et al., 1994) by offering the model a glimpse into previously processed tokens.
Attention offers a good retrospective explanation of the classification decision by indicating what parts of the input contributed the most to the decision. However, in many cases the final decision is best interpreted as a result of a series of inference steps, each of which can potentially affect its polarity. A case in point is sentiment analysis, in which contrastive clauses and negations act as polarity switches of the overall sentence sentiment. In such cases, attention will only point to the part of the input sentence whose polarity matches that of the final decision. However, unfolding the inference process of a model into a series of interpretable steps would make the model more interpretable and allow one to identify its shortcomings.
As a step toward that goal, we propose an extension of the iterative attention mechanism (Sordoni et al., 2016), which we call the iterative recursive attention model (IRAM), where the result of an attentive query is nonlinearly transformed and then added to the set of vector representations of the input sequence. The nonlinear transformation, along with reusing the representations obtained in previous steps, allows the model to construct a recursive representation and process the input sequence bit by bit. The upshot is that we can inspect how the model weighs the different parts of the sentence and recursively combines them to give the final decision. We test the model on two sentiment analysis tasks and demonstrate its capacity to isolate different task-related aspects of the input, while reaching performance comparable with the state of the art.

Related Work
Attention  and its variants (Luong et al., 2015) have initially been proposed for machine translation, but are now widely adopted in NLP. Attention has proven especially useful in tasks that involve long text sequences, such as summarization (Rush et al., 2015;See et al., 2017), question answering (Hermann et al., 2015Cui et al., 2017), and natural language inference (Rocktäschel et al., 2015;Yin et al., 2016;Parikh et al., 2016), as well as purely attentional machine translation (Vaswani et al., 2017;Gu et al., 2017).
Thus far, there has been a number of interesting and effective approaches for interpreting the in-ner workings of recurrent neural networks through methods such as representing them as finite automata (Weiss et al., 2017), extracting inference rules (Zanzotto and Ferrone, 2017), and analyzing saliency of inputs through first-order derivative information (Li et al., 2016;Arras et al., 2017).
Akin to the saliency analysis approaches, we opt not to condense the trained network into a finite set of rules. We differ from (Li et al., 2016;Arras et al., 2017) in that we attempt to decode the steps of the decision process of a recurrent network instead of demonstrating through saliency how the decision changes with respect to the inputs. In the context of sentiment analysis, the main benefit we see in representing the decision process of a recurrent network as a sequence of steps is that it offers a simple way to isolate sentiment-bearing phrases by observing how they get grouped in a single iteration. Secondly, we aim for improved interpretability of functional dependencies such as negation, where we demonstrate that our method first attends on the negated phrase, constructing an intermediate representation, which is then recursively transformed in the next iteration. Sordoni et al. (2016) introduced the iterative attention mechanism for question answering, where attention alternates between the question and the document, and the query is updated in each step by a GRU cell . The model combines the weights obtained throughout the iterations to select the final answer, similar to the attention sum reader of Kadlec et al. (2016) and pointer networks of Vinyals et al. (2015).
We believe there is much to gain from the iterative attention mechanism by eliminating the direct link between the intermediate representations and the output, allowing the model to construct its own sequential representation of the input. Our model only connects the last attention step to the output, removing the need for intermediate steps to contain all the information relevant for the final decision. Apart from (Sordoni et al., 2016), related work closest to ours consists of concepts of multi-head attention (Lin et al., 2017;Vaswani et al., 2017), in which all queries are generated at once, pairwise attention (Cui et al., 2017;, where attention is applied to multiple inputs but is not applied iteratively and hierarchical iterative attention (Yang et al., 2016), where the authors first use a intra-sentence attention mechanism and then combine the intermediate representations with inter-sentence attention. In contrast to their work, we do not predetermine the level on which the attention is applied -in each iteration the mechanism can focus on any element of the input sequence.

Model
Throughout the experiments, we will use two variants of our model: (1) the vanilla model and (2) the full model. The vanilla model contains the bare minimum of components needed for the attention mechanism to function as intended. The purpose of the vanilla model is to eliminate any additional confounders for the performance and showcase the interpretability of the model. For the full model, we extend the vanilla model with additional deep learning components commonly employed in stateof-the-art models, to showcase the performance of the model when given capacity akin to competing models.
In both versions of the model, data is processed in three phases: (1) encoding phase, which contextualizes the word representations; (2) attention phase, which uses iterative recursive attention to isolate and combine the different parts of the input; and (3) classification phase, where the learned representation is fed as an input to a classifier.
The vanilla and full model differ only in the encoding phase, while our proposed attention mechanism is employed only in the second phase. We begin with a detailed account of the proposed attention mechanism and its regularization, and continue with a description of the remaining components, highlighting the differences between the vanilla and full models. Fig. 1 shows the architecture of the iterative attention mechanism. The mechanism uses a recurrent network, dubbed the controller, to refine the attention query throughout T iterations.

Iterative Recursive Attention
Inputs to the mechanism are an initial queryx and a set of hidden states H = [h i , . . . , h N ] constituting the input sequence, both obtained from the encoding step.
As the controller, we use a gated recurrent unit (GRU)  cell. The input to the controller is the transformed result of the previous query, while the hidden state is the previous query.
For the attention mechanism we use bilinear attention (Luong et al., 2015): where q is the current query vector, W a parameter of size R dq×d h , while d q and d h are the dimensionalities of the query and the hidden state, respectively. The attention weights are then used to compute the input summary in timestep t as a linear combination of the hidden states: As we intend to useŝ (t) in the next iteration of the attention mechanism, we need to allow the network the capacity to discern between the new additions and original inputs. To this end, we use a highway network (Srivastava et al., 2015), which gives the model the option to pass subsets of the summary as-is or transform them with a nonlinearity. If the summary is not transformed with a nonlinearity, it ends up being merely a linear combination of the hidden states, and we gain no information from adding it to the sequence.

Attention Regularization
Ideally, we want the model to focus on different task-related aspects of the input in each iteration. However, the model is in no way incentivized to learn to propagate information through the sum-maries and can in principle focus on the same segment in each step.
To prevent this from happening -and push the model to focus on different aspects of the input in every step -we regularize it by minimizing the pairwise dot products between all iterations of attention: where γ is a hyperparameter determining the regularization strength and A ∈ R T ×N +T −1 is a matrix containing the attention weights generated in T steps over N inputs by the iterative attention mechanism. The matrix has N + T − 1 columns to account for attention over T − 1 added summaries, as the summary generated in the last iteration cannot be attended over. In each row t, the matrix has T − t − 1 trailing zeroes, corresponding to summaries that are not yet available in iteration t.
Concretely, the attention weight vector in row t of the matrix A consists of: (4) resulting in each element i, j of the regularization matrix AA T storing the dot product between attention weights in iterations i, j. The regularization expression is a sum over all off-diagonal elements. The diagonal elements are dot products of attention weights in the same iteration so we ignore them. We scale by 1 2 to account for the symmetrical elements in A T A and by 1 T to account for the number of dot product comparisons.
We note that, while this regularization penalty does encourage the model to focus on different elements of the input sequence, there is still a trivial way for the model to minimize the penalty without learning a meaningful behavior. Since the attention weight over the summary in iteration t is zero in all iterations t − < t, the model can simply attend over any elements of the input sequence in the first iteration, and afterwards propagate the information forward by fully attending only over the summary generated in the previous iteration. We will illustrate this behavior with concrete examples in Section 4.

Vanilla Encoder
For training, the inputs of the encoding phase are a sequence of words x = [w 1 , . . . , w N ] and a class label y. The encoder of the vanilla model maps the word indices to dense vector representations using pretrained GloVe vectors (Pennington et al., 2014). The sequence of word vectors is then fed as input to a bidirectional long-short term memory (BiLSTM) network (Hochreiter and Schmidhuber, 1997). The outputs of the BiLSTM are used as the input sequence to the iterative attention step, while the cell state in the last timestep is used as the initial query.

Full Encoder
There are three key differences between the full encoder and the vanilla encoder. The full encoder uses (1) character n-gram embeddings, (2) an additional highway network, whose task is to fine-tune the word embeddings, and (3) an additional layer of BiLSTM, followed by a highway layer to construct the initial query. For extensions (1) and (2), we took inspiration from McCann et al. (2017), who also use both components. However, unlike McCann et al. (2017), who used a ReLU feedforward network to fine-tune the embeddings for the task, we use a highway network, which we found performs better.
The pretrained character n-gram vectors obtained from (Hashimoto et al., 2016) are first aver-aged over all character n-grams for a given word and then concatenated to the GloVe embedding. Further on, before feeding the sequence of word embeddings to a recurrent model, we use a twolayer highway network (Srivastava et al., 2015) to fine-tune the embeddings for the task, which is especially beneficial when the input vectors are kept fixed.
To contextualize the input sequence and produce an initial attention query, we use a bidirectional long-short term memory (BiLSTM) network. We split the network conceptually into two parts: the lower l ctx layers are used to transform the input sequence of word embeddings into a sequence of contextualized word representations, while the upper l query layers are used to read and comprehend the now-transformed sequence and capture its relevant aspects into a single vector. The rationale for the split is that recurrent networks are often required to tackle two tasks at once: contextualize the input and comprehend the whole sequence. Intuitively, the split should incite a division of labor between the two parts of the network: contextualization network only has to memorize the local information specific to each word (e.g., verb tense, noun gender) in order to transform its representation, while comprehension network needs to model aspects of meaning pertaining to the entire sequence (e.g., the overall sentiment of the sentence, locations of sentiment bearing phrases).
We use a single (l ctx + l query )-layered BiLSTM, where we use the output of the l ctx -th layer, while we use the cell state from the last layer as the sequence representationx.
Lastly, since the weights of the BiLSTM network are suited toward processing the input sequence rather than preparing the query vector, we add an additional highway layer designed to fine-tune the sentence representation into the initial query.

Classifier
As input to the classifier, we use the summary vector obtained from the last step of iterative attention s (T ) . This way we force the network to propagate information through the attention steps, and also because the intermediate summaries do not contribute directly toward the classification and hence need not have the same polarity. The last summary vector is fed into a maxout network (Goodfellow et al., 2013) to obtain the class-conditional probabilities. Fig. 2 shows the full version of the iterative at-tention mechanism with all of the aforementioned components.

Datasets
We test IRAM on two sentiment classification datasets. The first is the Stanford Sentiment Treebank (SST) (Socher et al., 2013), a dataset derived from movie reviews on Rotten Tomatoes and containing 11,855 sentences labeled into five classes at the sentence level and at the level of each node in the constituency parse tree. The binary version with the neutral class removed contains 56,400 instances, while the fine-grained version with scores ranging from 1 (very negative) to 5 (very positive) contains 94,200 text-sentiment pairs. The second dataset is the Internet Movie Database (IMDb) (Maas et al., 2011), containing 22,500 multi-sentence reviews extracted from positive and negative reviews. We truncate each sentence from this dataset to a maximum length of 200 tokens.
Firstly, we demonstrate and analyze how each component in the vanilla model contributes to the performance and interpretability. We then analyze the full model and evaluate it on the aforementioned datasets.

Experimental Setup
Unless stated otherwise, all weights are initialized from a Gaussian distribution with zero mean and standard deviation of 0.01. We use the Adam optimizer (Kingma and Ba, 2014) with the AmsGrad modification (Reddi et al., 2018) and α = 0.0003. We clip the global norm of the gradients to 1.0 and set weight decay to 0.00003.
We use 300-dimensional GloVe word embeddings trained on the Common Crawl corpus and 100-dimensional character embeddings. We follow the recommendation of Mu et al. (2017) and standardize the embeddings. Dropout of 0.1 is applied to the word embedding matrix.
For both datasets, we set l ctx = 2 and l query = 1. The highway network for fine-tuning the input embeddings has two layers, while the ones fine-tuning the query and the summary have a single layer. All highway networks' gate biases are initialized to 1, as recommended by Srivastava et al. (2015), as well as the biases of the LSTM forget gates.
The maxout network uses two 200-dimensional layers with a pool size of 4. Throughout our experiments, we have experimented with selecting the batch size from {32, 64}, dropout for the recurrent layers and the maxout classifier from {0.1, 0.2, 0.3, 0.4}, and the LSTM hidden state size from {400, 500, 1000}. The word and character n-gram vectors are kept fixed for SST but are learned for the IMDb dataset. These parameters are optimized using cross-validation, and the best configuration is ran on the test set. As IMDb has no official validation set, we randomly select 10% of the dataset and use it for all of the experiments. The values of other hyperparameters were selected through inexhaustive search.

Analysis of the Vanilla Model
The vanilla model defined in Section 3.3 has two main confounding variables: strength (and presence) of attention regularization (γ) and the number of iterations of the iterative recursive attention mechanism (T ). We also would like to examine the difference in performance of the vanilla IRAM compared to some baseline sequence classifier. To this end, we implement a baseline model without the attention mechanism -a maxout classifier over the last hidden state of the encoder BiLSTM. To keep the running time of the experiments feasible, in this section we use only the binary SST dataset.
Effect of regularization. For each experiment in this round, we run every model three times with different random seeds and report the average results along with the standard deviations across the experiments. In Fig. 3 we present the comparison between the performance of the vanilla model with and without regularization. A more telling sign of  In Fig. 4 we can see that the attention mechanism, when not regularized, fails to use its capacity and simply attends over the same element in each timestep. The last two columns, which contain the summaries from the first two steps of the iterative attention mechanism, have an attention weight of 0, which means that the model does not pass any information through the summaries nor refine the query. This behavior initially prompted us to add the regularization penalty term.
Through inexhaustive search we isolated a critical range of values for γ, for which we perform a detailed analysis of performance. For this experiment, we fix T = 3 as it has exhibited better performance for the vanilla model.
Effect of the number of iterations. Apart from comparing the effect of the existence of regularization, in Fig. 3 we can also observe the effect of the number of timesteps T . Increasing T beyond 3 has a diminishing effect on classification performance, something which we find to be consistent for the IMDB dataset as well.
We attribute this decrease in performance to the SST NSE (Munkhdalai and Yu, 2017) 89.7 IRAM 90.1 BCN + CoVe (McCann et al., 2017) 90.3 bmLSTM (Radford et al., 2017) 91.8 SST-5 IRAM 53.7 BCN + CoVe (McCann et al., 2017) 53.7 BCN + ELMo (Peters et al., 2018) 54.7 IMDb IRAM 91.2 TRNN (Dieng et al., 2016) 93.8 oh-LSTM (Johnson and Zhang, 2016) 94.1 Virtual (Miyato et al., 2016) 94.1  Table 2: Effect of removing components on performance fact that SST is relatively simple, containing at most two contrastive aspects in each sentence, making any additional steps unnecessary. While the model could in theory exploit the pass-through mechanism, we believe that this operation adds some noise to the final representations and in turn affects performance slightly.

Analysis of the Full Model
We now evaluate the full model. Table 1 shows the accuracy scores of our best models (for T = 3, γ = 0.0003) and other state-of-the-art models on the test portions of the SST and IMDb datasets. Our model performs competitively with the best results on SST and SST-5 datasets. It is important to note that our model does not use transfer learning apart from the pretrained word vectors, which is not the case for the competing models.
Ablation study of encoder components. As mentioned in Section 3.4, through adding various components to the model we introduced a number of confounders. In order to determine the effect of each of the added components on the overall score, we evaluate the performance of the full model on the binary SST dataset with the remaining hyperparameters fixed and one of the components removed in isolation.

Visualizing Attention
To gain an intuition about the working of IRAM, we visually analyzed its attention mechanism on a number of sentences from our dataset. We limit ourselves to examples from the test set of the SST dataset as the length of examples is manageable for visualization. We isolate three specific cases where the attention mechanism demonstrates interesting results: (1) simple unipolar sentences, (2) sentences with negations, and (3) multipolar sentences. The least interesting case is the unipolar, as the attention mechanism often does not need multiple iterations. Fig. 6a shows the attention mechanism simply propagating information, since sentiment classification is straightforward and does not require multiple attention steps. This can be seen from most of the attention weight in the second and third steps being on the columns corresponding to the summaries.
The more interesting cases are sentences involving negations and modifiers. Fig. 6b shows the handling of negation: attention is initially on all words except on the negator. In the second step, the mechanism combines the output of the first step with the negation. We interpret this as flipping the sentiment -the model cannot rely solely on recognizing a negative word, and has to account for what that word negates through a functional dependence. These examples highlight one of the drawbacks of recurrent networks which we aim to alleviate. In case a standard attention mechanism is applied to a sentence containing a negator, the hidden representation of the negator has to scale or negate the intensity of an expression. Our model has the capacity to process such sequences iteratively, first constructing the representation of an expression, which is then adjusted by the nonlinear transformation and simpler to combine with the negator in the next step.
Lastly, Fig. 6c shows a contrastive multipolar sentence, where the model in the first step focuses on positive words, and then combines the negative words (tortured, unsettling) with the results of the first step. In such cases, the model succeeds to isolate the contrasting aspects of the sentence and attends to them in different iterations of the model, alleviating the burden of simultaneously representing the positive and negative aspects. After both contrastive representations have been formed, the model has the capacity to weigh them one against other and compute the final representation.

Conclusion
The proposed iterative recursive attention model (IRAM) has the capacity to construct representations of the input sequence in a recursive fashion, making inference more interpretable. We demonstrated that the model can learn to focus on various task-relevant parts of the input, and can propagate the information in a meaningful way to handle the more difficult cases. On the sentiment analysis task, the model performs comparable to the state of the art. Our next goals will be to try to use the iterative attention mechanism to extract tree-like sentence structures akin to constituency parse trees, evaluate the model on more complex datasets as well as extend the model to support an adaptive number of iterative steps.