Visualizing and Understanding Neural Models in NLP

While neural networks have been successfully applied to many NLP tasks the resulting vector-based models are very difficult to interpret. For example it's not clear how they achieve {\em compositionality}, building sentence meaning from the meanings of words and phrases. In this paper we describe four strategies for visualizing compositionality in neural models for NLP, inspired by similar work in computer vision. We first plot unit values to visualize compositionality of negation, intensification, and concessive clauses, allow us to see well-known markedness asymmetries in negation. We then introduce three simple and straightforward methods for visualizing a unit's {\em salience}, the amount it contributes to the final composed meaning: (1) gradient back-propagation, (2) the variance of a token from the average word node, (3) LSTM-style gates that measure information flow. We test our methods on sentiment using simple recurrent nets and LSTMs. Our general-purpose methods may have wide applications for understanding compositionality and other semantic properties of deep networks , and also shed light on why LSTMs outperform simple recurrent nets,


Introduction
Neural models match or outperform the performance of other state-of-the-art systems on a variety of NLP tasks. Yet unlike traditional feature-based classifiers that assign and optimize weights to varieties of human interpretable features (parts-of-speech, named entities, word shapes, syntactic parse features etc) the behavior of deep learning models is much less easily interpreted. Deep learning models mainly operate on word embeddings (low-dimensional, continuous, real-valued vectors) through multi-layer neural architectures, each layer of which is characterized as an array of hidden neuron units. It is unclear how deep learning models deal with composition, implementing functions like negation or intensification, or combining meaning from different parts of the sentence, filtering away the informational chaff from the wheat, to build sentence meaning.
In this paper, we explore multiple strategies to interpret meaning composition in neural models. We employ traditional methods like representation plotting, and introduce simple strategies for measuring how much a neural unit contributes to meaning composition, its 'salience' or importance using first derivatives.
Visualization techniques/models represented in this work shed important light on how neural models work: For example, we illustrate that LSTM's success is due to its ability in maintaining a much sharper focus on the important key words than other models; Composition in multiple clauses works competitively, and that the models are able to capture negative asymmetry, an important property of semantic compositionally in natural language understanding; there is sharp dimensional locality, with certain dimensions marking negation and quantification in a manner that was surprisingly localist. Though our attempts only touch superficial points in neural models, and each method has its pros and cons, together they may offer some insights into the behaviors of neural models in language based tasks, marking one initial step toward understanding how they achieve meaning composition in natural language processing. The next section describes some visualization models in vision and NLP that have inspired this work. We describe datasets and the adopted neural models in Section 3. Different visualization strategies and correspondent analytical results are presented separately in Section 4,5,6, followed by a brief conclusion.

A Brief Review of Neural Visualization
Similarity is commonly visualized graphically, generally by projecting the embedding space into two dimensions and observing that similar words tend to be clustered together (e.g., Elman (1989), Ji and Eisenstein (2014), Faruqui and Dyer (2014)). (Karpathy et al., 2015) attempts to interpret recurrent neural models from a statical point of view and does deeply touch compositionally of meanings. Other relevant attempts include (Fyshe et al., 2015;Faruqui et al., 2015).
Methods for interpreting and visualizing neural models have been much more significantly explored in vision, especially for Convolutional Neural Networks (CNNs or ConvNets) (Krizhevsky et al., 2012), multi-layer neural networks in which the original matrix of image pixels is convolved and pooled as it is passed on to hidden layers. ConvNet visualizing techniques consist mainly in mapping the different layers of the network (or other features like SIFT (Lowe, 2004) and HOG (Dalal and Triggs, 2005)) back to the initial image input, thus capturing the humaninterpretable information they represent in the input, and how units in these layers contribute to any final decisions (Simonyan et al., 2013;Mahendran and Vedaldi, 2014;Nguyen et al., 2014;Szegedy et al., 2013;Girshick et al., 2014;Zeiler and Fergus, 2014). Such methods include: (1) Inversion: Inverting the representations by training an additional model to project outputs from different neural levels back to the initial input images (Mahendran and Vedaldi, 2014;Vondrick et al., 2013;Weinzaepfel et al., 2011). The intuition behind reconstruction is that the pixels that are reconstructable from the current representations are the content of the representation. The inverting algorithms allow the current representation to align with corresponding parts of the original images.
(2) Back-propagation (Erhan et al., 2009;Simonyan et al., 2013) andDeconvolutional Networks (Zeiler andFergus, 2014): Errors are back propagated from output layers to each intermediate layer and finally to the original image inputs. Deconvolutional Networks work in a similar way by projecting outputs back to initial inputs layer by layer, each layer associated with one supervised model for projecting upper ones to lower ones These strategies make it possible to spot active regions or ones that contribute the most to the final classification decision.
(3) Generation: This group of work generates images in a specific class from a sketch guided by already trained neural models (Szegedy et al., 2013;Nguyen et al., 2014). Models begin with an image whose pixels are randomly initialized and mutated at each step. The specific layers that are activated at different stages of image construction can help in interpretation.
While the above strategies inspire the work we present in this paper, there are fundamental differences between vision and NLP. In NLP words function as basic units, and hence (word) vectors rather than single pixels are the basic units. Sequences of words (e.g., phrases and sentences) are also presented in a more structured way than arrangements of pixels. In parallel to our research, independent researches (Karpathy et al., 2015) have been conducted to explore similar direction from an error-analysis point of view, by analyzing predictions and errors from a recurrent neural models. Other distantly relevant works include: Murphy et al. (2012;Fyshe et al. (2015) used an manual task to quantify the interpretability of semantic dimensions by presetting human users with a list of words and ask them to choose the one that does not belong to the list. Faruqui et al. (2015). Similar

Stanford Sentiment Treebank
Stanford Sentiment Treebank is a benchmark dataset widely used for neural model evaluations. The dataset contains gold-standard sentiment labels for every parse tree constituent, from sentences to phrases to individual words, for 215,154 phrases in 11,855 sentences. The task is to perform both finegrained (very positive, positive, neutral, negative and very negative) and coarse-grained (positive vs negative) classification at both the phrase and sentence level. For more details about the dataset, please refer to Socher et al. (2013).
While many studies on this dataset use recursive parse-tree models, in this work we employ only standard sequence models (RNNs and LSTMs) since these are the most widely used current neural models, and sequential visualization is more straightforward. We therefore first transform each parse tree node to a sequence of tokens. The sequence is first mapped to a phrase/sentence representation and fed into a softmax classifier. Phrase/sentence representations are built with the following three models: Standard Recurrent Sequence with TANH activation functions, LSTMs and Bidirectional LSTMs. For details about the three models, please refer to Appendix.
Training AdaGrad with mini-batch was used for training, with parameters (L2 penalty, learning rate, mini batch size) tuned on the development set. The number of iterations is treated as a variable to tune and parameters are harvested based on the best performance on the dev set. The number of dimensions for the word and hidden layer are set to 60 with 0.1 dropout rate. Parameters are tuned on the dev set. The standard recurrent model achieves 0.429 (fine grained) and 0.850 (coarse grained) accuracy at the sentence level; LSTM achieves 0.469 and 0.870, and Bidirectional LSTM 0.488 and 0.878, respectively.

Sequence-to-Sequence Models
SEQ2SEQ are neural models aiming at generating a sequence of output texts given inputs. Theoretically, SEQ2SEQ models can be adapted to NLP tasks that can be formalized as predicting outputs given inputs and serve for different purposes due to different inputs and outputs, e.g., machine translation where inputs correspond to source sentences and outputs to target sentences Luong et al., 2014); conversational response generation if inputs correspond to messages and outputs correspond to responses (Vinyals and Le, 2015;. SEQ2SEQ need to be trained on massive amount of data for implicitly semantic and syntactic relations between pairs to be learned. SEQ2SEQ models map an input sequence to a vector representation using LSTM models and then sequentially predicts tokens based on the pre-obtained representation. The model defines a distribution over outputs (Y) and sequentially predicts tokens given inputs (X) using a softmax function.
where f (h t−1 , e yt ) denotes the activation function between h t−1 and e yt , where h t−1 is the representation output from the LSTM at time t − 1. For each time step in word prediction, SEQ2SEQ models combine the current token with previously built embeddings for next-step word prediction.
For easy visualization purposes, we turn to the most straightforward task-autoencoder-where inputs and outputs are identical. The goal of an autoencoder is to reconstruct inputs from the pre-obtained representation. We would like to see how individual input tokens affect the overall sentence representation and each of the tokens to predict in outputs. We trained the auto-encoder on a subset of WMT'14 corpus containing 4 million english sentences with an average length of 22.5 words. We followed training protocols described in .

Representation Plotting
We begin with simple plots of representations to shed light on local compositions using Stanford Sentiment Treebank. Figure 1 shows a 60d heatmap vector for the representation of selected words/phrases/sentences, with an emphasis on extent modifications (adverbial and adjectival) and negation. Embeddings for phrases or sentences are attained by composing word representations from the pretrained model.

Local Composition
The intensification part of Figure 1 shows suggestive patterns where values for a few dimensions are strengthened by modifiers like "a lot" (the red bar in the first example) "so much" (the red bar in the second example), and "incredibly". Though the patterns for negations are not as clear, there is still a consistent reversal for some dimensions, visible as a shift between blue and red for dimensions boxed on the left.
We then visualize words and phrases using t-sne (Van der Maaten and Hinton, 2008) in Figure 2 liberately adding in some random words for comparative purposes. As can be seen, neural models nicely learn the properties of local compositionally, clustering negation+positive words ('not nice', 'not good') together with negative words. Note also the asymmetry of negation: "not bad" is clustered more with the negative than the positive words (as shown both in Figure 1 and 2). This asymmetry has been widely discussed in linguistics, for example as arising from markedness, since 'good' is the unmarked direction of the scale (0; Horn, 1989;Fraenkel and Schul, 2008). This suggests that although the model does seem to focus on certain units for negation in Figure 1, the neural model is not just learning to apply a fixed transform for 'not' but is able to capture the subtle differences in the composition of different words.
Concessive Sentences In concessive sentences, two clauses have opposite polarities, usually related by a contrary-to-expectation implicature. We plot evolving representations over time for two concessives in Figure 3. The plots suggest: 1. For tasks like sentiment analysis whose goal is to predict a specific semantic dimension (as opposed to general tasks like language model word prediction), too large a dimensionality leads to many dimensions non-functional (with values close to 0), causing two sentences of opposite sentiment to differ only in a few dimensions. This may explain why more dimensions don't necessarily lead to better performance on such tasks (For example, as reported in (Socher et al., 2013), optimal performance is achieved when word dimensionality is set to between 25 and 35).
2. Both sentences contain two clauses connected by the conjunction "though". Such two-clause sentences might either work collaboratively-models would remember the word "though" and make the second clause share the same sentiment orientation as first-or competitively, with the stronger one dominating. The region within dotted line in Figure 3(a) favors the second assumption: the difference between the two sentences is diluted when the final words ("interesting" and "boring") appear. Figure 4 we explore this clause composition in more detail. Representations move closer to the negative sentiment region by adding negative clauses like "although it had bad acting" or "but it is too long" to the end of a simply positive "I like the movie". By contrast, adding a concessive clause to a negative clause does not move toward the positive; "I hate X but ..." is still very negative, not that different than "I hate X". This difference again suggests the model is able to capture negative asymmetry (0; Horn, 1989;Fraenkel and Schul, 2008).

First-Derivative Saliency
In this section, we describe another strategy which is is inspired by the back-propagation strategy in vision (Erhan et al., 2009;Simonyan et al., 2013). It measures how much each input unit contributes to the final decision, which can be approximated by first derivatives.
More formally, for a classification model, an input E is associated with a gold-standard class label c. (Depending on the NLP task, an input could be the embedding for a word or a sequence of words, while labels could be POS tags, sentiment labels, the next word index to predict etc.) Given embeddings E for input words with the associated gold class label c, the trained model associates the pair (E, c) with a score S c (E). The goal is to decide which units of E make the most significant contribution to S c (e), and thus the decision, the choice of class label c.
In the case of deep neural models, the class score S c (e) is a highly non-linear function. We approximate S c (e) with a linear function of e by computing the first-order Taylor expansion where w(e) is the derivative of S c with respect to the embedding e.
The magnitude (absolute value) of the derivative indicates the sensitiveness of the final decision to the change in one particular dimension, telling us how much one specific dimension of the word embedding contributes to the final decision. The saliency score is given by S(e) = |w(e)| (3)

Results on Stanford Sentiment Treebank
We first illustrate results on Stanford Treebank. We plot in Figures 5, 6 and 7 the saliency scores (the absolute value of the derivative of the loss function with respect to each dimension of all word inputs) for three sentences, applying the trained model to each sentence. Each row corresponds to saliency score for the correspondent word representation with each grid representing each dimension. The examples are based on the clear sentiment indicator "hate" that lends them all negative sentiment.
"I hate the movie" All three models assign high saliency to "hate" and dampen the influence of other tokens. LSTM offers a clearer focus on "hate" than the standard recurrent model, but the bi-directional LSTM shows the clearest focus, attaching almost zero emphasis on words other than "hate". This is presumably due to the gates structures in LSTMs and Bi-LSTMs that controls information flow, making these architectures better at filtering out less relevant information.
"I hate the movie that I saw last night" All three models assign the correct sentiment. The simple recurrent models again do poorly at filtering out irrelevant information, assigning too much salience to words unrelated to sentiment. However none of the models suffer from the gradient vanishing problems despite this sentence being longer; the salience of "hate" still stands out after 7-8 following convolutional operations. "I hate the movie though the plot is interesting" The simple recurrent model emphasizes only the second clause "the plot is interesting", assigning no credit to the first clause "I hate the movie". This might seem to be caused by a vanishing gradient, yet the model correctly classifies the sentence as very negative, suggesting that it is successfully incorporating information from the first negative clause. We separately tested the individual clause "though the plot is interesting". The standard recurrent model confidently labels it as positive. Thus despite the lower saliency scores for words in the first clause, the simple recurrent system manages to rely on that clause and downplay the information from the latter positive clause-despite the higher saliency scores of the later words. This illustrates a limitation of saliency visualization. first-order derivatives don't capture all the information we would like to visualize, perhaps because they are only a rough approximate to individual contributions and might not suffice to deal with highly non-linear cases. By contrast, the LSTM emphasizes the first clause, sharply dampening the influence from the second clause, while the Bi-LSTM focuses on both "hate the movie" and "plot is interesting".
5.2 Results on Sequence-to-Sequence Autoencoder Figure 9 represents saliency heatmap for autoencoder in terms of predicting correspondent token at each time step. We compute first-derivatives for each preceding word through back-propagation as decoding goes on. Each grid corresponds to magnitude of average saliency value for each 1000-dimensional word vector. The heatmaps give clear overview about the behavior of neural models during decoding. Observations can be summarized as follows: 1. For each time step of word prediction, SEQ2SEQ models manage to link word to predict back to correspondent region at the inputs (automatically learn alignments), e.g., input region centering around token "hate" exerts more impact when token "hate" is to be predicted, similar cases with tokens "movie", "plot" and "boring".
2. Neural decoding combines the previously built representation with the word predicted at the current step. As decoding proceeds, the influence of the initial input on decoding (i.e., tokens in source sentences) gradually diminishes as more previouslypredicted words are encoded in the vector representations. Meanwhile, the influence of language model gradually dominates: when word "boring" is to be predicted, models attach more weight to earlier predicted tokens "plot" and "is" but less to correspondent regions in the inputs, i.e., the word "boring" in inputs.

Average and Variance
For settings where word embeddings are treated as parameters to optimize from scratch (as opposed to using pre-trained embeddings), we propose a second,   surprisingly easy and direct way to visualize important indicators. We first compute the average of the word embeddings for all the words within the sentences. The measure of salience or influence for a word is its deviation from this average. The idea is that during training, models would learn to render indicators different from non-indicator words, en-abling them to stand out even after many layers of computation.  As the figure shows, the variance-based salience measure also does a good job of emphasizing the relevant sentiment words. The model does have shortcomings: (1) it can only be used in to scenarios where word embeddings are parameters to learn (2) it's clear how well the model is able to visualize local compositionality.

Conclusion
In this paper, we offer several methods to help visualize and interpret neural models, to understand how neural models are able to compose meanings, demonstrating asymmetries of negation and explain some aspects of the strong performance of LSTMs at these tasks.
Though our attempts only touch superficial points in neural models, and each method has its pros and cons, together they may offer some insights into the behaviors of neural models in language based tasks, marking one initial step toward understanding how they achieve meaning composition in natural language processing. Our future work includes using results of the visualization be used to perform error analysis, and understanding strengths limitations of

Acknowledgement
The authors want to thank Sam Bowman, Percy Liang, Will Monroe, Sida Wang, Chris Manning and other members of the Stanford NLP group, as well as anonymous reviewers for their helpful advice on various aspects of this work. This work partially supported by NSF Award IIS-1514268. Jiwei Li is supported by Facebook fellowship, which we gratefully acknowledge. Any opinions, findings, and conclu-sions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of NSF or Facebook.
states at time t. h t denotes the hidden vector outputted from LSTM model at time t and e t denotes the word embedding input at time t. We have where σ denotes the sigmoid function. i t , f t and o t are scalars within the range of [0,1]. × denotes pairwise dot.
A multi-layer LSTM models works in the same way as multi-layer recurrent models by enable multilayer's compositions.
Bidirectional Models (Schuster and Paliwal, 1997) add bidirectionality to the recurrent framework where embeddings for each time are calculated both forwardly and backwardly: Normally, bidirectional models feed the concatenation vector calculated from both directions [e ← 1 , e → N S ] to the classifier. Bidirectional models can be similarly extended to both multi-layer neural model and LSTM version.