Evaluating neural network explanation methods using hybrid documents and morphosyntactic agreement

The behavior of deep neural networks (DNNs) is hard to understand. This makes it necessary to explore post hoc explanation methods. We conduct the first comprehensive evaluation of explanation methods for NLP. To this end, we design two novel evaluation paradigms that cover two important classes of NLP problems: small context and large context problems. Both paradigms require no manual annotation and are therefore broadly applicable. We also introduce LIMSSE, an explanation method inspired by LIME that is designed for NLP. We show empirically that LIMSSE, LRP and DeepLIFT are the most effective explanation methods and recommend them for explaining DNNs in NLP.


Introduction
DNNs are complex models that combine linear transformations with different types of nonlinearities. If the model is deep, i.e., has many layers, then its behavior during training and inference is notoriously hard to understand. This is a problem for both scientific methodology and real-world deployment. Scientific methodology demands that we understand our models. In the real world, a decision (e.g., "your blog post is offensive and has been removed") by itself is often insufficient; in addition, an explanation of the decision may be required (e.g., "our system flagged the following words as offensive"). The European Union plans to mandate that intelligent systems used for sensitive applications provide such explanations (European General Data Protection Regulation, expected 2018, cf. Goodman andFlaxman (2016)).
A number of post hoc explanation methods for DNNs have been proposed. Due to the complexity of the DNNs they explain, these methods are necessarily approximations and come with their own sources of error. At this point, it is not clear which of these methods to use when reliable explanations for a specific DNN architecture are needed.
Definitions. (i) A task method solves an NLP problem, e.g., a GRU that predicts sentiment.
(ii) An explanation method explains the behavior of a task method on a specific input. For our purpose, it is a function φ(t, k, X) that assigns real-valued relevance scores for a target class k (e.g., positive) to positions t in an input text X (e.g., "great food"). For this example, an explanation method might assign: φ(1, k, X) > φ(2, k, X).
(iii) An (explanation) evaluation paradigm quantitatively evaluates explanation methods for a task method, e.g., by assigning them accuracies.
Contributions. (i) We present novel evaluation paradigms for explanation methods for two classes of common NLP tasks (see §2). Crucially, neither paradigm requires manual annotations and our methodology is therefore broadly applicable.
(ii) Using these paradigms, we perform a comprehensive evaluation of explanation methods for NLP ( §3). We cover the most important classes of task methods, RNNs and CNNs, as well as the recently proposed Quasi-RNNs.

Evaluation paradigms
In this section, we introduce two novel evaluation paradigms for explanation methods on two types of common NLP tasks, small context tasks and large context tasks. Small context tasks are defined as those that can be solved by finding short, self-contained indicators, such as words and phrases, and weighing them up (i.e., tasks where CNNs with pooling can be expected to perform well). We design the hybrid document paradigm for evaluating explanation methods on small context tasks. Large context tasks require the correct handling of long-distance dependencies, such as subject-verb agreement. 1 We design the morphosyntactic agreement paradigm for evaluating explanation methods on large context tasks. We could also use human judgments for evaluation. While we use Mohseni and Ragan (2018)'s manual relevance benchmark for comparison, there are two issues with it: (i) Due to the cost of human labor, it is limited in size and domain. (ii) More importantly, a good explanation method should not reflect what humans attend to, but what task methods attend to. For instance, the family name "Kolstad" has 11 out of its 13 appearances in the 20 newsgroups corpus in sci.electronics posts. Thus, task methods probably learn it as a sci.electronics indicator. Indeed, the explanation method in Fig 1 (top) marks "Kolstad" as relevant, but the human annotator does not.

Small context: Hybrid document paradigm
Given a collection of documents, hybrid documents are created by randomly concatenating document fragments. We assume that, on average, the most relevant input for a class k in a hybrid document is located in a fragment that stems from a document with gold label k. Hence, an explanation method succeeds if it places maximal relevance for k inside the correct fragment. Formally, let x t be a word inside hybrid document X that originates from a document X with gold label y(X ). x t 's gold label y(X, t) is set to y(X ). Let f (X) be the class assigned to the hybrid document by a task method, and let φ be an explanation method as defined above. Let rmax(X, φ) denote the position of the maximally relevant word in X for the predicted class f (X). If this maximally relevant word comes from a document with the correct gold label, the explanation method is awarded a hit: hit(φ, X) = I[y X, rmax(X, φ) = f (X)] (1) where I[P ] is 1 if P is true and 0 otherwise. In Fig 1 (bottom), the explanation method grad L2 1p places rmax outside the correct (underlined) fragment. Therefore, it does not get a hit point, while limsse ms s does. The pointing game accuracy of an explanation method is calculated as its total number of hit points divided by the number of possible hit points. This is a form of the pointing game paradigm from computer vision (Zhang et al., 2016).

Large context: Morphosyntactic agreement paradigm
Many natural languages display morphosyntactic agreement between words v and w. A DNN that  predicts the agreeing feature in w should pay attention to v. For example, in the sentence "the children with the telescope are home", the number of the verb (plural for "are") can be predicted from the subject ("children") without looking at the verb. If the language allows for v and w to be far apart (Fig 3, top), successful task methods have to be able to handle large contexts. Linzen et al. (2016) show that English verb number can be predicted by a unidirectional LSTM with accuracy > 99%, based on left context alone. When a task method predicts the correct number, we expect successful explanation methods to place maximal relevance on the subject: where target(X) is the location of the subject, and rmax is calculated as above. Regardless of whether the prediction is correct, we expect rmax to fall onto a noun that has the predicted number: where feat(X, t) is the morphological feature (here: number) of x t . In Fig 2, rmax on "link" gives a hit target point (and a hit feat point), rmax on "editor" gives a hit feat point. grad L2 s does not get any points as "history" is not a plural noun.
Labels for this task can be automatically generated using part-of-speech taggers and parsers, which are available for many languages.

Explanation methods
In this section, we define the explanation methods that will be evaluated. For our purpose, explanation methods produce word relevance scores φ(t, k, X), which are specific to a given class k and a given input X. φ(t, k, X) > φ(t , k, X) means that x t contributed more than x t to the task method's (potential) decision to classify X as k.

Gradient-based explanation methods
Gradient-based explanation methods approximate the contribution of some DNN input i to some output o with o's gradient with respect to i (Simonyan et al., 2014). In the following, we consider two output functions o(k, X), the unnormalized class score s(k, X) and the class probability p(k|X): where k is the target class, h(X) the document representation (e.g., an RNN's final hidden layer), w k (resp. b k ) k's weight vector (resp. bias). The simple gradient of o(k, X) w.r.t. i is: grad 1 underestimates the importance of inputs that saturate a nonlinearity (Shrikumar et al., 2017). To address this, Sundararajan et al. (2017) integrate over all gradients on a linear interpolation α ∈ [0, 1] between a baseline inputX (here: all-zero embeddings) and X: grad (i, k, X) = where M is a big enough constant (here: 50). In NLP, symbolic inputs (e.g., words) are often represented as one-hot vectors x t ∈ {1, 0} |V | and embedded via a real-valued matrix: e t = M x t . Gradients are computed with respect to individual entries of E = [ e 1 . . . e |X| ]. Bansal et al. (2016) and Hechtlinger (2016) use the L2 norm to reduce vectors of gradients to single values: where grad( e t , k, E) is a vector of elementwise gradients w.r.t. e t . Denil et al. (2015) use the dot product of the gradient vector and the embedding 2 , i.e., the gradient of the "hot" entry in x t : We use "grad 1 " for Eq 4, "grad " for Eq 5, " p " for Eq 3, " s " for Eq 2, "L2" for Eq 6 and "dot" for Eq 7. This gives us eight explanation methods: grad L2 1s , grad L2 1p , grad dot 1s , grad dot 1p , grad L2 s , grad L2 p , grad dot s , grad dot p .

Layer-wise relevance propagation
Layer-wise relevance propagation (LRP) is a backpropagation-based explanation method developed for fully connected neural networks and CNNs (Bach et al., 2015) and later extended to LSTMs (Arras et al., 2017b). In this paper, we use Epsilon LRP (Eq 58, Bach et al. (2015)). Remember that the activation of neuron j, a j , is the sum of weighted upstream activations, i a i w i,j , plus bias b j , squeezed through some nonlinearity. We denote the pre-nonlinearity activation of j as a j . The relevance of j, R(j), is distributed to upstream neurons i proportionally to the contribution that i makes to a j in the forward pass: This ensures that relevance is conserved between layers, with the exception of relevance attributed to b j . To prevent numerical instabilities, esign(a ) returns − if a < 0 and otherwise. We set = .001. The full algorithm is: ... recursive application of Eq 8 ...
where L is the final layer, k the target class and R(e t,j ) the relevance of dimension j in the t'th embedding vector. For → 0 and provided that all nonlinearities up to the unnormalized class score are relu, Epsilon LRP is equivalent to the product of input and raw score gradient ( (2017b)'s modification and treat sigmoid-activated gates as time step-specific weights rather than neurons. For instance, the relevance of LSTM candidate vector g t is calculated from memory vector c t and input gate vector i t as This is equivalent to applying Eq 8 while treating i t as a diagonal weight matrix. The gate neurons in i t do not receive any relevance themselves. See supplementary material for formal definitions of Epsilon LRP for different architectures.

DeepLIFT
DeepLIFT (Shrikumar et al., 2017) is another backpropagation-based explanation method. Unlike LRP, it does not explain s(k, X), but s(k, X)−s(k,X), whereX is some baseline input (here: all-zero embeddings). Following Ancona et al. (2018) (Eq 4), we use this backpropagation rule: whereā refers to the forward pass of the baseline. Note that the original method has a different mechanism for avoiding small denominators; we use esign for compatibility with LRP. The DeepLIFT algorithm is started with R(L k ) = s(k, X)−s(k,X) I[k = k]. On gated (Q)RNNs, we proceed analogous to LRP and treat gates as weights.

Cell decomposition for gated RNNs
The cell decomposition explanation method for LSTMs (Murdoch and Szlam, 2017) decomposes the unnormalized class score s(k, X) (Eq 2) into additive contributions. For every time step t, we compute how much of c t "survives" until the final step T and contributes to s(k, X). This is achieved by applying all future forget gates f , the final tanh nonlinearity, the final output gate o T , as well as the class weights of k to c t . We call this quantity "net load of t for class k": where and are applied elementwise. The relevance of t is its gain in net load relative to t − 1: For GRU, we change the definition of net load: where z are GRU update gates.

Input perturbation methods
Input perturbation methods assume that the removal or masking of relevant inputs changes the output (Zeiler and Fergus, 2014). Omissionbased methods remove inputs completely (Kádár et al., 2017), while occlusion-based methods replace them with a baseline (Li et al., 2016b). In computer vision, perturbations are usually applied to patches, as neighboring pixels tend to correlate (Zintgraf et al., 2017). To calculate the omit N (resp. occ N ) relevance of word x t , we delete (resp. occlude), one at a time, all N -grams that contain x t , and average the change in the unnormalized class score from Eq 2: where e t are embedding vectors, denotes concatenation andĒ is either a sequence of length zero (φ omit ) or a sequence of N baseline (here: all-zero) embedding vectors (φ occ ).

LIMSSE: LIME for NLP
Local Interpretable Model-agnostic Explanations (LIME) (Ribeiro et al., 2016) is a framework for explaining predictions of complex classifiers. LIME approximates the behavior of classifier f in the neighborhood of input X with an interpretable (here: linear) model. The interpretable model is trained on samples Z 1 . . . Z N (here: N = 3000), which are randomly drawn from X, with "gold labels" f (Z 1 ) . . . f (Z N ).
Since RNNs and CNNs respect word order, we cannot use the bag of words sampling method from the original description of LIME. Instead, we introduce Local Interpretable Model-agnostic Substring-based Explanations (LIMSSE). LIMSSE uniformly samples a length l n (here: 1 ≤ l n ≤ 6) and a starting point s n , which define the substring Z n = [ x sn . . . x sn+ln−1 ]. To the linear model, Z n is represented by a binary vector z n ∈ {0, 1} |X| , where z n,t = I[s n ≤ t < s n + l n ].
We learn a linear weight vectorˆ v k ∈ R |X| , whose entries are word relevances for k, i.e., φ limsse (t, k, X) =v k,t . To optimize it, we experiment with three loss functions. The first, which we will refer to as limsse bb , assumes that our DNN is a total black box that delivers only a classification: where f (Z n ) = argmax k p(k |Z n ) . The black box approach is maximally general, but insensitive to the magnitude of evidence found in Z n . Hence, we also test magnitude-sensitive loss functions: In all five architectures, the resulting document representation is projected to 20 (resp. two) dimensions using a fully connected layer, followed by a softmax. See supplementary material for details on training and regularization. After training, we sentence-tokenize the test sets, shuffle the sentences, concatenate ten sentences at a time and classify the resulting hybrid documents. Documents that are assigned a class that is not the gold label of at least one constituent word are discarded (yelp: < 0.1%; 20 newsgroups: 14% -20%). On the remaining documents, we use the explanation methods from §3 to find the maximally relevant word for each prediction. The random baseline samples the maximally relevant word from a uniform distribution.

Morphosyntactic agreement experiment
As task methods, we replicate Linzen et al.
(2016)'s unidirectional LSTM (R 50 randomly initialized word embeddings, hidden size 50). We also train unidirectional GRU, QGRU and QLSTM architectures with the same dimensionality. We use the explanation methods from §3 to find the most relevant word for predictions on the test set. As described in §2.2, explanation methods are awarded a hit target (resp. hit feat ) point if this word is the subject (resp. a noun with the predicted number feature). For reference, we use a random baseline as well as a baseline that assumes that the most relevant word directly precedes the verb.

Explanation methods
Our experiments suggest that explanation methods for neural NLP differ in quality.
As in previous work (see §6), gradient L2 norm (grad L2 ) performs poorly, especially on RNNs. We assume that this is due to its inability to distinguish relevances for and against k.
Gradient embedding dot product (grad dot ) is competitive on CNN (Table 2, grad dot 1p C05, grad dot 1s C10, C15), presumably because relu is linear on positive inputs, so gradients are exact in-decomp initially a pagan culture , detailed information about the return of the christian religion to the islands during the norse-era [is ...] deeplift initially a pagan culture , detailed information about the return of the christian religion to the islands during the norse-era [is ...] limsse ms p initially a pagan culture , detailed information about the return of the christian religion to the islands during the norse-era [is ...] lrp Your day is done . Definitely looking forward to going back . All three were outstanding ! I would highly recommend going here to anyone . We will see if anyone returns the message my boyfriend left . The price is unbelievable ! And our guys are on lunch so we ca n't fit you in . " It 's good , standard froyo . The pork shoulder was THAT tender . Try it with the Tomato Basil cram sauce . limsse ms p Your day is done . Definitely looking forward to going back . All three were outstanding ! I would highly recommend going here to anyone . We will see if anyone returns the message my boyfriend left . The price is unbelievable ! And our guys are on lunch so we ca n't fit you in . " It 's good , standard froyo . The pork shoulder was THAT tender . Try it with the Tomato Basil cram sauce . stead of approximate. grad dot also has decent performance for GRU (grad dot 1p C01, grad dot s C{06, 11, 16, 20, 24}), perhaps because GRU hidden activations are always in [-1,1], where tanh and σ are approximately linear.
Integrated gradient (grad ) mostly outperforms simple gradient (grad 1 ), though not consistently (C01, C07). Contrary to expectation, integration did not help much with the failure of the gradient method on LSTM on 20 newsgroups (grad dot 1 vs. grad dot in C08, C13), which we had assumed to be due to saturation of tanh on large absolute activations in c. Smaller intervals may be needed to approximate the integration, however, this means additional computational cost.
The gradient of s(k, X) performs better or similar to the gradient of p(k|X). The main exception is yelp (grad dot 1s vs. grad dot 1p , C01-C05). This is probably due to conflation by p(k|X) of evidence for k (numerator in Eq 3) and against competitor classes (denominator). In a two-class scenario, there is little incentive to keep classes separate, leading to information flow through the denominator. In future work, we will replace the twoway softmax with a one-way sigmoid such that φ(t, 0, X) := −φ(t, 1, X).
LRP and DeepLIFT are the most consistent explanation methods across evaluation paradigms and task methods. (The comparatively low pointing game accuracies on the yelp QRNNs and CNN (C02, C04, C05) are probably due to the fact that they explain s(k, .) in a two-way softmax, see above.) On CNN (C05, C10, C15), LRP and grad dot 1s perform almost identically, suggesting that they are indeed quasi-equivalent on this architecture (see §3.2). On (Q)RNNs, modified LRP and DeepLIFT appear to be superior to the gradient method (lrp vs. grad dot 1s , deeplift vs. grad dot s , C01-C04, C06-C09, C11-C14, C16-C27).
Decomposition performs well on LSTM, especially in the morphosyntactic agreement exper-iment, but it is inconsistent on other architectures. Gated RNNs have a long-term additive and a multiplicative pathway, and the decomposition method only detects information traveling via the additive one. Miao et al. (2016) show qualitatively that GRUs often reorganize long-term memory abruptly, which might explain the difference between LSTM and GRU. QRNNs only have additive recurrent connections; however, given that c t (resp. h t ) is calculated by convolution over several time steps, decomposition relevance can be incorrectly attributed inside that window. This likely is the reason for the stark difference between the performance of decomposition on QRNNs in the hybrid document experiment and on the manually labeled data (C07, C09 vs. C12, C14). Overall, we do not recommend the decomposition method, because it fails to take into account all routes by which information can be propagated.
Omission and occlusion produce inconsistent results in the hybrid document experiment. Shrikumar et al. (2017) show that perturbation methods can lack sensitivity when there are more relevant inputs than the "perturbation window" covers. In the morphosyntactic agreement experiment, omission is not competitive; we assume that this is because it interferes too much with syntactic structure. occ 1 does better (esp. C16-C19), possibly because an all-zero "placeholder" is less disruptive than word removal. But despite some high scores, it is less consistent than other explanation methods.
Magnitude-sensitive LIMSSE (limsse ms ) consistently outperforms black-box LIMSSE (limsse bb ), which suggests that numerical outputs should be used for approximation where possible. In the hybrid document experiment, magnitude-sensitive LIMSSE outperforms the other explanation methods (exceptions: C03, C05). However, it fails in the morphosyntactic agreement experiment (C16-C27). In fact, we expect LIMSSE to be unsuited for large context problems, as it cannot discover dependencies whose range is bigger than a given text sample. In Fig 3 (top), limsse ms p highlights any singular noun without taking into account how that noun fits into the overall syntactic structure.

Evaluation paradigms
The assumptions made by our automatic evaluation paradigms have exceptions: (i) the correlation between fragment of origin and relevance does not always hold (e.g., a positive review may contain negative fragments, and will almost certainly contain neutral fragments); (ii) in morphological prediction, we cannot always expect the subject to be the only predictor for number. In Fig 2 (bottom) for example, "few" is a reasonable clue for plural despite not being a noun. This imperfect ground truth means that absolute pointing game accuracies should be taken with a grain of salt; but we argue that this does not invalidate them for comparisons.
We also point out that there are characteristics of explanations that may be desirable but are not reflected by the pointing game. Consider Fig 3  (bottom). Both explanations get hit points, but the lrp explanation appears "cleaner" than limsse ms p , with relevance concentrated on fewer tokens. 6 Related work 6.1 Explanation methods Explanation methods can be divided into local and global methods (Doshi-Velez and Kim, 2017). Global methods infer general statements about what a DNN has learned, e.g., by clustering documents (Aubakirova and Bansal, 2016) or n-grams (Kádár et al., 2017) according to the neurons that they activate. Li et al. (2016a) compare embeddings of specific words with reference points to measure how drastically they were changed during training. In computer vision, Simonyan et al. (2014) optimize the input space to maximize the activation of a specific neuron. Global explanation methods are of limited value for explaining a specific prediction as they represent average behavior. Therefore, we focus on local methods.
Local explanation methods explain a decision taken for one specific input at a time. We have attempted to include all important local methods for NLP in our experiments (see §3). We do not address self-explanatory models (e.g., attention (Bahdanau et al., 2015) or rationale models (Lei et al., 2016)), as these are very specific architectures that may not be not applicable to all tasks.

Explanation evaluation
According to Doshi-Velez and Kim (2017)'s taxonomy of explanation evaluation paradigms, application-grounded paradigms test how well an explanation method helps real users solve real tasks (e.g., doctors judge automatic diagnoses); human-grounded paradigms rely on proxy tasks (e.g., humans rank task methods based on explanations); functionally-grounded paradigms work without human input, like our approach.
Arras et al. (2016) (cf. Samek et al. (2016)) propose a functionally-grounded explanation evaluation paradigm for NLP where words in a correctly (resp. incorrectly) classified document are deleted in descending (resp. ascending) order of relevance. They assume that the fewer words must be deleted to reduce (resp. increase) accuracy, the better the explanations. According to this metric, LRP ( §3.2) outperforms grad L2 on CNNs ( An issue with the word deletion paradigm is that it uses syntactically broken inputs, which may introduce artefacts (Sundararajan et al., 2017). In our hybrid document paradigm, inputs are syntactically intact (though semantically incoherent at the document level); the morphosyntactic agreement paradigm uses unmodified inputs.
Another class of functionally-grounded evaluation paradigms interprets the performance of a secondary task method, on inputs that are derived from (or altered by) an explanation method, as a proxy for the quality of that explanation method. Murdoch and Szlam (2017) build a rule-based classifier from the most relevant phrases in a corpus (task method: LSTM). The classifier based on decomp ( §3.4) outperforms the gradient-based classifier, which is in line with our results. Arras et al. (2017a) build document representations by summing over word embeddings weighted by relevance scores (task method: CNN). They show that K-nearest neighbor performs better on doc-ument representations derived with LRP than on those derived with grad L2 , which also matches our results. Denil et al. (2015) condense documents by extracting top-K relevant sentences, and let the original task method (CNN) classify them. The accuracy loss, relative to uncondensed documents, is smaller for grad dot than for heuristic baselines.
In the domain of human-based evaluation paradigms, Ribeiro et al. (2016) compare different variants of LIME ( §3.6) by how well they help non-experts clean a corpus from words that lead to overfitting. Selvaraju et al. (2017) assess how well explanation methods help non-experts identify the more accurate out of two object recognition CNNs. These experiments come closer to real use cases than functionally-grounded paradigms; however, they are less scalable.

Summary
We conducted the first comprehensive evaluation of explanation methods for NLP, an important undertaking because there is a need for understanding the behavior of DNNs.
To conduct this study, we introduced evaluation paradigms for explanation methods for two classes of NLP tasks, small context tasks (e.g., topic classification) and large context tasks (e.g., morphological prediction). Neither paradigm requires manual annotations. We also introduced LIMSSE, a substring-based explanation method inspired by LIME and designed for NLP.
Based on our experimental results, we recommend LRP, DeepLIFT and LIMSSE for small context tasks and LRP and DeepLIFT for large context tasks, on all five DNN architectures that we tested. On CNNs and possibly GRUs, the (integrated) gradient embedding dot product is a good alternative to DeepLIFT and LRP.

Code
Our implementation of LIMSSE, the gradient, perturbation and decomposition methods can be found in our branch of the keras package: www.github.com/ NPoe/keras.
To re-run our experiments, see scripts in www.github.com/NPoe/ neural-nlp-explanation-experiment. Our LRP implementation (same repository) is adapted from Arras et al. (2017b)  For sentiment analysis we use the Pennsylvania subset of the 10th yelp dataset challenge 7 . It contains 206,338 reviews with 1 to 5 star ratings. 1 or 2 stars are mapped to "negative", 4 or 5 stars to "positive", 3 star reviews are discarded. We randomly split the data into training, heldout and test sets (90%/5%/5%). On both corpora, we use NLTK (Bird et al., 2009) for word and sentence tokenization. Words with a frequency rank above 50000 are mapped to oov. To create hybrid documents, we sentence-tokenize the test sets, shuffle, and then concatenate ten sentences at a time.
The manually annotated 20 newsgroups documents were obtained from Mohseni and Ragan (2018) 8 . The relevance ground truth consists of one list of lowercased word types per document. There are a number of mismatches between the ground truth and the documents (e.g., one list contains rays but its document only contains x-rays). This made some reverse engineering necessary: Given X and its list, we add t to gt(X) if lower-cased xt is a prefix or suffix of at least one word type in the list.
For the morphosyntactic agreement experiment, we use Linzen et al. (2016)'s corpus of 1,577,211 English Wikipedia sentences with automatic morphosyntactic annotation 9 . We replicate the original dataset sizes (9% train, 1% heldout, 90% test). Like in the original corpus, words with a frequency rank above 10,000 are replaced by their part-of-speech tag.

Neural networks
Every neural network used in our paper is made up of a word embedding matrix, followed by a core layer, followed by a fully-connected layer with softmax activation.
In the hybrid document experiment, the |V | × 300 embedding matrix is initialized with GloVe embeddings (Pennington et al., 2014) 10 , which are fine-tuned during training. The core layer is a bidirectional Gated Recurrent Unit (GRU, Cho et al. (2014)), bidirectional Long-Short Term Memory Network (LSTM, Hochreiter and Schmidhuber (1997)), bidirectional Quasi-GRU or Quasi-LSTM (Bradbury et al., 2017), or a 1D Convolutional Neural Network (CNN) with global max pooling (Collobert et al., 2011). In all cases, the core layer has a hidden size of 150 (bidirectional architectures: 75 per direction), for QRNNs and CNN, we use a kernel width of 5. For regularization, we use 50% dropout between layers and on hidden-to-hidden connections (GRU/LSTM only).
We minimize categorical crossentropy using Adam (Kingma and Ba, 2015), with learning rate 0.001, β1 = 0.9, β2 = 0.999 and batch size 8. Heldout accuracy is monitored; after two stagnant epochs, the learning rate is halved, and after 5 (yelp), resp. 25 (20 newsgroups In the morphosyntactic agreement experiment, the |V | × 50 embedding matrix is randomly initialized. All (Q)RNNs are unidirectional and have a hidden size of 50. QRNN kernel width is 5. The core layer is followed by a fully connected 50 × 2 layer with softmax activation. We minimize categorical crossentropy using Adam (see above), with early stopping after 20 epochs based on heldout accuracy, and a batch size of 16.

Epsilon LRP and DeepLIFT
In the following, we assume that the hidden layer relevance vector R( h) (resp. R( hT )) has been backpropagated by the upstream fully connected layer using equations from Sections 3.2 and 3.3 (main paper). DeepLIFT can be derived by re-   i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] omit7 i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] occ1 i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] occ3 i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] occ7 i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] decomp i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] lrp i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] deeplift i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] limsse bb i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] limsse ms s i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] limsse ms p i like the fact that there is n't an editor making news decisions , that nearly any news story published [has ...] Figure 6: Verb context classified singular by GRU and plural by QLSTM. Green (resp. red): evidence for (resp. against) the prediction. Underlined: subject. Bold: rmax position. If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer . omit3 If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer . omit7 If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer .

occ1
If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer . occ3 If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer . occ7 If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer . decomp If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer .
lrp If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer .
deeplift If you find faith to be honest , show me how . David The whole denominational mindset only causes more problems , sadly . ( See section 7 for details . ) Thank you . 'The Armenians just shot and shot . Maybe coz they 're 'quality' cars ; -) 200 posts/day . can you explain this or is it that they usually talk to stars more than regular players which explains the hight percentage of results after . It was produced in collaboration with the American College of Surgeons Commission on Cancer .    When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad L2 1p When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad L2 s When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad L2 p When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad dot 1s When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad dot 1p When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad dot s When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
grad dot p When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .

omit1
When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food . omit3 When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food . omit7 When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .

occ1
When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .

occ3
When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .

occ7
When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food . decomp When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .

lrp
When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food . deeplift When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
limsse bb When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
limsse ms s When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food .
limsse ms p When we went to pay we handing the guy the card and our payment , he checked us out and handed back our payment . After we got our food our waitress went M.I.A . The room was good size . : ) The waitresses need to work on their skills a little more . This place is terrible . ! We will not be back . Luckily I do eat salmon , so I headed to the smoked salmon station . One of the few places where you can find good Italian food . Figure 13: Hybrid yelp review, classified negative. Green (resp. red): evidence for (resp. against) negative. Underlined: negative fragments. Italics: OOV. Task method: LSTM. Bold: rmax position.