Interpretable Neural Predictions with Differentiable Binary Variables

The success of neural networks comes hand in hand with a desire for more interpretability. We focus on text classifiers and make them more interpretable by having them provide a justification–a rationale–for their predictions. We approach this problem by jointly training two neural network models: a latent model that selects a rationale (i.e. a short and informative part of the input text), and a classifier that learns from the words in the rationale alone. Previous work proposed to assign binary latent masks to input positions and to promote short selections via sparsity-inducing penalties such as L0 regularisation. We propose a latent model that mixes discrete and continuous behaviour allowing at the same time for binary selections and gradient-based training without REINFORCE. In our formulation, we can tractably compute the expected value of penalties such as L0, which allows us to directly optimise the model towards a pre-specified text selection rate. We show that our approach is competitive with previous work on rationale extraction, and explore further uses in attention mechanisms.


Introduction
Neural networks are bringing incredible performance gains on text classification tasks (Howard and Ruder, 2018;Peters et al., 2018;Devlin et al., 2019). However, this power comes hand in hand with a desire for more interpretability, even though its definition may differ (Lipton, 2016). While it is useful to obtain high classification accuracy, with more data available than ever before it also becomes increasingly important to justify predictions. Imagine having to classify a large collection of documents, while verifying that the classifications make sense. It would be extremely time-consuming to read each document to evaluate the results. Moreover, if we do not pours a dark amber color with decent head that does not recede much . it 's a tad too dark to see the carbonation , but fairs well . smells of roasted malts and mouthfeel is quite strong in the sense that you can get a good taste of it before you even swallow .
Rationale Extractor pours a dark amber color with decent head that does not recede much . it 's a tad too dark to see the carbonation , but fairs well . smells of roasted malts and mouthfeel is quite strong in the sense that you can get a good taste of it before you even swallow . What if the model could provide us the most important parts of the document, as a justification for its prediction? That is exactly the focus of this paper. We use a setting that was pioneered by Lei et al. (2016). A rationale is defined to be a short yet sufficient part of the input text; short so that it makes clear what is most important, and sufficient so that a correct prediction can be made from the rationale alone. One neural network learns to extract the rationale, while another neural network, with separate parameters, learns to make a prediction from just the rationale. Lei et al. model this by assigning a binary Bernoulli variable to each input word. The rationale then consists of all the words for which a 1 was sampled. Because gradients do not flow through discrete samples, the rationale extractor is optimized using REINFORCE (Williams, 1992). An L 0 regularizer is used to make sure the rationale is short.
We propose an alternative to purely discrete selectors for which gradient estimation is possible without REINFORCE, instead relying on a repa-rameterization of a random variable that exhibits both continuous and discrete behavior (Louizos et al., 2017). To promote compact rationales, we employ a relaxed form of L 0 regularization (Louizos et al., 2017), penalizing the objective as a function of the expected proportion of selected text. We also propose the use of Lagrangian relaxation to target a specific rate of selected input text.
Our contributions are summarized as follows: 1 1. we present a differentiable approach to extractive rationales ( §2) including an objective that allows for specifying how much text is to be extracted ( §4); 2. we introduce HardKuma ( §3), which gives support to binary outcomes and allows for reparameterized gradient estimates; 3. we empirically show that our approach is competitive with previous work and that HardKuma has further applications, e.g. in attention mechanisms. ( §6).

Latent Rationale
We are interested in making NN-based text classifiers interpretable by (i) uncovering which parts of the input text contribute features for classification, and (ii) basing decisions on only a fraction of the input text (a rationale). Lei et al. (2016) approached (i) by inducing binary latent selectors that control which input positions are available to an NN encoder that learns features for classification/regression, and (ii) by regularising their architectures using sparsity-inducing penalties on latent assignments. In this section we put their approach under a probabilistic light, and this will then more naturally lead to our proposed method. In text classification, an input x is mapped to a distribution over target labels: where we have a neural network architecture f (·; θ) parameterize the model-θ collectively denotes the parameters of the NN layers in f . That is, an NN maps from data space (e.g. sentences, short paragraphs, or premise-hypothesis pairs) to the categorical parameter space (i.e. a vector of class probabilities). For the sake of concreteness, 1 Code available at https://github.com/ bastings/interpretable_predictions. consider the input a sequence x = x 1 , . . . , x n . A target y is typically a categorical outcome, such as a sentiment class or an entailment decision, but with an appropriate choice of likelihood it could also be a numerical score (continuous or integer). Lei et al. (2016) augment this model with a collection of latent variables which we denote by z = z 1 , . . . , z n . These variables are responsible for regulating which portions of the input x contribute with predictors (i.e. features) to the classifier. The model formulation changes as follows: where an NN g(·; φ) predicts a sequence of n Bernoulli parameters-one per latent variableand the classifier is modified such that z i indicates whether or not x i is available for encoding. We can think of the sequence z as a binary gating mechanism used to select a rationale, which with some abuse of notation we denote by x z. Figure  1 illustrates the approach.
Parameter estimation for this model can be done by maximizing a lower bound E(φ, θ) on the loglikelihood of the data derived by application of Jensen's inequality: 2 These latent rationales approach the first objective, namely, uncovering which parts of the input text contribute towards a decision. However note that an NN controls the Bernoulli parameters, thus nothing prevents this NN from selecting the whole of the input, thus defaulting to a standard text classifier. To promote compact rationales, Lei et al. (2016) impose sparsity-inducing penalties on latent selectors. They penalise for the total number of selected words, L 0 in (4), as well as, for the total number of transitions, fused lasso in (4), and approach the following optimization problem via gradient-based optimisation, where λ 0 and λ 1 are fixed hyperparameters. The objective is however intractable to compute, the lowerbound, in particular, requires marginalization of O(2 n ) binary sequences. For that reason, Lei et al. sample latent assignments and work with gradient estimates using REINFORCE (Williams, 1992).
The key ingredients are, therefore, binary latent variables and sparsity-inducing regularization, and therefore the solution is marked by nondifferentiability. We propose to replace Bernoulli variables by rectified continuous random variables (Socci et al., 1998), for they exhibit both discrete and continuous behaviour. Moreover, they are amenable to reparameterization in terms of a fixed random source (Kingma and Welling, 2014), in which case gradient estimation is possible without REINFORCE. Following Louizos et al. (2017), we exploit one such distribution to relax L 0 regularization and thus promote compact rationales with a differentiable objective. In section 3, we introduce this distribution and present its properties. In section 4, we employ a Lagrangian relaxation to automatically target a pre-specified selection rate. And finally, in section 5 we present an example for sentiment classification.

Hard Kumaraswamy Distribution
Key to our model is a novel distribution that exhibits both continuous and discrete behaviour, in this section we introduce it. With non-negligible probability, samples from this distribution evaluate to exactly 0 or exactly 1. In a nutshell: i) we start from a distribution over the open interval (0, 1) (see dashed curve in Figure 2); ii) we then stretch its support from l < 0 to r > 1 in order to include {0} and {1} (see solid curve in Figure 2); finally, iii) we collapse the probability mass over the interval (l, 0] to {0}, and similarly, the probability mass over the interval [1, r) to {1} (shaded areas in Figure 2). This stretch-and-rectify technique was proposed by Louizos et al. (2017), who rectified samples from the BinaryConcrete (or GumbelSoftmax) distribution (Maddison et al., 2017;Jang et al., 2017). We adapted their technique to the Kumaraswamy distribution motivated by its close resemblance to a Beta distribution, for which we have stronger intuitions (for example, its two shape parameters transit rather naturally from unimodal to bimodal configurations of the distribution). In the following, we introduce this new distribution formally. 3 3 We use uppercase letters for random variables (e.g. K, T , and H) and lowercase for assignments (e.g. k, t, h).

Kumaraswamy distribution
The Kumaraswamy distribution (Kumaraswamy, 1980) is a two-parameters distribution over the open interval (0, 1), we denote a Kumaraswamydistributed variable by K ∼ Kuma(a, b), where a ∈ R >0 and b ∈ R >0 control the distribution's shape. The dashed curve in Figure 2 illustrates the density of Kuma(0.5, 0.5). For more details including its pdf and cdf, consult Appendix A. The Kumaraswamy is a close relative of the Beta distribution, though not itself an exponential family, with a simple cdf whose inverse for u ∈ [0, 1], can be used to obtain samples by transformation of a uniform random source U ∼ U(0, 1). We can use this fact to reparameterize expectations (Nalisnick and Smyth, 2016).

Rectified Kumaraswamy
We stretch the support of the Kumaraswamy distribution to include 0 and 1. The resulting variable T ∼ Kuma(a, b, l, r) takes on values in the open interval (l, r) where l < 0 and r > 1, with cdf We now define a rectified random variable, denoted by H ∼ HardKuma(a, b, l, r), by passing random variable K, fK (k; α) is the probability density function (pdf), conditioned on parameters α, and FK (k; α) is the cumulative distribution function (cdf). a sample T ∼ Kuma(a, b, l, r) through a hardsigmoid, i.e. h = min(1, max(0, t)). The resulting variable is defined over the closed interval [0, 1]. Note that while there is 0 probability of sampling t = 0, sampling h = 0 corresponds to sampling any t ∈ (l, 0], a set whose mass under Kuma(t|a, b, l, r) is available in closed form: That is because all negative values of t are deterministically mapped to zero. Similarly, samples t ∈ [1, r) are all deterministically mapped to h = 1, whose total mass amounts to See Figure 2 for an illustration, and Appendix A for the complete derivations.

Reparameterization and gradients
Because this rectified variable is built upon a Kumaraswamy, it admits a reparameterisation in terms of a uniform variable U ∼ U(0, 1). We need to first sample a uniform variable in the open interval (0, 1) and transform the result to a Kumaraswamy variable via the inverse cdf (10a), then shift and scale the result to cover the stretched support (10b), and finally, apply the rectifier in order to get a sample in the closed interval [0, 1] (10c).
We denote this h = s(u; a, b, l, r) for short. Note that this transformation has two discontinuity points, namely, t = 0 and t = 1. Though recall, the probability of sampling t exactly 0 or exactly 1 is zero, which essentially means stochasticity circumvents points of non-differentiability of the rectifier (see Appendix A.3).

Controlled Sparsity
Following Louizos et al. (2017), we relax nondifferentiable penalties by computing them on expectation under our latent model p(z|x, φ). In addition, we propose the use of Lagrangian relaxation to target specific values for the penalties.
Thanks to the tractable Kumaraswamy cdf, the expected value of L 0 (z) is known in closed form where This quantity is a tractable and differentiable function of the parameters φ of the latent model. We can also compute a relaxation of fused lasso by computing the expected number of zero-to-nonzero and nonzero-to-zero changes: In both cases, we make the assumption that latent variables are independent given x, in Appendix B.1.2 we discuss how to estimate the regularizers for a model p(z i |x, z <i ) that conditions on the prefix z <i of sampled HardKuma assignments.
We can use regularizers to promote sparsity, but just how much text will our final model select? Ideally, we would target specific values r and solve a constrained optimization problem. In practice, constrained optimisation is very challenging, thus we employ Lagrangian relaxation instead: where R(φ) is a vector of regularisers, e.g. expected L 0 and expected fused lasso, and λ is a vector of Lagrangian multipliers λ. Note how this differs from the treatment of Lei et al. (2016) shown in (4) where regularizers are computed for assignments, rather than on expectation, and where λ 0 , λ 1 are fixed hyperparameters.

Sentiment Classification
As a concrete example, consider the case of sentiment classification where x is a sentence and y is a 5-way sentiment class (from very negative to very positive). The model consists of where the shape parameters a, b = g(x; φ), i.e. two sequences of n strictly positive scalars, are predicted by a NN, and the support boundaries (l, r) are fixed hyperparameters. We first specify an architecture that parameterizes latent selectors and then use a reparameterized sample to restrict which parts of the input contribute encodings for classification: 4 where emb(·) is an embedding layer, birnn(·; φ r ) is a bidirectional encoder, f a (·; φ a ) and f b (·; φ b ) are feed-forward transformations with softplus outputs, and s(·) turns the uniform sample u i into the latent selector z i (see §3). We then use the sampled z to modulate inputs to the classifier: where rnn(·; θ fwd ) and rnn(·; θ bwd ) are recurrent cells such as LSTMs (Hochreiter and Schmidhuber, 1997) that process the sequence in different directions, and f o (·; θ o ) is a feed-forward transformation with softmax output. Note how z i modulates features e i of the input x i that are available to the recurrent composition function.
We then obtain gradient estimates of E(φ, θ) via Monte Carlo (MC) sampling from is a shorthand for elementwise application of the transformation from uniform samples to HardKuma samples. This reparameterisation is the key to gradient estimation through stochastic computation graphs (Kingma and Welling, 2014;Rezende et al., 2014). 4 We describe architectures using blocks denoted by layer(inputs; subset of parameters), boldface letters for vectors, and the shorthand v n 1 for a sequence v1, . . . , vn .
SVM (Lei et al., 2016) 0.0154 BiLSTM (Lei et al., 2016) 0.0094 BiRCNN (Lei et al., 2016)  Deterministic predictions. At test time we make predictions based on what is the most likely assignment for each z i . We arg max across configurations of the distribution, namely, z i = 0, z i = 1, or 0 < z i < 1. When the continuous interval is more likely, we take the expected value of the underlying Kumaraswamy variable.

Experiments
We perform experiments on multi-aspect sentiment analysis to compare with previous work, as well as experiments on sentiment classification and natural language inference. All models were implemented in PyTorch, and Appendix B provides implementation details.
Goal. When rationalizing predictions, our goal is to perform as well as systems using the full input text, while using only a subset of the input text, leaving unnecessary words out for interpretability.

Multi-aspect Sentiment Analysis
In our first experiment we compare directly with previous work on rationalizing predictions (Lei et al., 2016). We replicate their setting.

Data.
A pre-processed subset of the BeerAdvocate 5 data set is used (McAuley et al., 2012). It consists of 220,000 beer reviews, where multiple aspects (e.g. look, smell, taste) are rated. As shown in Figure 1, a review typically consists of multiple sentences, and contains a 0-5 star rating (e.g. 3.5 stars) for each aspect. Lei et al. mapped the ratings to scalars in [0, 1].
Model. We use the models described in §5 with two small modifications: 1) since this is a regression task, we use a sigmoid activation in the output layer of the classifier rather than a softmax, 6 and  Table 2: Precision (% of selected words that was also annotated as the gold rationale) and selected (% of words not zeroed out) per aspect. In the attention baseline, the top 13% (7%) of words with highest attention weights are used for classification. Models were selected based on validation loss.
2) we use an extra RNN to condition z i on z <i : For a fair comparison we follow Lei et al. by using RCNN 7 cells rather than LSTM cells for encoding sentences on this task. Since this cell is not widely used, we verified its performance in Table 1. We observe that the BiRCNN performs on par with the BiLSTM (while using 50% fewer parameters), and similarly to previous results.

Evaluation.
A test set with sentence-level rationale annotations is available. The precision of a rationale is defined as the percentage of words with z = 0 that is part of the annotation. We also evaluate the predictions made from the rationale using mean squared error (MSE).
Baselines. For our baseline we reimplemented the approach of Lei et al. (2016) which we call Bernoulli after the distribution they use to sample z from. We also report their attention baseline, in which an attention score is computed for each word, after which it is simply thresholded to select the top-k percent as the rationale.
Results. Table 2 shows the precision and the percentage of selected words for the first three aspects. The models here have been selected based on validation MSE and were tuned to select a similar percentage of words ('selected'). We observe that our Bernoulli reimplementation reaches the precision similar to previous work, doing a little bit worse for the 'look' aspect. Our HardKuma managed to get even higher precision, and it extracted exactly the percentage of text that we spec-7 An RCNN cell can replace any LSTM cell and works well on text classification problems. See appendix B.

0%
20% 40% 60% 80% 100% ified (see §4). 8 Figure 3 shows the MSE for all aspects for various percentages of extracted text. We observe that HardKuma does better with a smaller percentage of text selected. The performance becomes more similar as more text is selected.

Sentiment Classification
We also experiment on the Stanford Sentiment Treebank (SST) (Socher et al., 2013). There are 5 sentiment classes: very negative, negative, neutral, positive, and very positive. Here we use the Hard-Kuma model described in §5, a Bernoulli model trained with REINFORCE, as well as a BiLSTM.
Results. Figure 4 shows the classification accuracy for various percentages of selected text. We observe that HardKuma outperforms the Bernoulli model at each percentage of selected text.  Figure 4: SST validation accuracy for various percentages of extracted text. HardKuma (blue crosses) has higher accuracy than Bernoulli (red circles) for similar amount of text, and reaches the full-text baseline (black star, 46.3 ± 2σ with σ = 0.7) around 40% text.

Analysis.
We wonder what kind of words are dropped when we select smaller amounts of text. For this analysis we exploit the word-level sentiment annotations in SST, which allows us to track the sentiment of words in the rationale. Figure 5 shows that a large portion of dropped words have neutral sentiment, and it seems plausible that exactly those words are not important features for classification. We also see that HardKuma drops (relatively) more neutral words than Bernoulli.

Natural Language Inference
In Natural language inference (NLI), given a premise sentence x (p) and a hypothesis sentence x (h) , the goal is to predict their relation y which can be contradiction, entailment, or neutral. As our dataset we use the Stanford Natural Language Inference (SNLI) corpus (Bowman et al., 2015).
Baseline. We use the Decomposable Attention model (DA) of Parikh et al. (2016). 9 DA does not make use of LSTMs, but rather uses attention to find connections between the premise and the hy-9 Better results e.g. Chen et al. (2017) and data sets for NLI exist, but are not the focus of this paper. pothesis that are predictive of the relation. Each word in the premise attends to each word in the hypothesis, and vice versa, resulting in a set of comparison vectors which are then aggregated for a final prediction. If there is no link between a word pair, it is not considered for prediction.
Model. Because the premise and hypothesis interact, it does not make sense to extract a rationale for the premise and hypothesis independently. Instead, we replace the attention between premise and hypothesis with HardKuma attention. Whereas in the baseline a similarity matrix is softmax-normalized across rows (premise to hypothesis) and columns (hypothesis to premise) to produce attention matrices, in our model each cell in the attention matrix is sampled from a Hard-Kuma parameterized by (a, b). To promote sparsity, we use the relaxed L 0 to specify the desired percentage of non-zero attention cells. The resulting matrix does not need further normalization.
Results. With a target rate of 10%, the Hard-Kuma model achieved 8.5% non-zero attention. Table 3 shows that, even with so many zeros in the attention matrices, it only does about 1% worse compared to the DA baseline. Figure 6

Related Work
This work has connections with work on interpretability, learning from rationales, sparse structures, and rectified distributions. We discuss each of those areas.
Interpretability. Machine learning research has been focusing more and more on interpretability (Gilpin et al., 2018). However, there are many nuances to interpretability (Lipton, 2016), and amongst them we focus on model transparency. One strategy is to extract a simpler, interpretable model from a neural network, though this comes at the cost of performance. For example, Thrun (1995) extract if-then rules, while Craven and Shavlik (1996)

extract decision trees.
There is also work on making word vectors more interpretable. Faruqui et al. (2015) make word vectors more sparse, and Herbelot and Vecchi (2015) learn to map distributional word vectors to model-theoretic semantic vectors.
Similarly to Lei et al. (2016), Titov and McDonald (2008) extract informative fragments of text by jointly training a classifier and a model predicting a stochastic mask, while relying on Gibbs sampling to do so. Their focus is on using the sentiment labels as a weak supervision signal for opinion summarization rather than on rationalizing classifier predictions.
There are also related approaches that aim to interpret an already-trained model, in contrast to Lei et al. (2016) and our approach where the rationale is jointly modeled. Ribeiro et al. (2016) make any classifier interpretable by approximating it locally with a linear proxy model in an approach called LIME, and Alvarez-Melis and Jaakkola (2017) propose a framework that returns input-output pairs that are causally related.
Learning from rationales. Our work is different from approaches that aim to improve classification using rationales as an additional input (Zaidan et al., 2007;Zaidan and Eisner, 2008;Zhang et al., 2016). Instead, our rationales are latent and we are interested in uncovering them. We only use annotated rationales for evaluation.
Sparse layers. Also arguing for enhanced interpretability, Niculae and Blondel (2017) propose a framework for learning sparsely activated attention layers based on smoothing the max operator. They derive a number of relaxations to max, including softmax itself, but in particular, they target relaxations such as sparsemax (Martins and Astudillo, 2016) which, unlike softmax, are sparse (i.e. produce vectors of probability values with components that evaluate to exactly 0). Their activation functions are themselves solutions to convex optimization problems, to which they provide efficient forward and backward passes. The technique can be seen as a deterministic sparsely activated layer which they use as a drop-in replacement to standard attention mechanisms. In contrast, in this paper we focus on binary outcomes rather than K-valued ones. Niculae et al. (2018) extend the framework to structured discrete spaces where they learn sparse parameterizations of discrete latent models. In this context, parameter estimation requires exact marginalization of discrete variables or gradient estimation via REINFORCE. They show that oftentimes distributions are sparse enough to enable exact marginal inference. Peng et al. (2018) propose SPIGOT, a proxy gradient to the non-differentiable arg max operator. This proxy requires an arg max solver (e.g. Viterbi for structured prediction) and, like the straight-through estimator (Bengio et al., 2013), is a biased estimator. Though, unlike ST it is efficient for structured variables. In contrast, in this work we chose to focus on unbiased estimators.
Rectified Distributions. The idea of rectified distributions has been around for some time. The rectified Gaussian distribution (Socci et al., 1998), in particular, has found applications to factor analysis (Harva and Kaban, 2005) and approximate inference in graphical models (Winn and Bishop, 2005). Louizos et al. (2017) propose to stretch and rectify samples from the BinaryConcrete (or Gum-belSoftmax) distribution (Maddison et al., 2017;Jang et al., 2017). They use rectified variables to induce sparsity in parameter space via a relaxation to L 0 . We adapt their technique to promote sparse activations instead. Rolfe (2017) learns a relaxation of a discrete random variable based on a tractable mixture of a point mass at zero and a continuous reparameterizable density, thus enabling reparameterized sampling from the half-closed interval [0, ∞). In contrast, with HardKuma we focused on giving support to both 0s and 1s.

Conclusions
We presented a differentiable approach to extractive rationales, including an objective that allows for specifying how much text is to be extracted. To allow for reparameterized gradient estimates and support for binary outcomes we introduced the HardKuma distribution. Apart from extracting rationales, we showed that HardKuma has further potential uses, which we demonstrated on premise-hypothesis attention in SNLI. We leave further explorations for future work.
where a ∈ R >0 and b ∈ R >0 are shape parameters. Its cumulative distribution takes a simple closed-form expression with inverse A.1 Generalised-support Kumaraswamy We can generalise the support of a Kumaraswamy variable by specifying two constants l < r and transforming a random variable K ∼ Kuma(a, b) to obtain T ∼ Kuma(a, b, l, r) as shown in (20, left).
The density of the resulting variable is where r − l > 0 by definition. This affine transformation leaves the cdf unchanged, i.e.
Thus we can obtain samples from this generalisedsupport Kumaraswamy by sampling from a uniform distribution U(0, 1), applying the inverse transform (19), then shifting and scaling the sample according to (20, left).

A.2 Rectified Kumaraswamy
First, we stretch a Kumaraswamy distribution to include 0 and 1 in its support, that is, with l < 0 and r > 1, we define T ∼ Kuma(a, b, l, r). Then we apply a hard-sigmoid transformation to this variable, that is, h = min(0, max(1, t)), which results in a rectified distribution which gives support to the closed interval [0, 1]. We denote this rectified variable by H ∼ HardKuma(a, b, l, r) whose distribution function is where is the probability of sampling exactly 0, where is the probability of sampling exactly 1, and is the probability of drawing a continuous value in (0, 1). Note that we used the result in (22) to express these probabilities in terms of the tractable cdf of the original Kumaraswamy variable.

A.3 Reparameterized gradients
Let us consider the case where we need derivatives of a function L(u) of the underlying uniform variable u, as when we compute reparameterized gradients in variational inference. Note that by chain rule. The term ∂L ∂h depends on a differentiable observation model and poses no challenge; the term ∂h ∂t is the derivative of the hard-sigmoid function, which is 0 for t < 0 or t > 1, 1 for 0 < t < 1, and undefined for t ∈ {0, 1}; the term ∂t ∂k = r − l follows directly from (20, left); the term ∂k ∂u = ∂ ∂u F −1 K (u; a, b) depends on the Kumaraswamy inverse cdf (19) and also poses no challenge. Thus the only two discontinuities happen for t ∈ {0, 1}, which is a 0 measure set under the stretched Kumaraswamy: we say this reparameterisation is differentiable almost everywhere, a useful property which essentially circumvents the discontinuity points of the rectifier. Figure 8 plots the pdf of the HardKumaraswamy for various a and b parameters. Figure 9 does the same but with the cdf.

B.1 Multi-aspect Sentiment Analysis
Our hyperparameters are taken from Lei et al. (2016) and listed in Table 4. The pre-trained word embeddings and data sets are available online at http://people.csail.mit.edu/ taolei/beer/. We train for 100 epochs and   For the Bernoulli baselines we vary L 0 weight λ 1 among {0.0002, 0.0003, 0.0004}, just as in the original paper. We set the fused lasso (coherence) weight λ 2 to 2 * λ 1 .
For the HardKuma models we set a target selection rate to the values targeted in Table 2, and optimize to this end using the Lagrange multiplier. We chose the fused lasso weight from {0.0001, 0.0002, 0.0003, 0.0004}.

B.1.1 Recurrent Unit
In our multi-aspect sentiment analysis experiments we use the RCNN of Lei et al. (2016). Intuitively, the RCNN is supposed to capture n-gram features that are not necessarily consecutive. We use the bigram version (filter width n = 2) used in Lei et al. (2016), which is defined as:

.2 Expected values for dependent latent variables
The expected L 0 is a chain of nested expectations, and we solve each term as a function of a sampled prefix, and the shape parameters a i , b i = g i (x, z <i ; φ) are predicted in sequence.

B.2 Sentiment Classification (SST)
For sentiment classification we make use of the PyTorch bidirectional LSTM module for encoding sentences, for both the rationale extractor and the classifier. The BiLSTM final states are concatenated, after which a linear layer followed by a softmax produces the prediction. Hyperparameters are listed in Table 5. We apply dropout to the embeddings and to the input of the output layer.

B.3 Natural Language Inference (SNLI)
Our hyperparameters are taken from Parikh et al. (2016) and listed in Table 6. Different from Parikh et al. is that we use Adam as the optimizer and a batch size of 64. Word embeddings are projected to 200 dimensions with a trained linear layer. Unknown words are mapped to 100 unknown word classes based on the MD5 hash function, just as in Parikh et al. (2016), and unknown word vectors are randomly initialized. We train for 100 epochs, evaluate every 1000 updates, and select the best model based on validation loss. Figure 10 shows a correct and incorrect example with HardKuma attention for each relation type (entailment, contradiction, neutral).