Rationale-Augmented Convolutional Neural Networks for Text Classification

We present a new Convolutional Neural Network (CNN) model for text classification that jointly exploits labels on documents and their constituent sentences. Specifically, we consider scenarios in which annotators explicitly mark sentences (or snippets) that support their overall document categorization, i.e., they provide rationales. Our model exploits such supervision via a hierarchical approach in which each document is represented by a linear combination of the vector representations of its component sentences. We propose a sentence-level convolutional model that estimates the probability that a given sentence is a rationale, and we then scale the contribution of each sentence to the aggregate document representation in proportion to these estimates. Experiments on five classification datasets that have document labels and associated rationales demonstrate that our approach consistently outperforms strong baselines. Moreover, our model naturally provides explanations for its predictions.


Introduction
Neural models that exploit word embeddings have recently achieved impressive results on text classification tasks (Goldberg, 2015).Feed-forward Convolutional Neural Networks (CNNs), in particular, have emerged as a relatively simple yet powerful class of models for text classification (Kim, 2014).
These neural text classification models have tended to assume a standard supervised learning setting in which instance labels are provided.Here we consider an alternative scenario in which we assume that we are provided a set of rationales (Zaidan et al., 2007;Zaidan and Eisner, 2008) in addition to instance labels, i.e., sentences or snippets that support the corresponding document categorizations.Providing such rationales during manual classification is a natural interaction for annotators, and requires little additional effort (Settles, 2011).Therefore, when training new classification systems, it is natural to acquire supervision at both the document and sentence level, with the aim of inducing a better predictive model, potentially with less effort.
Learning algorithms must be designed to capitalize on these two types of supervision.Past work (Section 2) has introduced such methods, but these have relied on linear models such as Support Vector Machines (SVMs) (Joachims, 1998), operating over sparse representations of text.We propose a novel CNN model for text classification that exploits both document labels and associated rationales.Specific contributions of this work as follows.
(1) This is the first work to incorporate rationales into neural models for text classification.(2) Empirically, we show that the proposed model uniformly outperforms relevant baseline approaches across five datasets, including previously proposed models that capitalize on rationales (Zaidan et al., 2007;Marshall et al., 2016) and multiple baseline CNN variants, including a CNN equipped with an attention mechanism.We also report state-of-the-art results on the important task of automatically assessing the risks of bias in the studies described in full-text biomedical articles (Marshall et al., 2016).(3) Our model naturally provides explanations for its predictions, providing interpretability.(2014) proposed the basic CNN model we describe below and then build upon in this work.Properties of this model were explored empirically in (Zhang and Wallace, 2015).We also note that Zhang et al. (2016) extended this model to jointly accommodate multiple sets of pre-trained word embeddings.Roughly concurrently to Kim, Johnson and Zhang (2014) proposed a similar CNN architecture, although they swapped in one-hot vectors in place of (pre-trained) word embeddings.They later developed a semi-supervised variant of this approach (Johnson and Zhang, 2015).

Kim
In related recent work on Recurrent Neural Network (RNN) models for text, Tang et al. (2015) proposed using a Long Short Term Memory (LSTM) layer to represent each sentence and then passing another RNN variant over these.And very recently, Yang et al. (2016) proposed a hierarchical network with two levels of attention mechanisms for document classification.We discuss this model specifically as well as attention more generally and its relationship to our proposed approach in Section 4.3.

Exploiting rationales
The notion of rationales was first introduced by Zaidan et al. (2007).They proposed modifying the Support Vector Machine (SVM) objective function to encode a preference for parameter values that result in instances that include manually annotated rationales being more confidently classified than 'pseudo'-instances from which these rationales have been stripped.They showed that this approach dramatically outperformed baseline SVM variants that did not exploit rationales.
Another line of related work concerns models that capitalize on dual supervision, i.e., labels on individual features.This work has largely involved inserting constraints into the learning process that favor parameter values that align with a priori featurelabel affinities or rankings (Druck et al., 2008;Mann and McCallum, 2010;Small et al., 2011;Settles, 2011).We do not discuss this line of work further here, as our focus is on exploiting provided rationales, rather than individual labeled features.Here there are four feature maps, two with heights 2 and two with heights 3, resulting in feature maps with lengths 6 and 5 respectively.
We first review the simple one-layer CNN for sentence modeling proposed by Kim (2014).Given an instance (sentence or document) comprising n words w 1 , w 2 ,...,w n , we replace each word with its d-dimensional pretrained embedding, forming an instance matrix A ∈ R n×d .
We then apply convolution operations on this matrix using multiple linear filters, each of which has the same width d but differing heights.In effect each filter thus considers distinct n-gram features, where n corresponds to filter height.(In practice, we introduce multiple, redundant features of each height.)Applying filter i parameterized by W i ∈ R h i ×d to the instance matrix induces a feature map f i ∈ R n−h i +1 .We then apply an element-wise nonlinear transformation to the feature map -specifically, we run f i through a Rectified Linear Unit, or ReLU (Krizhevsky et al., 2012).We extract the maximum value o i from each feature map i (1-max pooling).
Finally, we concatenate all of the features o i to form a vector representation o ∈ R |F | for this instance, where |F | denotes the total number of filters.
Classification is then performed on top of o, via a softmax function.Dropout (Srivastava et al., 2014) is often applied at this layer as a means of regularization.We provide an illustrative example of the basic CNN architecture just described in Figure 1.
This model was originally proposed for sentence classification, but we can adapt it for document classification by simply treating the document as one long sentence.We will refer to this basic CNN variant as CNN in the rest of the paper.Below we consider extensions that account for document structure.

Rationale-Augmented CNN for Document Classification
We now move to the main contribution of this work: a rationale-augmented CNN for text classification.We first introduce a simple variant of the above CNN that models document structure (Section 4.1) and then introduce a means of incorporating rationale-level supervision into this model (Section 4.2).In Section 4.3 we discuss connections to attention mechanisms and describe a baseline equipped with one, inspired by Yang et al. (2016).

Modeling Document Structure
Recall that rationales are snippets of text marked as having supported document-level categorizations.
We aim to develop a model that can exploit these annotations during training to improve classification.
Here we achieve this by developing a hierarchical model that estimates the probabilities of individual sentences being rationales and uses these estimates to inform the document level classification.
As a first step, we extend the CNN model above to explicitly account for document structure.Specifically, we apply a CNN to each individual sentence in a document to obtain sentence vectors independently.We then sum the respective sentence vectors to create a document vector.1As before, we add a softmax layer on top of the document-level vector to perform classification.We perform regularization by applying dropout both on the individual sentence vectors and the final document vector.We will refer to this model as Doc-CNN.Doc-CNN forms the basis for our novel approach, described below.

RA-CNN
In this section we present the Rationale-Augmented CNN (RA-CNN).Briefly, RA-CNN induces a document-level vector representation by taking a weighted sum of its constituent sentence vectors.Each sentence weight is set to reflect the estimated probability that it is a rationale in support of the most likely class.We provide a schematic of this model in Figure 2.
RA-CNN capitalizes on both sentence-and document-level supervision.There are thus two steps in the training phase: sentence level training and document level training.For the former, we apply a CNN to each sentence j in document i to obtain sentence vectors x ij sen .We then add a softmax layer parametrized by W sen ; this takes as input sentence vectors.We fit this model to maximize the probabilities of the observed rationales: (1) Where y ij sen denotes the rationale label for sentence j in document i, K sen denotes the number of possible classes for sentences, E denotes the word embedding matrix, C denotes convolution layer parameters, and W sen is a matrix of weights (comprising one weight vector per sentence class).
In our setting, each sentence has three possible labels (K sen = 3).When a rationale sentence appears in a positive document,2 it is a positive rationale; when a rationale sentence appears in a negative document, it is a negative rationale.All other sentences belong to a third, neutral class: these are non-rationales.
We train an estimator using the provided rationale annotations, optimizing over {E, C, W sen } to minimize the categorical cross-entropy of sentence labels.Once trained, this sub-model can provide conditional probability estimates regarding whether a given sentence is a positive or a negative rationale, which we will denote by p pos and p neg , respectively.
We next train the document-level classification model.The inputs to this are vector representations .  of documents, induced by summing over constituent sentence vectors, as in Doc-CNN.However, in the RA-CNN model this is a weighted sum.Specifically, weights are set to the estimated probabilities that corresponding sentences are rationales in the most likely direction.More precisely: Where N i is the number of sentences in the ith document.The intuition is that sentences likely to be rationales will have greater influence on the resultant document vector representation, while the contribution of neutral sentences (which are less relevant to the classification task) will be minimized.
The final classification is performed by a softmax layer parameterized by W doc ; the inputs to this layer are the document vectors.The W doc parameters are trained using the document-level labels, y i doc : (3) where K doc is the cardinality of the document label set.We optimize over parameters to minimize crossentropy loss (w.r.t. the document labels).
We note that the sentence-and document-level models share word embeddings E and convolution layer parameters C, but the document-level model has its own softmax parameters W doc .When training the document-level model, E, C and W doc are fit, but we hold W sen fixed.
The above two-step strategy can be equivalently described as follows.We first estimate E, C and W sen , which parameterize our model for identifying rationales in documents.We then move to fitting our document classification model.For this we initialize the word embedding and convolution parameters to the E and C estimates from the preceding step.We then directly minimize the document level classification objective, tuning E and C and simultaneously fitting W doc .
Note that this sequential training strategy differs from the alternating training approach commonly used in multi-task learning (Collobert and Weston, 2008).We found that the latter approach does not work well here, leading us to instead adopt the cascade-like feature learning approach (Collobert and Weston, 2008) just described.
One nice property of our model is that it naturally provides explanations for its predictions: the model identifies rationales then categorizes documents informed by these.Thus if the model classifies a test instance as positive, then by construction the sentences associated with the highest p ij pos estimates are those that the model relied on most heavily in coming to this disposition.These sentences can of course be output in conjunction with the prediction.We provide concrete examples of this in Section 7.2,

Rationales as 'Supervised Attention'
One may view RA-CNN as a supervised variant of a model equipped with an attention mechanism (Bahdanau et al., 2014).On this view, it is apparent that rather than capitalizing on rationales directly, we could attempt to let the model learn which sentences are important, using only the document labels.We therefore construct an additional baseline that does just this, thereby allowing us to assess the impact of learning directly from rationales.
Following the very recent work of Yang et al. ( 2016), we first posit for each sentence vector a hidden representation u ij sen .We then define a sentencelevel context vector u s , which is applied to each u ij sen to induce a weight α ij .Finally, the document vector is taken as a weighted sum over sentence vectors, where weights reflect α's. (5) Where x i doc again denotes the document vector fed into a softmax layer, and W s , u s and b s are learned during training.We will refer to this attention-based method as AT-CNN.

Datasets
We used five text classification datasets to evaluate our approach in total.Four of these are biomedical text classification datasets (5.1) and the last is a collection of movie reviews (5.2).These datasets share the property of having recorded rationales associated with each document categorization.We summarize attributes of all datasets used in this work in Table 1.

Risk of Bias (RoB) Datasets
We used a collection Risk of Bias (RoB) text classification datasets, described at length elsewhere (Marshall et al., 2016).Briefly, the task concerns assessing the reliability of the evidence presented in full-text biomedical journal articles that describe the conduct and results of randomized controlled trials (RCTs).This involves, e.g., assessing whether or not patients were properly blinded as to whether they were receiving an active treatment or a comparator (such as a placebo).If such blinding is not done correctly, it compromises the study by introducing statistical bias into the treatment efficacy estimate(s) derived from the trial.
A formal system for making bias assessments is codified by the Cochrane Risk of Bias Tool (Higgins et al., 2011).This tool defines multiple domains; the risk of bias may be assessed in each of these.We consider four domains here.(1) Random sequence generation (RSG): were patients were assigned to treatments in a truly random fashion?(2) Allocation concealment (AC): were group assignments revealed to the person assigning patients to groups (so that she may have knowingly or unknowingly) influenced these assignments?(3) Blinding of Participants and Personnel (BPP): were all trial participants and individuals involved in running the trial blinded as to who was receiving which treatment?(4) Blinding of outcome assessment (BOA): were the parties who measured the outcome(s) of interest blinded to the intervention group assignments?These assessments are somewhat subjective.To increase transparency, researchers performing RoB assessment therefore record rationales (sentences from articles) supporting their assessments.N is the number of instances, #sen is the average sentence count, #token is the average token per-sentence count and #rat is the average number of rationales per document.

Movie Review Dataset
We also ran experiments on a movie review (MR) dataset with accompanying rationales.Pang and Lee ( 2004) developed and published the original version of this dataset, which comprises 1000 positive and 1000 negative movie reviews from the Internet Movie Database (IMDB).3Zaidan et al. ( 2007) then augmented this dataset by adding rationales corresponding to the binary classifications for 1800 documents, leaving the remaining 200 for testing.Because 200 documents is a modest test sample size, we ran 9-fold cross validation on the 1800 annotated documents (each fold comprising 200 documents).The rationales, as originally marked in this dataset, were sub-sentential snippets; for the purposes of our model, we considered the entire sentences containing the marked snippets as rationales.
6 Experimental Setup

Baselines
We compare against several baselines to assess the advantages of directly incorporating rationale-level supervision into the proposed CNN architecture.We describe these below.SVMs.We evaluated a few variants of linear Support Vector Machines (SVMs).These rely on sparse representations of text.We consider variants that exploit uni-and bi-grams; we refer to these as uni-SVM and bi-SVM, respectively.We also re-implemented the rationale augmented SVM (RA-SVM) proposed by Zaidan et al. (2007), described in Section 2.
For the RoB dataset, we also compare to a recently proposed multi-task SVM (MT-SVM) model developed specifically for these RoB datasets (Marshall et al., 2015;Marshall et al., 2016).This model exploits the intuition that the risks of bias across the domains codified in the aforementioned Cochrane RoB tool will likely be correlated.That is, if we know that a study exhibits a high risk of bias for one domain, then it seems reasonable to assume it is at an elevated risk for the remaining domains.Furthermore, Marshall et al. (2016) include rationale-level supervision by first training a (multi-task) sentencelevel model to identify sentences likely to support RoB assessments in the respective domains.Special features extracted from these predicted rationales are then activated in the document-level model, informing the final classification.This model is the stateof-the-art on this task.CNNs.We compare against several baseline CNN variants to demonstrate the advantages of our approach.We emphasize that our focus in this work is not to explore how to induce generally 'better' document vector representations -this question has been addressed at length elsewhere, e.g., (Le and Mikolov, 2014;Jozefowicz et al., 2015;Tang et al., 2015;Yang et al., 2016).
Rather, the main contribution here is an augmentation of CNNs for text classification to capitalize on rationale-level supervision, thus improving performance and enhancing interpretability.This informed our choice of baseline CNN variants: standard CNN (Kim, 2014), Doc-CNN (described above) and AT-CNN (also described above) that capitalizes on an (unsupervised) attention mechanism at the sentence level, described in Section 4.3.4

Implementation/Hyper-Parameter Details
Sentence splitting.To split the documents from all datasets into sentences for consumption by our Doc-CNN and RA-CNN models, we used the Natural Language Toolkit (NLTK)5 sentence splitter.SVM-based models.We kept the 50,000 most frequently occurring features in each dataset.For estimation we used SGD.We tuned the C hyperparameter using nested development sets.For the RA-SVM, we additionally tuned the µ and C contrast parameters, as per Zaidan et al. (2007).
Table 2: Accuracies on the four RoB datasets.Uni-SVM: unigram SVM, RA-SVM: Rationale-augmented SVM (Zaidan et al., 2007), MT-SVM: a multi-task SVM model specifically designed for the RoB task, which also exploits the available sentence supervision (Marshall et al., 2016).We also report an estimate of human-level performance, as calculated using subsets of the data for each domain that were assessed by two experts (one was arbitrarily assumed to be correct).We report these numbers for reference; they are not directly comparable to the cross-fold estimates reported for the models.CNN-based models.For all models and datasets we initialized word embeddings to pre-trained vectors fit via Word2Vec.For the movie reviews dataset these were 300-dimensional and trained on Google News. 6For the RoB datasets, these were 200-dimensional and trained on biomedical texts in PubMed/PubMed Central (Pyysalo et al., 2013). 7raining proceeded as follows.We first extracted all sentences from all documents in the training data.The distribution of sentence types is highly imbalanced (nearly all are neutral).Therefore, we downsampled sentences before each epoch, so that sentence classes were equally represented.After training on sentence-level supervision, we moved to document-level model fitting.For this we initialized embedding and convolution layer parameters to the estimates from the preceding sentence-level training step (though these were further tuned to optimize the document-level objective).
For RA-CNN, we tuned the dropout rate (range: 0-.9) applied at the sentence vector level on each training fold (using a subset of the training data as a validation set) during the document level training phase.Anecdotally, we found this has a greater effect than the other model hyperparameters, which we thus set after a small informal process of experimentation on a subset of the data.Specifically, we fixed the dropout rate at the document level to 0.5, and we used 3 different filter heights: 3, 4 and 5, following (Zhang and Wallace, 2015).For each filter height, we used 100 feature maps for the baseline CNN, and 20 features maps for all the other CNN variants.
For parameter estimation we used ADADELTA (Zeiler, 2012), mini-batches of size 50, and an early stopping strategy (using a validation set).

Quantitative Results
For all CNN models, we replicated experiments 5 times, where each replication constitutes 5-fold CV.We report the mean and observed ranges in accuracy across these 5 replications for these models, because attributes of the model (notably, dropout) and the estimation procedure render model fitting stochas-   2 and Table 3, respectively.We also display these graphically in Figures 3 and 4. One can observe that RA-CNN consistently outperforms all of the baseline models, across all five datasets.We also observe that CNN/Doc-CNN do not necessarily improve over the results achieved by SVM-based models, which actually prove to be very strong baselines for longer document classification.This differs from previous comparisons w.r.t.classifying shorter texts (Zhang and Wallace, 2015).
Another observation is that AT-CNN does often improve performance over vanilla variants of CNN (i.e., without attention), especially on the RoB datasets, probably because these comprise longer documents.However, as one might expect, RA-CNN clearly outperforms AT-CNN by exploiting rationale-level supervision directly.However, by exploiting rationale information directly, RA-CNN is able to consistently perform better than baseline CNN and SVM model variants.Indeed, we find that RA-CNN outperformed the MT-SVM on all of the RoB datasets, and this was accomplished without exploiting cross-domain correlations (i.e., without multi-task learning).

Qualitative Results: Illustrative Rationales
In addition to realizing superior classification performance, RA-CNN also provides explainable categorizations.The model can provide the highest scoring rationales (ranked by max{p pos , p neg }) for any given target instance, which in turn -by constructionwill most influence the final document classification.
For example, a sample positive rationale supporting a correct designation of a study as being at low risk of bias with respect to blinding of outcomes assessment reads simply The study was performed double blind.An example rationale extracted for a study (correctly) deemed at high risk of bias, meanwhile, reads as the present study is retrospective, there is a risk that the woman did not properly recall how and what they experienced ....
Turning to the movie reviews dataset, an example rationale extracted from a glowing review of 'Goodfellas' (correctly classified as positive) reads this cinematic gem deserves its rightful place among the best films of 1990s.While a rationale extracted from an unfavorable review of 'The English Patient' asserts that the only redeeming qualities about this film are the fine acting of Fiennes and Dafoe and the beautiful desert cinematography.
In each of these cases, the extracted rationales directly support the respective classifications.This provides direct, meaningful insight into the automated classifications, an important benefit for neural models, which are often seen as opaque.

Conclusions
We developed a new model (RA-CNN) for text classification that extends the CNN architecture to directly exploit rationales when available.We showed that this model outperforms several strong, relevant baselines across five datasets, including vanilla and hierarchical CNN variants, and a CNN model equipped with an attention mechanism.Moreover, RA-CNN automatically provides explanations for classifications made at test time, thus providing interpretability.

Figure 1 :
Figure1: A toy example of a CNN for sentence classification.Here there are four feature maps, two with heights 2 and two with heights 3, resulting in feature maps with lengths 6 and 5 respectively.
Figure 2: A schematic of our proposed Rationale-Augmented Convolution Neural Network (RA-CNN).See text for details.

Figure 3 :
Figure 3: Accuracies on the four RoB datasets; RA-CNN uniformly outperforms the baselines.

Figure 4 :
Figure 4: Accuracies achieved on the movies dataset.RA-CNN outperforms all baseline models.

Table 3 :
Accuracies on the movie review dataset.tic(ZhangandWallace,2015).We do not report ranges for SVM-based models because the variance inherent in the estimation procedure is much lower for these simpler, linear models.Results on the RoB datasets and the movies dataset are shown in Tables