ERASER: A Benchmark to Evaluate Rationalized NLP Models

State-of-the-art models in NLP are now predominantly based on deep neural networks that are opaque in terms of how they come to make predictions. This limitation has increased interest in designing more interpretable deep models for NLP that reveal the ‘reasoning’ behind model outputs. But work in this direction has been conducted on different datasets and tasks with correspondingly unique aims and metrics; this makes it difficult to track progress. We propose the Evaluating Rationales And Simple English Reasoning (ERASER a benchmark to advance research on interpretable models in NLP. This benchmark comprises multiple datasets and tasks for which human annotations of “rationales” (supporting evidence) have been collected. We propose several metrics that aim to capture how well the rationales provided by models align with human rationales, and also how faithful these rationales are (i.e., the degree to which provided rationales influenced the corresponding predictions). Our hope is that releasing this benchmark facilitates progress on designing more interpretable NLP systems. The benchmark, code, and documentation are available at https://www.eraserbenchmark.com/


Introduction
Interest has recently grown in interpretable NLP systems that can reveal how and why models make their predictions. But work in this direction has been conducted on different datasets with correspondingly different metrics, and the inherent subjectivity in defining what constitutes 'interpretability' has translated into researchers using different metrics to quantify performance. We aim to facilitate measurable progress on designing interpretable NLP models by releasing a standardized benchmark of datasets -augmented and repurposed

Commonsense Explanations (CoS-E)
Where do you find the most amount of leafs?

Movie Reviews
In this movie, … Plots to take over the world. The acting is great! The soundtrack is run-of-the-mill, but the action more than makes up for it (a) Positive (b) Negative

Evidence Inference
Article Patients for this trial were recruited … Compared with 0.9% saline, 120 mg of inhaled nebulized furosemide had no effect on breathlessness during exercise.  from pre-existing corpora, and spanning a range of NLP tasks -and associated metrics for measuring the quality of rationales. We refer to this as the Evaluating Rationales And Simple English Reasoning (ERASER) benchmark.
In curating and releasing ERASER we take inspiration from the stickiness of the GLUE (Wang et al., 2019b) and SuperGLUE (Wang et al., 2019a) benchmarks for evaluating progress in natural language understanding tasks. These have enabled rapid progress on models for general language representation learning. We believe the still somewhat nascent subfield of interpretable NLP stands to similarly benefit from an analogous collection of standardized datasets/tasks and metrics.
'Interpretability' is a broad topic with many possible realizations (Doshi-Velez and Kim, 2017;Lipton, 2016). In ERASER we focus specifically on rationales, i.e., snippets of text from a source doc-ument that support a particular categorization. All datasets contained in ERASER include such rationales, explicitly marked by annotators as supporting particular categorizations. By definition rationales should be sufficient to categorize documents, but they may not be comprehensive. Therefore, for some datasets we have collected comprehensive rationales, i.e., in which all evidence supporting a classification has been marked.
How one measures the 'quality' of extracted rationales will invariably depend on their intended use. With this in mind, we propose a suite of metrics to evaluate rationales that might be appropriate for different scenarios. Broadly, this includes measures of agreement with human-provided rationales, and assessments of faithfulness. The latter aim to capture the extent to which rationales provided by a model in fact informed its predictions.
While we propose metrics that we think are reasonable, we view the problem of designing metrics for evaluating rationales -especially for capturing faithfulness -as a topic for further research that we hope that ERASER will help facilitate. We plan to revisit the metrics proposed here in future iterations of the benchmark, ideally with input from the community. Notably, while we provide a 'leaderboard', this is perhaps better viewed as a 'results board'; we do not privilege any one particular metric. Instead, we hope that ERASER permits comparison between models that provide rationales with respect to different criteria of interest.
We provide baseline models and report their performance across the corpora in ERASER. While implementing and initially evaluating these baselines, we found that no single 'off-the-shelf' architecture was readily adaptable to datasets with very different average input lengths and associated rationale snippets. This suggests a need for the development of new models capable of consuming potentially lengthy input documents and adaptively providing rationales at the level of granularity appropriate for a given task. ERASER provides a resource to develop such models, as it comprises datasets with a wide range of input text and rationale lengths (Section 4).
In sum, we introduce the ERASER benchmark (www.eraserbenchmark.com), a unified set of diverse NLP datasets (repurposed from existing corpora, including sentiment analysis, Natural Language Inference, and Question Answering tasks, among others) in a standardized format featuring human rationales for decisions, along with the starter code and tools, baseline models, and standardized metrics for rationales.

Desiderata for Rationales
In this section we discuss properties that might be desirable in rationales, and the metrics we propose to quantify these (for evaluation). We attempt to operationalize these criteria formally in Section 5.
As one simple metric, we can assess the degree to which the rationales extracted by a model agree with those highlighted by human annotators. To measure exact and partial match, we propose adopting metrics from named entity recognition (NER) and object detection. In addition, we consider more granular ranking metrics that account for the individual weights assigned to tokens (when models assign such token-level scores, that is).
One distinction to make when evaluating rationales is the degree to which explanation for predictions is desired. In some cases it may be important that rationales tell us why a model made the prediction that it did, i.e., that rationales are faithful. In other settings, we may be satisfied with "plausible" rationales, even if these are not faithful.
Another key consideration is whether one wants rationales that are comprehensive, rather than simply sufficient. A comprehensive set of rationales comprises all snippets that support a given label. Put another way, if we remove a comprehensive set of rationales from an instance, there should be no way to categorize it (Yu et al., 2019). ERASER permits evaluation of comprehensiveness by including exhaustive annotated rationales that we have collected for some of datasets in the benchmark.

Related Work
Interpretability in NLP is a large and fast-growing area, and we do not attempt to provide a comprehensive overview here. Instead, we focus on directions particularly relevant to ERASER, i.e., prior work on models that provide rationales for their predictions.
Learning to Explain. In ERASER we assume that rationales (marked by humans) are provided during training. However, models will of course not always have access to such direct supervision. This has motivated work on methods that can explain (or "rationalize") model predictions using only instance-level supervision.
In the context of modern neural models for text classification, one might use variants of attention  to extract rationales. Attention mechanisms learn to assign soft weights to (usually contextualized) token representations, and so one can extract highly weighted tokens as rationales. However, attention weights do not in general provide faithful explanations for predictions (Jain and Wallace, 2019;Serrano and Smith, 2019;Wiegreffe and Pinter, 2019;Zhong et al., 2019;Pruthi et al., 2019;Brunner et al., 2019;Moradi et al., 2019;Vashishth et al., 2019). This likely owes to encoders entangling inputs, which complicates the interpretation of attention weights over contextualized representations. In some cases, however, faithfulness may not be a primary concern. 1 By contrast, hard attention mechanisms discretely extract snippets from the input to pass to the classifier, and so by construction provide a sort of faithfulness in their explanations. Recent work has therefore pursued hard attention mechanisms as a means of providing explanations (Lei et al., 2016;Yu et al., 2019). Lei et al. (2016) proposed instantiating two models with their own parameters; an encoder to extract rationales, and a decoder that consumes the snippets it selects to make a prediction. They trained these models jointly. This is complicated by the discrete snippet selection performed by the encoder, which precludes gradientbased parameter estimation. They instead propose adopting a REINFORCE (Williams, 1992) style optimization technique.
Post-hoc explanation. Another strand of work in the interpretability literature considers post-hoc explanation methods. Such methods seek to explain why a given model made its prediction on a given input, most commonly in form of token level importance scores. Many of these methods rely on differentiability of the output with respect to inputs (Sundararajan et al., 2017;Smilkov et al., 2017). These types of explanations often have clear inherent semantics (e.g., simple gradients tell us exactly how perturbing inputs affects outputs), but they may nonetheless be difficult for humans to understand due to counterintuitive behaviors (Feng et al., 2018). 1 Interestingly, (Zhong et al., 2019) report that attention provides plausible but not faithful (explanatory) rationales. In other related work, Pruthi et al. (2019) show that one can easily learn to deceive using attention weights. These findings further highlight that one should be mindful of what criteria one wants rationales to fulfill.
Another class of 'black-box' methods do not require any specific conditions on models. Examples include LIME (Ribeiro et al., 2016) and Alvarez-Melis and Jaakkola (2017); these methods approximate model behavior locally by repeatedly asking model to make predictions over perturbed inputs and fitting a explainable low complexity model over these predictions.
Acquiring rationales. In addition to potentially providing model transparency, collecting rationales from annotators may afford greater efficiency in terms of model performance realized given a fixed amount of annotator effort (Zaidan and Eisner, 2008). In particular, recent work McDonnell et al. (2017McDonnell et al. ( , 2016 has observed that at least for some tasks, asking annotators to provide rationales justifying their categorizations does not impose much overhead, in terms of effort. Active learning (AL) (Settles, 2012) is a complementary strategy for reducing annotator effort that entails the model selecting the examples with which it is to be trained. Sharma et al. (2015) explored actively collecting both instance labels and supporting rationales. Their work suggests that selecting instances via an acquisition function specifically designed for learning with rationales can provide predictive gains over standard AL methods. A limitation of this work is that they relied on simulated rationales, for want of access to datasets with marked rationales; a gap that our work addresses.
Learning from Rationales. Work on learning from rationales that have been explicitly provided by users for text classification dates back over a decade (Zaidan et al., 2007;Zaidan and Eisner, 2008). Earlier efforts proposed extending standard discriminative models like Support Vector Machines (SVMs) with regularization terms that penalized parameter estimates which disagreed with provided rationales (Zaidan et al., 2007;Small et al., 2011). Other efforts have attempted to specify generative models of rationales (Zaidan and Eisner, 2008).
More recent work has looked to exploit rationales in training neural text classification models. For example, Zhang et al. (2016) proposed a rationale-augmented Convolutional Neural Network (CNN) for text classification, explicitly trained to identify sentences supporting document categorizations. Strout et al. (2019) have demonstrated that providing this model with target ra-tionales at train time results in the model providing rationales at test time that are preferred by humans (compared to rationales provided when the model learns to weight sentences in an end-to-end fashion). Other recent work has proposed training 'pipeline' models in which one model learns to extract rationales (using available rationale supervision), and a second, independent model is trained to make predictions on the basis of these (Lehman et al., 2019;Chen et al., 2019).
Elsewhere, Camburu et al. (2018) enriched the SNLI (Bowman et al., 2015) corpus with human rationales and trained an RNN for this task with the aim of being able to justify its predictions, in addition to learning better universal sentence representations. The authors used perplexity and BLEU scores as well as a manual scoring of a random sample of explanations.
Rajani et al. (2019) augmented the Common-senseQA (Talmor et al., 2019) corpus with rationales and trained a transformer (Vaswani et al., 2017) based GPT (Radford et al.) language model with an objective of using explanations to improve performance on the downstream task. Here the authors used perplexity to evaluate performance. The same work (Rajani et al., 2019) also pursued an innovative approach of training the model to generate natural language explanations directly, such that these agree with human provided free-text justifications. We view abstractive explanation as an exciting direction for future work, but here we focus on extractive rationalization.
The above efforts have measured rationale or explanation quality as a function of agreement with human rationales. This is natural in the setting in which supervision over rationales is assumed to be providing, as extracting these becomes a secondary predictive target which can be directly measured. However, agreement with human rationales demonstrates only plausibility; it does not guarantee that the model actually relied on the provided snippets to come to its prediction. Rationales that do meet these criterion are termed faithful: we discuss these two potential properties of rationales in more detail below. Importantly, we provide metrics that aim to measure these.

Datasets in ERASER
In this section we describe the datasets that comprise the proposed rationales benchmark. All datasets constitute predictive tasks for which we distribute both reference labels and spans marked by humans, in a standardized format. For some of the datasets we have acquired comprehensive rationales from humans for a subset of instances. This permits evaluation of model recall, with respect to extracted rationales.
We distribute train, validation, and test sets for all corpora (see Appendix A for processing details). We ensure that these sets comprise disjoint sets of source documents to avoid contamination. 2 We have made the decision to distribute the test sets publicly, 3 in part because we do not view the 'correct' metrics to use as settled. We plan to acquire additional human annotations on held-out portions of some of the included corpora so as to offer hidden test set evaluation opportunities in the future.
Evidence inference (Lehman et al., 2019). This is a dataset of full-text articles describing the conduct and results of randomized controlled trials (RCTs). The task is to infer whether a given intervention is reported to either significantly increase, significantly decrease, or have no significant effect on a specified outcome, as compared to a comparator of interest. A justifying rationale extracted from the text should be provided to support the inference. As the original annotations are not necessarily exhaustive, we collect exhaustive annotations on a subset of the test data 4 .
BoolQ (Clark et al., 2019). This corpus consists of passages selected from Wikipedia, and yes/no questions generated from these passages. As the original Wikipedia article versions used were not maintained, we have made a best-effort attempt to recover these, and then find within them the passages answering the corresponding questions. For public release, we acquired comprehensive annotations on a subset of documents in our test set 4 .
Movie Reviews (Zaidan and Eisner, 2008). One of the original datasets providing extractive rationales, the movies dataset has positive or negative sentiment labels on movie reviews. As the included rationale annotations are not necessarily comprehensive (i.e., annotators were not asked to mark all  text supporting a label), we collect a comprehensive evaluation set on the final fold of the original dataset (Pang and Lee, 2004) 4 .
FEVER (Thorne et al., 2018). FEVER 1.0 (short for Fact Extraction and VERification) is a factchecking dataset. The task is to verify claims from textual sources. In particular, each claim is to be classified as supported, refuted or not enough information with reference to a collection of potentially relevant source texts. We restrict this dataset to supported or refuted.
MultiRC (Khashabi et al., 2018). This is a reading comprehension dataset composed of questions with multiple correct answers that by construction depend on information from multiple sentences. In MultiRC, each Rationale is associated with a question, while answers are independent of one another. We convert each rationale/question/answer triplet into an instance within our dataset. Each answer candidate then has a label of True or False. In ERASER, models are evaluated both for their 'downstream' performance (i.e., performance on the actual classification task) and with respect to the rationales that they extract. For the former we rely on the established metrics for the respective tasks. Here we describe the metrics we propose to evaluate the quality of extracted rationales. We do not claim that these are necessarily the best metrics for evaluating rationales, but they are reasonable starting measures. We hope the release of ERASER will spur additional research into how best to measure the quality of model explanations in the context of NLP.

Agreement with human rationales
The simplest means of evaluating rationales extracted by models is to measure how well they agree with those marked by humans. To this end we propose two classes of metrics: those based on exact matches, and ranking metrics that provide a measure of the model's ability to discriminate between evidence and non-evidence tokens (appropriate for models that provide soft scores for tokens). For the former, we borrow from Named Entity Recognition (NER); we effectively measure the overlap between spans extracted and marked. Specifically, given an extracted set of l rationales {r 1 , ..., r l } extracted for instance i, we compute precision, recall, and F1 with respect to m human rationales {h 1 , ..., h m }.
Exact match is a particularly harsh metric in that it may not reflect subjective rationale quality; consider that an extra token destroys the match but not (usually) the meaning. We therefore consider softer variants. Intersection-Over-Union (IOU), borrowed from computer vision (Everingham et al., 2010), permits credit assignment in the case of partial matches. We define IOU on a token level: for two spans x, y, it is the size of the overlap of the tokens covered by the spans divided by the size of the union. We count a prediction as a match if it overlaps with any of the ground truth rationales by more than some threshold (0.5 for this work). We compute true positives from these matches; other measures (false positives, false negatives) are computed normally, and yield a more forgiving precision, recall, and F-measure.
We provide two additional relaxations of the exact match metric. First, a token-level precision, recall, and F1 allow for a broader sense of model coverage, although these ignore contiguousness, which is likely a desirable property of rationales. Systems may also provide a sentence-level decision as a second relaxed scoring metric. In general we consider token and span-level metrics superior to sentence metrics as they are more granular, but some datasets have meaningful sentence level annotations. 5 Our second class of metrics considers rankings. This rewards models for assigning relatively highscores to marked tokens. In particular, we take the Area Under the Precision-Recall curve (AUPRC) constructed by sweeping a threshold over token scores.
In general, the rationales we have for tasks are sufficient to make judgments, but not necessarily comprehensive. However, for some datasets we have explicitly collected comprehensive rationales for at least a subset of the test set. Therefore, on these datasets recall evaluates comprehensiveness directly (it does so only noisily on other datasets). We highlight which corpora contain comprehensive rationales in the test set in Table 4.

Measuring faithfulness
Above we proposed simple metrics for agreement with human-provided rationales. But as discussed above, a model may provide rationales that are plausible (and agree with those marked by humans) but that it did not in fact rely on to come to its disposition. In some scenarios this may be acceptable, but in many settings one may want rationales that actually explain model predictions, i.e., rationales extracted for an instance in this case ought to have meaningfully influenced its prediction for the same. We refer to these as faithful rationales.
How best to measure the faithfulness of rationales is an open question. In this first version of ERASER we propose a few straightforward metrics motivated by prior work (Zaidan et al., 2007;Yu et al., 2019). In particular, following Yu et al. (2019) we define metrics intended to capture the comprehensiveness and sufficiency of rationales, respectively. The former should capture whether all features needed to come to a prediction were selected, and the latter should tell us whether the extracted rationales contain enough signal to come to a disposition.
Comprehensiveness. To calculate rationale comprehensiveness we create contrast examples (Zaidan et al., 2007) by taking an input instance x i with rationales r i and erasing from the former all tokens found in the latter. That is, we construct a contrast example for x i ,x i , which is x i with the rationales removed. Assuming a simple classification setting, letp ij be the original prediction provided by a model m for the predicted class j:p ij = m(x i ) j . Then we consider the predicted probability from the model for the same class once the supporting rationales are stripped:p ij = m(x i ). Intuitively, the model ought to be less confident in its prediction once rationales are removed from x i . We can measure this as: If this is high, this implies that the rationales were indeed influential in the prediction; if it is low, then this suggests that they were not. A negative value here means that the model became more confident in its prediction after the rationales were removed; this would seem quite counter-intuitive if the rationales were indeed the reason for its prediction in the first place.
Sufficiency. The second metric for measuring the faithfulness of rationales that we use is intended to capture the degree to which the snippets within the extracted rationales are adequate for a model to make a prediction. Denote byp ij the predicted probability of class j using only rationales r i . Then: These metrics are illustrated in Figure 2. As defined, the above measures have assumed discrete rationales r i . We would like also to evaluate the faithfulness of continuous importance scores assigned to tokens by models. Here we adopt a simple approach for this. We convert soft scores over features s i provided by a model into discrete rationales r i by taking the top−k d values, where k d is a threshold for dataset d. We set k d to the average rationale length provided by humans for dataset d (see Table 4). Intuitively, this says: How much does the model prediction change if we remove a number of tokens equal to what humans use (on average for this dataset) in order of the importance scores assigned to these by the model. Once we have discretized the soft scores into rationales in this way, we compute the faithfulness scores as per Equations 1 and 2. This approach is conceptually simple. It is also computationally cheap to evaluate, in contrast to measures that require per-token measurements, e.g., importance score correlations with 'leave-one-out' scores (Jain and Wallace, 2019), or counting how many 'important' tokens need to be erased before a prediction flips (Serrano and Smith, 2019). However, the necessity of discretizing continuous scores forces us to rely on the rather ad-hoc application of threshold k d . We believe that picking this based on human rationale annotations per dataset is reasonable, but acknowledge that alternative choice of threshold may yield quite different results for a given model and rationale set. It may be better to construct curves of this measure across varying k d and compare these, but this is both subtle (such curves will not necessarily be monotonic) and computationally intensive.
Ultimately, we hope that ERASER inspires additional research into designing faithfulness metrics for rationales. We plan to incorporate additional such metrics into future versions of the benchmark, if appropriate.

Baseline Models
Our focus in this work is primarily on the ERASER benchmark itself, rather than on any particular model(s). However, to establish initial empirical results that might provide a starting point for future work, we evaluate several baseline models across the corpora in ERASER. 6 We broadly class these into models that assign 'soft' (continuous) scores to tokens, and those that perform a 'hard' (discrete) selection over inputs. We additionally consider models specifically designed to select individual tokens (and very short sequences) as rationales, as compared to longer snippets.
We describe these models in the following subsections. All of our implementations are available in the ERASER repository. Note that we do not aim to provide, by any means, a comprehensive suite of models: rather, our aim is to establish a reasonable starting point for additional work on such models.
All of the datasets in ERASER have a similar structure: inputs, rationales, labels. But they differ considerably in length (Table 4), both of documents and corresponding rationales. We found that p(Forest|x i )

< l a t e x i t s h a 1 _ b a s e 6 4 = " o u g n m b + N i P K I 2 t l w 1 o M y Z A t v k 8 c = " > A A A C B X i c b V D J S g N B E O 2 J W 4 x b 1 K M e B o M Q L 2 E m C n o M C u I x g l k g M 4 S e T i V p 0 r P Q X S M J 4 1 y 8 + C t e P C j i 1 X / w 5 t / Y W Q 6 a + K D g 8 V 4 V V f W 8 S H C F l v V t Z J a W V 1 b X s u u 5 j c 2 t 7 Z 3 8 7 l 5 d h b F k U G O h C G X T o w o E D 6 C G H
A U 0 I w n U 9 w Q 0 v M H V 2 G / c g 1 Q 8 D O 5 w F I H r 0 1 7 A u 5 x R 1 F I 7 f + j 0 K S Z R W k w c h C E m 1 6 E E h W n 6 M G z z k 3 a + Y J W s C c x F Y s 9 I g c x Q b e e / n E 7 I Y h 8 C Z I I q 1 b K t C N 2 E S u R M Q J p z Y g U R Z Q P a g 5 a m A f V B u c n k i 9 Q 8 1 k r H 7 I Z S V 4 D m R P 0 9 k V B f q Z H v 6 U 6 f Y l / N e 2 P x P 6 8 V Y / f C T X g Q x Q g B m y 7 q x s L E 0 B x H Y n a 4 B I Z i p A l l k u t b T d a n k j L U w e V 0 C P b 8 y 4 u k X i 7 Z p 6 X y 7 V m h c j m L I 0 s O y B E p E p u c k w q 5 I V V S I 4 w 8 k m f y S t 6 M J + P F e D c + p q 0 Z Y z a z T / 7 A + P w B u O K Z W Q = = < / l a t e x i t > Where do you find the most amount of leafs?  this motivated use of different models for datasets, appropriate to their sizes and rationale granularities. In our case this was in fact necessitated by computational constraints, as we were unable to run larger models on lengthier documents such as those within Evidence Inference. We hope that this benchmark motivates design of models that provide rationales that can flexibly adapt to varying input lengths and expected rationale granularities. Indeed, only with such models can we perform comparisons across datasets.

Hard selection
Models that perform hard selection may be viewed as comprising two independent modules: an encoder which is responsible for extracting snippets of inputs, and a decoder that makes a prediction based only on the text provided by the encoder. We consider two variants of such models. Lei et al. (2016). In this model, the encoder induces a binary mask over inputs x, z. The decoder consumes the attributes of x indicated by z to make a predictionŷ. The components are typically trained jointly. This end-to-end training is complicated by the use of (non-differentiable) hard attention, i.e., the binary mask z, which means it is not possible to train the model using variants of gradient descent. Instead, Lei et al. (2016) propose using REINFORCE (Williams, 1992) style estimation, minimizing the loss over expected binary vectors z yielded from the encoder.
One of the advantages of this approach is that it need not have access to marked rationales; it can learn to rationalize on the basis of instance labels alone. However, given that here we do have access to rationales in the training data, we experiment with a variant in which we train the encoder explicitly using rationale-level annotations.
In our implementation of Lei et al. (2016), we drop in two independent BERT (Devlin et al., 2018) base modules with bidirectional LSTM (Hochreiter and Schmidhuber, 1997) on top to induce contextualized representations of tokens for the encoder and decoder (the decoder, in addition, uses additive attention to collapse the LSTM hidden representations to a single vector), respectively. The encoder generates a scalar (denoting the probability of selecting that token) for each LSTM hidden state using a feedfoward layer and sigmoid. In the model where we do use human rationales during training, we minimize binary cross entropy between our sigmoid output and the ground truth rationale. Thus our final loss function is composed of decoder classification loss, reinforce estimator loss (details can be found in Lei et al. (2016)) and if used, a rationale supervision loss.
Pipeline models. These are simple models in which we first train the encoder to extract rationales, and then train the decoder to perform prediction using only rationales. No parameters are shared between the two models. Realizing this type of approach is possible only when one has access to direct rationale supervision in order to train the encoder (which in general we assume in ERASER).
Here we first consider a simple pipeline that first segments inputs into sentences. It passes these, one at a time, through a Gated Recurrent Unit (GRU)  to yield hidden representations that we compose via an attentive decoding layer . This aggregate representation is then passed to a classification module which predicts whether the corresponding sentence is a rationale (or not). A second model, using effectively the same architecture but parameterized independently, consumes the outputs (rationales) from the first to make predictions. This simple model is described at length in prior work (Lehman et al., 2019). We further consider a 'BERT-to-BERT' pipeline, where we replace each stage with a BERT module for prediction (Devlin et al., 2018).
In all pipeline models, we train each stage independently. The rationale identification stage is trained using approximate sentence boundaries from our source annotations, with randomly sampled negative examples at each epoch. The classification stage uses the same positive rationales as the identification stage, a kind of teacher forcing. See Appendix C for more detail.

Soft selection
A subset of datasets in ERASER contain tokenlevel annotations, i.e., in these cases individual words and/or comparatively short sequences of words are marked as supporting classification decisions. These are: MultiRC, Movies, e-SNLI, and CoS-E. For these datasets we consider a model that passes tokens through BERT (Devlin et al., 2018) to induce contextualized representations that are then passed to a bi-directional LSTM (Hochreiter and Schmidhuber, 1997). The hidden representations from the LSTM are collapsed into a single vector using additive attention  and finally through a linear layer followed by a sigmoid to yield (per-token) relevance predictions. We use the LSTM layer in part to bypass the 512 word limit imposed by BERT; when we exceed this length, we effectively start encoding a 'new' sequence (setting the positional index to 0) via BERT. The hope is that the LSTM layer learns to compensate for this. We have not yet trained this model on larger corpora due to computational constraints. For now we instead use a similar setup as above for Evidence Inference, BoolQ, and FEVER, except we swap in GloVe 300-d embeddings (Pennington et al., 2014) in place of BERT representations for tokens.
For these models we consider input gradients (with respect to output) and attention induced over contextualized representations as 'soft' scores.

Evaluation
Performance IOU F1 Token F1 Evidence Inference (Lehman et al., 2019) 0  Here we present initial results for the baseline models discussed in Section 6, with respect to the metrics proposed in Section 5. We present results in two parts, reflecting the two classes of rationales discussed above: 'hard' approaches that perform discrete selection of snippets and 'soft' methods that assign continuous importance scores to tokens.   First, in Table 3 we evaluate models that perform discrete selection of rationales. We view these models as faithful by design, because by construction we know what snippets of text the decoder used to make a prediction. 7 Therefore, for these methods we report only metrics that measure agreement with human annotations.
Due to computational constraints, we are currently unable to run our BERT-based implementation of Lei et al. (2016) over larger corpora. Conversely, Lehman et al. (2019) assumes a setting in which rationale are sentences, and so is not appropriate for datasets in which rationales tend to comprise only very short spans. Again, in our view this highlights the need for models that can rationalize at varying levels of granularity, depending on what is appropriate.
We observe that for the "rationalizing" model of Lei et al. (2016), exploiting rationale-level supervision generally improves agreement with humanprovided rationales, which is consistent with prior work (Zhang et al., 2016;Strout et al., 2019). Here, Lei et al. (2016) consistently outperform the simple pipeline model from Lehman et al. (2019). Furthermore, Lei et al. (2016) outperforms the 'BERT-to-BERT' pipeline on the comparable datasets for the final classification tasks. This may be an artifact of the amount of text each model can select: 'BERTto-BERT' is limited to sentences, while Lei et al. (2016) can select any subset of the text.
In Table 4 we report metrics for models that assign soft (continuous) importance scores to individual tokens. For these models we again measure downstream (task) performance (F1 or accuracy, as appropriate). Here the models are actually the same, and so downstream performance is equivalent. To assess the quality of token scores with respect to human annotations, we report the Area Under the Precision Recall Curve (AUPRC). Finally, as these scoring functions assign only soft scores to inputs (and may still use all inputs to come to a particular prediction), we report the metrics intended to measure faithfulness defined above: comprehensiveness and sufficiency. Here we observe that the simple gradient attribution yields consistently more 'faithful' rationales with respect to comprehensiveness, and in a slight majority of cases also with respect to sufficiency. Interestingly, however, attention weights yield better AUPRCs. 7 Note that this further assumes that the encoder and decoder do not share parameters.
We view these as preliminary results and intend to implement and evaluate additional baselines in the near future. Critically, we see a need for establishing the performance of a single architecture across ERASER, which comprises datasets of very different size, and featuring rationales at differing granularities.

Discussion
We have described a new publicly available Evaluating Rationales And Simple English Reasoning (ERASER) benchmark. This comprises seven datasets, all of which have both instance level labels and corresponding supporting snippets ('rationales') marked by human annotators. We have augmented many of these datasets with additional annotations, and converted them into a standard format comprising inputs, rationales, and outputs. ERASER is intended to facilitate progress on explainable models for NLP.
We have proposed several metrics intended to measure the quality of rationales extracted by models, both in terms of agreement with human annotations, and in terms of 'faithfulness' with respect to comprehensiveness and sufficiency. We believe these metrics provide reasonable means of comparison of specific aspects of interpretability. However, we view the problem of measuring faithfulness, in particular, a topic ripe for additional research; we hope that ERASER facilitates this.
More generally, our hope is that ERASER facilitates progress on designing and comparing relative strengths and weaknesses of interpretable NLP models across a variety of tasks and datasets. We aim to continually update this benchmark and the corresponding metrics that it defines. In contrast to most benchmarks, we are not privileging any one measure of performance. Our view is that for interpretability, different models may excel at different things, and our aim for ERASER is to facilitate meaningful contrastive comparisons that highlight which models excel with respect to particular metrics of interest (e.g., certain models may provide superior faithfulness, though with lower predictive performance). We host a leaderboard, but allow for sorting with respect to any metric of interest.
The ERASER datasets, code for working with the data and performing evaluations, and our baseline model implementations are all available at: www.eraserbenchmark.com, which we will be continuously updating. C Hyperparameter and training details C.1 (Lei et al., 2016)

models
For these models, we set the sparsity rate at 0.01 and we set the contiguity loss weight to 2 times sparsity rate (following the original paper). We used bert-base-uncased (Wolf et al., 2019) as token embedder and Bidirectional LSTM with 128 dimensional hidden state in each direction. A dropout (Srivastava et al., 2014) rate of 0.2 was used before feeding the hidden representations to attention layer in decoder and linear layer in encoder. One layer MLP with 128 dimensional hidden state and ReLU activation was used to compute the decoder output distribution.
A learning rate of 2e-5 with Adam (Kingma and Ba, 2014) optimizer was used for all models and we only fine-tuned top two layers of BERT encoder. Th models were trained for 20 epochs and early stopping with patience of 5 epochs was used. The best model was selected on validation set using the final task performance metric.
The input for the above model was encoded in form of [CLS] document [SEP] query [SEP].
This model was implemented using AllenNLP library (Gardner et al., 2017).

C.2 BERT-LSTM/GloVe-LSTM
This model is essentially the same as decoder in previous section. The BERT-LSTM uses the same hyperparameter and GloVe-LSTM is trained with a learning rate of 1e-2.

C.3 (Lehman et al., 2019) models
With the exception of the Evidence Inference dataset, these models were trained using the GLoVe (Pennington et al., 2014) 200 dimension word vectors, and Evidence Inference using the (Pyysalo et al., 2013) PubMed word vectors. We use Adam (Kingma and Ba, 2014) with a learning rate of 1e-3, Dropout (Srivastava et al., 2014) of 0.05 at each layer (embedding, GRU, attention layer) of the model, for 50 epochs with a patience of 10. We monitor validation loss, and keep the best model on the validation set.

C.4 BERT-to-BERT model
We primarily used the bert-base-uncased model for both portions of the identification and classification pipeline, with the sole exception being Evidence Inference with SciBERT (Beltagy et al., 2019).