Leakage-Adjusted Simulatability: Can Models Generate Non-Trivial Explanations of Their Behavior in Natural Language?

Data collection for natural language (NL) understanding tasks has increasingly included human explanations alongside data points, allowing past works to introduce models that both perform a task and generate NL explanations for their outputs. Yet to date, model-generated explanations have been evaluated on the basis of surface-level similarities to human explanations, both through automatic metrics like BLEU and human evaluations. We argue that these evaluations are insufficient, since they fail to indicate whether explanations support actual model behavior (faithfulness), rather than simply match what a human would say (plausibility). In this work, we address the problem of evaluating explanations from the the model simulatability perspective. Our contributions are as follows: (1) We introduce a leakage-adjusted simulatability (LAS) metric for evaluating NL explanations, which measures how well explanations help an observer predict a model’s output, while controlling for how explanations can directly leak the output. We use a model as a proxy for a human observer, and validate this choice with two human subject experiments. (2) Using the CoS-E and e-SNLI datasets, we evaluate two existing generative graphical models and two new approaches; one rationalizing method we introduce achieves roughly human-level LAS scores. (3) Lastly, we frame explanation generation as a multi-agent game and optimize explanations for simulatability while penalizing label leakage, which can improve LAS scores.


Introduction
Deep neural models have achieved impressive success in many areas. However, their interpretability and explainability have remained broadly limited. To make neural models more interpretable, previous works have proposed methods for explaining model decisions, e.g., through various feature importance estimates (Hendricks et al., 2018;Ribeiro et al., 2016) or model-generated natural language (NL) (Hendricks et al., 2016;Kim et al., 2018). Early work on generating NL explanations focused on providing explanations that were both descriptive of an image and discriminative as labels (Hendricks et al., 2016). Since then, a variety of datasets have been collected with free-form human generated explanations accompanying each data point (Camburu et al., 2018;Kim et al., 2018;Zellers et al., 2019;Wang et al., 2019;Rajani et al., 2019). Models have been proposed for these datasets with two aims: (1) to teach models how to explain their own decisions in natural language, by offering demonstrations of humans doing this, and (2) to increase model accuracy on the task, by making use of additional information in human explanations.
Past works have proposed varying methods for generating NL explanations, which can be represented by distinct graphical models. In our work, we explore four graphical models, shown in Figure 1. Each model generates explanations in either a reasoning (RE) or rationalizing (RA) mode, where rationalizing models explicitly condition explanations on a label and reasoning models condition only on the input. Approaches further differ by whether they use explanations as inputs to a task model (ST) or as additional supervision in a multitask framework (MT). Two of these models are drawn from prior works: MT-RA (Camburu et al., 2018) and ST-RE (Rajani et al., 2019). We introduce ST-RA and also test MT-RE as the reasoning counterpart to MT-RA. To fairly compare the approaches, we implement each graphical model with a state-of-the-art pretrained T5 model (Raffel et al., 2019) (details in Section 3).
Training Phase 1 Training Phase 2 Figure 1: Graphical models representing varying roles of explanations, where the task input is denoted by x, task output by y, and explanation by e. We introduce a new rationalizing model, ST-RA, while also testing a reasoning multi-task model, MT-RE, and two other methods from past works (Camburu et al., 2018;Rajani et al., 2019). Generated explanations have typically been evaluated by automatic measures of similarity with human explanations. Most commonly, phrasematching metrics such as BLEU (Papineni et al., 2002) are used. In a few cases, human evaluations have been employed, also primarily to assess the similarity of explanations to what humans would say. On the basis of these evaluations, past works have suggested their models produce "justifications of its classification decisions" (Camburu et al., 2018) and "explanations to justify its predictions" (Rajani et al., 2019). While useful starting points, we argue that these evaluations are insufficient, because they do not necessarily indicate anything about a model's true internal reasoning. For example, suppose the ground-truth label is A, while a model predicts B; a higher BLEU score will be observed when the model gives an explanation to support human label A, instead of model prediction B. This point is substantiated by Jacovi and Goldberg (2020b), who advocate for evaluations of explanation faithfulness rather than plausibility.
To resolve this evaluation problem, we introduce the leakage-adjusted simulatability (LAS) metric, which is better suited for identifying when explanations actually support model behavior. LAS scores combine two key mechanisms: they measure simulatability, which reflects how well an observer can use model explanations to predict the model's output, while controlling for explanation leakage, which occurs when explanations directly leak the output. This metric is inspired by prior work on model interpretability (Doshi-Velez and Kim, 2017;Hase and Bansal, 2020), but to date no simulatability analysis has been carried out for NL explanations. We automate our evaluation by using a pretrained language model as the observer, serving as a proxy for a human. Using LAS scores, we evaluate model-generated as well as human explanations for COMMONSENSEQA (CQA) (Talmor et al., 2019;Rajani et al., 2019) and SNLI (Bowman et al., 2015;Camburu et al., 2018) tasks. We provide two human evaluations to validate our model-based approach. The first is an expert simulatability evaluation, where we manually play the role of the simulator in our LAS metric computation. The second is a subjective ratings task, where we collect data from Mechanical Turkers.
Lastly, since we propose a metric for evaluation, the question naturally arises of whether an objective besides standard language modeling is better suited to improving explanations under this metric. While our formulation of LAS is not differentiable, we present a proxy objective that involves using a simulator during training. This training procedure is neatly interpreted as a multi-agent game. Agents share a common objective, which is for the simulator to predict the task model's output using the explanation it receives, but we penalize agents for pursuing the trivial solution, i.e., restating outputs without giving additional information.
We summarize our key results as follows: 1. We introduce the LAS score, which captures how explanations improve simulatability while controlling for direct label leakage, and we use it to evaluate four generative models. 2. We show that our LAS scores provide a deeper understanding of explanation effectiveness than metrics like BLEU and discuss their relationship with our expert simulation analysis and crowdsourced human quality ratings. 3. We find that our ST-RA approach achieves nearly human-level LAS scores, and that rationalizing models outperform reasoning models. 4. We observe no trade-off between interpretability and accuracy, though this also means that existing methods struggle to learn from human explanations. 5. In a multi-agent game, we show that optimizing explanations for simulatability and penalizing trivial explanations can improve LAS scores in some settings.

Related Work
Generating Natural Language Explanations. Early work on this topic proposes to generate explanations for images that are descriptive as captions and discriminative as labels (Hendricks et al., 2016). However, they seek to explain the image's label rather than a classifier's output. Ling et al. There is now a wealth of work on evaluating explanations of machine learning models (Ribeiro et al., 2016;Doshi-Velez and Kim, 2017;Hooker et al., 2019;Jacovi and Goldberg, 2020b). For NLP tasks, past works focused on extractive rather than generative explanations (Nguyen, 2018;DeYoung et al., 2020). Such methods extract parts of the model input that are important to the output according to some criterion. However, they are not suited to evaluate NL explanations that are not part of the input, which motivates our new simulatability metric.
Measures of similarity between model-generated and human explanations are used to evaluate nearly every method introduced above, with BLEU being the most common (Hendricks et al., 2016;Ling et al., 2017;Park et al., 2018;Kim et al., 2018;Camburu et al., 2018;Rajani et al., 2019). In a few cases, human evaluations are employed for similar purposes (Hendricks et al., 2016;Park et al., 2018;Kim et al., 2018). While these evaluations provide a good starting point, they do not support previous claims that explanations show the reasons for model behavior because they evaluate plausibility rather than faithfulness. We introduce a leakageadjusted simulatability metric (LAS) in response to this issue. As observed by Jacovi and Goldberg (2020a), faithfulness and simulatability are closely related, but simulatability primarily captures causal attribution of explanations and not necessarily social attribution. Simulatability-based evaluations have been conducted before (Ribeiro et al., 2018;Hase and Bansal, 2020), but we are the first to consider NL explanations and employ model-based controls for label leakage. Two contemporaneous works also explore relevant topics. Narang et al.
(2020) train a T5 model to generate explanations in a set-up analogous to our MT-RA setting. They also notice the shortcomings of BLEU and collect binary human ratings of whether explanations "support" model outputs. Kumar and Talukdar (2020) introduce label-specific versions of the method in Rajani et al. (2019), one of which shares the graphical structure of our ST-RA model. However, their evaluation focuses on whether humans can recover ground truth labels from generated explanations alone, which they term "explanation accuracy." Given these interesting concurrent works, our contributions are still distinguished by our joint focus on (1) simulatability-based evaluation, (2) controls for explanation label leakage, and (3) comparison of several distinct graphical models. Multi-Agent Communication. The most relevant work to our multi-agent game concerns discrete communication policies with natural language or artificial protocols grounded in NL. Lazaridou et al.
(2017) ground a communication protocol in natural language via an auxiliary image classification task. In concurrent work, Lazaridou et al. (2020) learn NL protocols for an image-based reference game by pretraining with image captions. While our approach shares the premise that language use is goal-oriented, we optimize full explanations of model outputs rather than descriptions of images in reference games. Another contemporaneous work optimizes for simulatability in a multi-agent setting, but they use extractive rather than generative explanations (Treviso and Martins, 2020).

Modeling With Explanations
In this section, we delineate our baseline model and the four graphical models we study. The graph- The answer is: hen house Figure 2: Inputs and outputs for the T5 Multi-task framework. In the reasoning mode, explanations are not conditioned on the model's prediction, whereas in the rationalizing mode they are dependent on the model output.
ical models are depicted in Figure 1. We also summarize the key features of each approach in Table 1. We show examples of task inputs and outputs along with explanations in Table 2. In general, we initialize models from T5-Base, which is a Transformer-based sequence-to-sequence model, pretrained with a large-scale English corpus.
Baseline. The baseline model simply predicts y given x. We adopt the approach of Raffel et al.
(2019) for fine-tuning to multiple-choice problems, which is to maximize the likelihood of correct answer tokens conditioned on the task inputs. To produce predictions, however, we compute a likelihood for each answer choice and select the most likely choice, rather than sampling text. SNLI also fits into this framework by taking the three relations as answer choices.
ST-RE. Rajani et al. (2019) proposed a Commonsense Auto-Generated Explanation (CAGE) framework for CQA, with a two-phase training procedure: first, with human explanations as supervision, a model is trained to generate explanations given task inputs; then generated explanations are supplied with task inputs to a classifier that performs the task. We represent this framework in Figure 1, where we term it ST-RE to fit within our data-agnostic model taxonomy. ST stands for serial-task (from the separate training phases) and RE for the reasoning explanation generation. While originally composed of GPT and BERT, we implement this approach with two separate T5 models.
ST-RA. We extend the ST-RE approach to operate in a rationalizing mode (shown in Figure 5 in Appendix). Instead of generating one explanation per example, we propose to generate explanations for each possible task output, conditioned on that output. Then, we give each answer choice its own input sequence, which includes the task input and an explanation supporting that answer choice.

Method
Task Set Conditioning Finally, a classifier scores each input and output sequence. Instead of maximizing the likelihood of correct answer tokens, we find that a new learning objective is necessary for training the task model. We renormalize the decoder likelihoods of each answer choice a i given the encoder input s i . With the set of encoder sequences S and answer choices A, we define the probability of each answer choice as: Then we maximize the likelihood of the correct answer choice.
MT-RE. The alternative to using explanations as task model inputs is to use them as supervision in a multi-task framework. As a counterpart to ST-RE, we test a reasoning multi-task model, where explanations are conditioned only on the task input (shown in Figure 2). We use a single task-specific word prepended to the input sequence so that the encoder hidden states will be tailored to either the task or explanation generation. For this model, the multi-task learning objective mixes a label prediction loss L task (for the task itself), and a language modeling loss L LM (for explanation generation): where α is the mixing ratio to be tuned on development set. We reach a value of α = .5 on both datasets when tuning for task accuracy.  Table 2: Two example data points from CQA with HUMAN or STRA label (bold in text) and explanation. We give leakage indicators and example-level LAS scores from both model-based (T5) and human simulators (see Section 4). More examples can be found in Table 7.
MT-RA. Represented in Figure 2, MT-RA is a multi-task model where explanations are conditioned on the model output. This approach originates in Camburu et al. (2018), where it is introduced as an LSTM-based model. As above, we use a task mixing weight of α = .5 for both datasets.

LAS: Leakage-Adjusted Simulatability
While many motivations drive humans' explanations for their behavior, we consider one central purpose to be helping others understand one's internal reasoning. This notion is captured by the concept of simulatability (Doshi-Velez and Kim, 2017). A model is simulatable to the extent that an observer, or simulator, can predict its outputs. The simulator can be either a human or a learned model; we will consider both settings. From this perspective, one might use the simulator's accuracy to measure explanation quality. With task inputs 2 the accuracy is defined as: However, this measure fails to distinguish between different ways in which the simulator can successfully predict the task model output, as shown in the causal diagram in Figure 3. We suggest that the simulator's success does not reflect explanation quality when (1) the simulator can guess behavior correctly from the input x alone, or (2) the explanationê directly restates the task model output, i.e., leaking the label to the simulator. What we are truly looking for in explanations is that they provide semantic content that informs the simulator of the 2 For the remainder of the paper, we use the indicator function in this way to describe the correctness of predictions, which is a slight abuse of notation for the sake of brevity.  (3) whether the explanation leaks the model output,êŷ task model's output in the context of its input. Note that we do not think label leakage means an explanation is bad. Explanations will leak more often than not, as human explanations leak about 85% of the time for CoS-E and about 97% of the time for e-SNLI (estimated by T5 simulator). Instead, we think the more important aspect is to evaluate the explanation's semantic content. For examples of leaking and nonleaking explanations, see Table 2.

Correctness Simulator
To deal with issue (1) above, we introduce an input-only baseline and measure the effect of an explanation on simulatability as 1 To resolve the issue (2), we propose to control for a label leaking variable, which has the effect of blocking that causal pathway (Pearl, 2009). We do so by using a proxy variable for label leakage, which is an indicator variable for if the simulator can predictŷ solely fromê. The correctness of this prediction suggests that the explanation gives away the answer directly. With this approach, we can estimate explanations' leakage-controlled effect on simulatability by (1) grouping data by the level of explanation label leakage, (2) where n 0 and n 1 are the number of examples in nonleaking and leaking groups respectively. We use a pretrained T5-Base model as a proxy for a human simulator (depicted in Figure 4). This approach has the advantage of scaling across large datasets with uniform quality in predictions, and, as described in Section 5, it enables directly optimizing explanations for simulatability. We validate this choice of proxy with two human subject experiments (see Section 6.2). Simulator models are trained with task model outputs as labels and x and e combined into input sequences. In order to make sure the simulator makes good use of both x and e, we randomly dropout either x orê from the input during training. When testing, the simulator's correctness on each example is 1[ŷ i |x i ,ê i ], and we obtain 1[ŷ i |x i ] and 1[ŷ i |ê i ] by droppingê i or x i from the input. We will compare LAS and Acc(ŷ|x,ê) for explanations from the models introduced above as well as human explanations. We discuss the relationship with human experiments for both metrics in Section 6.2. In analysis to follow, we will also refer to example-level LAS scores, which are given as 1[ŷ|x,ê]−1[ŷ|x] and take values -1, 0, or 1 (see Table 2 for examples). Lastly, while we use a binary proxy for label leakage, a continuous measure can be obtained from p(ŷ|ê). After calibrating the simulator probabilities via Platt scaling (Platt, 2000), we perform a sensitivity analysis of our results for bin counts between 2 and 100: LAS estimates typically vary by less than 1 point across bin counts. For further details, see Appendix B.1.

Multi-Agent Explanation Optimization
In this section, we explore an approach to optimizing explanations for LAS, rather than just relying on a standard language modeling loss to produce explanations. The approach is naturally framed as a multi-agent game. Note that we do not aim to improve model accuracy or explanations' BLEU scores in these experiments.
In our game, there are two agents. The first is a task model that predicts labels and generates explanations jointly. Here, we use MT-RE or MT-RA. The second agent is a simulator model that predicts the task model's outputŷ i given its explanationê i and the model input x i , matching the previous simulation format shown in Figure 4. These two agents are jointly trained during the multiagent training procedure. The objective of the simulator is the same as discussed in the above section, which is to predictŷ i given x i andê i , and we randomly dropout x i orê i to ensure they are both being used. As in Section 3, the task model learns to perform the task (minimizing L task ) and generate explanations (minimizing L LM ) via supervision from ground-truth labels and human explanations. Here, the task model also tries to minimize the simulator's loss through its explanations. The chief computational challenge with this approach is that explanations are sampled by greedy decoding, and thus the loss is not differentiable with respect to the task model. We explore two optimization methods circumventing this issue: Approximate SGD via argmax relaxation (Maddison et al., 2017) and REINFORCE (Williams, 1992). Our aim is for explanations to better communicate the task model's reasoning process, without adopting the trivial solution, i.e., directly stating its output. Thus while we optimize explanations for simulatability, we also penalize label leakage, which we formalize below. Note that the task model's predictions are not optimized to agree with the simulator; only its explanations are optimized.
Approximate SGD. With a simulator model p φ ,  the simulatability loss term for explanations is where α is a mixing weight between terms. To differentiate through the greedy decoding for explanation sampling, we use one half of the Gumbel-Softmax trick (Maddison et al., 2017). During the forward pass in training, the argmax is used as normal, while during the backward pass, we relax the argmax to a softmax with temperature 1 for purposes of computing gradients.
Reinforce. Our second approach is to use the RE-INFORCE RL algorithm proposed by Williams (1992). Here we take the simulator's output probabilities as a reward for the task model. Now with the same goals as above, we define the reward for Then, the L exp for task model p θ is defined as: Finally, with either method, the full learning objective of the task model is L T askM odel = λ 1 L task + λ 2 L LM + λ 3 L exp . The tuning procedure and values for mixing weights are given in Appendix A.5.

Experimental Results
Here, we discuss experiments conducted with each method using two (English) datasets: The first is the COMMONSENSEQA (CQA) dataset of Talmor et al. (2019)

Automatic Explanation Evaluation
Below we describe key conclusions from our evaluation of leakage-adjusted simulatability (LAS), and we show results alongside overall simulator accuracy Acc(ŷ|x,ê) and BLEU in Table 3.
Humans vs. Models. Some models do achieve roughly human-level LAS scores for CQA and NLI. First, we find that human explanations are helpful to models: we estimate that explanations improve humans' simulatability by 4.31 percentage points for SNLI and by 14.73 points for CQA. Our ST-RA method performs similarly to humans on both datasets. On SNLI, MT-RA also achieves about human performance. We emphasize that this does not mean these models match human explanations in every respect. Rather, the semantics of the explanations have a similar effect on simulator accuracy as human explanations in our experimental settings.  BLEU vs. Simulatability. BLEU is not correlated with our LAS metric, which supports our conjecture that BLEU does not reflect the effect of explanations on simulatability. LAS also does not correlate with the simulator accuracy, Acc(ŷ|x,ê), which is expected given how the simulator accuracy is heavily influenced by explanation leakage.

Human Validation of LAS
We validate our model proxy variables with two human evaluations, an expert simulation experiment, and a crowdsourced subjective rating test. Expert Simulation. We (meaning the first three authors as expert annotators) validate our use of models as simulators of both model-generated and human explanations by manually playing the role of the simulator for 600 data points. With effectively the same design as our automatic metric computation, we simulate humans and our ST-RA model with both datasets, only with no training period in this case. Each annotator is randomly assigned a role for each data point (whether they see the input, explanation, or both), and points are sampled such that an annotator never sees the same point in different roles. The sample is roughly balanced across the strata of our model's proxy variables. We note that ideally, we would use only expert human simulators instead of proxies, though even annotating less than 1% of the data across conditions required 1800 individual responses. The correlations between proxy variables and our own are shown in Table 4. We group the data across subsets (e.g., explanation source and dataset) since the trends were similar between them. We find a strong correlation between the leakage proxy variable and the human leakage variable, with a Spearman rank correlation of ρ = 0.53 (p < 1e−15), and we observe a moderate correlation between the model-based and human example-level LAS, ρ = 0.29 (p < 1e−12) (Cohen, 1988).
The disagreements are concentrated in false negatives for leakage, where we identify leaking explanations when the model does not. With LAS, model scores of -1 and 1 often end up as a human 0, meaning that an explanation confuses the model but not the human rater (for -1), or the human can predict based on the input alone when the model cannot (for 1). Because of this tendency toward 0, human LAS will shrink slightly toward 0 in expectation, relative to the model LAS (see row-normalized Table 13 in Appendix). We also observe a degree of pragmatic drift between models and humans. Lazaridou et al. (2020) operationalize this as the difference in performance between human and model listeners in a reference game. Similarly, we can use simulator accuracy given the input and explanations. We find that humans are better simulators of humans, and models are better at predicting model outputs. Across datasets and simulators, the difference in accuracies is 12.83 percentage points on average.
Lastly, one may notice from Table 4 that our predictions of the human label are sometimes wrong. In fact, our own task accuracy is 70% (±7.33) for SNLI and 72% for CQA (±7.19). These accuracies are similar to those obtained by Pavlick and Kwiatkowski (2019) when re-annotating the SNLI dataset. Interestingly, they find that tasks such as these can have distributions over labels under human annotation, rather than consensus.
Human Subjective Quality Ratings. We collect human ratings from Mechanical Turkers for 200 test examples for both CQA and SNLI. Each example includes shuffled, unlabeled explanations (one from each model, plus humans, for a total of five), which we ask workers to separately rate on a 5-point Likert scale. After collecting 3 responses per item, we apply a worker quality filter, obtaining 902 ratings total. Further collection details are provided in Appendix D.  Human rating trends across example-level LAS scores are shown in Tables 5. A first observation is that LAS scores do not correlate well with human ratings. Curiously, though, simulator accuracies correlate with human ratings. We show these trends in Table 6, along with regression coefficients for predicting ratings from simulator accuracies. For both datasets, 1[ŷ|ê] best correlates with human ratings and the association with 1[ŷ|x,ê] is only significant for SNLI. Since good explanations tend to leak the label, it is not surprising that ratings correlate with label leakage. However, it is surprising that this association is stronger than the relationship with overall accuracy, 1[ŷ|x,ê]. Together, these results help explain why models may struggle to learn from human explanations, since models may focus on label leakage in human explanations at the expense of other information. They may also suggest that to collect human ratings that do not correlate with label leakage, a highly controlled environment for human ratings may be required.

Accuracy-Interpretability Trade-off
Past works on model interpretability have observed trade-offs between accuracy and model constraints for interpretation purposes (Bastings et al., 2019;Jain et al., 2020). Yet, Rudin (2018) and Jacovi and Goldberg (2020a) argue that we need not always face such a trade-off. Our findings provide quantitative evidence supporting these prior qualitative arguments. We observe consistently small changes in accuracy for our four models, and the largest changes, -.47 (p = .3124) for SNLI and -2.10 for CQA (p = .3272), are not statistically significant. We also test methods using human explanations purely for improving accuracy, e.g., through Masked Language Modeling objectives  Table 6: Human ratings broken down by dataset and simulator prediction, shown alongside regression results. 95% confidence intervals in parentheses.
that have been successful for pretraining models. We find that this objective does not lead to statistically significant accuracy improvements, suggesting models still struggle to truly learn from human explanations (results are shown in Table 14).

Multi-Agent Game
Multi-agent game results appear in Table 3, though we note that RL results should be cautiously interpreted as we observe unstable training behavior from this method. We find that optimization with SGD can reduce label leakage (from, e.g., 85.58% to 75.21% for CQA MT-RA) while slightly improving LAS scores, but only one of four changes in LAS scores is statistically significant, for MT-RE on SNLI. This approach does pull BLEU scores down. No statistically significant differences in accuracy are found; the largest change, a 3.37 point drop on CQA, has a p-value of .1287. We note that this kind of optimization may have the effect of increasing pragmatic drift, as is found for jointly optimized agents in (Lazaridou et al., 2020).

Conclusion
We introduce a leakage-adjusted simulatability metric to evaluate the influence of natural language explanations on model simulatability while controlling for explanations leaking the model outputs.
We validate our metric with two human subject experiments, and find that: (1) our ST-RA model attains similar LAS scores to human explanations, (2) rationalizing methods do better than reasoning methods, (3) no statistically significant relationship emerges between simulatability and accuracy, (4) our automatic metric correlates with expert simulation results, (5) the strongest predictor of crowdsourced explanation ratings is whether explanations leak the answer choice, and (6)  Note that explanations for the CQA test split were not collected for the CoS-E dataset, as the CQA test split itself is withheld as a leaderboard test set. Meanwhile, we report results using 10% of the SNLI training data, since training our multi-task T5 models with the full e-SNLI dataset can take over 24 hours per epoch on a single T4 GPU. These accuracy results are shown in Table 8. We report test set statistics here for simulation-related experiments for CQA, shown in Table 3, along with dev statistics for SNLI. Trends across models remain the same as with the data split statistics reported in the main paper. In Table 12, we confirm trends observed with the SNLI training data subset using models trained with the entire dataset. Finally, Table 7 shows additional examples from CQA and SNLI plus model-generated explanations.

A.2 Hypothesis Testing
We describe results as statistically significant when p-values are below .05, where p-values are calculated by bootstrap for LAS, a difference in the binomial means test for model accuracies, and by linear regression with i.i.d. normal noise for associations between human ratings and simulator correctness. Note that confidence intervals for LAS vary in width based on how many data points are in each leakage bin. With the expert evaluation, we compute Spearman's rank correlation between proxy and human simulation variables (with a corresponding p-value). For our data, the results are nearly identical to Pearson's linear correlation and Kendall's Tau.

A.3 Model Selection and Training Details
Our model selection procedure is to train each task model five times with differing seeds, then select the model with the best development performance. We train one simulator model per condition. Since the two-agent experiments have far increased computational load, we run one seed using a T5-Small during training, selecting the best task model according to its LAS with this weaker simulator. Afterward, we retrain with a T5-Base simulator.
Our training procedures result in the following (approximate) experimental times for each model when training on a single NVIDIA T4 GPU. With a T5-Base model and CQA data, our baseline takes about 10 hours for 20 epochs; ST-RE about 10 hours for 20 epochs; ST-RA about 20 hours for 20 epochs; MT-RE about 12 hours for 20 epochs; MT-RA about 12 hours for 20 epochs. Multi-agent RL optimization with a T5-Small simulator takes about 16 hours for 10 epochs, and SGD takes 24 hours for 10 epochs. Now with a T5-Base model and SNLI data (using 10% of the training data), our baseline takes about 24 hours for 10 epochs; ST-RE about 24 hours for 10 epochs; ST-RA about 48 hours for 10 epochs; MT-RE about 30 hours for 10 epochs; MT-RA about 30 hours for 10 epochs. Multi-agent RL optimization with a T5-Small simulator takes about 3 days for 5 epochs, and SGD takes 5 days for 5 epochs. Using the full SNLI dataset, the baseline took four days to train five epochs, and either MT model took 5 days for 5 epochs. We train generators for the ST conditions for 5 epochs on the 10% subset, which takes under 6 hours. Note that to follow our model selection procedure, experimental times should be multiplied by five here, and further extended to include training simulators.
Lastly, we note that T5-Base has 220 million parameters, while T5-Small as 60 million parameters (Raffel et al., 2019). In general, this means our model sizes are 220 million parameters, although, for multi-agent training, our effective model size is 280 million parameters.

A.4 Training Simulator Models
When training simulators, it is critical that the model can approximate the three distributions used in LAS computation: p φ (ŷ i |x i ,ê i ), p φ (ŷ i |x i ), and p φ (ŷ i |ê i ). This is achieved by applying dropout at the input token level to either (1) the entire x subsequence, or (2) the entireê subsequence. The same proportion of inputs in each batch are affected by the dropout, with the subset being chosen randomly. Without this technique, simulator models rely too heavily on explanations, and when conditioned only on x, they underperform baseline models that are trained only with x. In our multiagent experiments, we take a nearly identical approach, but we make use of the fact that each of the three simulator predictions is made for each batch (p φ (ŷ i |x i ,ê i ), p φ (ŷ i |x i ), and p φ (ŷ i |ê i )). That is, we weight these terms in the simulator objective by ratios implied by our dropout technique, rather than using dropout directly. See the Section A.5 for the relevant hyperparameters.
For simulator models, we tune mixing weights (or dropout proportions) by selecting based on each of the three predictions' accuracies, relative to baseline models trained on one input type only. Specifically, we select based on the max accuracy of the subsequence (x and e) predictions (with accuracies added together), under the constraint that models must achieve within 1 percentage point accuracy of the overall p φ (ŷ i |x i ,ê i ) accuracy. Now taking λ x,e , λ x , and λ e as loss function weights for  predictions conditioned on their subscripts, the effective loss function weights for CoS-E data are: λ x,e = .5, λ x = .5, and λ e = 0; and for NLI, we use λ x,e = .4, λ x = .4, λ e = .2.
The most complex set-up for tuning is our multiagent method. Here, we must tune mixing weights for the task, LM, and explanation objectives, as well as the weight for penalizing leaking explanations. First, we tune the task, LM, and simulatability weights directly for overall simulator accuracy, without applying a penalty for leaking. We search each parameter over the range [.2, .5] spaced by .05, with constraints that the three terms must add to 1, task weight must be as high as LM weight, and sim weight must be as high as task weight). Lastly, we tune the α trading off between explanation rewards and penalties by selecting directly for LAS scores; we search the unit interval spaces by .1. For SGD, α is set to .8 for CQA and .9 for SNLI; the task loss is .35, LM loss is .15, explanation loss is .5, and the simulator model objective adopts the same weights as described above. For RL, this mixing weight α is set to .8 for both datasets; the task loss is .025, LM loss is .025, explanation loss is .95, and the simulator model objective also adopts the same weights as described above.

B.1 Continuous Leakage Scores and LAS Metric
While we binarize our proxy for label leakage based on prediction correctness and take the raw average of explanation effects across two leakage bins, a continuous measure of leakage can be obtained directly from p(ŷ|ê). Then, an arbitrary number of bins can be used. Interestingly, for a T5 model fine-tuned by decoder sequence likelihood maximization, these probabilities are tightly con-

Generator
Encoder Sequences The answer is: contradiction The answer is: entailment The answer is: neutral Two children, both wearing tan coats, Two kids are hugging. premise: hypothesis: are embracing one another.
The answer is 'neutral' because: just because two children are embracing does not mean they are hugging The answer is 'entailment' because: hugging is a rephrasing of embracing.
The answer is 'contradiction' because: children are not kids.
Decoder Sequences Label 3 .11 .87 .02 x Figure 5: Inputs and outputs for the sequence to sequence ST-Ra framework. One explanation is generated for each answer choice, conditioned on the choice. The sequences and answers are supplied to a sequence-to-sequence task model for scoring. We use separate T5 models for the generator and task model.  centrated around values just above random chance performance (.33 for both CQA v1.0 and SNLI), taking a roughly normal distribution. As a result, they are easily calibrated via Platt scaling (Platt, 2000). To check for our results' robustness, we perform sensitivity analysis with respect to the number of evenly spaced leakage bins chosen to subset, after calibrating our leakage probabilities. Across bin counts between 2 and 100, LAS estimates typically vary by less than 1 point, and as a result, method ranking is almost always preserved. In the limit of the number of bins, our metric becomes the integral of the explanation effect as a function of leakage probability. To ensure the robustness of LAS scores, this type of sensitivity analysis should be performed whenever possible, but especially when explanation effectiveness is not linearly related to the leakage probability.

B.2 Robustness to Seed and Model Choice
We check LAS scores across three random seeds since random seeds tend to have a large influence on all statistics derived from pretrained neural lan-guage models (Dodge et al., 2020). Results are shown in Table 10. The rank ordering of scores is typically preserved, and in most cases, scores display relatively low variance, although there are some outlying values.
We also check the effect of using a different simulator model, shown in Table 11. We compare between our primary choice of T5-Base and RoBERTa-Large models for SNLI data. For ST models, the task model and simulator are of the same architecture, but we do not evaluate MT conditions since RoBERTa is not generative. RoBERTa produces lower LAS scores than T5, and their rank ordering is not necessarily the same, though ST-RA is the highest on average in both cases. The differences between them could result from their pretraining procedures, architectural differences, finetuning sample efficiency, or another cause.

C Alternative Computational Models and Language Modeling Objectives
Our generative models neither gained nor lost accuracy relative to their baselines when implemented with T5 models. Since learning from explanations to improve accuracy is another goal in collecting human explanations as data, we seek to assess this trend with alternative computational models and language modeling objectives. Hence, we test our MT models with Masked Language Modeling (MLM) objectives in place of the Causal objectives used for the generation, and wherever a generator or task model appears in current experiments, we test the effect of substituting GPT2 and BERT in their place. We show results for these models in Table 14; GPT2+BERT methods are tagged as ENC methods. Just as with our generative approaches, we observe no differences in accuracies between baselines and other methods.  Table 9: Evaluations of human and model-generated explanations by LAS score, overall simulator accuracy, and BLEU. We show the opposite data split relative to the main paper, for reproducibility. 95% confidence intervals as calculated by bootstrap are shown in parentheses. Confidence intervals are wider when the nonleaking subset is very small, and smaller when leaking and nonleaking subsets are both large.  Table 10: We check LAS scores across three random seeds, since random seeds tend to have a large influence on all statistics derived from pretrained neural language models (Dodge et al., 2020). Seed 1 is the result reported in the main body. We test two additional seeds for our primary experiments, retraining all models involved in the LAS score (including task model, simulator, and ST generators).

D Human Quality Rating Collection
We collected the human ratings of explanation quality from Amazon Mechanical Turk. For CQA or SNLI, we sample 200 examples from the development or testing set (CQA's testing set does not contain human explanations). Each example has five explanations that are generated by the four models we introduced in the main paper as well as humans. We anonymously shuffle the five explanations and ask turkers to rate them separately on a 5-point Likert scale. Meanwhile, we give them some instructions about "rate explanations by how they support the answer choice, rather than whether they are literally true" and "explanations in which cases should be rated low". Figure 6 shows   the full instructions we used for collecting explanation ratings for CQA, and Figure 7 shows one CQA question and its answer choices plus the first model's choice and its explanation. SNLI has a similar GUIs. Turkers will be required to rate five (choice, explanation) pairs on one page. We collected 3 responses for each example, so there are 600 responses in total for each dataset. We apply a simply quality filter to filter the responses from bad turkers. We first manually picked 10 explanations from both CQA and SNLI that contradict their corresponding model outputs (choices).   As we know, these explanations are sure to be bad. So, we filter the responses from those turkers who rated high (> 2 for CQA, > 3 for SNLI, since SNLI has a higher average rating) for these bad explanations. After filtering, we finally obtained 466 responses for CQA and 436 responses for SNLI.