Why do you think that? Exploring faithful sentence–level rationales without supervision

Evaluating the trustworthiness of a model’s prediction is essential for differentiating between ‘right for the right reasons’ and ‘right for the wrong reasons’. Identifying textual spans that determine the target label, known as faithful rationales, usually relies on pipeline approaches or reinforcement learning. However, such methods either require supervision and thus costly annotation of the rationales or employ non-differentiable models. We propose a differentiable training–framework to create models which output faithful rationales on a sentence level, by solely applying supervision on the target task. To achieve this, our model solves the task based on each rationale individually and learns to assign high scores to those which solved the task best. Our evaluation on three different datasets shows competitive results compared to a standard BERT blackbox while exceeding a pipeline counterpart’s performance in two cases. We further exploit the transparent decision–making process of these models to prefer selecting the correct rationales by applying direct supervision, thereby boosting the performance on the rationale–level.


Introduction
Large pre-trained language models, such as BERT (Devlin et al., 2018) or RoBERTa (Liu et al., 2019b) gain impressive results on a large variety of NLP tasks, including reasoning and inference (Rogers et al., 2020). Despite this success, research shows that their strong performance can rely, to some extent, on dataset-specific artifacts and not necessarily on the ability to solve the underlying task (Gururangan et al., 2018;Schuster et al., 2019;. Thus, these observations undermine the models' trustworthiness and impede 1 Code available at https://github.com/UKPLab/ emnlp2020-faithful-rationales  Figure 1: Example of the proposed rationale selecting process on one of the datasets (FEVER): Given a query and a document, our model selects the best rationale and predicts the label solely based on this selection. their deployment in situations where 'blindly trusting' the model is deemed irresponsible (Sokol and Flach, 2020). Explainability has thus emerged as an increasingly popular field (Gilpin et al., 2018;Guidotti et al., 2018).
We aim at faithful explanations -the identification of the actual reason for the model's prediction, which is essential for accountability, fairness, and credibility (Chakraborty et al., 2017;Wu and Mooney, 2019) to evaluate whether a model's prediction is based on the correct evidence. The recently published ERASER benchmark (DeYoung et al., 2020) provides multiple datasets with annotated rationales, i.e., parts of the input document, which are essential for correct predictions of the target variable (Zaidan et al., 2007). By contrast to post-hoc techniques to identify relevant input parts such as LIME (Ribeiro et al., 2016) or input reduction (Feng et al., 2018), we focus on models that are faithful by design, in which the selected rationale matches the full underlying evidence used for the prediction.
Existing strategies mostly rely on REINFORCE (Williams, 1992) style learning (Lei et al., 2016; or on training two disjoint models (Lehman et al., 2019;DeYoung et al., 2020), in the latter case depending on rationale supervision. This poses critical limitations as rationale annotations are costly to obtain and, in many cases, not available. Additionally, only when the model can select the "best" rationale from the full context we obtain an unbiased indicator for artifacts within a dataset that may influence models without rationale supervision.
In our proposed setup, we turn the hard selection into a differentiable problem by (a) decomposing each document into its residual sentences, and (b) similar to Clark and Gardner (2018) optimize the weighted loss based of each of these candidates. We show that this end-to-end trainable model (see Figure 1) can compete with a standard BERT on two reasoning tasks without rationale-supervision, and even slightly improve upon it, when supervised towards gold rationales. Our quantitative analysis shows how we can exploit these extracted rationales to identify the model's decision boundaries and annotation artifacts of a multi-hop reasoning dataset.

Related Work
Understanding the deep neural networks' decisions has gained increasing interest in the research community (DeYoung et al., 2020;Alishahi et al., 2019;Jacovi and Goldberg, 2020). Several works are concerned with post-hoc techniques to explain decisions of blackbox models (Ribeiro et al., 2016;Feng et al., 2018;Camburu et al., 2019). Visualizing attention weights has been heavily used, but is known to be insufficient (Jain and Wallace, 2019;Serrano and Smith, 2019). Other works focus on making the models themselves more interpretable via neural module networks (Jiang and Bansal, 2019;Gupta et al., 2020), graph-based networks (Tu et al., 2019;Qiu et al., 2019), pipeline models (Lehman et al., 2019), or by generating textual explanations (Camburu et al., 2018;Rajani et al., 2019;Liu et al., 2019a). Rather than only producing this explanation as additional output, Latcinnik and Berant (2020) base the target prediction on this automatically created hypothesis.
Some approaches jointly use rationales to explain the predictions and boost performance without ensuring faithfulness (Zaidan et al., 2007;Melamud et al., 2019;Strout et al., 2019) Figure 2: While the example from FEVER provides two alternative single-sentence rationales (R1 and R2), the MultiRC example requires considering two sentences at once for a single rationale (R1).
Very recent work  aims similarly to us, to infer faithful rationales based on its impact on the target prediction without supervision, thereby relying on a dedicated explanation technique to identify rationales and an additional model for the prediction. This work is different in that we (a) rely on the same network weights for rationale selection and target prediction, and (b) provide quantitative analysis about the decision criteria of the models on the reasoning tasks.

Datasets
We conduct our experiments on three different datasets as provided by ERASER. Specifically, we use FEVER (Thorne et al., 2018), Mul-tiRC (Khashabi et al., 2018), and Movies (Zaidan et al., 2007) as shown in Table 1. We limit ourselves to this sub-set of ERASER, as they require the identification of rationales from multi-sentence documents (as opposed to single sentences). Further, our approach must process the full sample, including the document, within the same minibatch. We do not consider datasets if their documents' size imposes memory issues with pre-trained language models, as this would require external preprocessing, which is not controlled by the model. FEVER is a large fact-checking dataset based on Wikipedia. Given a claim and a relevant document, the model must either support or refute the claim 2 . In FEVER, multiple alternative rationales may exist, each of which can be used to refute or support a claim.
MultiRC is a multi-hop-reasoning multiplechoice dataset. It encompasses a variety of genres. Each question is annotated with a single rationale, which always consists of multiple sentences. For each question, an arbitrary number of correct answers exists. Examples for both datasets can be found in Figure 2.
Movies is a sentiment dataset of movie reviews. As opposed to the other two corpora, it (a) does not require reasoning between the document and an additional claim/question, and (b) contains rationaleannotations on a span-level. Though we are primarily interested in sentence-level reasoning tasks, we apply our method to this dataset and map its annotations to sentences.

Our Model
Task Overview We propose a model that (a) explains its decisions by outputting which input parts are used for the predictions as faithful rationales and (b) achieves performance comparable to a standard blackbox approach. Importantly, the model must be able to select rationales that are useful to solve the target task, without relying on additional supervision. We achieve this by first creating multiple smaller samples for each original sampleeach associated with a potential rationale -and Figure 3: Model architecture. Each sample is split into its sentences (1), each individually encoded via BERT (2) followed by a linear layer (3). The loss for each input part is calculated separately (4,5). The score is computed via max-pooling (6), normalized (7) to compute the weighted loss (8). The input part with the highest score (6) is used for prediction.
solving the task based on each sub-sample individually. Similar to Clark and Gardner (2018), each sub-sample is associated with a learned score. Our model utilizes this score to jointly predict the target and the rationale. Instead of learning these scores via direct supervision (Min et al., 2019), our approach can derive them solely based on how useful each rationale is for solving the target task.

Single-Sentence without Rationale Supervision
Given a sample, the model must predict the label y based on a query q, i.e., the concatenation of the question and answer (MultiRC) or the claim (FEVER), and a document D. Instead of optimizing the objective given (q, D), we split D into segments and solve the overall task for each segment individually. We opt to split each document into sentences, as a trade-off between capturing enough semantic information within each segment while restricting each candidate's amount of information. Because some samples may be solved without any context (Schuster et al., 2019), we add a queryonly part, which is associated with no sentence (∅). Hence, for each (q k , D k ) with D k containing n k sentences s k,i , we create new input samples x new k with |x new k | = n k + 1 as We use a standard model m to compute the logits z k (without softmax) based on all (q k , s k,i ) in x new k within the same minibatch. All experiments use BERT-base-uncased (Devlin et al., 2018) with a linear layer on top of the [CLS] token whereas t reflects the number of target labels. Based on z k we compute |x new k | losses l k via softmax and cross-entropy based on each (q k , s k,i ) individually. Likewise, |x new k | different target predictionsŷ k are computed. Not all (q k , s k,i ) contain the right information to properly solve the target task. Similar to Clark and Gardner (2018); Min et al. (2019) we rely on confidence scores to identify the best prediction, based on the most relevant rationale. To do so we must (a) compute scalar values c k,i as confidence scores for each (q k , s k,i ), and (b) ensure that high scores c k,i are assigned to those input parts, that are most useful from the model's perspective. We compute c k via row-wise max-pooling over z k as it represents the value of the selected class: The key idea is to multiply these c k with the losses l k to compute the overall loss, s.t. high losses will be associated with low confidence and vice-versa. Yet, we cannot merely multiply both terms, as this would allow the model to decrease the loss towards minus infinity only by assigning high negative values to all c k without optimizing towards the actual label. To overcome this problem and obtain meaningful scores c k solely based on how useful each rationale is for the target task, we normalize all c k via softmax to obtain weights w k,i for each (q k , s k,i ). As an overall objective, we minimize the weighted sum of losses using these weights: The rationale behind this is threefold: A right prediction, i.e., a low loss l k,i , is only possible for informative sentences from the model's perspective. First, by allowing the model to distribute the weights for the losses amongst all candidates, it can neglect non-informative sentences when learning to assign low values (to high losses). Second, by normalizing these scores, it cannot ignore all sentences, but must assign comparatively higher scores to at least one (q k , s k,i ). Hence, to minimize the overall loss, high values must be assigned to the best suited (q k , s k,i ), i.e., with the lowest (expected) loss. Finally, by deriving these scores directly from the predicted class, the same function for prediction and selection is used and optimized. The hyperparameter τ is the temperature of softmax, controlling the distribution of the softmax function. Higher values for τ result in a softer distribution, i.e., the loss is more evenly distributed amongst rationale candidates. Lower values result in a more hardened distribution, i.e., the model focuses quicker on one selected rationale. For both, prediction and training, all rationales are always considered. The process is visually exemplified in Figure 3 and, for the most part (steps 2-5), resembles a standard setup.
Prediction For predictions, we select the sentence with the highest confidence from all sentences as the rationaler, and the prediction based onr as the targetŷ: Though the rationale is faithful on a sentence-level, we note that it does not indicate whether all information ofr is relevant to the model.

Rationale supervision
We believe that rationales without supervision provide more trustworthy explanations. They are not affected by an additional objective and solely are selected if they are useful for the target task. Nevertheless, we experimentally show how rationale-supervision can be applied by jointly (Yin and Roth, 2018) supervising on the target and rationale. To compute the rationale-loss as an additional objective, we treat slightly adapted confidence values c * k as a multilabel problem via a sigmoid layer and binary crossentropy loss.
This ensures that the correct class's confidence is increased even if the model (currently) predicts the wrong class.
Multiple Sentences Due to the memory consumption, encoding all (ordered) permutations of sentences up to a certain length through BERT is infeasible. To allow the model to select multiple sentences, for each permutation up to a length h, their representation is computed by max-pooling over the [CLS] token embeddings of its sentences. We experiment with up to two sentences.

Results
All experiments use AllenNLP  and BERT-base-uncased (Devlin et al., 2018) as provided by Wolf et al. (2019). We manually  Table 2: Mean performance and standard deviation for all models. U represents models without supervision on the rationale, S indicates supervision is applied on the rationale. The first two columns measure the performance on the target task using macro-averaged F1 and accuracy. The next three columns specify Precision, Recall and F1 of the rationales on a sentence-level. The last two columns jointly show the performance based on a correct rationale and target. Majority is only computed for the target-task performance.
tune hyper-parameters for standard BERT baseline models and the sentence-selecting models, and show results in Table 2. We report results for the best configurations using three different seeds. We additionally report results of the BERT-to-BERT pipeline models from ERASER, which are based on the implementation of Lehman et al. (2019).
Metrics As opposed to DeYoung et al. (2020) we choose sentences as the lexical unit for rationales. We report precision, recall, and F1 for the rationales rather than token-level IOU, to avoid that the length of sentences impacts the metrics 3 . As we are interested to understand whether a model makes the right prediction for the right reasons, we focus on sufficiency of selected rationales rather than comprehensiveness: The claim of FEVER in Figure 2 shows two valid rationales. Only one of these is required to support the claim. To compute precision, recall, and F1 w.r.t. sufficiency, we, therefore, compute these metrics based on the single, most similar 4 gold-rationale when evaluating any of the models. We additionally report the joint accuracy of the target task and the rationale. Here we con- 3 To simplify comparisons with future work, we report the original ERASER metrics in Appendix A. 4 Determined by highest F1 on the sentence-level. sider a prediction correct for the right reason, when it correctly predicts the target and all sentences of one gold-rationale (Acc. Full). A weaker measure (Acc. Part) only requires the intersection of the selected sentences and one gold-rationale to be non-empty. As multi-hop classification tasks tend to be easy to "trick" (Chen and Durrett, 2019), this joint evaluation with the underlying evidence gives a better impression of the performance on the task itself.
Observations The Target columns in Table 2 show that our models can compete with the standard BERT on both reasoning tasks FEVER and MultiRC. This is especially surprising for singlesentence models on the multi-hop reasoning task MultiRC. We find that the single-sentence model U is more sensitive towards seeds, yielding in a slightly lower overall performance and higher variance on MultiRC (see Appendix B). We believe this is because, given an unfortunate initialization, the model can focus on arbitrary features to quickly on this challenging dataset. Applying rationale supervision helps to stabilize this by improving the selected rationales rather than generally reaching higher target performances. The BERT-to-BERT pipeline makes its prediction based on the best sin-gle sentence and can only fairly be compared with the single-sentence selecting models. The unsupervised approach is far behind all other models on the movies dataset, which we partly attribute to the small training data combined with the much larger document size. Primarily, however, we find (see Section §5.1) that by design, our approach is unsuitable for this kind of data, which due to its discussing nature, contains evidence for both labels within the same document. The closest measure for "right for the right reasons" is represented by Acc. Full. Yet, it can only measure whether the prediction is based on the correct rationale on a sentence level, whereas it may still solely rely on certain contained words. Assuming comprehensive rationale annotations 5 , the opposite can be said, i.e., 92.9% of MultiRC are not classified correctly for the right reasons. Note that both, the single-sentence models and the BERT-to-BERT pipeline, are bound to reach an 0% for Acc. Full on MultiRC, since they can only select a single sentence as the rationale.

Analysis
Leveraging the information about the used rationales, we closer analyze decision criteria for FEVER and MultiRC, and why our method performed poorly on Movies. Further, except for the two-sentence S models on MultiRC, no other model selects two sentences as a rationale in more than 1.3%. We partly attribute this to the less-thanoptimal aggregation via max-pooling. As these are only selected due to the additional supervision, not for the utility to solve the overall task 6 , we focus on single-sentence models.

Poor Performance on Movies Dataset
Without rationale supervision, our approach by far lacks behind its counterparts. To better understand the reason for this performance gap, we analyze the underlying data and the predictions. We find that our models U reach an average recall of 0.93 and 0.32 for NEG and POS respectively on the dev set -despite the balanced training data. We emphasize that this is due to a very different nature of the data, compared to FEVER and MultiRC: Rather than all sentences within a document containing the same sentiment, they usually discuss pro and 5 FEVER does not provide comprehensive rationaleannotations. 6 We show supporting analysis for this in Appendix D.
(35) the scenes between nick and danny are very good, and i actually got a feel for their characters; a bond forms between them that holds parts of the film together. (36) chow and wahlberg are both good actors; chow is a pro, and can do this kind of stuff in his sleep.
(37) wahlberg seems less at home in this atmosphere, but he's still fun to watch.
(38) i also liked the subplot involving danny 's father; brian cox's performance is powerful, and his character makes a compelling moral compass for danny.
(39) but the film ultimately fails, mostly at the hands of insane incoherence and overly -familiar action scenes. cons, and hence contain evidence for the gold label, as well as the opposite label. An extract of such a document can be seen in Figure 4 and two full examples in Appendix E. During prediction, even for humans, it is impossible to predict the correct overall sentiment based on isolated, out of context sentences of opposing stances. An additional problem arises during training in our setup: For the presented example, the model must either learn to either predict the label NEG even for sentences with clearly (only) positive indicators, or learn to reduce their confidence values c k to mitigate their impact. Either way, this naturally compromises its ability to detect the opposite sentiment. This discussionbased nature of Movies significantly differs from MultiRC and Fever. In the latter case, each document only contains evidence for or against a claim, not both. In this case, the model must not learn contradicting patterns and only lower the confidence for irrelevant sentences, consistent with both labels. Both, the pipeline and the model S, show that by guiding the model towards gold rationales, it can detect sentences for the overall movie sentiment. Without this guiding, however, our approach seems not suitable for such tasks.

Learning curves
We investigate the impact of the amount of available training data for the three different models blackbox, model S, and U. To limit the data's impact, we create three random subsets of the training data of different sizes and report the average performance of each of the models on these subsets in Figure 5. All three models show similar trends across all training sizes for MultiRC. On FEVER,  the rationale-supervision offers an additional boost in scenarios with little data. Without rationalesupervision, it tends to require more data to reach its peak performance.

Model decisions on FEVER
Both (best) single-sentence models U and S perform very strong and predict the same label in 93.8% of all cases, from which they select the same rationale in 86%. We, therefore, focus on how supervision affects the model internally. Specifically, we exploit the fact that relevance and prediction are jointly encoded and optimized within the same logits z k,i . In Figure 6 we compare these z k,i from a global perspective after normalizing them using min-max-normalization. Applying rationalesupervision leads to more decisive predictions, as the vast majority of unselected sentences scores close to the global minimum, whereas selected sentences have scores close to the maximum. Invalid selected rationales tend to be shifted slightly more towards the lower end than selected correct rationales. This looks very different for model U.  importantly, a non-trivial amount of unselected sentences reached scores very close to the global maximum.
Does it learn semantically better decision criteria with supervison? A possible reason why such high values occur for unchosen sentences is that the selected rationale is not substantial for a correct target prediction. Schuster et al. (2019) identify n-grams within claims that highly correlate with certain classes. By adding new evidence and claims for each of their selected claims they design a symmetric test-set, which cannot be solved using such artifacts. Intuitively, similar to Stacey et al. (2020), applying rationale-supervision (model S) forces the model to learn -based on the rationale -high and low values for the same claim, i.e. containing the same artifacts. It should therefore be more sensitive for the context and not rely on claim-only features. We show the performance on this symmetric test set in Table 3. Despite a small improvement, it still lacks far behind the performance on FEVER. Even the model U rarely selects the claim-only as the rationale, suggesting that at least partially, additional context helps to solve the task properly. Yet, it shows that smaller lexical units than sentences as a rationale may be beneficial in such cases.

Model decisions on MultiRC
What is the impact of rationale supervision?
The ceiling performance on the target task remains the same, even with rationale-supervision. We analyze the validity of the selected rationales on the validation split to shed light on (a) how the model can achieve a strong performance, and (b) how rationale supervision affects the model. For simplicity, we select the best performing singlesentence models and group the predictions by the gold and predicted target label in Table 4. The model U results show that evidence of positive samples is more likely to get selected. While the correctly predicted positive samples mostly rely on gold evidence for the answer, for correctly pre-  dicted negative samples, the absence of supporting evidence seems sufficient, rather than explicit evidence against it. Note that none of these "evidence" is truly sufficient, as multiple sentences are technically required. To see whether this behavior is due to our training method or helpful for the underlying data, we re-evaluate the best performing BERT on the validation set and exclude all gold-rationales from the documents. The results show a recall of 28.4 (True) and 81.8 (False) 7 , suggesting a similar behavior. Hence, the major benefit from rationalesupervision is to predict the label False based on the correct sentence, which is not required to solve the overall task. To limit this property of future datasets, we believe it is important to add unanswerable instances, as done for instance by Thorne et al. (2018) or Rajpurkar et al. (2018).
What kind of sentences are selected as a rationale?
We jointly look at the selected sentences with the target prediction of both models U and S and observe a high correlation with word overlap of the question and the answer. Figure 7 shows KDE plots of the selected sentences based on the percentage of non-stopwords 8 of the question and answer respectively, that are also contained within the selected sentence. We make multiple observations: Positive predictions mostly depend on a high overlap with the answer. The overlap with the question has a lower priority. Especially for the model S, a clear decision boundary between rationales for both labels can be seen based on the lexical overlap. Interestingly, also Yadav et al. (2019), to a large part, rely on similar lexical features for their unsupervised detection of justification sentences on MultiRC. In line with the previous section, rationale supervision only has a limited impact on positive predictions. A significant difference is shown for the negative predictions. Whereas model U tends to select rationales for both labels based on similar criteria, the selected rationales for samples predicted False by model S almost entirely have lexical overlaps with the question only. This intuitively makes sense, as the same rationales are valid for each question. Negative rationales should therefore be relevant for the question, not for the answer. We show some examples in Appendix C.
Are single sentences sufficient for MultiRC?
It has been shown that noisy detection of evidence can already improve the performance on MultiRC , yet this should not be possible via single sentences. To see whether BERT exploits such biases, we follow Gururangan et al.
(2018) and identify samples within the test-set that are solvable using a single-hop only, i.e., these which the single-sentence U model classified correctly. To limit the impact of lucky guesses, we group samples by the number of these models that could solve them in Table 5. As pointed out in Section 4, one of our single-sentence models U on MultiRC performed poorly due to its seed sensitivity . To exclude impacts from this specific model and group the test-split by meaningful criteria, we retrain BERT blackbox and model U with a new random seed, reaching an F1a score of 66.3 and 67.6 respectively on the test set. We select the best three seeds of both model types for splitting the data (model U) and evaluation (BERT blackbox).  Table 5: Average performance of BERT models based on subsets of the test-split that can be solved using a single sentence, compared with a lexical overlap logistic regression. ∆F1a measures the difference w.r.t. the performance on the full test set. Columns indicate how many single-sentence models U could solve each contained instance correctly.
Lexical Overlap Logistic Regression Additionally, we mimic our observations with the high lexical overlap using a simple logistic regression. We calculate a rationale score r = w q q s + w a a s for each sentence s, whereas q s and a s represent the absolute/relative word overlap of the sentence with the question and answer respectively. For each sample, the sentence with the highest r is selected as a rationale (shorter sentences are preferred as a tie-breaker) and used to train a logistic regression (LR), breaking down the multi-hop reasoning task to two digits based on a single sentence. We run a grid-search with different values for w q and w a and select the model with the highest F1a score of 63.5 on the validation set (F1a score of 58.1 on the test set), using absolute word overlaps, w q = 0.4 and w a = 1.0.

Results
The performances are shown in Table 5. BERT performs strongly on samples that can be solved using a single sentence while struggling with the same instances as model U. Further, a simple logistic regression shows a similar trend. On the easiest (and largest) part it even exceeds the performance of the full test-set of any BERT model. The results suggest that high performance does not indicate successful multi-hop reasoning 9 .

Discussion
Limitations From a technical perspective, a limitation is memory consumption, as the model must process all rationale candidates of the same instance within the same minibatch. Though single-9 This is not the official, hidden test-set of MultiRC. sentence rationale can be processed, encoding all combinations of multiple sentences via BERT is problematic. Future work could investigate better sampling strategies or a greedy breadth search to reduce the number of candidates. Another limitation is the inability of coreference resolution between different sentences and the consideration of the context in general. Solving this is non-trivial, as we essentially buy faithfulness by explicitly omitting all other information than the selected sentence(s). While this does not seem crucial in the evaluated datasets, it poses potential dangers for malicious attacks, most importantly, when considering the permutations of multiple sentences. Therefore, we recommend to always show the identified evidence in context when using our approach in the real world.
Conclusion We proposed a conceptually simple approach to allow models to extract faithful rationales, which can compete with standard BERT on two reasoning tasks without supervision and even improve the overall performance, when supervising on the rationale. We showed that by outputting faithful rationales, it is possible to not only compare models based on the target performance alone, but also quantify how well even those correct predictions are based on the correct evidence. Our analysis showed that exploiting this knowledge about the selected rationales helps shed light on the models' the decision mechanism for debugging purposes and on the underlying data.    On MultiRC, the two-sentence model U selects a single sentence as the rationale in 99.0%, whereas the model S selects two sentences on 51.4%. In 83.4% both models predict the same targetŷ. Based on these, we consider all instances, where model S selects the same sentence as model U plus one additional sentence as a rationale, to identify whether (i) both sentences are relevant, (ii) the shared sentence is relevant, or (iii) the additional sentence is relevant for model S. Instead of looking at the prediction of the joint rationale of both sentences of model S, i.e., the selected rationale with the highest confidence score, we now look at the predictions of both selected sentences individually. Table 8 shows whether the prediction of model S remains stable for both predicted labels if only one of the sentences out of the two-sentence rationale is used. For False predictions, the additional sentence (only selected when supervised) has a major impact on the prediction and seems most relevant. This is in line with our observations in Section 5.4, namely that supervision affects the decision mechanism predicting this label. For the prediction of True, in almost all cases the same sentence as the one selected by model U yields in the same prediction. The additional sentence in isolation, however, changes the prediction to False in almost half of all cases. Though bound to our approach, these results suggest that rationalesupervision may yield in selecting rationales that are not required by the model to solve the target task, but rather the rationale-objective , thereby losing some of their faithfulness. This may be a relevant consideration when measuring faithfulness on a more fine-granular level. Figure 9 shows an example of positive sentiment in which the model disregards sentences with clear positive stances and selects a sentence containing "scary" as the rationale. Figure 10 shows how the model correctly selects a sentence of positive stance but interprets this sentence as negative. Both examples show that sentences with opposing stances occur by discussing the plot and the movie in general.

E Movies Examples
(1) there ' s a thin line between satire and controversy , and mike nichols ( the birdcage , wolf ) has directed a sharp and very honest look at a us presidential election .
(2) based on the book written by " anonymous " ( actually former " newsweek " writer joe klein ) , john travolta plays governor jack stanton .
(3) but he does n ' t actually play stanton . (4) he plays bill clinton ; just the same as emma thompson no doubt plays the first lady and billy bob thorton is the campaign manipulator james carville ( although the credits will of course say otherwise ) .
(5) the film is taken from the perspective of henry burton ( adrian lester ) , a morally correct and somewhat hesitant new advisor to stanton . (6) he searches for justice and dignity in the ugliest possible situations , and whether it be keeping the history of his boss ' pants under wraps or contemplating digging up dirt on another politician , he approaches his work with a keen desire to skillfully serve his country and his fellow workers . (7) richard jemmons ( billy bob thorton ) and daisy green ( maura tierney ) team up with henry as the would -be president ' s advisors , and hire lesbian veteran libby holden ( kathy bates ) as the campaign ' s eccentric " tougher than dirt " incriminator . (8) together they face all sorts of sexual allegations , the irritatingly discourteous media and other witty politicians in the election race . (9) in its satire and controversy , primary colors is a similar film to wag the dog : they both are not afraid to wipe their noses in the nitty -gritty and take a bold look at something that will never has honesty as a virtue .
(10) but whereas wag showed us how much affect a few people can have on the media , primary colors is much more concerned with fleshing out it ' s characters , letting us understand what they want and why , and making us truly appreciate the humanity and rectitude that they graciously represent . (11) seeing john travolta play bill clinton (12) so confidently and justly is enough to make the film more than worth a look . and the rest of the cast also make (13) superb performances -adrian lester sharply portrays the intellect of henry whilst kathy bates is perfect as the robust and energetic libby holden .
(14) at occasions , you ca n ' t help but feel that these terrific characters are going to waste .
(15) there are long slabs of time where john travolta ( unquestionably the most interesting to watch ) is missed from the screen ; and since it is awkwardly structured as henry ' s story we are often forced to watch scenes that perhaps are not so necessary to the central plot -or even the point of the film . (16) having said that , make no mistake -primary colors is always enjoyable to watch (17) . (18) but frequently we have to ask ourselves -exactly what are we watching ? (19) most of the first half of its duration is a lightheaded look at melodramatic confrontations that seem so genuine we can not help but laugh , but the way primary colors chooses to finish tackles aspects that are very contrary , and almost unsuitable , to the rest of the film . (20) but as i mentioned before , there is a thin line between satire and controversy -and for the most part , primary colors delivers an entertaining indulgence of political matters combined with a far -from -overpowering look at winning the public ' s opinion . (21) although at occasions the film may jump around a little too freely , focus is never lost on how important and vulnerable the subject matter really is .