When Choosing Plausible Alternatives, Clever Hans can be Clever

Pretrained language models, such as BERT and RoBERTa, have shown large improvements in the commonsense reasoning benchmark COPA. However, recent work found that many improvements in benchmarks of natural language understanding are not due to models learning the task, but due to their increasing ability to exploit superficial cues, such as tokens that occur more often in the correct answer than the wrong one. Are BERT’s and RoBERTa’s good performance on COPA also caused by this? We find superficial cues in COPA, as well as evidence that BERT exploits these cues.To remedy this problem, we introduce Balanced COPA, an extension of COPA that does not suffer from easy-to-exploit single token cues. We analyze BERT’s and RoBERTa’s performance on original and Balanced COPA, finding that BERT relies on superficial cues when they are present, but still achieves comparable performance once they are made ineffective, suggesting that BERT learns the task to a certain degree when forced to. In contrast, RoBERTa does not appear to rely on superficial cues.


Introduction
Pretrained language models such as ELMo (Peters et al., 2018), BERT (Devlin et al., 2019), and RoBERTa (Liu et al., 2019b) have led to improved performance in benchmarks of natural language understanding, in tasks such as natural language inference (NLI, Liu et al., 2019a), argumentation (Niven and Kao, 2019), and commonsense reasoning (Li et al., 2019;Sap et al., 2019). However, recent work has identified superficial cues in benchmark datasets which are predictive of the correct answer, such as token distributions and lexical overlap. Once these cues are neutralized, models perform poorly, suggesting that their good * Equal contribution.
The woman hummed to herself. What was the cause for this?
She was in a good mood. She was nervous.
The woman trembled. What was the cause for this? She was in a good mood. She was nervous.
(b) Mirrored COPA instance. Figure 1: A COPA instance (a) with premise and correct () and wrong () alternatives. Our analysis reveals that the unigram a (highlighted orange) is a superficial cue exploited by BERT. We neutralize such superficial cues by creating a mirrored instance (b). After mirroring, the highlighted superficial cue becomes ineffective in predicting the correct answer, since it occurs with equal probability in correct and wrong alternatives.
performance is an instance of the Clever Hans effect 1 (Pfungst, 1911): Models trained on datasets with superficial cues learn heuristics for exploiting these cues, but do not develop any deeper understanding of the task. While superficial cues have been identified in, among others, datasets for NLI (Gururangan et al., 2018;McCoy et al., 2019), machine reading comprehension (Sugawara et al., 2018), and argumentation (Niven and Kao, 2019), one of the main benchmarks for commonsense reasoning, namely the Choice of Plausible Alternatives (COPA, Roemmele et al., 2011), has not been analyzed so far. Here we present an analysis of superficial cues in COPA.
Given a premise, such as The man broke his toe, COPA requires choosing the more plausible, causally related alternative, in this case either: because He got a hole in his sock (wrong) or because He dropped a hammer on his foot (correct). To test whether COPA contains superficial cues, we conduct a dataset ablation in which we provide only partial input to the model. Specifically, we provide only the two alternatives, but not the premise, which makes solving the task impossible and hence should reduce the model to random performance. However, we observe that a model trained only on alternatives performs considerably better than random chance and trace this result to an unbalanced distribution of tokens between correct and wrong alternatives. Further analysis ( §4.3) reveals that finetuned BERT (Devlin et al., 2019) perform very well (83.9 percent accuracy) on easy instances containing superficial cues, but worse (71.9 percent) on hard instances without such simple cues.
To prevent models from exploiting superficial cues in COPA, we introduce Balanced COPA. Balanced COPA contains one additional, mirrored instance for each original training instance. This mirrored instance uses the same alternatives as the corresponding original instance, but introduces a new premise which matches the wrong alternative of the original instance, e.g. The man hid his feet, for which the correct alternative is now because He got a hole in his sock (See another example in Figure 1). Since each alternative occurs exactly once as correct answer and exactly once as wrong answer in Balanced COPA, the lexical distribution between correct and wrong answers is perfectly balanced, i.e., superficial cues in the original alternatives have become uninformative.
Balanced COPA allows us to study the impact of the presence or absence of superficial cues on model performance.
Since BERT exploits cues in the original COPA, we expected performance to degrade when training on Balanced COPA. However, BERT trained on Balanced COPA performed comparably overall. As we will show, this is due to better performance on the "hard" instances. This suggests that once superficial cues are made uninformative, BERT learns the task to a certain degree.
In summary, our contributions are: • We identify superficial cues in COPA that allow models to use simple heuristics instead of learning the task ( §2); • We introduce Balanced COPA, which prevents models from exploiting these cues ( §3); • Comparing models on original and Balanced COPA, we find that BERT heavily exploits cues when they are present, but is also able to learn the task when they are not ( §4); and • We show that RoBERTa does not appear to exploit superficial cues.
2 Superficial Cues in COPA 2.1 COPA: Choice of Plausible Alternatives Causal reasoning is an important prerequisite for natural language understanding. The Choice Of Plausible Alternatives (COPA) (Roemmele et al., 2011) is dataset that aims to benchmark causal reasoning in a simple binary classification setting. 2 COPA requires classifying sentence pairs consisting of the first sentence, the premise, and a second sentence that is either cause of, effect of, or unrelated to premise. Given the premise and two alternatives, one of which has a causal relation to the premise, while the other does not, models need to choose the more plausible alternative. Figure 1a shows an example of a COPA instance. The overall 1000 instances are split into training set 3 and test set of 500 instances each. Prior to neural network approaches, the most dominant way of solving COPA was via Pointwise Mutual Information (PMI)-based statistics using a large background corpus between the content words in the premise and the alternatives Luo et al., 2016;Sasaki et al., 2017;Goodwin et al., 2012). Recent studies show that BERT and RoBERTa achieve considerable improvements on COPA (see Table 1).
However, recent work found that the strong performance of BERT and other deep neural models in benchmarks of natural language understanding can be partly or in some cases entirely explained by their capability to exploit superficial cues present in benchmark datasets. For example, Niven and Kao (2019) found that BERT exploits superficial cues, namely the occurrence of certain tokens such as not, in the Ar-

Model Accuracy
BigramPMI (Goodwin et al., 2012) 63.4 PMI  65.4 PMI+Connectives (Luo et al., 2016) 70.2 PMI+Con.+Phrase (Sasaki et al., 2017) 71.4 BERT-large (Wang et al., 2019) 70.5 BERT-large (Sap et al., 2019) 75.0 BERT-large (Li et al., 2019) 75.4 RoBERTa-large (finetuned) 4 90.6 BERT-large (finetuned)* 76.5 ± 2.7 RoBERTa-large (finetuned)* 87.7 ± 0.9  (Williams et al., 2018) when given incomplete input, even though the task should not be solvable without the full input. This suggests that the partial input contains unintended superficial cues that allow the models to take shortcuts without learning the actual task. Sugawara et al. (2018) investigated superficial cues that make questions easier across recent machine reading comprehension datasets. Given the fact that superficial cues were found in benchmark datasets for a wide variety of natural language understanding task, does COPA contain such cues, as well?

Token Distribution
One of the simplest types of superficial cues are unbalanced token distributions, i.e tokens appearing more often or less frequently with one particular instance label than with other labels. For example, Niven and Kao (2019) found that the token not occurs more often in one type of instance an argumentation dataset. Similarly we identify superficial cues -in this case a single token that appears more frequently in correct alternatives or wrong alternatives -in the COPA training set. To find superficial cues in the form of predictive tokens, we use the following measures, defined by Niven and Kao (2019). Let T (i) j be the set of tokens in the alternatives for data point i with label j. The applicability α k of a token k counts how often this token occurs in an alternative with one label, but not the other: The productivity π k of a token is the proportion of applicable instances for which it predicts the correct answer: Finally, the coverage ξ k of a token is the proportion of applicable instances among all instances: Table 2 shows the five tokens with highest coverage. For example, a is the token with the highest coverage and appears in either a correct alternative or wrong alternative in 21.2% of COPA training instances. Its productivity of 57.5% expresses that it appears in in correct alternatives 7.5% more often than expected by random chance. This suggests that a model could rely on such unbalanced distributions of tokens to predict answers based only on alternatives without understanding the task.
To test this hypothesis, we perform a dataset ablation, providing only the two alternatives as input to RoBERTa, but not the premise, following similar ablations by Gururangan et al. (2018); Niven and Kao (2019). RoBERTa trained 5 in this setting, i.e. on alternatives only, achieves a mean accuracy of 59.6 (± 2.3). This is problematic because COPA is designed as a choice between alternatives given the premise. Without a premise given, model performance should not exceed random chance. Consequently, a result better than random chance shows that the dataset allows solving the task in a way that was not intended by its creators. To fix this problem, we create a balanced version of COPA that does not suffer from unbalanced token distributions in correct and wrong alternatives.  fective. Our approach is to balance the token distributions in correct alternatives and wrong alternatives in the training set. Without unbalanced token distributions, we hope models are able to learn other patterns more closely related to the task, e.g. a pair of causally related events, rather than superficial cues.

Data Collection
To create the balanced COPA training set, we manually mirror the original training set by modifying the premise. Taking the original training set as a starting point, we duplicate the COPA instances and modify their premises so that incorrect alternatives become correct. Suppose the following original COPA instance: • Premise: The stain came out of the shirt. What was the CAUSE of this?
We create the following balanced COPA instance, where the wrong alternative becomes the correct choice now: • Premise: The shirt did not have a hole anymore. What was the CAUSE of this?
• Alternative 1: I bleached the shirt.
•  To collect such balanced data, we asked five fluent English speakers who have background knowledge of NLP (see Appendix A for the detailed guideline). Finally, we collected 500 new mirrored instances. Concatenating it with the original training instances, the balanced COPA consists of 1,000 instances in total. The corpus is publicly available at https://balanced-copa. github.io.

Quality Evaluation
To ensure the quality of the mirrored instances, we estimate a human performance using Amazon Mechanical Turk (AMT), a widely-used crowdsourcing platform. We randomly sample 100 instances from the original COPA training set and 100 instances from the balanced COPA, and asked crowdworkers to solve each instance (see Appendix B for an actual screenshot). To avoid noisy workers, we presented our tasks to workers who meet master AMT qualification with at least 10,000 HIT approvals and 99% HIT approval rate. Per HIT, we assign three crowd workers and offer 10 cents reward.
From the collected responses, we calculate the accuracy of workers (by majority voting) and inter-annotator agreement by Fleiss' Kappa (Fleiss, 1981). The human evaluation shows that our mirrored instances are comparable in difficulty to the original ones (see Table 3). However, we found that some mirrored instances are a bit tricky at first glance. But, with a bit more attention, the answer is quite obvious (see Appendix C, for an example).

BERT and RoBERTa on COPA
In this section we analyze the performance of two recent pretrained language models on COPA: BERT and RoBERTa, an optimized variant of BERT that achieves better performance on the Su-perGLUE benchmark (Wang et al., 2019), which includes COPA.  (Noreen, 1989), with * indicating a significant difference between performance on Easy and Hard p < 5%. Methods are pointwise mutual information (PMI), word frequency provided by the wordfreq package (Speer et al., 2018), pretrained language model (LM), and next-sentence prediction (NSP).
We convert COPA instances as follows to make them compatible with the input format required by BERT/RoBERTa. For a COPA instance p, a 1 , a 2 , q , where p is a premise, a i is the i-th alternative, and q is a question type (either effect or cause), we construct BERT's input depending on the question type. We assume that the first sentence and the second sentence in the next sentence prediction task describe a cause and an effect, respectively. Specifically, for each i-th alternative, we define the following input function: Part of BERT's training objective includes next sentence prediction. Given a pair of sentences, BERT predicts whether one sentence can be plausibly followed by the other. For this, BERT's input format contains two [SEP] tokens to mark the two sentences and the [CLS] token, which is used as the input representation for next sentence prediction. This part of BERT's architecture makes it a natural fit for COPA.
One of the key differences between BERT and RoBERTa is that the next sentence prediction objective is not part of RoBERTa's training objective. Instead, RoBERTa is trained with masked language modeling only, with its input consisting of multiple concatenated sentences. To match this training setting, we encode two sentences in a single segment as follows: input(p, a i ) = "<s> p a i </s>" if q is effect "<s> a i p </s>" if q is cause After encoding premise-alternative with BERT or RoBERTa, we take the first hidden representation z 0 i , i.e. the one corresponding to [CLS] or <s>, in the final model layer and pass it through a linear layer for binary classification: where the parameters w ∈ R h and b ∈ R are learned on the COPA training set. Finally, we choose the alternative with the higher score, i.e., aî withî = arg max i∈{1,2} y i . For training, we minimize the cross entropy loss with the logits [y 1 ; y 2 ] and fine-tune BERT and RoBERTa's parameters. In our experiments, we use pretrained BERT-large (uncased) with 24 layers, 16 self-attention heads (340M parameters) and pretrained RoBERTa-large with 24 layers, 16 selfattention heads (355M parameters). 6

Training Details
For training, we consider two configurations: (i) using the original COPA training set ( §4.3), and (ii) using B-COPA ( §4.4). We randomly split the training data into training data and validation data with the ratio of 9:1. For B-COPA, we make sure that a pair of original instance and its mirrored counterpart always belong to the same split in order to ensure that a model is trained without superficial cues. For testing, we use all 500 instances from the original COPA test set.

Evaluation on Easy and Hard subsets
To investigate the behaviour of BERT and RoBERTa trained on the original COPA, which contains superficial cues, we split the test set into an Easy subset and a Hard subset. The Easy subset consists of instances that are correctly solved by the premise-oblivious model described in §2.
To account for variation between the three runs with different random seeds, we deem an instance correctly classified only if the premise-oblivous model's prediction is correct for all three runs. This results in the Easy subset with 190 instances and the Hard subset comprising the remaining 310 instances. Such an easy/hard split follows similar splits in NLI datasets (Gururangan et al., 2018). We then compare BERT and RoBERTa with previous models on the Easy and Hard subsets. 7 As Table 4 shows, previous models perform similarly on both subsets, with the exception of Sasaki et al. (2017). 8 Overall both BERT (76.5%) and RoBERTa (87.7%) considerably outperform the 7 For previous models, we use the prediction keys available on http://people.ict.usc.edu/˜gordon/ copa.html 8 We conjecture that word frequency is another superficial cue exploited by models. To verify this we train a classifier based on word frequencies only (Speer et al., 2018) and find that this classifier is able to identify the correct alternative better than random chance, but this result is not significant (p = 9.8%).
best previous model (71.4%). However, BERT's improvements over previous work can be almost entirely attributed to high accuracy on the Easy subset: on this subset, finetuned BERT-large improves 8.6 percent over the model by (Sasaki et al., 2017) (83.9% vs. 75.3%), but on the Hard subset, the improvement is only 2.9 percent (71.9% vs. 69.0%). This indicates that BERT relies on superficial cues. The difference between accuracy on Easy and Hard is less pronounced for RoBERTa, but still suggests some reliance on superficial cues. We speculate that superficial cues in the COPA training set prevented BERT and RoBERTa from focusing on task-related non-superficial cues such as causally related event pairs.

Evaluation on Balanced COPA (B-COPA)
How will BERT and RoBERTa behave when there are no superficial cues in the training set? To answer this question, we now train BERT and RoBERTa on B-COPA and evaluate on the Easy and Hard subsets. The results are shown in Table 5. The smaller performance gap between Easy and Hard subsets indicates that training on B-COPA encourages BERT and RoBERTa to rely less on superficial cues. Moreover, training on B-COPA improves performance on the Hard subset, both when training with all 1000 instances in B-COPA, and when matching the training size of the original COPA (500 instances, B-COPA 50%). Note that training on B-COPA 50% exposes the model to lexically less diverse training instances than the original COPA due to the high overlap between mirrored alternatives (see §3).
These results show that once superficial cues are removed, the models are able to learn the task to a high degree. This contrasts with Niven and Kao (2019)  on the Argument Reasoning Comprehension Task (Habernal et al., 2018) does not exceed random chance level after superficial cues are made uninformative. A likely explanation for this contrast is the difference in the inherent task difficulties. Argument reasoning comprehension is a high level natural language understanding task requiring world knowledge and complex reasoning skills, while COPA can be largely solved with associative reasoning, as the performance of the PMI-based baselines shows (Table 4). A second possible explanations is BERT's insensitivity to negations (Ettinger, 2019). Since Niven and Kao (2019) made superficial cues uninformative by adding negated instances to the dataset, BERT's insensitivity to negations makes distinguishing between instances and negated instances difficult (see §3).

Analysis of sentence pair embeddings
The findings presented in the previous sections, namely BERT's and RoBERTa's good performance on COPA in spite of the rather small amount of training data, leads us to the following hypothesis that pretraining enables these models to create an embedding space in which embeddings of plausible sentence pairs are distinguishable from embeddings of less plausible pairs.
To investigate how well the respective embedding spaces of BERT and RoBERTa separate plausible and less-plausible pairs, we train BERT-large and RoBERTa-large without fine-tuning. Specifically, we freeze model weights and train a classifier by parameterizing w and b in Equation 1 as a soft-margin Support Vector Machine (SVM, Cortes and Vapnik, 1995). 9 We also report results for a simple model that only uses BERT's pretrained next sentence predictor (BERT-base-NSP, BERT-large-NSP), i.e., we choose the alternative with the higher next sentence prediction score. The results are shown in Table 6. The relatively high accuracies of BERT-large, RoBERTa-large and BERT-*-NSP show that these pretrained models are already well-equipped to perform this task "out-of-the-box".

Analysis of sensitivity to cues
To analyze the sensitivity of BERT and RoBERTa to superficial cues and to content words, we employ a gradient-based approach, following (Brunner et al., 2019). Specifically, we define the sensitivity s i,t of the classification score in i-th COPA test instance to input token t, as follows: where T i is a sequence of all input tokens in the i-th COPA test instance, y is a score function defined by Equation (1), and x t ∈ R 1024 is a position-augmented token embedding of t. We then define the sensitivity S(k) to cue k over all COPA test instances as the average over all m COPA test instances: S(k) = 1 m m i s i,k . We are interested in the change of sensitivity towards cue t of a model trained on original COPA compared to a model trained on Balanced COPA. We plot this difference as a function of the cue's productivity (Figure 2). We observe that BERT trained on Balanced COPA is less sensitive to a 9 We tune the SVM hyperparameter C ∈ {0.0001, 0.001, 0.01, 0.1, 1} on the validation set.   few highly productive superficial cues than BERT trained on original COPA. Note the decrease in the sensitivity for cues of productivity from 0.7 to 0.9. These cues are shown in Table 7. However, for cues with lower productivity, the picture is less clear, in case of RoBERTa, there are no noticeable trends in the change of sensitivity.

Conclusions
We established that COPA, an important benchmark of commonsense reasoning, contains superficial cues, specifically single tokens predictive of the correct answer, that allow models to solve the task without actually understanding it. Our experiments suggest that BERT's good performance on COPA can be explained by its ability to exploit these superficial cues. BERT performs well on Easy instances with such superficial cues, and comparable to previous methods on Hard instances without such cues. RoBERTa, in contrast, represents a real improvement considerably outperforms both BERT and previous methods on Hard instances as well.
To allow evaluating models on a benchmark without predictive single tokens, we created the Balanced COPA dataset. Balanced COPA neutralizes this kind of superficial cue by mirroring instances from the original COPA dataset, thereby removing any differences in token distributions between correct and wrong alternatives. Surprisingly, we found that both BERT and RoBERTa finetuned on Balanced COPA perform comparably overall to the models finetuned on the original COPA. However, a more detailed analysis revealed quite different behaviour. Whereas BERT finetuned on original COPA heavily exploited superficial cues, we now find evidence that BERT finetuned on balanced COPA appears to learn some aspects of the task with similar accuracies on both Easy and Hard instances. Even more surprisingly, RoBERTa benefits from training on Balanced COPA instances and achieves higher accuracy than on the original COPA with superficial cues.
Two important questions remain unanswered at present, which we plan to explore in future work: Even in the presence of superficial cues, RoBERTa does not seem to rely on them. First, why does RoBERTa not appear to rely on superficial cues, even when they are available? And second, are the results of our experiments on Balanced COPA specific to BERT and RoBERTa or are all pretrained language models able to exploit superficial cues in COPA and able to solve the task by other means if no such cues are present?