Probing Neural Network Comprehension of Natural Language Arguments

We are surprised to find that BERT’s peak performance of 77% on the Argument Reasoning Comprehension Task reaches just three points below the average untrained human baseline. However, we show that this result is entirely accounted for by exploitation of spurious statistical cues in the dataset. We analyze the nature of these cues and demonstrate that a range of models all exploit them. This analysis informs the construction of an adversarial dataset on which all models achieve random accuracy. Our adversarial dataset provides a more robust assessment of argument comprehension and should be adopted as the standard in future work.


Introduction
Argumentation mining is the task of determining argumentative structure in natural language text -e.g., which text segments represent claims, and which comprise reasons that support or attack those claims (Mochales and Moens, 2011;Lippi and Torroni, 2016). This is a challenging task for machine learners, as it can be hard even for humans to determine when two text segments stand in argumentative relation, as evidenced by studies on argument annotation (Habernal et al., 2014).
One approach to this problem is to focus on warrants (Toulmin, 1958) -a form of world knowledge that permit inferences. Consider a simple argument: "(1) It is raining; therefore (2) you should take an umbrella." 1 The warrant "(3) it is bad to get wet" could license this inference. Knowing (3) facilitates drawing the inferential connection between (1) and (2). However it would be hard to find it stated anywhere since warrants are most often left implicit (Walton, 2005). Thus, on this approach, machine learners must not only reason with warrants but also discover them. 1 This example adapted from Black and Hunter (2012) Claim Google is not a harmful monopoly Reason People can choose not to use Google Warrant Other search engines don't redirect to Google Alternative All other search engines redirect to Google Reason (and since) Warrant → Claim Reason (but since) Alternative → ¬ Claim Figure 1: An example of a data point from the ARCT test set and how it should be read. The inference from R and A to ¬C is by design.
The Argument Reasoning Comprehension Task (ARCT) (Habernal et al., 2018a) defers the problem of discovering warrants and focuses on inference. An argument is provided, comprising a claim C and reason R. This task is to pick the correct warrant W over a distractor, called the alternative warrant A. The alternative is written such that R ∧ A → ¬C. An alternative warrant for our earlier example could be "(4) it is good to get wet," in which case we have (1) ∧ (4) → "(¬2) you shouldn't take an umbrella." An example from the dataset is given in Figure 1. The ARCT SemEval shared task (Habernal et al., 2018b) verified the challenging nature of this problem. Even supplying warrants, learners still need to rely on further world knowledge. For example, to correctly classify the data point in Figure 1 it is at least required to know how consumer choice and web re-directs relate to the concept of monopoly, and that Google is a search engine. All but one participating system in the shared task could not exceed 60% accuracy (on binary classification).
It is therefore surprising that BERT (Devlin et al., 2018) achieves 77% test set accuracy with its best run (Table 1) Table 1: Baselines and BERT results. Our results come from 20 different random seeds (± gives the standard deviation). The mean for BERT Large is skewed by the 5/20 random seeds for which it failed to train, a problem noted by Devlin et al. (2018). We therefore consider the median a better measure of BERT's average performance. The mean of the non-degenerate runs for BERT (Large) is 0.716 ± 0.04. To investigate BERT's decision making we looked at data points it finds easy to classify over multiple runs. Habernal et al. (2018b) performed a similar analysis with the SemEval submissions, and consistent with their results we found that BERT exploits the presence of cue words in the warrant, especially "not." Through probing experiments designed to isolate such effects, we demonstrate in this work that BERT's surprising performance can be entirely accounted for in terms of exploiting spurious statistical cues.
However, we show that the major problem can be eliminated in ARCT. Since R ∧ A → ¬C, we can add a copy of each data point with the claim negated and the label inverted. This means that the distribution of statistical cues in the warrants will be mirrored over both labels, eliminating the signal. On this adversarial dataset all models perform randomly, with BERT achieving a maximum test set accuracy of 53%. The adversarial dataset therefore provides a more robust evaluation of argument comprehension and should be adopted as the standard in future work on this dataset.

Task Description and Baselines
Let i = 1, . . . , n index each point in the dataset D, where |D| = n. The two candidate warrants in each case are randomly assigned a binary label j ∈ {0, 1}, such that each has an equal probability of being correct. The inputs are the representations for the claim c (i) , reason r (i) , warrant zero w The general architecture for all models is given in Figure 2. Shared parameters θ are learned to classify each warrant independently with the argument, yielding the logits: These are then concatenated and passed through softmax to determine a probability distribution over the two warrants The baselines are a bag of vectors (BoV), bidirectional LSTM (Hochreiter and Schmidhuber, 1997) (BiLSTM), the SemEval winner GIST (Choi and Lee, 2018), the best model of Botschen et al. (2018), and human performance (Table 1). For all of our experiments we use grid search to select hyperparameters, dropout regularization (Srivastava et al., 2014), and Adam (Kingma and Ba, 2014) for optimization. We anneal the learning rate by 1/10 when validation accuracy drops. The final parameters come from the epoch with maximum validation accuracy. The BoV and BiLSTM inputs are 300-dimensional GloVe embeddings trained on 640B tokens (Pennington et al., 2014). Code to reproduce all experiments, and detailing all hyperparameters, is provided on GitHub. 2

BERT
Our BERT classifier is visualized in Figure 3. The claim and reason are joined to form the first text segment, which is paired with each warrant and independently processed. The final layer CLS vector is passed to a linear layer to obtain the logits z The whole architecture is fine-tuned. The learning rate is 2e −5 and we allow a maximum of 20 training epochs, taking the parameters from the epoch with the best validation set accuracy. We use the Hugging Face PyTorch implementation. 3 Devlin et al. (2018) report that, on small datasets, BERT sometimes fails to train, yielding degenerate results. ARCT is very small with 1, 210 training observations. In 5/20 runs we encountered this phenomenon, seeing close to random accuracies on validation and test sets. These cases occurred where training accuracy was also not significantly above random (< 80%). Removing the degenerate runs, BERT's mean is 71.6 ± 0.04., which would beat the previous state of the art -as would the median of 71.2%, which is a better average than the overall mean since it is not skewed by the degenerate cases. However, our main finding is that these results are not meaningful and should be discarded. In the following sections we focus on BERT's peak performance of 77% to make this case. 3 https://github.com/huggingface/pytorch-pretrained-BERT

Statistical Cues
The major source of spurious statistical cues in ARCT comes from uneven distributions of linguistic artifacts over the warrants, and therefore over the labels. This section aims to demonstrate the presence and nature of these cues. We only consider unigrams and bigrams, although more sophisticated cues may be present. To this end, we aim to calculate how beneficial it is for a model to exploit a cue k, and how pervasive it is in the dataset (indicating the strength of the signal).
Formally, let T (i) j be the set of tokens in the warrant for data point i with label j. We define a cue's applicability α k as the number of data points where it occurs with one label but not the other: The productivity π k of a cue is defined as the proportion of applicable data points for which it predicts the correct answer: Finally, we define the coverage ξ k of a cue as the proportion of applicable cases over the total number of data points: ξ k = α k /n. In these terms, the productivity of a cue measures the benefit of exploiting it, while coverage measures the  Table 2: Productivity and coverage of using the presence of "not" in the warrant to predict the label in ARCT. Across the whole dataset, if you pick the warrant with "not" you will be right 61% of the time, which covers 64% of all data points.
strength of the signal it provides. With m labels, if π k > 1/m then the presence of a cue is going to be useful for the task and a machine learner would do well to make use of it. The productivity and coverage of the strongest unigram cue we found ("not") is given in Table  2. It provides a particularly strong training signal. While it is less productive in the test set, it is just one among many such cues. We found a range of other unigrams, albeit with less overall productivity, mostly being high frequency words such as "is," "do," and "are." Bigrams that occurred with not, such as "will not" and "cannot," were also found to be highly productive. These statistics indicate the nature of the problem. In the next section we demonstrate that our models are in fact exploiting these cues.

Probing Experiments
If a model is exploiting distributional cues over the labels, then if trained only on the warrants (W) it should perform relatively well. The same can be said for removing either just the claim, leaving the reason and warrant (R, W), or removing the reason (C, W). The latter setups allow the models to additionally consider cues in the reasons and claims, as well as cues holding over their combinations with the warrants. Each of these setups breaks the task since we no longer have an argument to match with a warrant.
Experimental results are given in Table 3. On warrants alone (W) BERT achieves a maximum 71% accuracy. That leaves only six percentage points to account for its peak of 77%. We find a gain of four percentage points for (R, W) over (W), and a gain of two for (C, W), accounting for the missing six points. Based on this evidence our major finding is that the entirety of BERT's performance can be accounted for in terms of exploiting spurious statistical cues.  Table 3: Results of probing experiments with BERT Large, and the BoV and BiLSTM baselines. These results indicate that BERT's peak 77% performance can be entirely accounted for by exploiting spurious cues. By just considering warrants (W) we can get to 71%. Adding cues over reasons (R, W) and claims (C, W) accounts for the remaining six points.

Adversarial Test Set
The major problem of statistical cues over labels in ARCT can be eliminated due the original design of the dataset. Given that R ∧ A → ¬C, we can produce adversarial examples by negating the claim and inverting the label for each data point (Figure 4). The adversarial examples are then combined with the original data. This eliminates the problem by mirroring the distributions of cues around both labels. The ARCT authors provide a training set augmented in this way. The negation of most claims in the validation and test sets already exist elsewhere in the dataset. The remaining claims were manually negated by a native English speaker. We tried two experimental setups. In the first, models trained and validated on the original data were evaluated on the adversarial set. All results were worse than random due to overfitting the cues in the original training set. In the second, models were trained from scratch on the adversarial training and validation sets, then evaluated on the adversarial test set. Results are given in Table 4. BERT's peak performance has reduced to 53%, with mean and median at 50%. We conclude from these results that the adversarial dataset has successfully eliminated the cues as expected, providing a more robust evaluation of machine argument comprehension. This result better apts with our intuitions about this task: with little to no understanding about the reality underlying these arguments, good performance shouldn't be feasible.  Figure 4: Original and adversarial data points. The claim is negated and the warrants are swapped. The assignment of labels to W and A are kept the same. By including both, the distribution of linguistic artifacts in the warrants are thereby mirrored around the labels, eliminating the major source of spurious statistical cues in ARCT.

Related Work
The most successful previous work on ARCT (Choi and Lee, 2018;Zhao et al., 2018;Niven and Kao, 2018) involved transfer learning from Natural Language Inference (NLI) datasets (Bowman et al., 2015;Williams et al., 2017), and utilized effective NLI models such as ESIM (Chen et al., 2016) and InferSent (Conneau et al., 2017). More recently, Botschen et al. (2018) added FrameNet knowledge with modest performance gains. These models should be evaluated on our adversarial dataset. In particular it will be interesting if Botschen et al.'s model stands out due to the inclusion of some of the required world knowledge. There is much recent work focusing on statistical cues in datasets in vision (Jo and Bengio, 2017) and NLP (Sanchez et al., 2018;McCoy et al., 2019;Gururangan et al., 2018;Glockner et al., 2018;Poliak et al., 2018;Rajpurkar et al., 2018;Jia and Liang, 2017). Similar to our experiment with warrants, Poliak et al. (2018) classified NLI data based on the hypothesis only. A similar experiment to our probing task was performed by Niven and Kao (2018), but only with reasons and warrants. They found that independent warrant classification with shared parameters provides some regularization against warrant-label cues (Niven and Kao, 2018). However, this does not solve the problem since the presence of a cue is enough to increase the logits for either warrant.
The original ARCT data comes with a training set created in the same way as our adversarial dataset. Habernal et al. (2018a) reported experiments using this training data that led to random accuracy. They suggested it could be that high similarity between the data points made the problem too difficult for the simple models they implemented. Our work indicates the necessity of applying this transformation to the entire dataset in order to obtain a more robust evaluation by eliminating spurious statistical cues over the labels.

Conclusion
ARCT provides a fortuitous opportunity to see how stark the problem of exploiting spurious statistics can be. Due to our ability to eliminate the major source of these cues, we were able to show that BERT's maximum performance fell from just three points below the average untrained human baseline to essentially random. To answer our question in the introduction: BERT has learned nothing about argument comprehension. However, our investigations confirmed that BERT is indeed a very strong learner. Analysis of easy to classify data points showed reliance on a lower proportion of the strongest cue word than the BoV and BiLSTM -i.e. BERT has learned when to ignore the presence of "not" and focus on different cues. This indicates an ability to exploit much more subtle joint distributional information. As our learners get stronger, controlling for spurious statistics becomes more important in order to have confidence in their apparent performance. Taken with a growing body of previous work, our results indicate the need for further research into the extent of this problem in NLP more generally.
The adversarial dataset should be adopted as the standard in future work on ARCT. We hope that providing a more robust evaluation will help to spur more productive research on this problem.