The Curse of Performance Instability in Analysis Datasets: Consequences, Source, and Suggestions

We find that the performance of state-of-the-art models on Natural Language Inference (NLI) and Reading Comprehension (RC) analysis/stress sets can be highly unstable. This raises three questions: (1) How will the instability affect the reliability of the conclusions drawn based on these analysis sets? (2) Where does this instability come from? (3) How should we handle this instability and what are some potential solutions? For the first question, we conduct a thorough empirical study over analysis sets and find that in addition to the unstable final performance, the instability exists all along the training curve. We also observe lower-than-expected correlations between the analysis validation set and standard validation set, questioning the effectiveness of the current model-selection routine. Next, to answer the second question, we give both theoretical explanations and empirical evidence regarding the source of the instability, demonstrating that the instability mainly comes from high inter-example correlations within analysis sets. Finally, for the third question, we discuss an initial attempt to mitigate the instability and suggest guidelines for future work such as reporting the decomposed variance for more interpretable results and fair comparison across models. Our code is publicly available at: https://github. com/owenzx/InstabilityAnalysis


Introduction
Neural network models have significantly pushed forward performances on natural language processing benchmarks with the development of largescale language model pre-training Radford et al., 2018;Devlin et al., 2019;Radford et al., 2019;Liu et al., 2019b). For example, on two semantically challenging tasks, Natu-  (Naik et al., 2018a) (from the topmost line to the bottom, respectively). The solid lines represent the means of ten runs and the shadow area indicates a distance within a standard deviation from the means. The two dashed lines show the trajectories of one single run for MNLIm and Numerical Stress Test using the same model.
ral Language Inference (NLI) and Reading Comprehension (RC), the state-of-the-art results have reached or even surpassed the estimated human performance on certain benchmark datasets Rajpurkar et al., 2016aRajpurkar et al., , 2018. These astounding improvements, in turn, motivate a new trend of research to analyze what language understanding and reasoning skills are actually achieved, versus what is still missing within these current models. Following this trend, numerous analysis approaches have been proposed to examine models' ability to capture different linguistic phenomena (e.g., named entities, syntax, lexical inference, etc.).
Those studies are often conducted in three steps: (1) proposing assumptions about models' certain ability; (2) building analysis datasets by automatic generation or crowd-sourcing; (3) concluding models' certain ability based on results on these analy-sis datasets.
Past analysis studies have led to many key discoveries in NLP models, such as over-stability (Jia and Liang, 2017), surface pattern overfitting (Gururangan et al., 2018), but recently McCoy et al. (2019a) found that the results of different runs of BERT NLI models have large non-negligible variances on the HANS (McCoy et al., 2019b) analysis datasets, contrasting sharply with their stable results on standard validation set across multiple seeds. This finding raises concerns regarding the reliability of individual results reported on those datasets, the conclusions made upon these results, and lack of reproducibility (Makel et al., 2012). Thus, to help consolidate further developments, we conduct a deep investigation on model instability, showing how unstable the results are, and how such instability compromises the feedback loop between model analysis and model development.
We start our investigation from a thorough empirical study of several representative models on both NLI and RC. Overall, we observe four worrisome observations in our experiments: (1) The final results of the same model with different random seeds on several analysis sets are of significantly high variance. The largest variance is more than 27 times of that for standard development set; (2) These large instabilities on certain datasets is model-agnostic. Certain datasets have unstable results across different models; (3) The instability not only occurs at the final performance but exists all along training trajectory, as shown in Fig. 1; (4) The results of the same model on analysis sets and on the standard development set have low correlation, making it hard to draw any constructive conclusion and questioning the effectiveness of the standard model-selection routine.
Next, in order to grasp a better understanding of this instability issue, we explore theoretical explanations behind this instability. Through our theoretical analysis and empirical demonstration, we show that inter-examples correlation within the dataset is the dominating factor causing this performance instability. Specifically, the variance of model accuracy on the entire analysis set can be decomposed into two terms: (1) the sum of single-data variance (the variance caused by individual prediction randomness on each example), and (2) the sum of inter-data covariance (caused by the correlation between different predictions). To understand the latter term better, consider the following case: if there are many examples correlated with each other in the evaluation set, then the change of model prediction on one example will influence predictions on all the correlated examples, causing high variances in final accuracy. We estimate these two terms with multiple runs of experiments and show that inter-data covariance contributes significantly more than single-data variance to final accuracy variance, indicating its major role in the cause of instability.
Finally, in order for the continuous progress of the community to be built upon trustworthy and interpretable results, we provide initial suggestions on how to perceive the implication of this instability issue and how we should potentially handle it. For this, we encourage future research to: (1) when reporting means and variance over multiple runs, also report two decomposed variance terms (i.e., sum of single data variance and sum of inter-data covariance) for more interpretable results and fair comparison across models; (2) focus on designing models with better inductive and structural biases, and datasets with higher linguistic diversity.
Overall, our contribution is 3-fold. First, we provide a thorough empirical study of the instability issue in models' performance on analysis datasets. Second, we demonstrate theoretically and empirically that the performance variance is attributed mostly to the inter-example correlation. Finally, we provide some suggestions on how to deal with this instability issue, including reporting of the decomposed variance for more interpretable evaluation and better comparison.
2 Related Work NLI and RC Analysis. Many analysis works have been conducted to study what the models are actually capturing alongside recent improvements on NLI and RC benchmark scores. In NLI, some analyses target word/phrase level lexical/semantic inference (Glockner et al., 2018;Shwartz and Dagan, 2018;Carmona et al., 2018), some are more syntactic-related (McCoy et al., 2019b;Nie et al., 2019;Geiger et al., 2019), some also involved logical-related study (Minervini and Riedel, 2018;. Naik et al. (2018a) proposed a suite of analysis sets covering different linguistic phenomena. In RC, adversarial style analysis is used to test the robustness of the models (Jia and Liang, 2017). Most of the work follows the style of Carmona et al. (2018) to diagnose/analyze models' behavior on pre-designed analysis sets. In this paper, we analyze NLI and RC models from a broader perspective by inspecting models' performance across different analysis sets, and their inter-dataset and intra-dataset relationships.
Dataset-Related Analysis. As deep learning models heavily rely on high-quality training sets, another line of works has aimed at studying the meta-issues of the data and the dataset creation itself. The most well-known one of this kind is the analysis of undesirable bias. In VQA datasets, unimodal biases were found, compromising their authority on multi-modality evaluation (Jabri et al., 2016;Goyal et al., 2017). In machine comprehension, Kaushik and Lipton (2018) found that passage-only models can achieve decent accuracy. In NLI, hypothesis bias was also found in SNLI and MultiNLI (Tsuchiya, 2018;Gururangan et al., 2018). All these findings raised concerns regarding spurious shortcuts that emerged in dataset collection and their unintended and harmful effects on trained models.
To mitigate these problems, several recent works have proposed new guidelines for better collections and uses of datasets. Specifically, Liu et al. (2019a) introduced a systematic and task-agnostic method to analyze datasets. Rozen et al. (2019) further explain how to improve challenging datasets and why diversity matters. Geva et al. (2019) suggest that annotator bias should be monitored throughout the collection process and that part of the test data be created by exclusive annotators. Our work is complementary to those analyses.
Robustifying NLI and RC Models. Recently, a number of works have been proposed to directly improve the performance on the analysis datasets both for NLI through model ensembling (Clark et al., 2019;He et al., 2019), novel training mechanisms (Pang et al., 2019;Yaghoobzadeh et al., 2019), enhancing word representations (Moosavi et al., 2019), and for RC through using different training objectives (Yeh and Chen, 2019;Lewis and Fan, 2019). While improvements have been made on certain analysis datasets, the stability of the results is not examined. As explained in this paper, we highly recommend those result variances be scrutinized in future work for fidelity considerations.
Instability in Performance. Performance instability has already been recognized as an important issue in deep reinforcement learning (Irpan, 2018) and active learning (Bloodgood and Grothendieck, 2013). However, supervised learning is presumably stable especially with fixed datasets and labels. This assumption is challenged by some analyses recently. McCoy et al. (2019a) show high variances in NLI-models performance on the analysis dataset. Phang et al. (2018) found high variances in fine-tuning pre-trained models in several NLP tasks on the GLUE Benchmark. Gurevych (2017, 2018) state that conclusions based on single run performance may not be reliable for machine learning approaches. Weber et al. (2018) found that the model's ability to generalize beyond the training distribution depends greatly on the chosen random seed. Dodge et al. (2020) showed weight initialization and training data order both contribute to the randomness in BERT performance. In our work, we present a comprehensive explanation and analysis of the instability of neural models on analysis datasets and give general guidance for future work.
3 The Curse of Instability

Tasks and Datasets
In this work, we target our experiments on NLI and RC for two reasons: 1) their straightforwardness for both automatic evaluation and human understanding, and 2) their wide acceptance of being benchmark tasks for evaluating natural language understanding.
For NLI, we use SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2018) as the main standard datasets and use HANS (McCoy et al., 2019b), SNLI-hard (Gururangan et al., 2018), BREAK-NLI (Glockner et al., 2018), Stress Test (Naik et al., 2018a) as our auxiliary analysis sets. Note that the Stress Test contains 6 subsets (denoted as 'STR-X') targeting different linguistic categories. For RC, we use SQuAD1.1 (Rajpurkar et al., 2016b) as the main standard dataset and use AdvSQuAD (Jia and Liang, 2017) as the analysis set. Detail descriptions of the models and datasets are in Appendix.

Models and Training
Since BERT (Devlin et al., 2019) achieves state-ofthe-art results on several NLP tasks, the pretrainingthen-finetuning framework has been widely used. To keep our analysis aligned with recent progress, we focused our experiments on this framework. Specifically, in our study, we used the two most XLNet . 2 Moreover, for NLI, we additionally use RoBERTa (Liu et al., 2019b) and ESIM (Chen et al., 2017) in our experiments. RoBERTa is almost the same as BERT except that it has been trained on 10 times more data during the pre-training phrase to be more robust. ESIM is the most representative pre-BERT model for sequence matching problem and we used an ELMoenhanced-version . 3 Training Details. For all pre-trained transformer models, namely, BERT, RoBERTa, and XLNet, we use the same set of hyper-parameters for analysis consideration. For NLI, we use the suggested hyper-parameters in Devlin et al. (2019). The batch size is set to 32 and the peak learning rate is set to 2e-5. We save checkpoints every 500 iterations, resulting in 117 intermediate checkpoints. In our preliminary experiments, we find that tuning these hyper-parameters will not significantly influence the results. The training set for NLI is the union of SNLI ( Fluctuation in Training Trajectory. Intuitively, the inconsistency and instability in the final performance of different runs can be caused by the randomness in initialization and stochasticity in training dynamics. To see how much these factors can contribute to the inconsistency in the final performance, we keep track of the results on different evaluation sets along the training process and compare their training trajectories. We choose HANS and STR-NU as our example unstable analysis datasets because their variances in final performance are the largest, and we choose SNLI and MNLI-m for standard validation set comparison. As shown in Fig. 1, the training curve on MNLI and SNLI (the top two lines) is highly stable, while there are significant fluctuations in the HANS and STR-NU trajectories (bottom two lines). Besides the mean and standard deviation over multiple runs, we also show the accuracy of one run as the bottom dashed line in Fig. 1. We find that two adjacent checkpoints can have a dramatically large performance gap on STR-NU. Such fluctuation in training is very likely to be one of the reasons for the instability in the final performance and might give rise to untrustworthy conclusions drawn from the final results.  Low Correlation between Datasets. The typical routine for neural network model selection requires practitioners to choose the model or checkpoint hinged on the observation of models' performance on the validation set. The routine was followed in all previous NLI analysis studies where models were chosen by the performance on standard validation set and tested on analysis sets. An important assumption behind this routine is that the performance on the validation set should be correlated with the models' general ability. However, as shown in Fig. 1, the striking difference between the wildly fluctuated training curves for analysis sets and the smooth curves for the standard validation set questions the validity of this assumption. Therefore, to check the effectiveness of model selection under these instabilities, we checked the correlation for the performance on different datasets during training. For dataset D i , we use a i t,s to denote the accuracy of the checkpoint at t-th time step and trained with the seed s ∈ S, where S is the set of all seeds. We calculate the correlation Corr i,j between datasets D i and D j by: where T is the number of checkpoints. The correlations between different NLI datasets are shown in Fig. 3. We can observe high correlation (> 0.95) among standard validation datasets (e.g. MNLI-m, MNLI-mm, SNLI) but low correla-  tions between other dataset pairs, especially when pairing STR-O or STR-NU with MNLI or SNLI. This indicates that: 1) the standard validation set is not representative enough for certain analysis sets; 2) doing model selection solely based on the standard validation set cannot reduce the instability on low-correlated analysis sets.

Tracking Instability
Before answering the question how to handle these instabilities, we first seek the source of the instability to get a better understanding of the issue. We start with the intuition that high variance could be the result of high inter-example correlation within the dataset, and then provide hints from experimental observations. Next, we show theoretical evidence to formalize our claim. Finally, we conclude that the major source of variance is the interexample correlations based on empirical results.

Inter-Example Correlations
Presumably, the wild fluctuation in the training trajectory on different datasets might come from two potential sources. Firstly, the individual prediction of each example may be highly unstable so that the prediction is constantly changing. Secondly, there might be strong inter-example correlations in the datasets such that a large proportion of predictions are more likely to change simultaneously, thus causing large instability. Here we show that the second reason, i.e., the strong inter-example prediction cor- relation in the datasets is what contribute most to the overall instability. We examine the correlation between different example prediction pairs during the training process. In Fig. 4, we calculated the inter-example Spearman's correlation on MNLI and HANS. Fig. 4 shows a clear difference between the inter-example correlation in stable (MNLI) datasets versus unstable (HANS) datasets. For stable datasets (MNLI), the correlations between the predictions of examples are uniformly low, while for unstable datasets (HANS), there exist clear groups of examples that have very strong inter-correlation between their predictions. This observation suggests that those groups could be a major source of instability if they contain samples with frequently changing predictions.

Variance Decomposition
Next, we provide theoretical support to show how the high inter-example correlation contributes to the large variance in final accuracy. Later, we will also demonstrate that it is the major source of the large variance. Suppose dataset D contains exam- where N is the number of data points in the dataset, x i and y i are the inputs and labels, respectively. We use a random variable C i

Statistics Standard Dataset
Analysis Dataset

MNLI-m MNLI-mm SNLI BREAK HANS SNLI-hard STR-L STR-S STR-NE STR-O STR-A STR-NU
We then decompose the variance of the accuracy Var(Acc) into the sum of data variances Var(C i ), and the sum of inter-data covariances Cov(C i , C j ): Here, the first term 1 N 2 Var(C i ) means the instability caused by the randomness in individual example prediction and the second term 2 N 2 i<j Cov(C i , C j ) means the instability caused by the covariance of the prediction between different examples. The latter covariance term is highly related to the inter-example correlation.
Premise: Though the author encouraged the lawyer, the tourist waited. Hypothesis: The author encouraged the lawyer.
Label: entailment Premise: The lawyer thought that the senators supported the manager. Hypothesis: The senators supported the manager.
Label: non-entailment Finally, to demonstrate that the inter-example correlation is the major source of high variance, we calculate the total variance, the independent variance (the 1st term in Eq. 2), and the covariance (the 2nd term in Eq. 2) on every dataset. The results are shown in Table 3. In contrast to similar averages of the independent variance on standard datasets and analysis datasets, we found a large gap between the averages of covariances on different datasets. This different trend of total variance and independent variance proves that the inter-example correlation in the datasets is the major reason for the difference of variance on the analysis datasets.

Highly-Correlated Cases
In this section, we take a look at the examples whose predictions have high inter-correlations. As shown in Table 5, example pairs in NLI datasets with high covariance usually target the same linguistic phenomenon and share similar lexicon usage. These similarities in both syntax and lexicon make the prediction in these two examples highlycorrelated. The situation is similar for RC datasets. As adversarial RC datasets such as AddSent are created by appending a distractor sentence at the end of the original passage, different examples can look very similar. In Table 6, we see two examples are created by appending two similar distractor sentences to the same context, making the predictions of these two examples highly correlated.
In conclusion, since analysis datasets are usu-Original Context: In February 2010, in response to controversies regarding claims in the Fourth Assessment Report, five climate scientists-all contributing or lead IPCC report authors-wrote in the journal Nature calling for changes to the IPCC. They suggested a range of new organizational options, from tightening the selection of lead authors and contributors to dumping it in favor of a small permanent body or even turning the whole climate science assessment process into a moderated "living" Wikipedia-IPCC. Other recommendations included that the panel employs full-time staff and remove government oversight from its processes to avoid political interference. Question: How was it suggested that the IPCC avoid political problems?
Answer: remove government oversight from its processes Distractor Sentence 1: It was suggested that the PANEL avoid nonpolitical problems.
Distractor Sentence 2: It was suggested that the panel could avoid nonpolitical problems by learning.   ally created using pre-specified linguistic patterns/properties and investigation phenomena in mind, the distributions of analysis datasets are less diverse than the distributions of standard datasets. The difficulty of the dataset and the lack of diversity can lead to highly-correlated predictions and high instability in models' final performances.

Implications, Suggestions, and Discussion
So far, we have demonstrated how severe this instability issue is and how the instability can be traced back to the high correlation between predictions of certain example clusters. Now based on all the previous analysis results and conclusions, we discuss some potential ways of how to deal with this instability issue. We first want to point out that this instability issue is not a simple problem that can be solved by trivial modifications of the dataset, model, or training algorithm. Here, below we first present one initial attempt at illustrating the difficulty of solving this issue via dataset resplitting.
Limitation of Model Selection. In this experiment, we see if an oracle model selection process can help reduce instability. Unlike the benchmark datasets, such as SNLI, MNLI, and SQuAD, analysis sets are often proposed as a single set without dev/test splits. In Sec. 4, we observe that models' performances on analysis sets have little correlation with model performance on standard validation sets, making the selection model routine useless for reducing performance instability on analysis sets. Therefore, we do oracle model selection by dividing the original analysis set into an 80% analysisdev dataset and a 20% analysis-test dataset.
In Table 7, we compare the results of BERT-B on the new analysis-test with model selection based on the results on either MNLI or the corresponding analysis-dev. While model selection on analysisdev helps increase the mean performance on several datasets 5 , especially on HANS, STR-O, and STR-NU, indicating the expected high correlation inside the analysis set, however, the variances of final results are not always reduced for different datasets. Hence, besides the performance instability caused by noisy model selection, different random seeds indeed lead to models with different performance on analysis datasets. This observation might indi-cate that performance instability is relatively independent of the mean performance and hints that current models may have intrinsic randomness brought by different random seeds which is unlikely to be removed through simple dataset/model fixes.

Implications of Result Instability
If the intrinsic randomness in the model prevents a quick fix, what does this instability issue imply? At first glance, one may view the instability as a problem caused by careless dataset design or deficiency in model architecture/training algorithms. While both parts are indeed imperfect, here we suggest it is more beneficial to view this instability as an inevitable consequence of the current datasets and models. On the data side, as these analysis datasets usually leverage specific rules or linguistic patterns to generate examples targeting specific linguistic phenomena and properties, they contain highly similar examples (examples shown in 4.3). Hence, the model's predictions of these examples will be inevitably highly-correlated. On the model side, as the current model is not good enough to stably capture these hard linguistic/logical properties through learning, they will exhibit instability over some examples, which is amplified by the high correlation between examples predictions. These datasets can still serve as good evaluation tools as long as we are aware of the instability issue and report results with multiple runs. To better handle the instability, we also propose some long and short term solution suggestions below, based on variance reporting and analysis dataset diversification.

Short/Long Term Suggestions
Better Analysis Reporting (Short Term). Even if we cannot get a quick fix to remove the instability in the results, it is still important to keep making progress using currently available resources, and more importantly, to accurately evaluate this progress. Therefore, in the short run, we encourage researchers to report the decomposed variance (Idp Var and Cov) for a more accurate understanding of the models and datasets as in Sec 4.2, Table 3 and  Table 4. The first number (independent variance, i.e., Idp Var) can be viewed as a metric regarding how stable the model makes one single prediction and this number can be compared across different models. Models with a lower score can be interpreted as being more stable for one single prediction. By comparing models with both total variance and the Idp Var, we can have a better understanding of where the instability of the models comes from. A more stable model should aim to improve the total variance with more focus on Idp Var. If the target is to learn the targeted property of the dataset better, then more focus should be drawn towards the second term when analysing the results.

Model and Dataset Suggestions (Long Term).
In the long run, we should be focusing on improving models (including better inductive biases, large-scale pre-training with tasks concerning structure/compositionality) so that they can get high accuracy stably. Dataset-wise, as different analysis datasets show poor correlation between each other, we suggest building datasets using a diverse set of patterns to create examples, in order to test the systematic capability of certain linguistic properties under different contexts instead of model's ability to solve one single pattern or property, since more diverse dataset may lead to lower covariance between predictions, which is shown to be the major source of the instability in Section 4.

Conclusions
Auxiliary analysis datasets are meant to be important resources for debugging and understanding models. However, large instability of current models on these analysis sets undermine such benefits and bring non-ignorable obstacles for future research. In this paper, we examine the issue of instability in detail, provide theoretical and empirical evidence discovering the high inter-example correlation that causes this issue. Finally, we give suggestions on future research directions and on better analysis variance reporting. We hope this paper will guide researchers on how to handle instability in practice and inspire future work on reducing the instabilities in experiments. traditional models to see how different structures and the use of pre-trained representations influence the result.

A.1 Transformer Models
BERT (Devlin et al., 2019). BERT is a Transformer model pre-trained with masked language supervision on a large unlabeled corpus to obtain deep bi-directional representations (Vaswani et al., 2017). To conduct the task of NLI, the premise and the hypothesis is concatenated as the input and a simple classifier is added on top of these pretrained representations to predict the label. The whole model is fine-tuned on NLI datasets before evaluation.
RoBERTa (Liu et al., 2019b). RoBERTa uses the same structure as BERT, but carefully tunes the hyper-parameters for pre-training and is trained 10 times more data during pre-training. The finetuning architecture and process are the same as BERT.
XLNet . XLNet also adopts the Transformer structure but the pre-training target is a generalized auto-regressive language modeling. It also can take in infinite-length input by using the Transformer-XL  architecture. The fine-tuning architecture and process are the same as BERT.

A.2 Traditional Models Models
ESIM (Chen et al., 2017). ESIM first uses BiL-STM to encode both the premise and the hypothesis sentence and perform cross-attention before making the prediction using a classifier. It is one representative model before the use of pre-trained Transformer structure.
Break NLI. The examples in Break NLI resemble the examples in SNLI. The hypothesis is generated by swapping words in the premise so that lexical or world knowledge is required to make the correct prediction.
SNLI-Hard. SNLI hard dataset is a subset of the test set of SNLI. The examples that can be predicted correctly by only looking at the annotation artifacts in the premise sentence are removed.
NLI Stress. NLI Stress datasets is a collection of datasets modified from MNLI. Each dataset targets one specific linguistic phenomenon, including word overlap, negation, antonyms, numerical reasoning, length mismatch, and spelling errors. Models with certain weaknesses will get low performance on the corresponding dataset.

HANS.
The examples in HANS are created to reveal three heuristics used by models: the lexical overlap heuristic, the sub-sequence heuristic, and the constituent heuristic. For each heuristic, examples are generated using 5 different templates.
Dataset statistics and categories for all the NLI datasets can be seen in Table 8.

C Means and Standard Deviations of Final Results on NLI/RC datasets
Here we provide the mean and standard deviation of the final performance over 10 different seeds in Table 9 and