We Need to Talk about Standard Splits

It is standard practice in speech & language technology to rank systems according to their performance on a test set held out for evaluation. However, few researchers apply statistical tests to determine whether differences in performance are likely to arise by chance, and few examine the stability of system ranking across multiple training-testing splits. We conduct replication and reproduction experiments with nine part-of-speech taggers published between 2000 and 2018, each of which claimed state-of-the-art performance on a widely-used “standard split”. While we replicate results on the standard split, we fail to reliably reproduce some rankings when we repeat this analysis with randomly generated training-testing splits. We argue that randomly generated splits should be used in system evaluation.


Introduction
Evaluation with a held-out test set is one of the few methodological practices shared across nearly all areas of speech and language processing. In this study we argue that one common instantiation of this procedure-evaluation with a standard splitis insufficient for system comparison, and propose an alternative based on multiple random splits.
Standard split evaluation can be formalized as follows. Let G be a set of ground truth data, partitioned into a training set G train , a development set G dev and a test (evaluation) set G test . Let S be a system with arbitrary parameters and hyperparameters, and let M be an evaluation metric. Without loss of generality, we assume that M is a function with domain G × S and that higher values of M indicate better performance. Furthermore, we assume a supervised training scenario in which the free parameters of S are set so as to maximize M(G train , S), optionally tuning hyperparameters so as to maximize M(G dev , S). Then, if S 1 and S 2 are competing systems so trained, we prefer S 1 to S 2 if and only if M(G test , S 1 ) > M(G test , S 2 ).

Hypothesis testing for system comparison
One major concern with this procedure is that it treats M(G test , S 1 ) and M(G test , S 2 ) as exact quantities when they are better seen as estimates of random variables corresponding to true system performance. In fact many widely used evaluation metrics, including accuracy and F-score, have known statistical distributions, allowing hypothesis testing to be used for system comparison.
For instance, consider the comparison of two systems S 1 and S 2 trained and tuned to maximize accuracy. The difference in test accuracy,δ = M(G test , S 1 ) − M(G test , S 2 ), can be thought of as estimate of some latent variable δ representing the true difference in system performance. While the distribution ofδ is not obvious, the probability that there is no population-level difference in system performance (i.e., δ = 0) can be computed indirectly using McNemar's test (Gillick and Cox, 1989). Let n 1>2 be the number of samples in G test which S 1 correctly classifies but S 2 misclassifies, and n 2>1 be the number of samples which S 1 misclassifies but S 2 correctly classifies. When δ = 0, roughly half of the disagreements should favor S 1 and the other half should favor S 2 . Thus, under the null hypothesis, n 1>2 ∼ Bin(n, .5) where n = n 1>2 + n 2>1 . And, the (one-sided) probability of the null hypothesis is the probability of sampling n 1>2 from this distribution. Similar methods can be used for other evaluation metrics, or a reference distribution can be estimated with bootstrap resampling (Efron, 1981).
Despite this, few recent studies make use of statistical system comparison. Dror et al. (2018) survey statistical practices in all long papers presented at the 2017 meeting of the Association for Computational Linguistics (ACL), and all articles published in the 2017 volume of the Transactions of the ACL. They find that the majority of these works do not use appropriate statistical tests for system comparison, and many others do not report which test(s) were used. We hypothesize that the lack of hypothesis testing for system comparison may lead to type I error, the error of rejecting a true null hypothesis. As it is rarely possible to perform the necessary hypothesis tests from published results, we evaluate this risk using a replication experiment.

Standard vs. random splits
Furthermore, we hypothesize that standard split methodology may be insufficient for system evaluation. While evaluations based on standard splits are an entrenched practice in many areas of natural language processing, the static nature of standard splits may lead researchers to unconsciously "overfit" to the vagaries of the training and test sets, producing poor generalization. This tendency may also be amplified by publication bias in the sense of Scargle (2000). The field has chosen to define "state of the art" performance as "the best performance on a standard split", and few experiments which do not report improvements on a standard split are ultimately published. This effect is likely to be particularly pronounced on highly-saturated tasks for which system performance is near ceiling, as this increases the prior probability of the null hypothesis (i.e., of no difference). We evaluate this risk using a series of reproductions.

Replication and reproduction
In this study we perform a replication and a series of reproductions. These techniques were until recently quite rare in this field, despite the inherently repeatable nature of most natural language processing experiments. Researchers attempting replications or reproductions have reported problems with availability of data (Mieskes, 2017;Wieling et al., 2018) and software (Pedersen, 2008), and various details of implementation (Fokkens et al., 2013;Reimers and Gurevych, 2017;Schluter and Varab, 2018). While we cannot completely avoid these pitfalls, we select a task-English part-ofspeech tagging-for which both data and software are abundantly available. This task has two other important affordances for our purposes. First, it is face-valid, both in the sense that the equivalence classes defined by POS tags reflect genuine linguistic insights and that standard evaluation metrics such as token and sentence accuracy directly measure the underlying construct. Secondly, POS tagging is useful both in zero-shot settings (e.g., Elkahky et al., 2018;Trask et al., 2015) and as a source of features for many downstream tasks, and in both settings, tagging errors are likely to propagate. We release the underlying software under a permissive license. 1

Data
The Wall St. Journal (WSJ) portion of Penn Treebank-3 (LDC99T42; Marcus et al., 1993) is commonly used to evaluate English part-of-speech taggers. In experiment 1, we also use a portion of OntoNotes 5 (LDC2013T19; Weischedel et al., 2011), a substantial subset of the Penn Treebank WSJ data re-annotated for quality assurance.

Models
We attempted to choose a set of taggers claiming state-of-the-art performance at time of publication. We first identified candidate taggers using the "State of the Art" page for part-of-speech tagging on the ACL Wiki. 2 We then selected nine taggers for which all needed software and external data was available at time of writing. These taggers are described in more detail below.

Metrics
Our primarily evaluation metric is token accuracy, the percentage of tokens which are correctly tagged with respect to the gold data. We compute 95% Wilson (1927) score confidence intervals for accuracies, and use the two-sided mid-p variant (Fagerland et al., 2013) of McNemar's test for system comparison. We also report out-of-vocabulary (OOV) accuracy-that is, token accuracy limited to tokens not present in the training data-and sentence accuracy, the percentage of sentences for which there are no tagging errors. Table 1 reports statistics for the standard split. The OntoNotes sample is slightly smaller as it omits sentences on financial news, most of which is highly redundant and idiosyncratic. However, the entire OntoNotes sample was tagged by a single experienced annotator, eliminating any annotatorspecific biases in the Penn Treebank (e.g., Ratnaparkhi, 1997, 137f.

Experiment 1: Replication
In experiment 1, we adopt the standard split established by Collins (2002)

Experiment 2: Reproduction
We now repeat these analyses across twenty randomly generated 80%-10%-10% splits. After Dror et al. (2017), we use the Bonferroni procedure to control familywise error rate, the probability of falsely rejecting at least one true null hypothesis. This is appropriate insofar as each individual trial (i.e, evaluation on a random split) has a non-trivial statistical dependence on other trials. Table 3 reports the number of random splits, out of twenty, where the McNemar test p-value is significant after the correction for familywise error rate. This provides a coarse estimate of how often the second system would be likely to significantly outperform the first system given a random partition of similar size. Most of these pairwise comparisons are stable across random trials. However, for example, Stanford tagger is not a significant improvement over LAPOS for nearly all random trials, and in some random trials-two for Penn Treebank, fourteen for OntoNotes-it is in fact worse. Recall also that the Stanford tagger was also not significantly better than LAPOS for OntoNotes in experiment 1. Figure 1 shows token accuracies across the two experiments. The last row of the figure gives results for an oracle ensemble which correctly pre-   dicts the tag just in case any of the six taggers predicts the correct tag.

Error analysis
From experiment 1, we estimate that the last two decades of POS tagging research has produced a 1.28% absolute reduction in token errors. At the same time, the best tagger is 1.16% below the oracle ensemble. Thus we were interested in disagreements between taggers. We investigate this by treating each of the six taggers as separate coders in a collaborative annotation task. We compute persentence inter-annotator agreement using Krippendorff's α (Artstein and Poesio, 2008), then manually inspect sentences with the lowest α values, i.e., with the highest rate of disagreement. By far the most common source of disagreement are "headline"-like sentences such as Foreign Bonds. While these sentences are usually quite short, high disagreement is also found for some longer headlines, as in the example sentence in table 4; the effect seems to be due more to capitalization than sentence length. Several taggers lean heavily on capitalization cues to identify proper nouns, and thus capitalized tokens in headline sentences are frequently misclassified as proper nouns and vice versa, as are sentence-initial capitalized nouns in general. Most other sentences with low α have local syntactic ambiguities. For example, the word lining, acting as a common noun (NN) in the context …a silver for the…, is mislabeled as a gerund (VBG) by two of six taggers.

Discussion
We draw attention to two distinctions between the replication and reproduction experiments. First, we find that a system judged to be significantly better than another on the basis of performance on the  standard split, does not in outperform that system on re-annotated data or randomly generated splits, suggesting that it is "overfit to the standard split" and does not represent a genuine improvement in performance. Secondly, as can be seen in figure 1, overall performance is slightly higher on the random splits. We posit this to be an effect of randomization at the sentence-level. For example, in the standard split the word asbestos occurs fifteen times in a single training set document, but just once in the test set. Such discrepancies are far less likely to arise in random splits. Diversity of languages, data, and tasks are all highly desirable goals for natural language processing. However, nothing about this demonstration depends on any particularities of the English language, the WSJ data, or the POS tagging task. English is a somewhat challenging language for POS tagging because of its relatively impoverished inflectional morphology and pervasive noun-verb ambiguity (Elkahky et al., 2018). It would not do to use these six taggers for other languages as they are designed for English text and in some cases depend on English-only external resources for feature generation. However, random split experiments could, for instance, be performed for the subtasks of the CoNLL-2018 shared task on multilingual parsing (Zeman et al., 2018).
We finally note that repeatedly training the Flair tagger in experiment 2 required substantial grid computing resources and may not be feasible for many researchers at the present time.

Conclusions
We demonstrate that standard practices in system comparison, and in particular, the use of a single standard split, may result in avoidable Type I error. We suggest that practitioners who wish to firmly establish that a new system is truly state-of-the-art augment their evaluations with Bonferronicorrected random split hypothesis testing.
It is said that statistical praxis is of greatest import in those areas of science least informed by theory. While linguistic theory and statistical learning theory both have much to contribute to part-ofspeech tagging, we still lack a theory of the tagging task rich enough to guide hypothesis formation. In the meantime, we must depend on system comparison, backed by statistical best practices and error analysis, to make forward progress on this task.