Is the Best Better? Bayesian Statistical Model Comparison for Natural Language Processing

Recent work raises concerns about the use of standard splits to compare natural language processing models. We propose a Bayesian statistical model comparison technique which uses k-fold cross-validation across multiple data sets to estimate the likelihood that one model will outperform the other, or that the two will produce practically equivalent results. We use this technique to rank six English part-of-speech taggers across two data sets and three evaluation metrics.


Introduction
Gorman and Bedrick (2019) raise concerns about standard procedures used to compare speech and language processing models. They evaluate the performance of six English part-of-speech taggers using multiple randomly-generated trainingtesting splits; in some cases, they fail to reproduce previously-published system rankings established using a single "standard" split. They argue that point estimates of performance derived from a single training-testing splits are insufficient to establish system rankings, even when null hypothesis significance testing is used for model comparison.
In this study, we propose a technique for system comparison based on Bayesian statistical analysis. Our approach, motivated in Section 2 and described in Section 3, allow us to infer the likelihood that one model will outperform the other, or even that both models' performance will be practically equivalent, something that is not possible with the frequentist statistical tests used by Gorman and Bedrick. Our approach can also be applied simultaneously across multiple data sets. As a proof of concept, in Sections 4-5 we apply the proposed method using the experimental setup of Gorman and Bedrick, and in Section 6, we use it to rank the six taggers, compare evaluation metrics, and interpret the results. Our failure to reproduce some of earlier reported results leads us to discuss the impact of repeating experiments, contrasting performance in multiple measures and the advantages of comparing likelihoods in Section 6. We also discuss the notion of practical equivalence for speech and language technology.
2 Prior work Langley (1988) argues that machine learning should be viewed as an experimental science, and as such, machine learning technologies should be evaluated according to their performance on multiple held-out data sets. Dietterich (1998) proposes a framework for comparing two supervised classifiers using a null hypothesis tests to determine whether two classifiers have the same likelihood of predicting a correct result. This study introduces several methods, including a paired ttest for k-fold cross-validation results. However, he notes that the assumptions of normality and independence may not be satisfied in all cases. Nadeau and Bengio (2000) propose a correlationbased correction for the Dietterich t-test procedure which adjusts for the overlap between folds. Hull (1994) and Schütze et al. (1995) propose non-parametric tests for comparing models across multiple data sets; Salzberg (1997) proposes Bonferroni-corrected ANOVA analysis. Demšar (2006) reports that the Friedman non-parametric test with the Nemenyi correction makes fewer assumptions and has greater power than parametric tests. Other authors (e.g., Luengo et al., 2009;García et al., 2010;Derrac et al., 2011) further adapt the Friedman test for model comparison.
However, as Demšar (2006) notes, there still does not exist a non-parametric null hypothesis test designed for use with a repeated measure (i.e., k-fold) design across multiple data sets. As a re-sult, there is no procedure that takes into consideration the variance in scores of a given data set, at least within the frequentist paradigm. Demšar (2008) and Benavoli et al. (2017) enumerate additional problems with null hypothesis significance testing (NHST) procedures for model comparison: • NHST does not estimate probabilities for hypotheses; i.e., it does not tell us how likely two models are to perfrom equivalently, • NHST p-values conflate effect size and sample size; i.e., with a sufficiently large sample, one can claim significance even if the effect size is trivial, • NHST yields no information about the null hypothesis; i.e., one cannot draw further conclusions from a failure to reject the null hypothesis, and • there is no principled way to select an appropriate α-level for NHST.
These issues lead Benavoli et al. to reject NHSTbased model comparison in favor of a Bayesian approach. Bayesian hypothesis tests are defined by a likelihood function p(d | θ), a probability model of the data d conditioned on θ, a vector of parameters. The prior distribution for θ, p(θ) must also be defined. From these components, a posterior probability distribution p(θ | d) can then be calculated and queried (i.e., sampled from) to perform inference. Various techniques can be used to estimate θ; they are usually related to the differences in models' scores using some evaluation metric, and ultimately, to whether one method is likely to perform better or worse than the other. Thus, the posterior distribution can be used to perform model comparison. Benavoli et al. also introduce the notion of a region of practical equivalence (henceforth, ROPE), which allows Bayesian hypothesis testing to estimate the likelihood that two models' results will be functionally indistinguishable. ROPE defines an interval around a model's result -if another model's performance falls within this interval -they are deemed practically equivalent. For example, if one deems that a difference of 1 percentage point in accuracy between models denotes practical equivalence, a [−0.01, 0.01] interval is used as ROPE. If one model performs at .941 accuracy and another at .949 -they will be deemed practically equivalent. This allows to protect the statistical procedure from artifacts and false alarms of significance. Readers are referred to the accessible tutorial by Benavoli et al. (2017) for further details. Corani et al. (2017) generalize Bayesian model comparison to a repeated measures scenario in which there are multiple data sets with unequal score variances. They propose a hierarchical Bayesian model for estimating the likelihood of one model performing better, worse, or equivalently, to another. We now proceed to briefly describe and adapt this procedure to re-evaluate the findings of Gorman and Bedrick (2019).

Bayesian model comparison
Imagine a scenario where one wishes to compare the performance of two classifiers across q data sets. By performing m k-fold evaluations, the experimenter obtains a vector of n = mk observations, i.e., differences in scores, between the two models: x i = (x i,1 , . . . , x i,n ). The values in these vectors are a positive cross-correlation ρ because cross-validation introduces overlap in training data. Let δ i denote the mean difference score on the ith data set, and let δ 0 denote the average population-level difference. Corani et al. (2017) propose a hierarchical probabilistic model where MVN is a multivariate normal distribution over the vector of classifier differences with mean δ i and a covariance matrix Σ i with variance σ 2 i along the diagonal and ρσ 2 i on the off-diagonal. Data set variances are drawn from a Student distribution parameterized by the average populationlevel difference δ 0 and variance σ 0 , with µ degrees of freedom. The prior distributions for δ 0 , σ 0 , and µ are defined so as to preserve the robustness of the model; these are motivated and described in more detail by Corani et al. (2017). Crucially, we model the differences obtained in individual runs using a multi-variate normal distribution oriented to the per-data set mean differences with a perdata set variance, and the mean differences using a unimodal distribution robust to outliers and nonnormality. Per-data set variances are modeled by a uniform distribution.
After the model learns the parameter distributions from experimental data, we obtain a posterior probability distribution p(δ 0 , σ 0 , µ | d). To infer whether one classifier is more likely to outperform another-or whether they are practically equivalent-we draw N s samples from the posterior distribution. We use decision counters n lef t , n rope , and n right to keep track of how many times the left model was more likely to outperform the right model, to be practically equivalent to the right model, and to be outperformed by the right model, respectively. For each sample of the parameters, we define the posterior of the mean difference accuracy on a new unseen data set δ next as t(δ 0 , σ 0 , µ). We obtain the outcome probabilities by integrating the distribution over the three intervals-e.g., we obtain the probability that the left model is better than the right by integrating from the left end of the distribution to the left edge of the ROPE interval, and so on-and then incrementing the decision counter for the region with the highest outcome probability. Finally, we compute likelihoods for the three scenarios by dividing the decision counts by the number of samples drawn: P (lef t) = n lef t Ns , P (rope) = nrope Ns , and P (right) = n right Ns . Instead of significance, we thus estimate the likelihoods that one method is better than the other (or are practically equivalent). These estimates follow from observing the beliefs of a Bayesian model that models the probability of the methods' mean difference on unseen data sets, after sampling parameters from a meta-distribution which estimates the difference and variance over the population of data sets with µ degrees of freedom.

Materials and methods
To compare the results of our study with the ones obtained by Gorman and Bedrick (2019)  statistics for this data are given in Table 1.
We perform 20 randomized 10-fold cross validations, obtaining 200 measurements of each model's performance on each data set. In each run, 80% of the data is used for training, 10% for validation, and 10% for evaluation. We fit Bayesian models using the baycomp library 2 and draw 50,000 samples from the posterior.
Following Gorman and Bedrick, we use three evaluation metrics. Token accuracy is simply the number of test data tokens correctly tagged divided by the total number of tokens, and is the standard intrinsic evaluation metric used for this task. OOV accuracy is similar to token accuracy but is restricted to out-of-vocabulary tokens, i.e., those found in the test data but not in the training data. Finally, sentence accuracy is the number of test data sentences that contain no tagging errors, divided by the number of test sentences. Groundtruth data is provided by human annotators. 3

Results
Posterior distributions of the hierarchical models are visualized in Figures 1-3 and summarized in Table 2. We define the ROPE to have the same size as the 95% confidence interval; this is roughly 2-3% for sentence and OOV accuracy, and 0.2% for token accuracy. Thus, two models are judged to be practically equivalent in sentence accuracy if they differ in performance on fewer than 98 sentences of the Penn Treebank or 75 sentences on the slightly smaller OntoNotes corpus. For token accuracy, they are practically equivalent if they differ on fewer than 210 PTB tokens or 162 OntoNotes tokens, respectively.
The hierarchical model estimates, for example, that TnT, the simplest tagger, would be outperformed in token accuracy by any of the other five taggers 80-90% of the time. However, there is a surprisingly high chance of practical equivalence in token accuracy between the Collins tag-2 http://github.com/janezd/baycomp 3 Annotation quality for these data has been studied by Ratnaparkhi (1997) and Manning (2011), among others.  ger, LAPOS, and the Stanford tagger; for instance, the probability of practical equivalence of the latter two is 84%. This result is contrary to Gorman and Bedrick's replication of a standard split evaluation-they report that LA-POS is significantly better than the Collins tagger, and that the Stanford tagger is significantly better than LAPOS, according to two-tailed Mc-Nemar tests at α = .05-but it is consistent with their subsequent failure to consistently reproduce this ranking using randomly-generated splits and Bonferroni-corrected McNemar tests. In contrast, NLP4J and Flair are quite likely to outperform the other taggers, and Flair has an 80% chance of outperforming NLP4J.
Similar results are obtained with sentence accuracy, a less-commonly used metric. TnT is once again quite likely to be outperformed by other models. Whereas LAPOS is quite likely to outperform the Collins tagger, there is an 82% probability that LAPOS and Stanford taggers will yield practically equivalent results. Both NLP4J and Flair are both quite likely to outperform earlier models, and Flair is most likely to outperform NLP4J.
There is a 55% chance of practical equivalence between TnT and the Collins tagger for OOV accuracy. This is somewhat surprising because the two models use rather different strategies for OOV inference: TnT estimates hidden Markov model emission probabilities for OOVs using a simple suffix-based heuristic (Brants, 2000, 225f.), whereas the Collins tagger, a discriminativelytrained model, uses sub-word features developed by Ratnaparkhi (1997) to handle rare or unseen words. Similarly, whereas NLP4J and Flair also use distinct OOV modeling strategies, we estimate 1 out of 20 Penn Treebank random splits. We also find out that while both taggers were practically equivalent in both token and sentence accuracy, Stanford is likely to outperform LAPOS in OOV words, which could have impacted the statistical significance in the original experiment, as the repetition of the k-fold procedure causes strong variation -of the vocabulary available at training and OOV token sets -between experimental runs.
We note that Bayesian comparison and the precise quantities it estimate may give insights into the particular strengths and weaknesses of the various models and evaluation metrics. For instance, we infer that whereas the Collins tagger improves upon TnT, and Flair improves upon NLP4J, in both token and sentence accuracy, these improvements are not likely to be due to differences in the models' handling of out-of-vocabulary words. This is because TnT and the Collins tagger, and NLP4J and Flair, are most likely practically equivalent in their tagging accuracy for OOV words.
In Bayesian approaches as we are thinking about probabilities of a method outperforming another method. As a result we can do what was not possible in the NHST approach taken by Gorman and Bedrick. We can order methods into at a partial ordering to gain an insight into which methods are more likely to perform better than others. We can do this based on the modeled likelihoods, but it would not be in a NHST framework, because there are currently no multiple comparison correction procedures that take into account the variance of repeated runs of a method on the same data set.
Gorman and Bedrick reported that LAPOS would be sure to outperform Collins on PTB (20 out of 20 times), but not on Ontonotes (7 out of 20 times) in token accuracy. We found out that that the most likely scenario, when the performance is modeled using a hierarchical model on evidence from both data sets jointly, that these difference are likely within practical equivalence.
We set the interval of practical equivalence of observed accuracies to match the 95% confidence intervals reported by Bedrick and Gorman, to maintain a capacity for comparing the two experimental approaches. However, we believe it is much more useful to have an interpretable and intuitively understandable definition of what practical equivalence means in the experiment. Instead of setting it based on statistical confidence intervals, we recommend selecting the ROPE to rep-resent the scale of human annotator differences, or the error level that does not negatively impact a downstream task that depends on the prediction quality of evaluated methods.

Conclusions
We compare the performance of six part-of-speech taggers on two data sets using twenty repetitions of a ten-fold cross-validation procedure and statistical system comparison performed using hierarchical Bayesian models. By sampling from the posterior distribution of these models, we estimate the likelihood that a given tagger will be better than, worse than, or practically equivalent to other taggers on three different evaluation metrics. These estimates are valid insofar as the data sets used to estimate the Bayesian models comprise a representative sample of a coherent population of data sets. This method provides a principled way to perform statistical model comparison using k-fold cross-validation, a data-efficient evaluation technique. It also allows us to incorporate results obtained across multiple data sets and to make population-level inferences. We finally compare the results obtained with the proposed method to those computed using randomly generated splits and traditional NHST-based model comparison. The results provide new insights into the strengths and weaknesses of English part-ofspeech tagging models, complementing other approaches to model comparison and interpretation.

A Visualizations
Visualizations of the posterior samples in Section 5 are shown in Figures 1-3