Not All Claims are Created Equal: Choosing the Right Statistical Approach to Assess Hypotheses

Empirical research in Natural Language Processing (NLP) has adopted a narrow set of principles for assessing hypotheses, relying mainly on p-value computation, which suffers from several known issues. While alternative proposals have been well-debated and adopted in other fields, they remain rarely discussed or used within the NLP community. We address this gap by contrasting various hypothesis assessment techniques, especially those not commonly used in the field (such as evaluations based on Bayesian inference). Since these statistical techniques differ in the hypotheses they can support, we argue that practitioners should first decide their target hypothesis before choosing an assessment method. This is crucial because common fallacies, misconceptions, and misinterpretation surrounding hypothesis assessment methods often stem from a discrepancy between what one would like to claim versus what the method used actually assesses. Our survey reveals that these issues are omnipresent in the NLP research community. As a step forward, we provide best practices and guidelines tailored to NLP research, as well as an easy-to-use package for Bayesian assessment of hypotheses, complementing existing tools.


Introduction
Empirical fields, such as Natural Language Processing (NLP), must follow scientific principles for assessing hypotheses and drawing conclusions from experiments.For instance, suppose we come across the results in Table 1, summarizing the accuracy of two question-answering (QA) systems S 1 and S 2 on some datasets.What is the correct way to interpret this empirical observation in terms of  (Devlin et al., 2019;Sun et al., 2018) on the ARC question-answering dataset (Clark et al., 2018).ARC-easy & ARCchallenge have 2376 & 1172 instances, respectively.Acc.: accuracy as a percentage.
the superiority of one system over another?While S 1 has higher accuracy than S 2 in both cases, the gap is moderate and the datasets are of limited size.Can this apparent difference in performance be explained simply by random chance, or do we have sufficient evidence to conclude that S 1 is in fact inherently different (in particular, inherently stronger) than S 2 on these datasets?If the latter, can we quantify this gap in inherent strength while accounting for random fluctuation?Such fundamental questions arise in one form or another in every empirical NLP effort.Researchers often wish to draw conclusions such as: (Ca) I'm 95% confident that S1 and S2 are inherently different, in the sense that if they were inherently identical, it would be highly unlikely to witness the observed 3.5% empirical gap for ARC-easy.(Cb) With probability at least 95%, the inherent accuracy of S1 exceeds that of S2 by at least 1% for ARC-easy.
These two conclusions differ in two respects.First, Ca claims the two systems are inherently different, while Cb goes further to claim a margin of at least 1% between their inherent accuracies.The second, more subtle difference lies in the interpretation of the 95% figure: the 95% confidence expressed in Ca is in terms of the space of empirical observations we could have made, given some underlying truth about how the inherent accuracies of S 1 and S 2 relate; while the 95% probability expressed in Cb is directly over the space of possible arXiv:1911.03850v3[cs.CL] 5 May 2020 inherent accuracies of the two systems.
To support such a claim, one must turn it into a proper mathematical statement that can be validated using a statistical calculation.This in turn brings in additional choices: we can make at least four statistically distinct hypotheses here, each supported by a different statistical evaluation: (H1) Assuming S1 and S2 have inherently identical accuracy, the probability (p-value) of making a hypothetical observation with an accuracy gap at least as large as the empirical observation (here, 3.5%) is at most 5% (making us 95% confident that the above assumption is false).(H2) Assuming S1 and S2 have inherently identical accuracy, the empirical accuracy gap (here, 3.5%) is larger than the maximum possible gap (confidence interval) that could hypothetically be observed with a probability of over 5% (making us 95% confident that the above assumption is false).(H3) Assume a prior belief (a probability distribution) w.r.t.
the inherent accuracy of typical systems.Given the empirically observed accuracies, the probability (posterior interval) that the inherent accuracy of S1 exceeds that of S2 by a margin of 1% is at least 95%.(H4) Assume a prior belief (a probability distribution) w.r.t.
the inherent accuracies of typical systems.Given the empirically observed accuracies, the odds increase by a factor of 1.32 (Bayes factor) in favor of the hypothesis that the inherent accuracy of S1 exceeds that of S2 by a margin of 1%.
As this illustrates, there are multiple ways to formulate empirical hypotheses and support empirical claims.Since each hypothesis starts with a different assumption and makes a (mathematically) different claim, it can only be tested with a certain set of statistical methods.Therefore, NLP practitioners ought to define their target hypothesis before choosing an assessment method.
The most common statistical methodology used in NLP is null-hypothesis significance testing (NHST) which uses p-values (Søgaard et al., 2014;Koehn, 2004;Dror and Reichart, 2018).Hypotheses H1&H2 can be tested with p-value-based methods, which include confidence intervals and operate over the probability space of observations 2 ( §2.1 and §2.2).On the other hand, there are often overlooked approaches, based on Bayesian inference (Kruschke and Liddell, 2018), that can be used to assess hypotheses H3&H4 ( §2.3 and §2.4) and have two broad strengths: they can deal more naturally with accuracy margins and they operate directly over the probability space of inherent accuracy (rather than of observations).
For each technique reviewed in this work, we 2 More precisely, over the probability space of an aggregation function over observations, called test statistics.
discuss how it compares with alternatives and summarize common misinterpretations surrounding it ( §3).For example, a common misconception about p-value is that it represents a probability of the validity of a hypothesis.While desirable, p-values in fact do not provide such a probabilistic interpretation ( §3.2).It is instead through a Bayesian analysis of the posterior distribution of the test statistic (inherent accuracy in the earlier example) that one can make claims about the probability space of that statistic, such as H3.
We quantify and demonstrate related common malpractices in the field through a manual annotation of 439 ACL-2018 conference papers,3 and a survey filled out by 55 NLP researchers ( §4).We highlight surprising findings from the survey, such as the following: While 86% expressed fairto-complete confidence in the interpretation of pvalues, only a small percentage of them correctly answered a basic p-value interpretation question.
Contributions.This work seeks to inform the NLP community about crucial distinctions between various statistical hypotheses and their corresponding assessment methods, helping move the community towards well-substantiated empirical claims and conclusions.Our exposition covers a broader range of methods ( §2) than those included in recent related efforts ( §1.1), and highlights that these methods achieve different goals.Our surveys of NLP researchers reveals problematic trends ( §4), emphasizing the need for increased scrutiny and clarity.We conclude by suggesting guidelines for better testing ( §5), as well as providing a toolkit called HyBayes (cf.Footnote 1) tailored towards commonly used NLP metrics.We hope this work will encourage an improved understanding of statistical assessment methods and effective reporting practices with measures of uncertainty.

Related Work
While there is an abundant discussion of significance testing in other fields, only a handful of NLP efforts address it.For instance, Chinchor (1992) defined the principles of using hypothesis testing in the context of NLP problems.Mostnotably, there are works studying various randomized tests (Koehn, 2004;Ojala and Garriga, 2010;Graham et al., 2014), or metric-specific tests (Evert, 2004).More recently, Dror et al. (2018) and Dror and Reichart (2018) provide a thorough review of frequentist tests.While an important step in better informing the community, it covers a subset of statistical tools.Our work complements this effort by pointing out alternative tests.
With increasing over-reliance on certain hypothesis testing techniques, there are growing troubling trends of misuse or misinterpretation of such techniques (Goodman, 2008;Demšar, 2008).Some communities, such as statistics and psychology, even have published guidelines and restrictions on the use of p-values (Trafimow and Marks, 2015;Wasserstein et al., 2016).In parallel, some authors have advocated for using alternate paradigms such as Bayesian evaluations (Kruschke, 2010).
NLP is arguably an equally empirical field, yet with a rare discussion of proper practices of scientific testing, common pitfalls, and various alternatives.In particular, while limitations of p-values are heavily discussed in statistics and psychology, only a few NLP efforts approach them: over-estimation of significance by model-based tests (Riezler and Maxwell, 2005), lack of independence assumption in practice (Berg-Kirkpatrick et al., 2012), and sensitivity to the choice of the significance level (Søgaard et al., 2014).Our goal is to provide a unifying view of the pitfalls and best practices, and equip NLP researchers with Bayesian hypothesis assessment approaches as an important alternative tool in their toolkit.

Assessment of Hypotheses
We often wish to draw qualitative inferences based on the outcome of experiments (for example, inferring the relative inherent performance of systems).To do so, we usually formulate a hypothesis that can be assessed through some analysis.
Suppose we want to compare two systems on a dataset of instances x = [x 1 , . . ., x n ] with respect to a measure M(S, x) representing the performance of a system S on an instance x.
Let M(S, x) denote the vector [M(S, x i )] n i=1 .Given systems S 1 , S 2 , define y [M(S 1 , x), M(S 2 , x)] as a vector of observations. 4n a typical NLP experiment, the goal is to infer some inherent and unknown properties of systems.To this end, a practitioner assumes a probability distribution on the observations y, parameterized by θ, the properties of the systems.In other words, y is assumed to have a distribution5 with unknown parameters θ.In this setting, a hypothesis H is a condition on θ.Hypothesis assessment is a way of evaluating the degree to which the observations y are compatible with H.The overall process is depicted in Figure 1.
Following our running example, we use the task of answering natural language questions (Clark et al., 2018).While our examples are shown for this particular task, all the ideas are applicable to more general experimental settings.
For this task, the performance metric M(S, x) is defined as a binary function indicating whether a system S answers a given question x correctly or not.The performance vector M(S, x) captures the system's accuracy on the entire dataset (cf.Table 1).We assume that each system S i has an unknown inherent accuracy value, denoted θ i .Let θ = [θ 1 , θ 2 ] denote the unknown inherent accuracy of two systems.In this setup, one might, for instance, be interested in assessing the credibility of the hypothesis H that θ 1 < θ 2 .
Table 2 shows a categorization of statistical tools developed for the assessment of such hypotheses.The two tools on the left are based on frequentist statistics, while the ones on the right are based on Bayesian inference (Kruschke and Liddell, 2018).A complementary categorization of these tools is based on the nature of the results that they provide: the ones on the top encourage binary decision mak- ing, while those on the bottom provide uncertainty around estimates.We discuss all four classes of tests in the following sub-sections.

Null-Hypothesis Significance Testing
In frequentist hypothesis testing, there is an asymmetric relationship between two hypotheses.The hypothesis formulated to be rejected is usually called the null-hypothesis H 0 .For instance, in our example H 0 : θ 1 = θ 2 .A decision procedure is devised by which, depending on y, the null-hypothesis will either be rejected in favor of H 1 , or the test will stay undecided.
A key notion here is p-value, the probability, under the null-hypothesis H 0 , of observing an outcome at least equal to or extreme than the empirical observations y.To apply this notion on a set of observations y, one has to define a function that maps y to a numerical value.This function is called the test statistic δ(.) and it formalizes the interpretation of extremeness.Concretely, p-value is defined as, In this notation, Y is a random variable over possible observations and δ(y) is the empirically observed value of the test statistic.
A large p-value implies that the data could easily have been observed under the null-hypothesis.Therefore, a lower p-value is used as evidence towards rejecting the null-hypothesis.
Example 1 (Assessment of H1) We form a null-hypothesis using the accuracy of the two systems (Table 1) using a one-sided z-test a with δ(y) . We formulate a null-hypothesis against the claim of S 1 having strictly better accuracy than S 2 .This results in a p-value of 0.0037 (details in §A.1) and can be interpreted as the following: if the systems have inherently identical accuracy values, the probability of observing a superiority at least as extreme as our observations is 0.0037.For a significance level of 0.05 (picked before the test) this p-value is small enough to reject the null-hypothesis.
a The choice of this test is based on an implicit assumption that two events corresponding to answering two distinct questions, are independent with identical probability, i.e., equal to the inherent accuracy of the system.Hence, the number of correct answers follows a binomial distribution.Since, the total number of questions is large, i.e., 2376 in ARC-easy, this distribution can be approximated with a normal distribution.It is possible to use other tests with less restrictive assumptions (see Dror et al. (2018)), but for the sake of simplicity we use this test to illustrate core ideas of "p-value" analysis.
This family of the tests is thus far the most widely used tool in NLP research.Each variant of this test is based on some assumptions about the distribution of the observations, under the nullhypothesis, and an appropriate definition of the test statistics δ(.).Since a complete exposition of such tests is outside the scope of this work, we encourage interested readers to refer to the existing reviews, such as Dror et al. (2018).

Confidence Intervals
Confidence Intervals (CIs) are used to express the uncertainty of estimated parameters.In particular, the 95% CI is the range of values for parameter θ such that the corresponding test based on p-value is not rejected: In other words, the confidence interval merely asks which values of the parameter θ could be used, before the test is rejected.
Example 2 (Assessment of H2) Consider the same setting as in Example 1.According to Table 1, the estimated value of the accuracy differences (maximum-likelihood estimates) is θ 1 − θ 2 = 0.035.A 95% CI of this quantity provides a range of values that are not rejected under the corresponding null-hypothesis.In particular, a 95% CI gives The blue bar in Figure 2 (right) shows the corresponding CI.Notice that the conclusion of Example 1 is compatible with this CI; the null-hypothesis θ 1 = θ 2 which got rejected is not included in the CI.

Posterior Intervals
Bayesian methods focus on prior and posterior distributions of θ.Recall that in a typical NLP experiment, these parameters can be, e.g., the actual mean or standard deviation for the performance of a system, as its inherent and unobserved property.
In Bayesian inference frameworks, a priori assumptions and beliefs are encoded in the form of a prior distribution P(θ) on parameters of the model. 6In other words, a prior distribution describes the common belief about the parameters of the model.It also implies a distribution over possible observations.For assessing hypotheses H3 and H4 in our running example, we will simply use the uniform prior, i.e., the inherent accuracy is uniformly distributed over [0,1].This corresponds to having no prior belief about how high or low the inherent accuracy of a typical QA system may be.
In general, the choice of this prior can be viewed as a compromise between the beliefs of the analyzer and those of the audience.The above uniform prior, which is equivalent to the Beta(1,1) distribution, is completely non-committal and thus best suited for a broad audience who has no reason to believe an inherent accuracy of 0.8 is more likely than 0.3.For a moderately informed audience that already believes the inherent accuracy is likely to be widely distributed but centered around 0.67, the analyzer may use a Beta(3,1.5)prior to evaluate a hypothesis.Similarly, for an audience that already believes the inherent accuracy to be highly peaked around 0.75, the analyzer may want to use a Beta(9,3) prior.Formally, one incorporates θ in a hierarchical model in the form of a likelihood function P(y|θ).This explicitly models the underlying process that connects the latent parameters to the observations.Consequently, a posterior distribution is inferred using the Bayes rule and conditioned on the observations: P(θ|y) = P(y|θ)P(θ)

P(y)
. The posterior distribution is a combined summary of the data and prior information, about likely values of θ.The mode of the posterior (maximum a posteriori) can be seen as an estimate for θ.Additionally, the posterior can be used to describe the uncertainty around the mode.
While the posterior distribution can be analytically calculated for simple models, it is not so straightforward for general models.Fortunately, recent advances in hardware, Markov Chain Monte Carlo (MCMC) techniques (Metropolis et al., 1953;Gamerman and Lopes, 2006), and probabilistic programming7 allow sufficiently-accurate numerical approximations of posteriors.
One way to summarize the uncertainty around the point estimate of parameters is by marking the span of values that cover α% of the mostcredible density in the posterior distribution (e.g., α = 95%).This is called Highest Density Intervals (HDIs) or Bayesian Confidence Intervals (Oliphant, 2006) (not to be confused with CI, in §2.2).
Recall that a hypothesis H is a condition on θ (see Figure 1).Therefore, given the posterior P(θ|y), one can calculate the probability of H, as a probabilistic event, conditioned on y: P(H|y).
For example in an unpaired t-test, H 0 is the event that the means of two groups are equal.Bayesian statisticians usually relax this strict equality θ 1 = θ 2 and instead evaluate the credibility of |θ 1 − θ 2 | < ε for some small value of ε.The intuition is that when θ 1 and θ 2 are close enough they are practically equivalent.This motivates the definition of Region Of Practical Equivalence (ROPE): An interval around zero with "negligible" radius.The boundaries of ROPE depend on the application, the meaning of the parameters and its audience.In our running example, a radius of one percent for ROPE implies that improvements less than 1 percent are not considered notable.For a discussion on setting ROPE see Kruschke (2018).
These concepts give researchers the flexibility to define and assess a wide range of hypotheses.For instance, we can address H3 (from Introduction) and its different variations that can be of interest depending on the application.The analysis of H3 is depicted in Figure 2 and explained next. 8xample 3 (Assessment of H3) Recall the setting from previous examples.The left panel of Figure 2 shows the prior on the latent accuracy of the systems and their differences (further details on the hierarchical model in §A.3.)We then obtain the posterior distribution (Figure 2, right), in this case via numerical methods).
Notice that one can read the following conclusion: with probability 0.996, the hypothe- sis H3 (with a margin of 0%) holds true.As explained in §C.2, this statement does not imply any difference with a notable margin.In fact, the posterior in Figure 2 implies that this experiment is not sufficient to claim the following: with probability at least 0.95, hypothesis H3 (with a margin of 1%) holds true.This is the case since ROPE (0.01, 0.01) overlaps with 95% HDI (0.00939, 0.0612).

Bayes Factor
A common tool among Bayesian frameworks is the notion of Bayes Factor. 9Intuitively, it compares how the observations y shift the credibility from prior to posterior of the two competing hypothesis: If the BF 01 equals to 1 then the data provide equal support for the two hypotheses and there is no reason to change our a priori opinion about the relative likelihood of the two hypotheses.A smaller Bayes Factor is an indication of rejecting the nullhypothesis H 0 .If it is greater than 1 then there is support for the null-hypothesis and we should infer that the odds are in favor of H 0 .
Notice that the symmetric nature of Bayes Factor allows all the three outcomes of "accept", "reject", and "undecided," as opposed to the definition of p-value that cannot accept a hypothesis. 9"Bayesian Hypothesis Testing" usually refers to the arguments based on "Bayes Factor."However, as shown in §2.3, there are other Bayesian approaches for assessing hypotheses. .This value is very close to 1 which means that this observation does not change our prior belief about the two systems difference.

Comparisons
Many aspects influence the choice of an approach to assess significance of hypotheses.This section provides a comparative summary, with details in Appendix C and an overall summary in Table 3.

Susceptibility to Misinterpretation
The complexity of interpreting significance tests combined with insufficient reporting could result in ambiguous or misleading conclusions.This ambiguity can not only confuse authors but also cause confusion among readers of the papers.
While p-values ( §2.1) are the most common approach, they are inherently complex, which makes them easier to misinterpret (see examples in §C.1).Interpretation of confidence intervals ( §2.2) can also be challenging since it is an extension of pvalue (Hoekstra et al., 2014).Approaches that provide measures of uncertainty directly in the hypothesis space (like the ones in §2.3) are often more natural choices for reporting the results of experiments (Kruschke and Liddell, 2018).

Measures of Certainty
A key difference is that not all methods studied here provide a measure of uncertainty over the hypothesis space.For instance, p-values ( §2.1) do not provide probability estimates on two systems being different (or equal) (Goodman, 2008).On the contrary, they encourage binary thinking (Gelman, 2013), that is, confidently concluding that one system is better than another, without taking into account the extent of the difference between the systems.CIs ( §2.2) provide a range of values for the target parameter.However, this range also does not have any probabilistic interpretation in the hypothesis space (du Prel et al., 2009).On the other hand, posterior intervals ( §2.3) generally provide a useful summary as they capture probabilistic estimates of the correctness of the hypothesis.

Dependence on Stopping Intention
The process by which samples in the test are collected can affect the outcome of a test.For instance, the sample size n (whether it is determined before the process of gathering information begins, or is a random variable itself) can change the result.Once observations are recorded, this distinction is usually ignored.Hence, the testing algorithms that do not depend on the distribution of n are more desirable.Unfortunately, the definition of p-value ( §2.1) depends on the distribution of n.For instance, Kruschke (2010, §11.1) provides examples where this subtlety can change the outcome of a test, even when the final set of observations is identical.

Sensitivity to the Choice of Prior
The choice of the prior can change the outcome of Bayesian approaches ( §2.3 & §2.4).Decisions of Bayes Factor ( §2.4) are known to be sensitive to the choice of prior, while posterior estimates ( §2.3) are less so.For further discussion, see C.4 or refer to discussions by Sinharay and Stern (2002); Liu and Aitkin (2008) or Dienes (2008).

Current Trends and Malpractices
This section highlights common practices relevant to the our target approaches.To better understand the common practices or misinterpretations in the field, we conducted a survey.We shared the survey among ∼450 NLP researchers (randomly selected from ACL'18 Proceedings) from which 55 individuals filled out the survey.While similar surveys have been performed in other fields (Windish et al., 2007), this is the first in the NLP community, to the best of our knowledge.Here we review the main highlights (see Appendix for more details and charts).
Interpreting p-values.While the majority of the participants have a self-claimed ability to interpret p-values (Figure 9f), many choose its imprecise interpretation "The probability of the observation this extreme happening due to pure chance" (the popular choice) vs. a more precise statement "Conditioned on the null hypothesis, the probability of the observation this extreme happening." The use of CIs.Even though 95% percent of the participants self-claimed the knowledge of CIs (Figure 9e), it is rarely used in practice.In an annotation done on ACL'18 papers by two of the authors, only 6 (out of 439) papers were found to use CIs.
The use of Bayes Factors.A majority of the participants had "heard" about "Bayesian Hypothesis Testing" but did not know the definition of "Bayes Factor" (Figure 3).HDIs (discussed in §2.3) were the least known.We did not find any papers in ACL'18 that use Bayesian tools.
Have you heard about "Bayesian Hypothesis Testing"?
I have used "hypothesis testing" in the past (in a homework, a paper, etc).

Do you know the definition of "Bayes Factor"?
Do you know the definition of "Highest Density Interval"?The use of "significan*".A notable portion of NLP papers express their findings by using the term "significant" (e.g., "our approach significantly improves over X.") Almost all ACL'18 papers use the term "significant" 10 somewhere.Unfortunately, there is no single universal interpretation of such phrases across readers.In our survey, we observe that when participants read "X significantly improves Y" in the abstract of a hypothetical paper: 1.About 82% expect the claim to be backed by "hypothesis testing"; however, only 57% expect notable empirical improvement (see Q3 in Appendix B); 2. About 35% expect the paper to test "practical significance", which is not generally assessed by popular tests (see §C.2); 3. A few also expect a theoretical argument.
Recent trends.Table 3 provides a summary of the techniques studied here.We make two key observations: (i) many papers don't use any hypothesis assessment method and would benefit from one; (ii) from the final column, p-value based techniques clearly dominate the field, a clear disregard to the advantages that the bottom two alternatives offer.

Recommended Practices
Having discussed common issues, we provide a collection of recommendations (in addition to the prior recommendations, such as by Dror et al. (2018)).The first step is to define your goal.Each of the tools in §2 provides a distinct set of information.Therefore, one needs to formalize a hypothesis and consequently the question you intend to answer by assessing this hypothesis.Here are four representative questions, one for each method: 1. Assuming that the null-hypothesis is true, is it likely to witness observations this extreme?( §2.1) 2. How much my null-hypothesis can deviate from the mean of the observations until a p-value argument rejects it.( §2.2) 3. Having observed the observations, how probable is my claimed hypothesis?( §2.3) 4. By observing the data how much do the odds increase in favor of the hypothesis?( §2.4) If you decide to use frequentist tests: • Check if your setting is compatible with the assumptions of the test.In particular, investigate if the meaning of null-hypothesis and sampling distribution match the experimental setting.• Include a summary of the above investigation.
Justify unresolved assumption mismatches.• Statements reporting p-value and confidence interval must be precise enough so that the results are not misinterpreted (see §3.1).• The term "significant" should be used with caution and clear purpose to avoid misinterpretations (see §4).One way to achieve this is by using adjectives "statistical" or "practical" before any (possibly inflected) usage of "significance."• Often times, the desired conclusion is a notable margin in the superiority of one system over another (see §3).In such cases, a pointwise pvalue argument is not sufficient; a confidence interval analysis is needed.If CI is inapplicable for some reason, this should be mentioned.
If you decide to use Bayesian approaches: • Since Bayesian tests are less known, it is better to provide a short motivation for the usage.The count of certain patterns an algorithm could find in a big pool, in a fixed amount of time.Notice that you can't convert this into a ratio form, since there is no welldefined denominator.Ex: measuring how many of questions could be answered correctly (from an infinite pool of questions) by a particular QA systems, in a limited minute (the system is allowed to skip the questions too) For each group: mu ∈ R and sigma ∈ R+ Shared between groups: thresholds between possibe levels Collection of objects/labels arranged in a certain ordering, not necessarily with a metric distance between them; for example sentiment labels (https://www.aclweb.org/anthology/S16-1001.pdf), product review categories, grammaticality of sentences bootstrap / permutation Assumption 1: The observations are distributed as a t-student with unknown normality parameter (a normal distribution with potentially longer tales).
Assumption 2: The observations from each group are assumed to be i.i.d, conditioned on the inherent characterstics of two systems Assumption 3: The total number of instances (the denominators) is known.
Assumption 4: The variable is inherently continuous, or the granularity (the denominator) is high enough to treat the variable as continuous.
Assumption 6: The observations follow a binomial-distribution.
Assumption 7: The observations follow a normal distribution.
* In this model (unlike frequentist t-test) outliers don't need to be discarded manually to realize the strict normality assumption.
Table 4: Select models supported by our package HyBayes at the time of this publication.
• Comment on the certainty (or the lack of) of your inference in terms of HDI and ROPE: (I) is HDI completely inside ROPE, (II) they are completely disjoint, (III) HDI contains values both inside and outside ROPE (see §2.3.)• For reproducibility, include further details about your test: MCMC traces, convergence plots, etc. (Our HyBayes package provides all of this.) • Be wary that Bayes Factor is highly sensitive to the choice of prior (see §3.4).See Appendix §C.4 for possible ways to mitigate this.

Package HyBayes
We provide an accompanying package, HyBayes, to facilitate comparing systems using the two Bayesian hypothesis assessment approaches discussed earlier: (a) posterior probabilities and (b) Bayes Factors.(Several packages are already available for frequentist assessments.)Table 4 summarizes common settings in which HyBayes can be employed 11 in NLP research, including typical use cases, underlying data assumptions, recommended hierarchical model, metrics (accuracy, exact match, etc.), and frequentist tests generally used in these cases.These settings cover several typical assumptions on observed NLP data.However, if a user has specific information on observations or can capitalize on other assumptions, we recommend adding a custom model, which can be done relatively easily.
11 These settings are available at the time of this publication, with more options likely to be added in the future.

Conclusion
Using well-founded mechanisms for assessing the validity of hypotheses is crucial for any field that relies on empirical work.Our survey indicates that the NLP community is not fully utilizing scientific methods geared towards such assessment, with only a relatively small number of papers using such methods, and most of them relying on p-value.
Our goal was to review different alternatives, especially a few often ignored in NLP.We surfaced various issues and potential dangers of careless use and interpretations of different approaches.We do not recommend a particular approach.Every technique has its own weaknesses.Hence, a researcher should pick the right approach according to their needs and intentions, with a proper understanding of the techniques.Incorrect use of any technique can result in misleading conclusions.
We contribute a new toolkit, HyBayes, to make it easy for NLP practitioners to use Bayesian assessment in their efforts.We hope that this work provides a complementary picture of hypothesis assessment techniques for the field and encourages more rigorous reporting trends.At the higher level, θ i is assumed to follow a uniform distribution in [0, 1].In a later section below, we show how slightly different values for Beta distribution can be used instead as a generalization.
To run this analysis with HyBayes one can run the following command: python -m HyBayes \ --config config_QABinomial.iniwhere config QABinomial.ini is a configuration file that defines a hierarchical model and the locations of the observations.This particular config file corresponds to the 2nd row of Table 4.For more information and examples, please refer to the manual of our released package (footnote 1).With this command, HyBayes would run a sampling-based Bayesian inference on the observed data, using the specified hierarchical model.The results should look like Fig. 2.
Generalizations.It is possible to take previous observations in the literature into account and start with a non-uniform prior.For example, setting α and β to 1 2 incorporates the idea that the performances are generally closer to 0 or 1, on the other hand α = β = 2 lowers the probability of such extremes.Notice that, as long as θ 1 and θ 2 are assumed to follow the same distribution, the probability that S 1 is better than S 2 is 0.5, as expected from any fair prior.
New observations.The researcher might proceed with performing another experiment with another dataset.Fig. 5 shows the posterior given both observations from the performances on the "easy" and "challenge" datasets.Notice that in this case, HDI is completely higher than two percent superiority of the accuracy of S 1 over S 2 .This means that one can make the following statement: With probability 0.95, system S 1 's accuracy is two percent higher than that of S 2 's.In this case, it seems acceptable to informally state this claim as: system S 1 practically significantly outperforms system S 2 .

B Additional Details About the Survey in Section 4
Below we include three questions from the survey discussed in Section 4 that measure participants' interpretation of various terms or statistical tools.Next to each question, we show the distribution of the collected responses.In Q1 & Q2, reveal that individuals have substantial difficulties in interpreting p−values.In Q1, while (b) is the popular response, the correct answer is the much less popular option (c).In Q2, while (d) is the most popular response, the correct answer is (c).
For Q3, as discussed in §4, researchers often find it difficult to interpret statements that contain "significant".Therefore, we must be extra cautious about how this loaded keyword is used.
Q1: An NLP paper shows a performance of 38% for a classifier.They also show that adding a feature F improves the performance to 39%.In this finding, the authors have stated that p-value > 0.05.This means: (a) The probability is greater than 1 in 20 that an improvement would be found again, if the study were repeated.(b) The probability is greater than 1 in 20 that an improvement this large could occur by chance alone.(c) The probability is less than 1 in 20 that an improvement this large could occur by chance alone.(d) The chance is 95% that the study is correct.Q2: An NLP paper shows a performance of 38% for a classifier-1.They also show that adding a feature improves the performance to 45% (call this classifier-2).The authors claim that this finding is "statistically significant" with a significance level of 0.01.Which of the following(s) make sense?(a) With probability at least 0.01 classifier-2 will have better results than classifier-1, if we repeat this experiment.(b) With probability 0.01 the observations are sampled from two equal classifiers.(c) The probability of observing a margin 7% is at most 0.01, assuming that the two classifiers inherently have the same performance.(d) If we repeat the experiment, with a probability 99% classifier-2 will have a higher performance than classifier-1.(e) If we repeat the experiment, with a probability 99% classifier-2 will have a performance margin of 7% compared to classifier-1.(f) I don't know, because I don't know how hypothesis testing is done.(g) Don't have time.Skipping this question.Q3: An NLP paper presents system-1 and it compares it with a baseline system-2.
In its "abstract" it writes: " ... system-1 significantly improves over system-2."What are the right way(s) to interpret this (select all that applies) (a) It is expected that authors have performed some type of "hypothesis testing."(b) It is expected that the authors have reported the performances of two systems on a dataset where system-1 has a higher performance than system-2 with a notable margin in the dataset.(c) It is expected that the authors have reported experiments and concluded that system-1, inherently, has a higher perf.than system-2 with a notable margin.(d) It is expected that the authors conclude through theoretical arguments that, system-1 has a higher inherent perf.than system-2 with a notable margin.(e) Even though not expected, it is acceptable that the authors are concluding, through some theoretical arguments, that system-1 has a higher inherent performance than system-2 with a notable margin.(f) Too tedious.Skipping this question.Responses to the remaining questions are summarized in Figure 9: (a) Current positions of the participants in our survey.

C Comparison, Misinterpretations, and Fallacies
This is an extension of the ideas discussed in Section 3.

C.1 Susceptibility to Misinterpretation
The complexity of interpreting significance tests, combined with insufficient reporting (as shown by Dror et al. (2018)) could result in ambiguous or misleading conclusions.Not only many papers are unclear about their tests, but their results could also misinterpreted by readers.Among the techniques studied here, p-values, due to their complex definition, have received by far the biggest number of criticisms.While pvalues are the most common approach, they are inherently complex which makes them easy to misinterpret.Here are a few common misinterpretations (Demšar, 2008;Goodman, 2008): • Misconception #1: If p < 0.05, the nullhypothesis has only a 5% chance of being true: To see that this is false, remember that p-value is defined with the assumption that null-hypothesis is correct (Eq.1.) • Misconception #2: If p > 0.05, there is no difference between the two systems: Having large p-value only means that the null-hypothesis is consistent with the observations, but it does not tell anything about the likeliness of the nullhypothesis.• Misconception #3: A statistically significant result (p < 0.05) indicates a large/notable difference between two systems: p-value only indicates strict superiority and provides no information about the margin of the effect.Interpretation of CI can also be challenging since it is an extension of p-value (Hoekstra et al., 2014).Tests that provide measures of uncertainty (like the ones in §2.3) are more natural choices for reporting the results of experiments (Kruschke and Liddell, 2018).Overall, our view on "ease of interpretation" of the approaches is summarized in Table 3.

C.2 Practical Significance
Statistical significance is different from practical significance (Berger and Sellke, 1987).While we saw in Example 1 that the difference between systems S 1 and S 2 is statistically significant (H1), it does not provide any intuition on the magnitude of their difference (the x parameter in H2-4).In other words, a small p-value does not necessarily mean that the effect is practically important.Confidence intervals alleviate this issue by providing the range of parameters that are compatible with the data; however, they do not provide probability estimates (Kruschke and Liddell, 2018).Bayesian analysis provides probability distributions directly over the target parameters of interest.This allows users to report uncertainty estimates for hypotheses that capture the margins of the effects.

C.3 Unintended Misleading Result by Iterative Testing
While many tests are designed for a single-round experiment, in practice researchers perform multiple rounds of experiments until a predetermined condition is satisfied.This is particularly a problem in binary tests (such as p-value) when the condition is to achieve a desired result.For example, a researcher could continue experimenting until they achieve a statistically significant result (even if they don't necessarily have any intention of cheating the test) (Kim and Bang, 2016).
Since the outcomes of p-values and CIs only can "reject" or stay "undecided", these tests reinforce an unintentional bias towards the only possible decision.Consequently, it becomes easy to misuse this testing mechanism: for big enough data points it is possible to make statistically significant claims (Amrhein et al., 2017).
On the other hand, the approaches in §2.3 and §2.4 provide both outcomes of "Accept" and "reject", beside staying "undecided."Therefore an honest researcher is more probable to accept that their data supports the opposite of what their conjecture was.
For an in-depth study of how each approach behaves in sequential tests, we refer to Kruschke (2010, §13.3).
Since p-values and CIs do not depend on prior they are not subject to this issue.

C.4 Choice of Prior for Bayes Factor
As discussed in §2.4,§C.3, and §5, Bayes Factor is quite sensitive to the choice of prior.To address this, here are a few options to set the prior: 1. Within the framework of model selection, if the priors are decided based on a clear meaning, as opposed to formulating a "vague" prior, then they can be justified to the audience.
2. Often, there are a few choices of noncommittal priors that seem equally represen-tative of our beliefs.The best option is to perform and report one test for each of these choices to control the sensitivity to the prior.
3. Another common approach to mitigate this concern is to use a small portion of the data to get an "informed" prior and do the analysis using this prior.This ensures that the new prior is meaningful and defensible.
4. If none the above applies, it is recommended to use the approaches in §2.3 instead.Even though posterior density depends on prior, this measure is known to be robust for different choices of similar priors.

C.5 Flexible Choice of Hierarchical Model
In this respect, the approaches in §2.3 and §2.4 are better suited to take the specifics of each setting into account.

C.6 Bayes Factors vs. Posterior Intervals
We refer interested readers to Kruschke and Liddell (2018, pp. 165-166) for a list of Bayes factor caveats.

Figure 1 :
Figure 1: Progression of steps taken during a scientific assessment of claims from empirical observations.

Figure 2 :
Figure 2: Left: Prior distributions of two systems (bottom row) and their difference (top row).Right: Posterior distributions of two systems (bottom row) and their difference (top row) after observing the performances on ARCeasy dataset.Note the posterior HDI estimate, (0.00939, 0.0612).Here we assume at least one percent accuracy difference to be considered practically different.Hence, we indicate the interval (−0.01, 0.01) as ROPE ( §2.3.)

Figure 3 :
Figure 3: Select results from our survey.

Figure 4 :
Figure 4: The hierarchical model in Example 3&4.

Figure 5 :
Figure 5: Posterior distributions of two systems (bottom row) and their difference (top row) after observing the performances on both datasets.
(e) I don't know, because I don't know what p-value means.(f) Don't have time.Skipping this question.
Distribution of responses to Q1.
Distribution of responses to Q2.
Distribution of responses to Q3.
Distribution of responses to "What venues do you usually publish in?" (c) Distribution of responses to "I have learned about statistical hypothesis testing/assessment (via taking classes or reading it from other places)."Distribution of responses to "I can understand almost all the "statistical" terms I encounter in papers."(e) Distribution of responses to "Do you know the definition of Confidence Interval?"Distribution of responses to "I know p-values and I know how to interpret them."

Figure 9 :
Figure 9: Responses to several questions in our survey of NLP researchers

Table 1 :
Performance of two systems

Table 2 :
Various classes of methods for statistical assessment of hypotheses.

Table 3 :
A comparison of different statistical methods for evaluating the credibility of a hypothesis given a set of observations.The total number of published papers in at the ACL-2018 conference is 439.