Showing Your Work Doesn’t Always Work

In natural language processing, a recently popular line of work explores how to best report the experimental results of neural networks. One exemplar publication, titled “Show Your Work: Improved Reporting of Experimental Results” (Dodge et al., 2019), advocates for reporting the expected validation effectiveness of the best-tuned model, with respect to the computational budget. In the present work, we critically examine this paper. As far as statistical generalizability is concerned, we find unspoken pitfalls and caveats with this approach. We analytically show that their estimator is biased and uses error-prone assumptions. We find that the estimator favors negative errors and yields poor bootstrapped confidence intervals. We derive an unbiased alternative and bolster our claims with empirical evidence from statistical simulation. Our codebase is at https://github.com/castorini/meanmax.


Introduction
Questionable answers and irreproducible results represent a formidable beast in natural language processing research. Worryingly, countless experimental papers lack empirical rigor, disregarding necessities such as the reporting of statistical significance tests (Dror et al., 2018) and computational environments (Crane, 2018). As Forde and Paganini (2019) concisely lament, explorimentation, the act of tinkering with metaparameters and praying for success, while helpful in brainstorming, does not constitute a rigorous scientific effort.
Against the crashing wave of explorimentation, though, a few brave souls have resisted the urge to feed the beast. Reimers and Gurevych (2017) argue for the reporting of neural network score distributions. Gorman and Bedrick (2019) demonstrate that deterministic dataset splits yield less robust results than random ones for neural networks.  advocate for reporting the expected validation quality as a function of the computation budget used for hyperparameter tuning, which is paramount to robust conclusions.
But carefully tread we must. Papers that advocate for scientific rigor must be held to the very same standards that they espouse, lest they birth a new beast altogether. In this work, we critically examine one such paper from . We acknowledge the validity of their technical contribution, but we find several notable caveats, as far as statistical generalizability is concerned. Analytically, we show that their estimator is negatively biased and uses assumptions that are subject to large errors. Based on our theoretical results, we hypothesize that this estimator strongly prefers underestimates to overestimates and yields poor confidence intervals with the common bootstrap method (Efron, 1982).
Our main contributions are as follows: First, we prove that their estimator is biased under weak conditions and provide an unbiased solution. Second, we show that one of their core approximations often contains large errors, leading to poorly controlled bootstrapped confidence intervals. Finally, we empirically confirm the practical hypothesis using the results of neural networks for document classification and sentiment analysis.

Background and Related Work
Notation. We describe our notation of fundamental concepts in probability theory. First, the cumulative distribution function (CDF) of a random variable (RV) X is defined as where I denotes the indicator function. Note that we pick "B" instead of "n" to be consistent with . The error of the ECDF is pop-ularly characterized by the Kolmogorov-Smirnov (KS) distance between the ECDF and CDF: (2.1) Naturally, by definition of the CDF and ECDF, KS(F B , F ) ≤ 1. Using the CDF, the expectation for both discrete and continuous (cts.) RVs is defined using the Riemann-Stieltjes integral.
We write the i th order statistic of independent and identically distributed (i.i.d.) X 1 , . . . , X B as X (i:B) . Recall that the i th order statistic X (i:B) is an RV representing the i th smallest value if the RVs were sorted. Hyperparameter tuning. In random search, a probability distribution p(H) is first defined over a k-tuple hyperparameter configuration H := (H 1 , . . . , H k ), which can include both cts. and discrete variables, such as the learning rate and random seed of the experimental environment. Commonly, researchers choose the uniform distribution over a bounded support for each hyperparameter (Bergstra and Bengio, 2012). Combined with the appropriate model family M and dataset D := (D T , D V )-split into training and validation sets, respectively-a configuration then yields a numeric score V on D V . Finally, after sampling B i.i.d. configurations, we obtain the scores V 1 , . . . , V B and pick the hyperparameter configuration associated with the best one.

Analysis of Showing Your Work
In "Show Your Work: Improved Reporting of Experimental Results,"  realize the ramifications of underreporting the hyperparameter tuning policy and its associated budget. One of their key findings is that, given different computation quotas for hyperparameter tuning, researchers may arrive at drastically different conclusions for the same model. Given a small tuning budget, a researcher may conclude that a smaller model outperforms a bigger one, while they may reach the opposite conclusion for a larger budget.
To ameliorate this issue,  argue for fully reporting the expected maximum of the score as a function of the budget. Concretely, the parameters of interest are θ 1 , . . . , θ B , where θ n := E [max{V 1 , . . . , V n }] = E[V (n:n) ] for 1 ≤ n ≤ B. In other words, θ n is precisely the expected value of the n th order statistic for a sample of size n drawn i.i.d. at tuning time. For this quantity, they propose an estimator, derived as follows: first, observe that the CDF of V * n = V (n:n) is which we denote as F n (v). Then For approximating the CDF,  use the ECDFF n B (v), constructed from some sam- To construct an estimatorθ n for θ n ,  then replace the CDF with the ECDF: which, by definition, evaluates tô where, with some abuse of notation, v 0 < v 1 is a dummy variable andF n B (v 0 ) := 0. We henceforth refer toθ n as the MeanMax estimator.  recommend plotting the number of trials on the x-axis andθ n on the y-axis.

Pitfalls and Caveats
We find two unspoken caveats in : first, the MeanMax estimator is statistically biased, under weak conditions. Second, the ECDF, as formulated, is a poor drop-in replacement for the true CDF, in the sense that the finite sample error can be unacceptable if certain, realistic conditions are unmet. Estimator bias. The bias of an estimatorθ is defined as the difference between its expectation and its estimand θ: Bias(θ) := E[θ] − θ. An estimator is said to be unbiased if its bias is zero; otherwise, it is biased. We make the following claim: Theorem 1. Let V 1 , . . . , V B be an i.i.d. sample (of size B) from an unknown distribution F on the real line. Then, for all 1 ≤ n ≤ B, Bias(θ n ) ≤ 0, with strict inequality iff V (1) < V (n) with nonzero probability. In particular, if n = 1, then Bias(θ 1 ) = 0 while if n > 1 with F continuous or discrete but non-degenerate, then Bias(θ n ) < 0.
Proof. Let 1 < n ≤ B. We are interested in estimating the expectation of the maximum of the n i.i.d. samples: An obvious unbiased estimator, based on the given sample of size B, is the following: This estimator is obviously unbiased since due to the i.i.d. assumption on the sample. A second, biased estimator is the following: This estimator is only asymptotically unbiased when n is fixed while B tends to ∞. In fact, we will prove below that for all 1 ≤ n ≤ B: is defined as the i th smallest order statistic of the sample. We start with simplifying the calculation of the two estimators. It is easy to see that the following holds: where we basically enumerate all possibilities for max{V i 1 , . . . , V in } = V (j) . By convention, m n = 0 if m < n so the above summation effectively goes from k to B, but our convention will make it more convenient for comparison. Similarly, We make an important observation that connects our estimators to that of Dodge et al. LetF B there are no ties in the sample. The formula continues to hold even if there are ties, in which case we simply collapse the ties, using the fact that k Now, we are ready to prove Eq. (3.8). All we need to do is to compare the cumulative sums of the coefficients in the two estimators: We need only consider k ≥ n (the case k < n is trivial). One can easily verify the following expression backwards: where the last inequality follows from k < B and n > 1. Thus, we have verified the following for all 1 ≤ k < B: Eq. (3.8) now follows since V (1) < · · · < V (B) lies in the isotonic cone while we have proved the difference of the two coefficients lies in the dual cone of the isotonic cone. An elementary way to see this is to first compare the coefficients in front of V (B) : clearly,Û B n 's is larger since it has smaller sum of all coefficients (but the one in front of V (B) ; take k = B − 1) whereas the total sum is always one. Repeat this comparison for V (1) , . . . , V (B−1) .
Lastly, if V (1) < V (n) , then there exists a subset (with repetition) 1 ≤ i 1 ≤ . . . ≤ i n ≤ n such that max{V (i 1 ) , . . . , V (in) } < V (n) . For instance, setting i 1 = . . . = i n = 1 would suffice. SinceV B n puts positive mass on every subset of n elements (with repetitions allowed), the strict inequality follows. We note that if F is continuous, or if F is discrete but non-degenerate, then V (1) < V (n) with nonzero probability, hence The proof is now complete.
For further caveats, see Appendix A. The practical implication is that researchers may falsely conclude, on average, that a method is worse than it is, since the MeanMax estimator is negatively biased. In the context of environmental consciousness , more computation than necessary is used to make a conclusion. Notably, this result always holds for cts. distributions, since the population maximum is never in the sample. Practically, this theorem suggests the failure of bootstrapping (Efron, 1982) for statistical hypothesis testing and constructing confidence intervals (CIs) of the expected maximum, since the bootstrap requires a good approximation of the CDF (Canty et al., 2006). Thus, relying on the bootstrap method for constructing confidence intervals of the expected maximum, as in Lucic et al. (2018), may lead to poor coverage of the true parameter.

Experimental Setup
To support the validity of our conclusions, we opt for cleanroom Monte Carlo simulations, which enable us to determine the true parameter and draw millions of samples. To maintain the realism of our study, we apply kernel density estimation to actual results, using the resulting probability density (or discretized mass) function as the ground truth distribution. Specifically, we examine the experimental results of the following neural networks: Document classification. We first conduct hyperparameter search over neural networks for document classification, namely a multilayer perceptron (MLP) and a long short-term memory (LSTM; Hochreiter and Schmidhuber, 1997) model representing state of the art (for LSTMs) from Adhikari et al. (2019). For our dataset and evaluation metric, we choose Reuters (Apté et al., 1994) and the F 1 score, respectively. Next, we fit discretized kernel density estimators to the results-see the appendix for experimental details. We name the distributions after their models, MLP and LSTM. Sentiment analysis. Similar to , on the task of sentiment analysis, we tune the hyperparameters of two LSTMs-one ingesting embeddings from language models (ELMo; Peters et al., 2018), the other shallow word vectors (GloVe; Pennington et al., 2014). We choose the binary Stanford Sentiment Treebank (Socher et al., 2013) dataset and apply the same kernel density estimation method. We denote the distributions by their embedding types, GloVe and ELMo.

Experimental Test Battery
False conclusion probing. To assess the impact of the estimator bias, we measure the probability of researchers falsely concluding that one method underperforms its true value for a given n. The unbiased estimator has an expectation of 0.5, preferring neither underestimates nor overestimates.
Concretely, denote the true n-run expected maxima of the method as θ n and the estimator asθ n . We iterate n = 1, . . . , 50 and report the proportion of samples (of size B = 50) whereθ n < θ n . We compute the true parameter using 1,000,000 iterations of Monte Carlo simulation and estimate the proportion with 5,000 samples for each n. CI coverage. To evaluate the validity of bootstrapping the expected maximum, we measure the coverage probability of CIs constructed using the percentile bootstrap method (Efron, 1982). Specifically, we set B = 50 and iterate n = 1, . . . , 50. For each n, across M = 1000 samples, we compare the empirical coverage probability (ECP) to the nominal coverage rate of 95%, with CIs constructed using 5, 000 bootstrapped resamples. The ECPα n is computed aŝ where CI i is the CI of the i th sample.

Results
Following Dodge et al. (2019), we present the budget-quality curves for each model pair in Figure 1. For each n number of trials, we vertically average each curve across the 5,000 samples. We construct CIs but do not display them, since the estimate is precise (standard error < 0.001). For document classification, we observe that the LSTM is more difficult to tune but achieves higher quality after some effort. For sentiment analysis, using ELMo consistently attains better accuracy with the same number of trials-we do not consider the wall clock time.
In Figure 2, we show a failure case of biased estimation in the document classification task. At B = 25, from n = 20 to 25, the averaged estimate yields the wrong conclusion that the MLP outperforms the LSTM-see the true LSTM line, which is above the true MLP line, compared to its estimate, which is below.
False conclusions probing. Figure 3 shows the results of our false conclusion probing experiment. We find that the estimator quickly prefers negative errors as n increases. The curves are mostly similar for both tasks, except the MLP fares worse. This requires further analysis, though we conjecture that the reason is lower estimator variance, which would result in more consistent errors. CI coverage. We present the results of the CI coverage experiment results in Figure 4. We find that the bootstrapped confidence intervals quickly fail to contain the true parameter at the nominal coverage rate of 0.95, decreasing to an ECP of 0.7 by n = 20. Since the underlying ECDF is the same, this result extends to Lucic et al. (2018), who construct CIs for the expected maximum.

Conclusions
In this work, we provide a dual-pronged theoretical and empirical analysis of . We find unspoken caveats in their work-namely, that the estimator is statistically biased under weak conditions and uses an ECDF assumption that is subject to large errors. We empirically study its practical effects on tasks in document classification and sentiment analysis. We demonstrate that it prefers negative errors and that bootstrapping leads to poorly controlled confidence intervals.

A Cautionary Notes
We caution that the estimator described in the text of Dodge et al. isV n n . This is clear from their equation (7) where the empirical distribution is defined over the first n samples, instead of the B samples that we use here. In other words, they claim, at least in the text, to useF n instead ofF B for their estimatorV n n . Clearly, the estimatorV n n is (much) worse thanV B n since the latter exploits all B samples while the former only looks at the first n samples. However, close examination of their codebase 1 reveals that they useV B n , so the paper discrepancy is a simple notation error.
Lastly, we mention that our notation forÛ B n and V B n is motivated by the fact that the former is a U -statistic while the latter is a V -statistic. The relation between the two has been heavily studied in statistics since Hoeffding's seminar work. For us, it suffices to point out thatV B n ≤Û B n , with the latter being unbiased while the former is only asymptotically unbiased. The difference between the two is more pronounced when n is close to B. We note thatÛ B n can be computed by a reasonable approximation of the binomial coefficients, using say Stirling's formula.

B Proof of Theorem 2
Theorem 3. If the sample does not contain the population maximum, KS(F n B , F n ) → 1 exponentially quickly as n and B increase.
Thus concluding the proof.

C Experimental Settings
We present hyperparameters in Tables 1 and 2 and Figure 5. We conduct all GloVe and ELMo experiments using PyTorch 1.3.0 with CUDA 10.0 and cuDNN 7.6.3, running on NVIDIA Titan RTX, Titan V, and RTX 2080 Ti graphics accelerators. Our MLP and LSTM experiments use PyTorch 0.4.1 with CUDA 9.2 and cuDNN 7.1.4, running on RTX 2080 Ti's. We use Hedwig 2 for the document classification experiments and the Show Your Work codebase (see link in Table 1) for the sentiment classification ones.