Preregistering NLP Research

Preregistration refers to the practice of specifying what you are going to do, and what you expect to find in your study, before carrying out the study. This practice is increasingly common in medicine and psychology, but is rarely discussed in NLP. This paper discusses preregistration in more detail, explores how NLP researchers could preregister their work, and presents several preregistration questions for different kinds of studies. Finally, we argue in favour of registered reports, which could provide firmer grounds for slow science in NLP research. The goal of this paper is to elicit a discussion in the NLP community, which we hope to synthesise into a general NLP preregistration form in future research.


Introduction
Scientific results are only as reliable as the methods that we use to obtain those results. Recent years have seen growing concerns about the reproducibility of scientific research, leading some to speak of a 'reproducibility crisis' (see Fidler and Wilcox 2018 for an overview of the debate). Although the main focus of the debate has been on psychology Manifesto for reproducible science, where the authors discuss the different threats to reproducible science, and different ways to address these threats. We will first highlight some of their proposals, and discuss their adoption rate in NLP. Our main observation is that preregistration is rarely used. We believe this is an undesirable situation, and devote the rest of this paper to argue for preregistration of NLP research.
Munafò et al. recommend more methodological training, so that e.g. statistical methods are applied correctly. In NLP, we see different researchers picking up the gauntlet to teach others about statistics (Dror et al., 2018(Dror et al., , 2020, achieving language-independence (Bender, 2011), or best practices in human evaluation (van der Lee et al., 2019,2021). Moreover, every *ACL conference offers tutorials on a wide range of different topics. While efforts to improve methodology could be more systematic (e.g. by actively encouraging methodology tutorials, and working towards community standards), 1 the infrastructure is in place.
Munafò et al. also recommend to diversify peer review. Instead of only having journals, that are responsible for both the evaluation and dissemination of research, we can now also solicit peer feedback after publishing our work on a platform like ArXiv or OpenReview. The NLP community is clearly ahead of the curve in terms of the adoption of preprints, and actively discussing ways to improve peer review (ACL Reviewing Committee 2020a,b; Rogers and Augenstein 2020). To improve the quality of the reviews themselves, ACL2020 featured a tutorial on peer reviewing (Cohen et al., 2020).
Another advice from Munafò et al. is to adopt reporting guidelines, so that papers include all relevant details for others to reproduce the results. The NLP community is rapidly adopting such guidelines, in the form of Dodge et al.'s (2019) reproducibility checklist that authors for EMNLP2020 need to fill in. Beyond reproducibility, we are also seeing more and more researchers adopting Data statements (Bender and Friedman, 2018)

Data collection
Have any data been collected for this study already? Hypothesis What's the main question being asked or hypothesis being tested in this study?

Dependent variable
Describe the key dependent variable(s) specifying how they will be measured.

Conditions
How many and which conditions will participants be assigned to? Analyses Specify exactly which analyses you will conduct to examine the main question/hypothesis. Outliers and Exclusions Describe exactly how outliers will be defined and handled, and your precise rule(s) for excluding observations.

Sample Size
How many observations will be collected or what will determine sample size? Other Anything else you would like to pre-register?

Research aim
Specify the overall aim of the research.

Use of literature
Specify the role of theory in your research design.

Rationale
Elaborate if your research is conducted from a certain theoretical perspective.

Tradition
Specify the type of tradition you work in: grounded theory, phenomenology, . . .

Data collection plan
Describe your data collection plan freely. Be as explicit as possible.

Type of data collected
Select the type(s) of data you will collect.

Type of sampling
Indicate the type of sampling you will rely on: purposive, theoretical, convenience, snowball. . .

Rationale
Indicate why you choose this particular type of sampling.

Sort of sample
Pick the ideal composition of your sample: heterogenous, homogenous, . . .

Stopping rule
Indicate what will determine to stop data collection: saturation, planning, resources, other.

Data collection script
Upload your topic guide, observation script, focus group script, etc. Compared to the work on reporting quality, there has been little talk of preregistration in the NLP literature; the terms 'preregister' or 'preregistration' are hardly used in the ACL Anthology. 2 For this reason, we will focus on preregistration and its application in NLP research. The next sections discuss how preregistration works ( §2), propose preregistration questions for NLP research ( §3), discuss the idea of 'registered reports' as an alter-2 Looking for these terms, we found four papers that men- native pathway to publication ( §4) and the overall feasibility of preregistrations in NLP ( §5).

How does preregistration work?
Before you begin, you enter the hypotheses, design, and analysis plan of your study on a website like the Open Science Framework, AsPredicted, or ResearchBox. These sites provide a time stamp; evidence that you indeed made all the relevant decisions before carrying out the study. During your study, you follow the preregistered plans as closely as possible. In an ideal world, there would be an exact match between your plans and the actual study you carried out. But there are usually unforeseen circumstances that force you to change your study. This is fine, if the changes are clearly specified (including the reasons for those changes) in your final report (Nosek et al., 2018).
A typical preregistration form. Table 1 shows questions from the preregistration form from As-Predicted. 3 This form is geared towards hypothesisdriven, experimental research where human participants are assigned to different experimental conditions. Simmons et al. (2017) note that answers should state exactly how the study will be executed, but also that it should be short and easy to read.
Data collection, hypothesis, dependent variable. The form first asks whether data collection has been carried out yet (ideally the answer should be no, but see Appendix §A.1), and then asks researchers to What are your hypotheses/key assumptions? What is the independent variable? (e.g. model architecture) What is the dependent variable (e.g. output quality) How will you measure the dependent variable? Is there just one condition (corpus/task), or more? What parameter settings will you use? What data will you use, and how is it split in train/val/test? Why this data? What are key properties of the data? How will you analyse the results and test the hypotheses? Table 2: Questions for analysis, experiments, and reproduction papers (expanded in Appendix A). make their main hypothesis explicit so that it cannot be changed after the fact. Following the hypothesis, researchers should describe their key dependent variables (i.e. the main outcome variables) and how they will be measured. This includes cutoff points that will be used to discretise continuous variables (e.g. to divide participants in different groups).
Conditions, analyses, outliers and exclusions. Next, the form asks about the design of the study, the analyses, and the process of determining outliers (and whether those should be excluded). The answer needs to be detailed enough so that other researchers are able to reproduce the study.
Sample size and other. The form then asks how much data will be collected, so as to prevent optional stopping (where researchers keep collecting data until the results are in line with their preferred hypothesis). 4 Finally, the form allows researchers to specify other aspects of the study they would like to preregister, such as "secondary analyses, variables collected for exploratory purposes, [or] unusual analyses." Qualitative research. Preregistration is not only suitable for quantitative research; Haven and Grootel (2019) present a proposal to preregister qualitative studies as well. Their suggestions are also presented in Table 1. The authors argue that, although qualitative research differs in its goals from quantitative research (developing theories rather than testing them), it is still valuable to make your assumptions and research plans explicit before carrying out your planned study. Because qualitative research is more flexible than quantitative research, Haven and Grootel view qualitative preregistrations as living documents; continuously updated to track the research progress. This stimulates conscientiousness, and avoids sloppy research. Public preregistrations also allow for immediate feedback.
What do you aim to learn from the error analysis? What do you know from the literature about system errors? What kinds of errors do you expect to find? How will you sample the outputs to analyse? Do you also consider the input in your sampling strategy? How do you plan to analyse the output? How many judges will assess the output? Are they trained? How is the reliability of the judges assessed? Is there a fixed error categorisation scheme or not? Table 3: Questions to ask before an error analysis.

Preregistration in NLP research
To determine what a preregistration for NLP research should look like, we need to consider the different kinds of research contributions in NLP. For this, we use the paper types proposed for COLING 2018. 5 These are: Computationally-aided linguistic analysis; NLP engineering experiment paper; Reproduction/Resource/Position/Survey Paper. Of these, position papers are less suitable for preregistration, since these are more opinion/experiencedriven, and the process of writing them cannot be formalised. We treat the others below.
Analysis, experiments, and reproduction papers typically have one or more hypotheses, even though they may not always be marked as such. 6 This means we can ask many of the same questions for these studies as for experimental research. Table 2 provides a rough overview of important questions to ask before carrying out your research.
If your study contains an error analysis, then you could ask the more qualitatively oriented questions in Table 3. They acknowledge that you always enter error analysis with some expectation (i.e. researcher bias) of what kinds of mistakes systems are likely to make, and where those mistakes may be found. The questions also stimulate researchers to go beyond the practice of providing some 'lemons' alongside cherry-picked examples showing good performance.
The main benefit of asking these questions beforehand is that they force researchers to carefully consider their methodology, and they make researchers' expectations explicit. This also helps to identify unexpected findings, or changes that were made to the research design during the study.
Resource papers are on the qualitative side of the spectrum, and as such the questions from Haven and Grootel (2019), presented at the bottom of Table 1, are generally appropriate for these kinds of papers as well. Particularly 1) the original purpose for collecting the data, 2) sampling decisions (what documents to include), and 3) annotation (what framework/perspective to use) are important. Because the former typically influences the latter two, it is useful to document how the goal of the study influenced decisions regarding sampling and annotation, in case the study at some point pivots towards another goal.
Survey papers should follow the PRISMA guidelines for structured reviews (

Registered reports
Registered reports "[split] conventional peer review in half" (Chambers, 2019). First, authors submit a well-motivated research plan for review, before carrying out the study (similar to a preregistration). This plan may go back-and-forth between the authors and the reviewers, but once the plan is accepted, the authors receive the guarantee that, if they carry out the study according to plan, their work will be published. As with preregistration, deviations from the original plan are allowed, but these should be indentified in the final report. The main advantage of registered reports is that they provide a means to avoid publication bias. Because studies aren't judged on the basis of their results, positive results are equally likely to be published as negative results. As long as the study is deemed valuable a priori, it should get published. An additional benefit of registered reports is that reviews may actually correct flaws in the research design, meaning that we reduce the chance of running an expensive study all for nothing. In the case of NLP research, this may save a lot of energy (cf. Strubell et al. 2019). We are not aware of any NLP journals that offer registered reports, but strongly encourage the NLP community to take steps in this direction. 7

Feasibility
Gelman and Loken (2013, 2014) touch upon the feasibility of preregistration, noting that: "[f]or most of our own research projects this strategy hardly seems possible: in our many applied research projects, we have learned so much by looking at the data. Our most important hypotheses could never have been formulated ahead of time." This certainly rings true for NLP as well. However, we should be careful about conclusions that are drawn on the basis of pre-existing data. Gelman and Loken (2013) note that in such cases, if it is feasible to collect more data, it is good to follow up positive results with a pre-registered replication to confirm your initial findings. One way to do this is to collect and evaluate your model on a new test set (cf. Recht et al. 2019). This tells us to what extent trained models generalise to unseen data. Another idea could be to preregister the human evaluation (or error analysis) of the model output.
We believe that preregistration, and especially registered reports, could ease the pressure to publish as soon as possible. If your analysis plan is accepted for publication, you can take as long as you want to actually carry out the study, without having to worry about being scooped. This provides new opportunities for slow science in NLP (also see Min-Yen Kan's keynote at COLING 2018).

Questions about preregistration
Below we address some common questions about preregistration. We thank our anonymous reviewers for raising some of these questions. Is preregistration more work? In our experience, preregistration adds little overhead to a research project. Especially if a project requires approval by an Institutional Review Board (IRB), you need to write a description along similar lines anyway. For projects not requiring IRB approval, it is good practice to provide a model card (Mitchell et al., 2019), data sheet (Gebru et al., 2018) or data statement (Bender and Friedman, 2018) with your model or resource. Given the ethical aspects of NLP research, it is advisable to consider all dimensions of your study before you carry it out. Moreover, preregistration is a good way to start writing the paper before carrying out the research, a practice advocated by Eisner (2010) to maximise the impact of your work. Finally, it may be more work to prepare a registered report, but this comes with the benefit of having a pre-approved methodology. Once the project is completed, reviewers will not reject your paper based on methodological choices. Should I worry about being scooped? There is no need to worry. We already discussed registered reports, where research proposals are provisionally accepted before data collection starts. Otherwise, this worry has been addressed through the existence of both public and private preregistrations. A researcher can choose to keep a preregistration private until the research is completed. They can make their preregistration public whenever they like, for example to invite feedback from the community. In addition, preregistrations are also time-stamped, and you can use these time stamps during the review phase to show that you have had these ideas before similar work was published. 8 What about citing preregistrations? In some regards, the discussion about preregistrations is similar to the discussion about preprints (i.e. papers on ArXiv), thus similar questions arise. Both preregistrations and published studies are being cited. For example, medical journals like BMC Public Health also publish study protocols (similar to preregistrations), without any results, that are also cited by others (e.g. work using a similar protocol).
What should we do with concurrent work? It may of course happen that multiple researchers have similar ideas around the same time. We believe that it is still valuable to publish multiple independent studies with similar results. Even if they don't provide any new insights (which is rare), they do provide evidence towards the robustness of the findings. Where and how those findings should be published is a separate discussion. 9 How should we teach preregistration? Preregistration is already being incorporated into Psychology courses (see, for example, Blincoe and Buchert 2020). It is relatively straightforward to implement as part of student research proposals during applied courses in NLP: specify what you plan to do 8 The public/private distinction has been implemented by both the Open Science Foundation and AsPredicted.org. The Open Science Foundation allows for a 4-year embargo, during which the preregistration is kept private. Aspredicted allows for preregistrations to be private indefinitely. 9 However, if there is value in publishing the 'first' paper, there is probably also value in publishing the 'second' one. The same holds for the question of whether both studies should be cited; good scholarship considers all the available evidence.
exactly, and what you expect to find. It is often useful for students to have an explicit format to think through their research plans, to make sure that they make sense.

Limitations
Although preregistration is offered as a solution to improve our work, it does not solve all of our problems. Van 't Veer and Giner-Sorolla (2016) mention three limitations: 1. Flexibility. It may be difficult or infeasible for authors to foresee all possible outcomes, and as such there may be gaps in the preregistration, which still allow for flexibility in the analysis. 2. Fraud. There is no way to prevent fraudulent researchers from, e.g., creating multiple preregistrations, or falsely 'preregistering' studies that were already run. At some point we just have to trust each other to do the right thing, but increased transparency does make it harder to commit fraud. 3. Applicability. Preregistration may not be possible for all kinds of studies. As discussed above, it has mainly been developed for quantitative studies (particularly experiments), and there are proposals for the preregistration of qualitative research (Haven and Grootel, 2019), although we have yet to see whether this idea will catch on. Finally, Szollosi et al. (2020) argue that, although preregistration might offer greater transparency, it does not by itself improve scientific reasoning and theory development. Since large parts of NLP are pre-theoretical (we have observed effects but do not have any theoretical explanations for why these effects occur), one might reasonably argue that we should focus on theory development first, before we can carry out any meaningful experiments.

Conclusion
We have discussed how preregistration could benefit NLP research, and how different kinds of contributions could be preregistered. We have also proposed an initial list of questions to ask before carrying out NLP research (and see Appendix A for example preregistration forms). With this paper, we hope to encourage other NLP researchers to consider preregistering their work, so that they will no longer get lost in the garden of forking paths. Still, there is no silver bullet to cure sloppy science. Although preregistration is certainly helpful, it does not guarantee high-quality research, and we do need to stay critical about preregistered studies, and the way they are carried out. Andrew Gelman and Eric Loken. 2013. The garden of forking paths: Why multiple comparisons can be a problem, even when there is no "fishing expedition" or "p-hacking" and the research hypothesis was posited ahead of time. Department of Statistics, Columbia University.
Andrew Gelman and Eric Loken. 2014. The statistical crisis in science: data-dependent analysis-a" garden of forking paths"-explains why many statistically significant comparisons don't hold up. American scientist, 102 (6)

A Preregistration forms
This appendix provides preregistration forms for different kinds of paper types. These forms are preliminary, and they are mainly meant as a starting point for discussions of preregistration in NLP.
We are happy to admit that there may be flaws in this appendix (either in the forms or in our reasoning). Future work should investigate whether these forms are complete (i.e. limit researcher degrees of freedom as much as possible) and appropriate for different kinds of NLP research.

A.1 Preface: data availability in NLP
Preregistration is a means to avoid hindsight bias, because you have to specify your expectations upfront, when your perspective is not yet colored by your experience with the data. But for NLP studies it is unclear what 'the data' is. We can distinguish three kinds of data: 1. The training/validation/test sets, 2. The model output, 3. Human judgments.
In an ideal situation, preregistration would occur before any kind of data has been obtained. The problem is that this is often not the case; there are many canonical datasets for which the data is publicly available. Of course one could collect an additional test set (as we suggested above), but the community often judges new approaches based on their performance for established datasets. So what should we do? Still preregister! Arguably the training, validation, and test data is usually not central to the work. What matters is how a particular system performs. So even if we don't usually find ourselves in the ideal situation where none of the data is available yet, it is typically fine to preregister your study if the train/eval/test data is available but system outputs and evaluation scores are not. When authors are transparent in their data sharing policy, we can reconstruct the timeline of events before and after the preregistration, to see how much their knowledge about the data may have influenced them.

A.2 Computationally aided linguistic analysis
This paper type corresponds to several different setups, ranging from experiments with human subjects, to corpus analyses to see if particular generalisations from the literature hold up. Preregistration has been discussed from a linguistics perspective by Roettger (2020). For experiments with human participants, readers may refer to the standard preregistration forms from AsPredicted (see our Table 1), OSF, or the questions from Roettger's Figure 1.
For more corpus-oriented studies (e.g. Ruppenhofer et al. 2018), we should consider a mix of the quantitative and qualitative questions from our Table 1. Usually these kinds of studies do require some data collection, so authors should ask: