AmbigQA: Answering Ambiguous Open-domain Questions

Ambiguity is inherent to open-domain question answering; especially when exploring new topics, it can be difficult to ask questions that have a single, unambiguous answer. In this paper, we introduce AmbigQA, a new open-domain question answering task which involves predicting a set of question-answer pairs, where every plausible answer is paired with a disambiguated rewrite of the original question. To study this task, we construct AmbigNQ, a dataset covering 14,042 questions from NQ-open, an existing open-domain QA benchmark. We find that over half of the questions in NQ-open are ambiguous, exhibiting diverse types of ambiguity. We also present strong baseline models for AmbigQA which we show benefit from weakly supervised learning that incorporates NQ-open, strongly suggesting our new task and data will support significant future research effort. Our data is available at https://nlp.cs.washington.edu/ambigqa.


Introduction
In the open-domain setting, it can be difficult to formulate clear and unambiguous questions. For example, Figure 1 shows a Google search query (Kwiatkowski et al., 2019) that, perhaps surprisingly, has two possible interpretations given the evidence in Wikipedia. Although open-domain question answering (QA) systems aim to answer any factoid question (Voorhees et al., 1999), existing methods assume all questions have a single well defined answer. In this paper, we define the task of ambiguous question answering, and present the first data and baseline methods for the study of how to disambiguate and answer such questions.
Ambiguity arises frequently in open-domain QA, where questions are written during information gathering (e.g. search queries) without knowledge of the answer. As we will see in Section 4, over 50% of the questions we sampled from a set of Google search queries are ambiguous. Furthermore, identifying ambiguities is difficult both for humans and machines. As we saw in Figure 1, ambiguity is a function of both the question and the evidence provided by a very large text corpus.
To study this challenge, we introduce AM-BIGQA (Answering Ambiguous Open-domain Questions), a new task which involves disambiguating and answering a potentially ambiguous question. Specifically, the model must (1) find a set of distinct, equally plausible answers to the question, and (2) provide minimal yet unambiguous rewrites of the question that clarify the interpretation which leads to each answer. Figure 1 provides an example of two such disambiguated questions and their associated answers.
To support the study of this task, we construct a   (Kwiatkowski et al., 2019), denoted NQ-OPEN. For each question, annotators search for, navigate, and read multiple Wikipedia pages to find as many answers as possible. The high prevalence of ambiguity makes the annotation task difficult even for human experts; it is inherently difficult to know if you have found every possible interpretation of a question. Nonetheless, we were able to collect high quality data that covers high levels of ambiguity (2.1 distinct answers per question on average) with high estimated agreement (89.0 F1) on valid answers. The types of ambiguity are diverse and sometimes subtle (Table 1), including ambiguous entity or event references, or ambiguity over the answer type, many of which are only apparent after examining one or more Wikipedia pages.
To establish initial performance levels on this data, we present results for a set of strong baseline methods. We extend a state-of-the-art model for NQ-OPEN (Karpukhin et al., 2020) with three new components: (1) set-based question answering with a BART-based sequence-to-sequence model (Lewis et al., 2020), (2) a question disambiguation model also based on BART, and (3) a modification to democratic co-training (Zhou and Goldman, 2004) which allows this model to leverage the partial supervision available in the full NQ-OPEN dataset. We also do an ablation study and qualitative analysis, which suggest there is significant room for the future work on this task.
To summarize, our contributions are threefold.  (Lee et al., 2019;Asai et al., 2020;Min et al., 2019a,b;Guu et al., 2020;Karpukhin et al., 2020). Nonetheless, Kwiatkowski et al. (2019) report that the answers to such questions are often debatable, and the average agreement rate on NQ-OPEN test data is 49.2%, 1 in large part due to ambiguous questions. In this paper, we embrace this ambiguity as inherent to information seeking open QA, and present the first methods for returning sets of answers paired with different interpretations of the question.
Clarification Questions have been used to study and mitigate question ambiguity in other related settings. Research on community question answering (Braslavski et al., 2017;Daumé III, 2018, 2019) models ambiguity that arises from common information gaps in the question, which are often specific to the community being studied.
Recently, Xu et al. (2019) study clarification of questions deliberately constructed to have prespecified entity reference ambiguities. Aliannejadi et al. (2019) investigate using clarification questions to iteratively refine intents of simple queries where the information need is not immediately apparent (e.g., single keywords like dinosaur).
In contrast, we study open-domain factoid questions asked by real users: these present clear information needs, but carry diverse naturally occurring ambiguities, as shown in Table 1. Furthermore, instead of prolonging the user's information-seeking session with clarification questions, our task formulation provides a complete and immediate solution with a set of disambiguated answers.
3 Task: AMBIGQA 3.1 AMBIGQA Setup Figure 1 depicts the AMBIGQA task. The input is a prompt question q, and the output is a list of n question-answer pairs (x 1 , y 1 ), . . . , (x n , y n ), where each y i is an equally plausible answer to q, and each x i is a minimally edited modification of q whose answer is unambiguously y i . We consider two subtasks. 1 The NQ-OPEN test data is derived from NATURAL QUES-TIONS development data, which has 5-way annotations; we compute their pairwise agreement based on string match. Multiple Answer Prediction. Given a question q, output a set of semantically distinct and equally plausible answers y 1 , . . . , y n , where n is unknown.
Question Disambiguation. Given q and a set of answers y 1 , . . . , y n , generate disambiguated questions x 1 , . . . , x n , where each x i is a minimal edit of q which makes it more specific so that y i is a correct answer and all y j for all j = i are incorrect. When n = 1, this task is trivial, as x 1 = q.
We choose to represent ambiguity with a set of disambiguated questions because it is well-defined, immediately human-interpretable, and allows for straightforward annotation of a wide range of ambiguities without complex guidelines.

Evaluation Metrics
To evaluate model performance, we present several ways to compare a model prediction with m question-answer pairs (x 1 , y 1 ), . . . , (x m , y m ) with a gold reference set with n pairs (x 1 ,ŷ 1 ), . . . , (x n ,ŷ n ).
Since there may be more than one way to refer to a single answer (e.g., Michael Jordan versus Michael Jeffrey Jordan), each gold answerŷ i is a set of acceptable answer strings, where allŷ i are disjoint.
We assign each predicted question-answer pair (x i , y i ) a correctness score c i = I[y i ∈ y j ]f (x i ,x j ), based on a string similarity function f valued in [0, 1]. Intuitively, c i counts correct answers by the similarity f (x i ,x j ) between the predicted and reference question, to account for the fact that there can be many acceptable paraphrases of each reference question. We calculate F1 treating the c i as a measures of correctness: We consider three choices of the question similarity function f . F1 ans is the F1 score on answers only, where f always yields 1. This may be used without the question disambiguation step. F1 BLEU accounts for string similarity between questions, calculating f with BLEU (Papineni et al., 2002). F1 EDIT-F1 uses EDIT-F1 as f . EDIT-F1 is a new measure that represents each disambiguated question by its added and deleted unigrams compared to the prompt question, and computes the F1 score between them. For example, consider the prompt question "Who made the play the crucible?", the reference "Who wrote the play the crucible?" and the prediction "Who made the play the crucible in 2012?". The gold edits 2 here are -made , +wrote while the predicted edits are +in , +2012 . So even though the questions are similar, their EDIT-F1 is zero. Unlike BLEU which we use to directly measure similarity to the gold question, this metric only gives credit for getting the key semantic differences correct between the original question and the clarification.

Data Collection
We construct AMBIGNQ using prompt questions from NQ-OPEN and English Wikipedia as the evidence corpus. We use Amazon Mechanical Turk for crowdsourcing.
The crucial challenge in our annotation task is maximizing recall: finding all possible distinct answers to a question. This is difficult even for humans, as ambiguities are often only apparent after carefully searching the evidence for multiple possible answers. However, we were able to collect high quality data with high levels of ambiguity using careful worker selection and an annotation pipeline with two stages: generation and validation.
Generation. Workers in the first stage are given a prompt question and provided access to a search box that uses the Google Search API 3 to return results restricted to English Wikipedia. They may write any number of search queries to find, read, and navigate Wikipedia pages in search of answers to the prompt question. By allowing annotators to find Wikipedia pages on their own, this closely approximates the real process people use to answer open-ended questions-an approach for which there is no existing large-scale dataset. 4 Workers annotate answers to the prompt question as free text, which we instruct them to copy and paste from Wikipedia. A single distinct answer may be annotated with multiple possible spans (e.g., Michael Jordan versus Michael Jeffrey Jordan). We ask workers to consider all possible user intents, or plausible answers to the question. For questions with multiple distinct answers, each answer must be annotated with a minimal edit of the  prompt question which differentiates it from the other answers, in line with our task requirements. As a special case, some questions contain temporal deixis which depends on the time of writing, e.g., "When does the new family guy season come out?" To avoid unmanageably large sets of answers, we instruct workers to remove the time-dependence by refining the prompt question to be based on up to three most recent events before Jan 1, 2018, e.g., "When does family guy season 16 come out?" (example in Table 1).
Generators may skip the prompt question if the answer is not found in Wikipedia, or the question is ill-formed, too subjective or too ambiguous, e.g., "When did the new tax cuts go into effect?" Validation. Workers in the validation stage review the complete annotations provided by multiple generators. Validators mark each generator's annotations as correct or incorrect; the correct ones are provided as gold references in the final dataset. If only some question-answer pairs from each generator were correct, the validator provides a new set of question-answer pairs by combining the valid ones from each generator and disambiguating the questions accordingly. In this case, only the new set of question-answer pairs is used as gold.
Just like in the generation stage, validators have access to a search engine over Wikipedia, and are additionally given the Wikipedia pages that the generators viewed in order to speed up the process. Validation is skipped when all annotated answers from all generators exactly match (37% of cases).
Quality control. We recruit highly qualified workers through a qualification test (details in Appendix A). Although the task was difficult for most workers, we found that our highly qualified fulltime workers, given quick and detailed feedback on their work, produced high accuracy and recall.
In the beginning of data collection, we used a more complex pipeline consisting of three generators and two validators. However, we eventually  found that managing a small pool of highly qualified workers allowed us to collect high-quality data with a simpler pipeline and lower costs. We pay 0.75 and 0.15 USD per prompt question for generation and validation, respectively. For the development and test data, we use two generators and one validator per prompt question. For the training data, we skip validation and only use one generator per prompt question.
Inter-annotator agreement. Evaluating both generators against each other on the development set yields 60.8 F1 ans . All annotations passed validation for 76% of questions, while validators made changes (edits or exclusions) in the remaining 24%. This indicates that the task is hard even for humans, but the validation stage enables aggregation of work from multiple generators.
The average F1 ans between the co-authors and workers on a sample of 50 validations was 89.0%. This indicates that, despite the intrinsic difficulty and the subjectivity of the task, humans agree on the boundary between valid and invalid answers in most cases.

Data Analysis
The final dataset contains 14,042 annotated examples. As shown in Table 2, over 50% of development and test examples contain multiple questionanswer pairs. This indicates a surprisingly high rate of ambiguity in NQ-OPEN, even though previous work has studied it with the assumption that each question has a single answer. We also find a discrepancy between development and test; this is likely due to the way in which NQ-OPEN is constructed, which over-samples difficult questions in the test set (see Appendix B for details). In comparison to development and test, fewer training examples contain multiple question-answer pairs (47%), presumably because using only one worker per training example yielded slightly lower recall.
Types of ambiguity. Table 1 shows a breakdown of the types of ambiguity, where a single example may fall into multiple categories. They are diverse, including ambiguity in entity references, event references, properties, and answer types, with a relatively uniform distribution between them. It is worth comparing with Xu et al. (2019), who intentionally elicited questions with ambiguous entity references; our analysis shows that unintended ambiguity comes from more diverse sources. In many cases, the ambiguity is not apparent from the prompt question alone, but only after researching the question on Wikipedia, as evidenced by differences in model performance (Section 6.2).
Annotator behavior. Figures 2a and 2b show the number of unique Wikipedia pages visited from the search interface 5 and the number of unique search queries, collected by workers during annotation process. More often than not, workers used multiple search queries and navigated multiple Wikipedia pages. This shows how our setup captures ambiguity in the retrieval step of opendomain question answering, which is missed in annotation approaches that assume a pre-specified evidence document.
Distribution of edits. Figure 2c shows a word cloud of unigram edits made to questions in the development data, where we remove stopwords except wh-words and group numeric values by the number of digits. We find that adding a numeric value such as a year is common, as it is an easy way to disambiguate entity or event references and to remove time dependence. Edits to the wh-word are also common, especially for specifying the answer type (e.g., from "who" to "which group"; see Table 1). The distribution of edits is fairly long-tailed, Algorithm 1 Democratic co-training with weak supervision 1: procedure DEMOCRATIC CO-TRAINING WITH WEAK SUPERVISION(Dfull, Dpartial) 2: // Each question in D full has an answer list annotated 3: // Each question in D partial has one answer annotated 4:Dfull ← Dfull 5: for iter ∈ {1..N } do 6: for i ∈ {1..C} do 7: fi ← train(Dfull) // Train C sequence-to-sequence QA models 8:DL ←Dfull 9: for (q j , y j ) ∈Dpartial do // Get predictions by using yj as prefix of the sequence-to-sequence 10:Ŷ j ← {ŷ :ŷ = y j and |{i :ŷ ∈ fi(q j |y j ), i = 1..C}| > C 2 } 11: if |Ŷ j | > 0 then // Add if there is an additional answer by the majority of the models 12:Dfull←DL ∪ {(q j , {y j } ∪Ŷ j )} 13: else if ∀i = 1..C, |fi(x j ) − {y j }| = 0 then // Add if all models predict a single answer 14:Dfull←DL ∪ {(q j , {y j })} with the top 100 most frequent edits covering 36% of the total, and the top 1,000 covering 69%.
Mismatches with NQ-OPEN. 29.4% of our development examples do not include the NQ-OPEN answer. We report a breakdown of a random sample of 50 such questions in Appendix C. In short, the mismatch cases do not indicate low precision or low recall; in many cases both AMBIGNQ answers and NQ-OPEN answers are correct (60%), or AMBIGNQ answers include a better answer than NQ-OPEN answers (32%).

Model
To set initial performance levels on AMBIGNQ, we present a baseline AMBIGQA model combining ideas from recent advances in open-domain QA (Karpukhin et al., 2020) and generative pretraining (Lewis et al., 2020). Given a prompt question q, our model predicts answers y 1 ..y n and generates corresponding questions x 1 ..x n conditioning on q, the answers y 1 ..y n , and the passages containing them. A novel co-training step also allows the model to leverage the partial supervision available in NQ-OPEN.  (Lewis et al., 2020). Specifically, it conditions on the concatenation of q and the top passages in order up to 1024 tokens, and sequentially generates distinct answers token-by-token, separated by [SEP], in the order in which they appear in the input passages. We pretrain SPANSEQGEN on NQ-OPEN and fine-tune it on AMBIGNQ.
We develop SPANSEQGEN primarily because Karpukhin et al. (2020) is designed for generating a single answer, but SPANSEQGEN also boosts the performance on NQ-OPEN (41.5→42.2 on the test data). We include ablations on different approaches and models in Section 6.2.
Question Disambiguation. We design a question disambiguation (QD) model based on BART. The model generates each question x i (i = 1..n) conditioning on the concatenation of q, the target answer y i , other answers y 1 ..y i−1 , y i+1 ..y n , and the top passages as used by SPANSEQGEN. We pretrain on NQ-OPEN to generate questions given an answer and passage, and then finetune it on the full task data in AMBIGNQ. We include ablations on different variants of the model in Section 6.2.
Co-training with weak supervision. Given the prevalence of unmarked ambiguity in NQ-OPEN, we introduce a method for treating the original annotations as weak supervision, while learning to correct for some the false negatives in the data. We modify a democratic co-training algorithm (Zhou and Goldman, 2004) as described in Algorithm 1. We iteratively grow the training setD full from AM-BIGNQ (D full ) with silver data from NQ-OPEN (D partial ) predicted by a majority of a set C of SPANSEQGEN models trained onD full . The key step is injecting the known answer y j from NQ-OPENas a prefix to SPANSEQGEN's output during  prediction. 6 In each step, if a majority of C predict an additional answer, we assume we have found a false negative and add the result to the training set D full . On the other hand, if all models predict no additional answer, we add the example toD full with y j as a single answer.

Experiments
We describe the baseline models used in our experiments, followed by results and ablations. Implementation details and hyperparameters of all models are provided in Appendix D.

Baselines
DISAMBIG-FIRST. This baseline disambiguates the prompt question without any context from plausible answers or reference passages. Specifically, it implements the following pipeline: (1) Feed the prompt question q into a BERT-based binary classifier to determine whether it is ambiguous.
(2) If q is ambiguous, pass it into a BART-based model which generates a sequence of disambiguated questions x 1 ..x n (n > 1), separated by [SEP]; otherwise, consider only x 1 = q.
(3) Feed each x i into a state-of-the-art model on NQ-OPEN (Karpukhin et al., 2020) to produce its answer y i .
Thresholding + QD. We also include a model based on Karpukhin et al. (2020), with thresholding for multiple answer prediction and our BARTbased question disambiguation (QD) model. The Karpukhin et al. (2020) model outputs a passage selection probability P (p i ) for passage p i , and probabilities P (s|p i ) and P (t|p i ) for start and end tokens of the answer span, respectively. We treat the product of these three probabilities as the span's likelihood of being the answer, and obtain y 1 ..y n by taking valid spans with likelihood larger than a hyperparameter γ. The model is trained to maximize the marginal likelihood of any span in the gold answer setŷ 1 ..ŷ n . As with SPANSEQGEN, we pretrain on NQ-OPEN and finetune on AMBIGNQ. Finally, we produce disambiguated questions using our BART-based QD model (Section 5). Table 3 reports the performance of our baselines; example model outputs are provided in Table 6.

Results
Main results. We first find that DISAMBIG-FIRST is significantly worse than other models. In particular, classification accuracy on whether the prompt question is ambiguous is 67%, close to the majority baseline (60%). 7 When the model does identify an ambiguous question, its rewrites often look reasonable on the surface, but do not match the facts. For instance, in example 1 of Table 6, it asks about filming in 2017 and during season 1 for Snow White and the Huntsman, which was actually a film released in 2012. This shows that reading evidence documents is crucial for identifying and characterizing ambiguities. While SPANSEQGEN outperforms Karpukhin et al. (2020) with thresholding, the difference is not as great as we expected. This suggests two things. First, thresholding may be a surprisingly effective baseline for outputting multiple answers, even though the answers must compete with each other for probability mass in order to surpass the threshold γ. Second, maximizing likelihood in a sequence-to-sequence model like SPANSEQGEN may not produce well-calibrated results for question answering. For instance, the model seems to suffer due to variation in the length of the output sequence, outputting shorter sequences on average (3.0 tokens) than gold (6.7). This problem has   also been reported in other conditional generation tasks (Sountsov and Sarawagi, 2016;Stahlberg and Byrne, 2019); we leave it for future work. Overall, SPANSEQGEN achieves reasonable F1 ans scores. F1 ans on examples with multiple question-answer pairs (F1 ans (multi)) are lower, indicating that predicting all plausible answers is more challenging than predicting a single answer, as expected. SPANSEQGEN also obtains the best performance in F1 BLEU and F1 EDIT-F1 , although their absolute values are low in general; we investigate this in our question disambiguation ablations below. See Table 6 for example predictions and error cases.
There is a substantial difference in performance between development and test overall, likely due to distributional differences in the original questions in NQ-OPEN; detailed discussion is in Appendix B.
Effect of co-training. To see the effect of our cotraining method, we compare with a naive ensemble, as co-training also requires multiple trained models. While we see gains from ensembling alone, an ensemble trained with the co-training method achieves the best performance on all met-rics. This result demonstrates the importance of jointly using AMBIGNQ and partial supervision from NQ-OPEN.
Ablations on question disambiguation. Table 4 reports results of an ablation experiment on question disambiguation. Among our ablations, we include models without the prompt question or untargeted answers as input, and a naive baseline that always outputs the prompt question. We report the metrics both in the scenarios of the full task and the gold answers given, to see the performance dependent on and independent from multiple answer prediction, respectively. 8 Simply copying the prompt question gives high F1 BLEU , which is natural since the questions were disambiguated using minimal edits. This justifies using F1 EDIT-F1 to evaluate semantic differences from the prompt question. In addition, we find that our QD model conditioned on all available context is better than other variants in overall metrics.
Performance is low overall, even given the gold answers, highlighting the challenge of the task. We think there are two major reasons. First, maximizing the likelihood of the output sequence can miss the importance of edits to the prompt question, leading the QD model to miss the information that is most important to differentiate one answer from the others. Second, there is a lack of annotated data, especially for question disambiguation which does not benefit from weakly supervised learning with NQ-OPEN; future work can explore how to maximize the use of supervision from other available data. It is also worth noting that the metric may miss edits that are semantically correct, but phrased differently (see Table 6, example 2). Table 6: Model predictions on samples from the development data. (#1) DISAMBIG-FIRST generates questions that look reasonable on the surface but don't match the facts. SPANSEQGEN produces the reasonable answers and questions, although not perfect. (#2) SPANSEQGEN produces correct answers and questions. (#3) the model produces the incorrect answer "February 9, 2018", which is the release date of Fifty Shades Freed.

Zero-shot multiple answer prediction
Since AMBIGNQ provides an evaluation set with explicit sets of multiple answers, we can also test if models trained on partial supervision only (NQ-OPEN) are capable of producing full answer sets. This may be important for modeling in domains where single-answer datasets are available but full annotations like in AMBIGNQ are not. To this end, we present a zero-shot setting where a system predicts multiple distinct answers without using AMBIGNQ training data. We include four NQ-OPEN models including ours, consisting of diverse approaches and model architectures, as baselines. These models, when trained on NQ-OPEN, may be made to predict multiple answers via thresholding as described in Section 6.1. 9 Table 5 reports zero-shot performance. Although Min et al. (2019b) outperforms Asai et al. (2020) on NQ-OPEN, it is worse on AMBIGNQ. In addition, although SPANSEQGEN outperforms Karpukhin et al. (2020) in the standard setting, it is worse in zero-shot F1 ans (multi), potentially because thresh- 9 We allow using development data to tune the threshold γ, although this arguably makes our setting not zero-shot in the strictest sense.
olding exacerbates the problems that SPANSEQ-GEN has with long sequences (Section 6.2).

Conclusion & Future Work
We introduced AMBIGQA, a new task that involves providing multiple possible answers to a potentially ambiguous open-domain question, and providing a disambiguated question corresponding to each answer. We constructed AMBIGNQ, a dataset with 14,042 annotations on NQ-OPEN questions. Our analysis shows the dataset contains diverse types of ambiguity, often not visible by the prompt question alone but only found upon reading evidence documents. We also introduced a first baseline model for producing multiple answers to open-domain questions, with experiments showing its effectiveness in learning from our data while highlighting avenues for future work. Future work may investigate (1) more effective ways of dealing with highly ambiguous questions (e.g., returning tables or other structures), (2) providing information related to the inferred information need when no answers are found, or (3) dealing with ill-formed questions.

A Data Collection Details
We use Amazon Mechanical Turk 10 and Spacro (Michael et al., 2018) 11 for crowdsourcing. All data was collected in February and March of 2020. We use the Google Search API 12 restricted to English Wikipedia for the search tool.
Crowdsourcing interface. Figure 3 shows the interface used for generation and validation. We use an iframe to render Wikipedia pages in a mobile view, in order to provide the document format that they are familiar with, rather than the plain text with no formatting. When workers write the questions and the answers in the generation stage, we show appropriate error messages (e.g. when the written question is the same as the prompt question) or warning messages (e.g., when the answer is composed of more than 20 words) in order to give tight feedback.
Quality control. We only recruit full-time workers that are dedicated to our task. We were able to recruit full-time workers by requiring the minimum number of HITs that can be achieved by working 40 hours a week. We also host a public website for them to monitor the validated statuses, ask questions on examples that they do not understand the validated result, or claim on the validation which is incorrect in their opinion. We found it very useful to communicate with workers, give feedback, and fix the incorrect annotations.
Inter-annotator agreement. When two independent generators are evaluated on the answer list from each other, they obtain 60.8 F1 ans . Specifically, for 76% of questions, all annotations passed validation, either automatically because they exactly matched (37%) or because they were both accepted by validators (39%). In the remaining 24% of cases, one annotator missed a possible questionanswer pair that the other one found, or included an invalid question-answer pair.

B Discrepancy between development and test in NQ-OPEN
In our experiments on AMBIGNQ, we found a significant discrepancy between the development and test sets. Upon further investigation, we identified that this is at least in part due to a distributional difference between the development and test sets of NQ-OPEN, upon which we built the data. As this may be important for other researchers working on NQ-OPEN, we detail our findings here. Following Lee et al. (2019), NQ-OPEN is constructed by filtering NATURAL QUESTIONS to questions where at least one annotator provided a non-null short answer to the question. 13 While the training and development sets of NQ-OPEN were all drawn from the training set of NATURAL QUES-TIONS, in which one annotator answered each question, the test set of NQ-OPEN is taken from its development set, which had five annotators per question.
This difference in number of annotators introduces a sampling bias: questions for which an annotator is less likely to find an answer are overrepresented in the NQ-OPEN test set, in comparison to training and development. Suppose, for example, that a randomly sampled annotator has a 50% chance of producing a short answer for some question q. Then q has a 50% chance of making it into NQ-OPEN's development set, but a (1−.5 5 =) 97% chance of making it into test. Concretely, when each annotator is considered independently, 34.6% of the short answer annotations in the test set of NQ-OPEN are null answers, and the majority of annotations are null for 33.9% of questions.
As a consequence, there is a significant gap in model performance between development and test when they are evaluated under the same condi-tions. The official evaluation protocol for NQ-OPEN counts a prediction as correct if it matches any of the gold reference answers. Under these conditions, the gap between development and test appears marginal (Table 7, first two columns). However, as the NQ-OPEN test set was more comprehensively annotated than development, it has a more generous evaluation; the number of unique reference answers is 1.2 and 1.8 on development and test, respectively. In order to make the evaluation more consistent, we try evaluating models against the first reference answer only, and find a significant gap between development and test (5-8%) across all models (Table 7, last two columns). 14 Despite this discrepancy, AMBIGNQ follows the setup and data split from NQ-OPEN providing consistency with prior work. Since the AMBIGNQ development and test sets were annotated under the same conditions, this discrepancy now shows up in the metrics. We leave the distribution shift of questions on the test data as one of challenges on AMBIGNQ.

C Data Analysis Details
Mismatches with NQ-OPEN. 29.4% of AM-BIGNQ development examples do not include the NQ-OPEN answer. We analyze a random sample of 50 such questions, and present a breakdown in Table 8. We find that our answers are correct in 92% of cases, including 44% of disagreements due to mismatches spans, 22% due to the NQ-OPEN answer being incorrect, and 14% due to time-dependence in the question. Of the 8% of cases where our answer is incorrect, the NQ-OPEN answers are also incorrect over half the time, indicating that these may be difficult questions.

D Baseline Implementation Details
Evidence corpus. We use English Wikipedia dump from 2018-12-20 and 2020-01-20 for NQ-OPEN and AMBIGNQ, respectively. Following 14 It is unlikely that this discrepancy is due to overfitting on development, because the effect is consistent across models and not present on the other datasets that they are evaluated on.
Model implementation. All models are implemented in PyTorch (Paszke et al., 2017), PyTorch-Transformers (Wolf et al., 2019) (for BERT) and fairseq (Ott et al., 2019) (for BART). We use BERT BASE and BART LARGE for all models. We use the exact same setup and hyperparameters for any process that we follow Karpukhin et al. (2020). For the passage retrieval through a dual encoder, we use the provided multi-setting trained model. For all BART-based models, we follow the default hyparameters from BART summarization code in fairseq, using one 32GB gpu. For finetuning, we change the learning rate to be 5e − 6 on both tasks. We use beam search for decoding the sequence. We train the model for 4 epochs (when trained on NQ-OPEN or pseudo-labelled data) or 15 epochs (when trained on AMBIGNQ), and take the best checkpoint based on the development data. Note that the perplexity of the output sequence does not correlate with the metric of interest (Exact Match, F1 ans or F1 EDIT-F1 ) as briefly discussed in Section 6.2, so using the metric of interest instead of perplexity is important for hyperparamter tuning or the choice of the best checkpoint.
Details in ensemble and co-training. We use an ensemble based on voting; the answers that are predicted by the highest number of models are chosen as the final answers. The number of models used in ensemble (C) is C = 5 before cotraining and C = 4 after cotraining. For co-training, we use N = 2 and C = 6, where N is the number of iteration and C is the number of models, in line with Algorithm 1. The choice of C is determined by taking the best combination of the models as follows. We train sixteen different models, using different hyperparamers including checkpoints from NQ-OPEN, learning rates, the order of the answers in the output sequence and the random seed. We then measure the development F1 ans on different combinations of the models with varying C (4 ≤ C ≤ 6) and take the best one.