Quoref: A Reading Comprehension Dataset with Questions Requiring Coreferential Reasoning

Machine comprehension of texts longer than a single sentence often requires coreference resolution. However, most current reading comprehension benchmarks do not contain complex coreferential phenomena and hence fail to evaluate the ability of models to resolve coreference. We present a new crowdsourced dataset containing more than 24K span-selection questions that require resolving coreference among entities in over 4.7K English paragraphs from Wikipedia. Obtaining questions focused on such phenomena is challenging, because it is hard to avoid lexical cues that shortcut complex reasoning. We deal with this issue by using a strong baseline model as an adversary in the crowdsourcing loop, which helps crowdworkers avoid writing questions with exploitable surface cues. We show that state-of-the-art reading comprehension models perform significantly worse than humans on this benchmark—the best model performance is 70.5 F1, while the estimated human performance is 93.4 F1.


Introduction
Paragraphs and other longer texts typically make multiple references to the same entities. Tracking these references and resolving coreference is essential for full machine comprehension of these texts. Significant progress has recently been made in reading comprehension research, due to large crowdsourced datasets (Rajpurkar et al., 2016;Bajaj et al., 2016;Joshi et al., 2017;Kwiatkowski et al., 2019, inter alia). However, these datasets focus largely on understanding local predicateargument structure, with very few questions requiring long-distance entity tracking. Obtaining such questions is hard for two reasons: (1) teaching crowdworkers about coreference is challenging, with even experts disagreeing on its nuances (Pradhan et al., 2007;Versley, 2008; Re-Byzantines were avid players of tavli (Byzantine Greek: τάβλη), a game known in English as backgammon, which is still popular in former Byzantine realms, and still known by the name tavli in Greece. Byzantine nobles were devoted to horsemanship, particularly tzykanion, now known as polo. The game came from Sassanid Persia in the early period and a Tzykanisterion (stadium for playing the game) was built by Theodosius II (r. 408-450) inside the Great Palace of Constantinople. Emperor Basil I (r. 867-886) excelled at it; Emperor Alexander (r. 912-913) died from exhaustion while playing, Emperor Alexios I Komnenos (r. 1081-1118) was injured while playing with Tatikios, and John I of Trebizond (r. 1235-1238) died from a fatal injury during a game. Aside from Constantinople and Trebizond, other Byzantine cities also featured tzykanisteria, most notably Sparta, Ephesus, and Athens, an indication of a thriving urban aristocracy. Q1. What is the Byzantine name of the game that Emperor Basil I excelled at? it → tzykanion Q2. What are the names of the sport that is played in a Tzykanisterion? the game → tzykanion; polo Q3. What cities had tzykanisteria? cities → Constantinople; Trebizond; Sparta; Ephesus; Athens Figure 1: Example paragraph and questions from the dataset. Highlighted text in paragraphs is where the questions with matching highlights are anchored. Next to the questions are the relevant coreferent mentions from the paragraph. They are bolded for the first question, italicized for the second, and underlined for the third in the paragraph. casens et al., 2011;Poesio et al., 2018), and (2) even if we can get crowdworkers to target coreference phenomena in their questions, these questions may contain giveaways that let models arrive at the correct answer without performing the desired reasoning (see §3 for examples).
We introduce a new dataset, QUOREF, 1 that contains questions requiring coreferential reasoning (see examples in Figure 1). The questions are derived from paragraphs taken from a diverse set of English Wikipedia articles and are collected using an annotation process ( §2) that deals with the aforementioned issues in the following ways: First, we devise a set of instructions that gets workers to find anaphoric expressions and their referents, asking questions that connect two mentions in a paragraph. These questions mostly revolve around traditional notions of coreference (Figure 1 Q1), but they can also involve referential phenomena that are more nebulous (Figure 1  Q3). Second, inspired by Dua et al. (2019), we disallow questions that can be answered by an adversary model (uncased base BERT, Devlin et al., 2019, trained on SQuAD 1.1, Rajpurkar et al., 2016) running in the background as the workers write questions. This adversary is not particularly skilled at answering questions requiring coreference, but can follow obvious lexical cues-it thus helps workers avoid writing questions that shortcut coreferential reasoning.
QUOREF contains more than 24K questions whose answers are spans or sets of spans in 4.7K paragraphs from English Wikipedia that can be arrived at by resolving coreference in those paragraphs. We manually analyze a sample of the dataset ( §3) and find that 78% of the questions cannot be answered without resolving coreference. We also show ( §4) that the best system performance is 70.5% F 1 , while the estimated human performance is 93.4%. These findings indicate that this dataset is an appropriate benchmark for coreference-aware reading comprehension.

Dataset Construction
Collecting paragraphs We scraped paragraphs from Wikipedia pages about English movies, art and architecture, geography, history, and music. For movies, we followed the list of English language films, 2 and extracted plot summaries that are at least 40 tokens, and for the remaining categories, we followed the lists of featured articles. 3 Since movie plot summaries usually mention many characters, it was easier to find hard QUOREF questions for them, and we sampled about 40% of the paragraphs from this category.
Crowdsourcing setup We crowdsourced questions about these paragraphs on Mechanical Turk. We asked workers to find two or more co-referring spans in the paragraph, and to write questions such that answering them would require the knowledge that those spans are coreferential. We did not ask them to explicitly mark the co-referring spans.
Workers were asked to write questions for a random sample of paragraphs from our pool, and we showed them examples of good and bad questions in the instructions (see Appendix A). For each question, the workers were also required to select one or more spans in the corresponding paragraph as the answer, and these spans are not required to be same as the coreferential spans that triggered the questions. 4 We used an uncased base BERT QA model (Devlin et al., 2019) trained on SQuAD 1.1 (Rajpurkar et al., 2016) as an adversary running in the background that attempted to answer the questions written by workers in real time, and the workers were able to submit their questions only if their answer did not match the adversary's prediction. 5 Appendix A further details the logistics of the crowdsourcing tasks. Some basic statistics of the resulting dataset can be seen in Table 1.

Semantic Phenomena in QUOREF
To better understand the phenomena present in QUOREF, we manually analyzed a random sample of 100 paragraph-question pairs. The following are some empirical observations.

Requirement of coreference resolution
We found that 78% of the manually analyzed questions cannot be answered without coreference resolution. The remaining 22% involve some form of coreference, but do not require it to be resolved for answering them. Examples include a paragraph that mentions only one city, "Bristol", and a sentence that says "the city was bombed". The associated question, Which city was bombed?, does not really require coreference resolution from a model  that can identify city names, making the content in the question after Which city unnecessary.
Types of coreferential reasoning Questions in QUOREF require resolving pronominal and nominal mentions of entities. Table 2 shows percentages and examples of analyzed questions that fall into these two categories. These are not disjoint sets, since we found that 32% of the questions require both (row 3). We also found that 10% require some form of commonsense reasoning (row 4).

Baseline Model Performance on QUOREF
We evaluated two classes of baseline models on QUOREF: state-of-the art reading comprehension models that predict single spans ( §4.1) and heuristic baselines to look for annotation artifacts ( §4.2).
We use two evaluation metrics to compare model performance: exact match (EM), and a (macro-averaged) F 1 score that measures overlap between a bag-of-words representation of the gold and predicted answers. We use the same implementation of EM as SQuAD, and we employ the F 1 metric used for DROP (Dua et al., 2019 Table 3: Performance of various baselines on QUOREF, measured by exact match (EM) and F 1 . Boldface marks the best systems for each metric and split.

Heuristic Baselines
In light of recent work exposing predictive artifacts in crowdsourced NLP datasets (Gururangan et al., 2018;Kaushik and Lipton, 2018, inter alia), we estimate the effect of predictive artifacts by training BERT QA and XLNet QA to predict a single start and end index given only the passage as input (passage-only). Table 3 presents the performance of all baseline models on QUOREF.

Results
The best performing model is XLNet QA, which reaches an F 1 score of 70.5 in the test set. However, it is still more than 20 F 1 points below human performance. 9 BERT QA trained on QUOREF under-performs XLNet QA, but still gets a decent F 1 score of 66.4. Note that BERT QA trained on SQuAD would have achieved an F 1 score of 0, since our dataset was constructed with that model as the adversary. The extent to which BERT QA does well on QUOREF might indicate its capacity for coreferential reasoning that was not exploited when it was trained on SQuAD (for a detailed discussion of this phenomenon, see Liu et al., 2019). Our analysis of model errors in §4.4 shows that some of the improved performance may also be due to artifacts in QUOREF.
We notice smaller improvements from XLNet QA over BERT QA (4.12 in F 1 test score, 2.6 6 https://github.com/huggingface/ pytorch-transformers 7 The large BERT model does not fit in the available GPU memory. 8 https://commoncrawl.org/ 9 Human performance was estimated from the authors' answers of 400 questions from the test set, scored with the same metric used for systems. in EM test score) on QUOREF compared to other reading comprehension benchmarks: SQuAD and RACE (see Yang et al., 2019). This might indicate the insufficiency of pretraining on more data (XL Net was pretrained on 6 times more plain text, nearly 10 times more wordpieces than BERT), for coreferential reasoning.
The passage-only baseline under-performs all other systems; examining its predictions reveals that it almost always predicts the most frequent entity in the passage. Its relatively low performance, despite the tendency for Wikipedia articles and passages to be written about a single entity, indicates that a large majority of questions likely require coreferential reasoning.

Error Analysis
We analyzed the predictions from the baseline systems to estimate the extent to which they really understand coreference.
Since the contexts in QUOREF come from Wikipedia articles, they are often either about a specific entity, or are narratives with a single protagonist. We found that the baseline models exploit this property to some extent. We observe that 51% of the QUOREF questions in the development set that were correctly answered by BERT QA were either the first or the most frequent entity in the paragraphs, while in the case of those that were incorrectly answered, this value is 12%. XLNet QA also exhibits a similar trend, with the numbers being 48% and 11%, respectively.
QA systems trained on QUOREF often need to find entities that occur far from the locations in the paragraph at which that the questions are anchored. To assess whether the baseline systems exploited answers being close, we manually analyzed predictions of BERT QA and XLNet QA on 100 questions in the development set, and found that the answers to 17% of the questions correctly answered by XLNet QA are the nearest entities, whereas the number is 4% for those incorrectly answered. For BERT QA, the numbers are 17% and 6% respectively.

Related Work
Traditional coreference datasets Unlike traditional coreference annotations in datasets like those of Pradhan et al. (2007), Ghaddar and Langlais (2016), Chen et al. (2018) and Poesio et al. (2018), which aim to obtain complete coref-erence clusters, our questions require understanding coreference between only a few spans. While this means that the notion of coreference captured by our dataset is less comprehensive, it is also less conservative and allows questions about coreference relations that are not marked in OntoNotes annotations. Since the notion is not as strict, it does not require linguistic expertise from annotators, making it more amenable to crowdsourcing. Guha et al. (2015) present the limitations of annotating coreference in newswire texts alone, and like us, built a non-newswire coreference resolution dataset focusing on Quiz Bowl questions. There is some other recent work (Poesio et al., 2019;Aralikatte and Søgaard, 2019) in crowdsourcing coreference judgments that relies on a relaxed notion of coreference as well.
Reading comprehension datasets There are many reading comprehension datasets (Richardson et al., 2013;Rajpurkar et al., 2016;Kwiatkowski et al., 2019;Dua et al., 2019, inter alia). Most of these datasets principally require understanding local predicate-argument structure in a paragraph of text. QUOREF also requires understanding local predicate-argument structure, but makes the reading task harder by explicitly querying anaphoric references, requiring a system to track entities throughout the discourse.

Conclusion
We present QUOREF, a focused reading comprehension benchmark that evaluates the ability of models to resolve coreference. We crowdsourced questions over paragraphs from Wikipedia, and manual analysis confirmed that most cannot be answered without coreference resolution. We show that current state-of-the-art reading comprehension models perform significantly worse than humans. Both these findings provide evidence that QUOREF is an appropriate benchmark for coreference-aware reading comprehension. The crowdworkers were giving the following instructions: "In this task, you will look at paragraphs that contain several phrases that are references to names of people, places, or things. For example, in the first sentence from sample paragraph below, the references Unas and the ninth and final king of Fifth Dynasty refer to the same person, and Pyramid of Unas, Unas's pyramid and the pyramid refer to the same construction. You will notice that multiple phrases often refer to the same person, place, or thing. Your job is to write questions that you would ask a person to see if they understood that the phrases refer to the same entity. To help you write such questions, we provided some examples of good questions you can ask about such phrases. We also want you to avoid questions that can be answered correctly by someone without actually understanding the paragraph. To help you do so, we provided an AI system running in the background that will try to answer the questions you write. You can consider any question it can answer to be too easy. However, please note that the AI system incorrectly answering a question does not necessarily mean that it is good. Please read the examples below carefully to understand what kinds of questions we are interested in."

A.2 Examples of Good Questions
We illustrate examples of good questions for the following paragraph.
"The Pyramid of Unas is a smooth-sided pyramid built in the 24th century BC for the Egyptian pharaoh Unas, the ninth and final king of the Fifth Dynasty. It is the smallest Old Kingdom pyramid, but significant due to the discovery of Pyramid Texts, spells for the king's afterlife incised into the walls of its subterranean chambers. Inscribed for the first time in Unas's pyramid, the tradition of funerary texts carried on in the pyramids of subsequent rulers, through to the end of the Old Kingdom, and into the Middle Kingdom through the Coffin Texts which form the basis of the Book of the Dead. Unas built his pyramid between the complexes of Sekhemket and Djoser, in North Saqqara. Anchored to the valley temple via a nearby lake, a long causeway was constructed to provide access to the pyramid site. The causeway had elaborately decorated walls covered with a roof which had a slit in one section allowing light to enter illuminating the images. A long wadi was used as a pathway. The terrain was difficult to negotiate and contained old buildings and tomb superstructures. These were torn down and repurposed as underlay for the causeway. A significant stretch of Djoser's causeway was reused for embankments. Tombs that were on the path had their superstructures demolished and were paved over, preserving their decorations." The following questions link pronouns: Q1: What is the name of the person whose pyramid was built in North Saqqara? A:

A.4 Worker Pool Management
Beyond training workers with the detailed instructions shown above, we ensured that the questions are of high quality by selecting a good pool of 21 workers using a two-stage selection process, allowing only those workers who clearly understood the requirements of the task to produce the final set of questions. Both the qualification and final HITs had 4 paragraphs per HIT for paragraphs from movie plot summaries, and 5 per HIT for the other domains, from which the workers could choose. For each HIT, workers typically spent 20 minutes, were required to write 10 questions, and were paid US$7.

B Experimental Setup Details
Unless otherwise mentioned, we adopt the original published procedures and hyperparameters used for each baseline.
BERT QA and XLNet QA We use uncased BERT, and cased XLNet, but lowercase our data while processing. We train our model with a batch size of 10, sequence length of 512 wordpieces, and a stride of 128. We use the AdamW optimizer, with a learning rate of 3 −5 . We train for 10 epochs, checkpointing the model after 19399 steps. We report the performance of the checkpoint which is the best on the dev set.
QANet Durining training, we truncate paragraphs to 400 (word) tokens during training and questions to 50 tokens. During evaluation, we truncate paragraphs to 1000 tokens and questions to 100 tokens.
Passage-only baseline We keep the HPs setup used for training BERT QA and XLNet QA and replace questions with empty strings.