TORQUE: A Reading Comprehension Dataset of Temporal Ordering Questions

A critical part of reading is being able to understand the temporal relationships between events described in a passage of text, even when those relationships are not explicitly stated. However, current machine reading comprehension benchmarks have practically no questions that test temporal phenomena, so systems trained on these benchmarks have no capacity to answer questions such as"what happened before/after [some event]?"We introduce TORQUE, a new English reading comprehension benchmark built on 3.2k news snippets with 21k human-generated questions querying temporal relationships. Results show that RoBERTa-large achieves an exact-match score of 51% on the test set of TORQUE, about 30% behind human performance.


Introduction
Time is important for understanding events and stories described in natural language text such as news articles, social media, financial reports, and electronic health records (Verhagen et al., 2007(Verhagen et al., , 2010UzZaman et al., 2013;Minard et al., 2015;Bethard et al., 2016Laparra et al., 2018). For instance, "he won the championship yesterday" is different from "he will win the championship tomorrow": he may be celebrating if he has already won it, while if he has not, he is probably still preparing for the game tomorrow.
The exact time when some event happens is often implicit in text. For instance, if we read that a woman is "expecting the birth of her first child", we know that the birth is in the future, while if she is "mourning the death of her mother", the death is in the past. These relationships between an event and a time point (e.g., "won the championship yesterday") or between two events (e.g., "expecting" is before "birth" and "mourning" is Heavy snow is causing disruption to transport across the UK, with heavy rainfall bringing flooding to the south-west of England. Rescuers searching for a woman trapped in a landslide at her home in Looe, Cornwall, said they had found a body. after "death") are called temporal relations (Pustejovsky et al., 2003).
This work studies reading comprehension for temporal relations, i.e., given a piece of text, a computer needs to answer temporal relation questions (Fig. 1). This problem is studied very little in reading comprehension (Rajpurkar et al., 2016(Rajpurkar et al., , 2018Dua et al., 2019;Lin et al., 2019), and existing systems are hence brittle when handling questions in TORQUE (Table 1).
Reading comprehension for temporal relationships has the following challenges. First, it requires event understanding, which is rare for reading comprehension datasets. For the example in Fig. 1, SQUAD (Rajpurkar et al., 2016)   other datasets largely only require an understanding of the predicate-argument structure, and would ask questions like "what was a woman trapped in?" But a temporal relation question would be "what started before a woman was trapped?" To answer it, the system needs to identify events (e.g., LANDSLIDE is an event and "body" is not), the time of these events (e.g., LANDSLIDE is a correct answer, while SAID is not because of the time that the two events happen), and look at the entire passage rather than the local predicateargument structures within a sentence (e.g., SNOW and RAINFALL are correct answers from the sentence before "a woman trapped").
Second, there are many events in a typical passage of text, so temporal relation questions typically query more than one relationship at the same time. This means that a question can have multiple answers (e.g., "what happened after the landslide?"), or no answers, because the question may be beyond the time scope (e.g., "what happened before the snow started?").
Third, temporal relations queried by natural language questions are often sensitive to a few key words such as before, after, and start. Those questions can easily be changed to make contrasting questions with dramatically different answers. Models that are not sensitive to these small changes in question words will perform poorly on this task, as shown in Table 1.
In this paper, we present TORQUE, the first reading comprehension benchmark that specifically targets these challenging phenomena. We trained crowd workers to label events in text, and to write and answer questions that query temporal relationships between these events. During data collection, we had workers write questions with contrasting changes to the temporal key words, to give a comprehensive test of a machine's temporal reasoning ability and minimize the effect of any data collection artifacts (Gardner et al., 2020). We annotated 3.2k text snippets randomly selected from the TempEval3 dataset (Uz-Zaman et al., 2013). In total, TORQUE has 25k events and 21k user-generated and fully answered temporal relation questions. Both the events and question-answer (QA) pairs from 20% of TORQUE were further validated by additional crowd workers, which we use for evaluation. Results show that RoBERTa-large  achieves 51% in exact-match (EM) on TORQUE after finetuning, about 30% behind human performance, indicating that more investigation is needed to better solve this problem.

Events
As temporal relations are relationships between events, we first define events. Generally speak-

Events in different modes
The lion had a large meal and slept for 24 hours.
[Negated] The lion didn't sleep after having a large meal.
[Uncertain] The lion may have had a large meal before sleeping.
[Hypothetical] If the lion has a large meal, it will sleep for 24 hours.
[Repetitive] The lion used to sleep for 24 hours after having large meals.
[Generic] After having a large meal, lions may sleep longer. Figure 3: Various modes of events that prior work needed to categorize. Section 3 shows that they can be handled naturally without explicit categorization.
This work follows this line of event definition and uses event and event trigger interchangeably. We define an event to be either a verb or a noun (e.g., TRAPPED and LANDSLIDE in Fig. 1). Specifically, in copular constructions, we choose to label the verb as the event, instead of an adjective or preposition. This allows us to give a consistent treatment of "she was on the east coast yesterday" and "she was happy", which we can easily teach to crowd workers.
Note that events expressed in text are not always factual. They can be negated, uncertain, hypothetical, or have other associated modalities (see Fig. 3). Prior work dealing with events often tried to categorize and label these various aspects because they were crucial for determining temporal relations. Sometimes certain categories were even dropped due to annotation difficulties (Pustejovsky et al., 2003;O'Gorman et al., 2016;Ning et al., 2018b). In this work, we simply have people label all events, no matter their modality, and use natural language to describe relations between them, as discussed in Sec. 3.

Temporal Relations
Temporal relations describe the relationship between two events with respect to time, or between one event and a fixed time point (e.g., yesterday). 1 We can use a triplet, (A, r, B), to represent this relationship, where A and B are events or time points, and r is a temporal relation. For example, the first sentence in Fig. 3 expresses a temporal relation (HAD, happened before, SLEPT).
In previous works, every event is assumed to be associated with a time interval [t start , t end ]. When comparing two events, there were 13 possible relations to choose from (see Fig. 4) (Allen, 1984). However, there are still many relations that cannot be expressed because the assumption that every event has a time interval is inaccurate: The time scope of an event may be fuzzy, an event can have a non-factual modality, or events can be repetitive and invoke multiple intervals (see Fig. 5). To better handle these phenomena, we move away from the fixed set of relations used in prior work and instead use natural language to annotate the relationships between events, as described in the next section.

Natural Language Annotation of Temporal Relations
Motivated by recent works (He et al., 2015;Michael et al., 2017;Levy et al., 2017;Gardner et al., 2019b), we propose using natural language question answering as an annotation format for temporal relations. Recalling that we denote a temporal relation between two events as (A, r, B), we use (?, r, B) to denote a temporal relation question. We instantiate these temporal relation questions using natural language. For in-

Confusing relations between the following events
Fuzzy time scope: Heavy snow is causing disruption to transport across the UK, with heavy rainfall bringing flooding to the southwest of England.
"Follow" is negated: Colonel Collins didn't follow a normal progression anymore once she was picked as a NASA astronaut.

"Leaves" is a series of time intervals:
The bus leaves at 10 am every day, so we will go to the bus stop at 9 am today. Figure 5: It is confusing to label these relations using a fixed set of relations: they are not simply before or after, but they can be fuzzy, can have modalities as events, and/or need multiple time intervals to represent.
stance, (?, happened before, SLEPT) means "what happened before a lion slept?" We then expect as an answer the set of all events A in the passage such that (A, r, B) holds, assuming for any deictic expression A or B the time point when the passage was written, and assuming that the passage is true.

Fuzzy relations
Heavy snow is causing disruption to transport across the UK, with heavy rainfall bringing flooding to the south-west of England.
Q: What happens at about the same time as the disruption? A: flooding Q: What started after the snow started? A: disruption Figure 6: Fuzzy relations that used to be difficult to represent using a predefined label set can be captured naturally in a reading comprehension task.

Advantages
Studying temporal relations as a reading comprehension task gives us the flexibility to handle many of the aforementioned difficulties. First, fuzzy relations can be described by natural language questions (after all, the relations are expressed in natural language in the first place). In Fig. 6, DISRUP-TION and FLOODING are at about the same time, but we do not know for sure which one is earlier, so we have to choose vague in the predefined label set. Similarly for SNOW and DISRUPTION, we do not know which one ends earlier and have to choose vague. In contrast, the QA pairs in Fig. 6 can naturally capture these fuzzy relations. Second, natural language questions can conveniently incorporate different modes of events. Figure 7 shows examples of not before, probably before, before under some conditions, often before,

Questions that query events in different modes
[Negated] What didn't the lion do after a large meal?
[Uncertain] What might the lion do before sleeping?
[Hypothetical] What will the lion do if it has a large meal?
[Repetitive] What did the lion use to do after large meals?
[Generic] What do lions after a large meal? Figure 7: Events in different modes can be distinguished using natural language questions. "Often before" vs "before" He used to take a walk after dinner.  and generally before, which accurately describe the relation between various modes of "having a meal" and "sleeping" in Fig. 3. In contrast, if we could only choose one label, we must choose before for all these relations, although these relations are actually different. For instance, a repetitive event may be a series of intervals rather than a single one, and often before is very different from before (Fig. 8).
Third, a major issue that prior works wanted to address was deciding when two events should have a relation Mostafazadeh et al., 2016;O'Gorman et al., 2016;Ning et al., 2018b). To avoid asking for relations that do not exist, prior works needed to explicitly annotate certain properties of events as a preprocessing step, but it still remains difficult to have a theory explaining, for instance, why hit can compare to expected and crisis, but not to gains. Interestingly, when we annotate temporal relations in nat-

When should two events have a relation?
Service industries showed solid job gains, an area expected to be hardest hit when the crisis hit the America economy.
Some pairs have relations: (showed gains), (expected hit), (gains crisis), etc. Some don't: (showed hit), (gains hit) A passerby called the police to report the body, but the line was busy.
Some pairs have relations: (called report), (called was) Some don't: (report was) Figure 9: It remains unclear how to determine if two events should have a temporal relation. ural language, the annotator naturally avoids event pairs that do not have relations. For instance, for the sentences in Fig. 9, one will not ask questions like "what happened after the service industries are hardest hit?" or "what happened after a passerby reported the body?" Instead, natural questions will be "what was expected to happen when the crisis hit America?" and "what was supposed to happen after a passerby called the police?" The format of natural language questions bypasses the need for explicit annotation of properties of events or other theories.

Penalize Shortcuts by Contrast Sets
Natural language questions give us many benefits in describing fuzzy relations and incorporating various temporal phenomena. However, this setup also increases the risk of leading to trivial solutions (Gardner et al., 2019a). As Fig. 10 shows, there are two events ATE and WENT in the text. Since ATE is already mentioned in the question, the answer of WENT seems a trivial option without the need to understand the underlying relationship. To address this issue, we create contrast questions which slightly modify the original questions, but dramatically change the answers, so that shortcuts are penalized. Specifically, for an existing question (?, r, B) (e.g., "what happened after he ate his breakfast?"), one should keep using B and change r (e.g., "what happened before/shortly after/... he ate his breakfast?"), or modify it to ask about the start/end time (e.g., "what happened after he started eating his breakfast?" or "what would finish after he ate his breakfast?"). We also instructed workers to make sure that the answers to the new question are different from the original one to avoid trivial modifications (e.g., changing  "what happened" to "what occurred").

Data Collection
We used Amazon Mechanical Turk to build TORQUE, a reading comprehension dataset of temporal ordering questions. Following prior work, we focus on passages that consist of two contiguous sentences, as this is sufficient to capture the vast majority of non-trivial temporal relations (Ning et al., 2017), and it greatly simplifies the annotation. We took all the articles used in the TempEval3 (TE3) workshop (2.8k articles) (UzZaman et al., 2013) and created a pool of 26k two-sentence passages. Given a passage randomly selected from this pool, the annotation process for crowd workers was as follows.
1. Label all the events 2. Repeatedly do the following 2 (a) Ask a temporal relation question and point out all the answers from the list of events (b) Modify the temporal relation to create one or more new questions and answer them.
The annotation guideline are public. 3 In the following sections, we further discuss issues of quality control and crowdsourcing cost.

Quality Control
We used three quality control strategies: qualification, pilot, and validation. Qualification We designed a separate qualification task where crowd workers were trained and tested on three individual capabilities: labeling events, asking temporal relation questions, and question-answering. They were tested on problems randomly selected from a pool designed by us. A crowd worker was considered level-1 qualified if they could pass the test within three attempts. In practice, roughly 1 out of 3 workers passed our qualification test.
Pilot We then asked level-1 crowd workers to do a small amount of the real task. We manually checked the annotations and gave feedback to them. Those who passed this inspection were called level-2 workers, and only they could work on the large-scale real task. Roughly 1 out of 3 pilot submissions received a level-2 qualification. In the end, there were 63 level-2 annotators, and 60 of them actually worked on our large-scale task.
Validation We randomly selected 20% of the articles from TORQUE for further validation. We first validated the events by 4 different level-2 annotators (with the original annotator, there were in total 5 different humans). We also intentionally added noise to the original event list so that the validators must carefully identify wrong events. The final event list was determined by aggregating all 5 humans using majority vote. Second, we validated the answers in the same portion of the data. Two level-2 workers were asked to verify the initial annotator's answers; we again added noise to the answer list as a quality control for the validators. Instead of using majority vote as we did for events, the final answers from all workers are considered correct. We did not do additional validation for the questions themselves, as a manual inspection found the quality to be very high already, with no bad questions in a random sample of 100.

Cost
In each job of the main task (as known as a "HIT" on MTurk), we presented 3 passages. The crowd worker could decide to use some or all of them. For each passage a worker decided to use, they needed to label the events, answer 3 hard-coded warm-up questions, and then ask and answer at least 12 questions (including contrast questions). The final reward is a base pay of $6 plus $0.5 for each extra question. Crowd workers thus had the incentive to (1) use fewer passages so that they can do event labeling and warm-up questions fewer times, (2) modify questions instead of asking from scratch, and (3) ask extra questions in each job. All these incentives were for more coverage of the temporal phenomena in each passage. In practice, crowd workers on average used 2 passages in each HIT. Validating the events in each passage and the answers to a specific question both cost $0.1. In total, TORQUE cost $15k to create for an average of $0.70 per question.

TORQUE Statistics
In TORQUE we collected 3.2k passage annotations (∼50 tokens/passage), 4 24.9k events (7.9 events/passage), and 21.2k user-provided questions (∼half of them were labeled by crowd workers as modifications of existing ones). Every passage also comes with 3 hard-coded warm-up questions asking which events in the passage had already happened, were ongoing, or were still in the future (the 3 warm-up questions form a contrast set, where we treat the first one as "original" and the other two as "modified"). Table 3 shows some basic statistics of TORQUE.
In a random sample of 200 questions in the test set of TORQUE, we found 94 questions querying about relations that cannot be directly represented by the previous single-interval-based labels. Table 2 gives example questions capturing these phenomena. More analysis of the event, answer, and workload distributions are in the appendix.

Quality
To validate the event annotations, we took the events provided by the initial annotator, added noise, and asked different workers to validate. We also trained an auxiliary event detection model using RoBERTa-large and added its predictions as event candidates. This tells us about the quality of events in TORQUE in two ways. First, the Worker Agreement with Aggregate (WAWA) F 1 here is 94.2%; that is, compare the majority-vote with all annotators, and perform micro-average on all instances. Second, if an event candidate is labeled by both the initial annotator and the model, then almost all of them (99.4%) are kept by the validators; if neither the initial annotator nor the model labeled a candidate, the candidate is almost surely removed (0.8%). As validators did not know which ones were noise or not beforehand, this indicates that the validators could identify noise terms reliably.
Similarly, the WAWA F 1 of the answer annotations is 84.7%, slightly lower than that for events, Type Subtype Example % Standard "What happened before Bush gave four key speeches?" 53% Fuzzy begin only "What started before Mr. Fournier was prohibited from organizing his own defense?" 15% overlap only "What events were occurring during the competition?" 10% end only "What will end after he is elected?" 1% Modality uncertain "What might happen after the FTSE 100 index was quoted 9.6 points lower?" 10% negation "What has not taken place before the official figures show something?" 5% hypothetical "What event will happen if the scheme is broadened?" 2% repetitive "What usually happens after common shares are acquired?" 1% Misc.
participant "What did Hass do before he went to work as a spy?" 4% opinion "What should happen in the future according to Obama's opinion?" 3% intention "What did Morales want to happen after Washington had a program to eradicate coca?" 1%  which is expected because temporal relation QA is intuitively harder. Results show that 12.3% of the randomly added answer candidates were labeled as correct answers by the validators. We manually inspected 100 questions and found 11.6% of the added noise terms were correct answers (very close to 12.3%), indicating that the validators were actually doing a good job in answer validation. More details of the metrics and the quality of annotations can be found in the appendix.

Experiment
We split TORQUE into train (80% of all the questions), dev (5%), and test (15%) and these three parts do not have the same articles. To solve TORQUE in an end-to-end fashion, the model here takes as input a passage and a question, then looks at every token in the passage and makes a binary classification of whether this token is an answer to the question or not. Specifically, we fine-tuned BERT (Devlin et al., 2019) and RoBERTa  (both "base" and "large") on the training set of TORQUE. We fixed batch size = 6 (each instance is a tuple of one passage, one question, and all its answers) with gradient accumulation step = 2 in all experiments. We selected the learning rate (from (1e −5 , 2e −5 )), the training epoch (within 10), and the random seed (from 3 arbitrary ones) based on performance on the dev set of TORQUE. To compute an estimate of human performance, one author answered 100 questions from the test set.
Both the human performance and system performances are shown in Table 4. We report the standard macro F 1 and exact-match (EM) metrics in question answering, and also EM consistency, the percentage of contrast question sets for which a model's predictions match exactly to all questions in a group (Gardner et al., 2020). We see warm-up questions are easier than user-provided ones because warm-up questions focus on easier phenomena of past/ongoing/future events. In addition, RoBERTa-large is expectedly the best system, but still far behind human performance, trailing by about 30% in EM.
We further downsampled the training data to test the performance of RoBERTa. We find that with 10% of the original training data, RoBERTa fails to learn anything meaningful and simply predicts "not an answer" for all tokens. With 50% of the training data, RoBERTa is slightly lower than but already comparable to that of using the entire training set. This means that the learning curve on TORQUE is already flat and the current size of TORQUE may not be the bottleneck for its low performance. More investigations into system modeling are needed to better solve TORQUE.  Table 4: Human/system performance on the test set of TORQUE. System performance is averaged from 3 runs; all std. dev. were ≤ 4% and those in [1%, 4%] are underlined. C (consistency) is the percentage of contrast groups for which a model's predictions have F 1 ≥ 80% for all questions in a group (Gardner et al., 2020).

Human F1
Human EM Human C Figure 11: RoBERTa-large with different percentage of training data. Human performance in dashed lines.

Related Work
The study of time is to understand when, how long, and how often things happen. While how long and how often usually require temporal common sense knowledge (Vempala et al., 2018;Zhou et al., 2019Zhou et al., , 2020, the problem of when often boils down to extracting the temporal relations. Modeling. Research on temporal relations often focuses on algorithmic improvement, such as structured inference (Do et al., 2012;Ning et al., 2018a), structured learning (Leeuwenberg and Moens, 2017;Ning et al., 2017), and neural networks Tourille et al., 2017;Cheng and Miyao, 2017;Meng and Rumshisky, 2018;Leeuwenberg and Moens, 2018;.
Formalisms. The approach that prior works took to handle the aforementioned temporal phenemona was to define formalisms such as the different modes of events (Fig. 3), different time axes for events (Ning et al., 2018b), and specific rules to follow when there is confusion. For exam-ple, Bethard et al. (2007);Ning et al. (2018b) focused on a limited set of temporal phenomena and achieved high inter-annotator agreements (IAA), while Styler IV et al. (2014);O'Gorman et al. (2016) aimed at covering more phenomena but suffered from low IAAs even between NLP researchers.
QA as annotation. A natural choice is then to cast temporal relation understanding as a reading comprehension (RC) problem. The QA-TempEval workshop, albeit its name, is actually not studying temporal relations in an RC setting (Llorens et al., 2015). This work is motivated by the philosophy in QA-SRL (He et al., 2015) and QAMR (Michael et al., 2017), where QA pairs were used as representations for predicate-argument structures. In zero-shot relation extraction (RE), they reduced relation slot filling to an RC problem so as to build very large distant training data and improve zero-shot learning performance (Levy et al., 2017). However, our work differs from zero-shot RE since it centers around entities, while TORQUE is about events; the way to ask and answer questions in zero-shot RE and in TORQUE is thus significantly different.

Conclusion
Understanding temporal ordering of events is critical in reading comprehension, but existing works have studied very little about it. This paper presents TORQUE, a new English reading comprehension dataset of temporal ordering questions on 3.2k news snippets. These questions include 9.5k hard-coded questions asking which events had happened, were ongoing, or were still in the future, and 21.2k human-generated questions querying more complex phenomena than the hardcoded ones. We argue that studying temporal relations as a reading comprehension task allows for more convenient representation of these temporal phenomena than is possible in conventional formalisms. Results show that even a state-of-theart language model, RoBERTa-large, falls behind human performance on TORQUE by a large margin, necessitating more investigation on improving reading comprehension on temporal relationships in the future.

A Event Distribution
As we mentioned in Sec. 5, TORQUE has 24.9k events over 3.2k passages. Figure 12 shows the histogram of the number of events in all these passages. We can see it roughly follows a Gaussian distribution with the mean at around 7-8 events per passage.  Figure 13 further shows the 50 most common events in TORQUE. Unsurprisingly, the most common events are reporting verbs (e.g., "say", "tell", "report", and "announce") and copular verbs. Other common events such as "meeting", "killed", "visit", and "war" are also expected given that the passages of TORQUE were taken from news articles. #Appearances in TORQUE Figure 13: Fifty most common event triggers in TORQUE. Note the y-axis is in log scale. Figure 14 shows a sunburst visualization of the questions provided by crowd workers in TORQUE, including both their original questions and their modifications. Specifically, Fig. 14a shows that almost all of the questions start with "what." The small portion of questions that do not start with "what" are cases where crowd workers switch the order of how they ask. One example of these was "Before making his statement to the Sunday Mirror, what did the author do?" Figure 14a also shows the most common following words of "what."

B Question Prefix Distribution
Figures 14b-c further show the distribution of questions starting with "what happened" and "what will." We can see that when asking things in the past, people ask more about "what happened before/after" than "what happened while/during," while when asking things in the future, people ask much more about "what will happen after" than "what will happen before."

What happened
What will Overall (a) (b) (c) Figure 14: Prefix distribution of user-provided questions.

C Answer Distribution
The distribution of the number of answers to each question is shown in the figure below, where we divide the questions into 4 categories: the original warm-up questions, the modified warm-up questions , the original user questions, and the modified user questions. Note for each passage, there are 3 warm-up questions and they are all hard-coded when crowd workers worked on them. We are treating the first one (i.e., "What events have already finished?") as the original and the other two as modified (i.e., "What events have begun but have not finished?" and "What will happen in the future?").

TORQUE: #Answers in different categories
Original Modified Warm-up User provided

Original Modified
We can see that in both the warm-up and the user questions, "modified" has a larger portion of questions with no answers at all as compared to the "original." This effect is very significant for warm-up questions because in news articles, most of the events were in the past. As for the user-provided questions, the percentage of no-answer questions is higher in "modified," but it is not as drastic as for the warm-up question. This because we only required that the modified question should have different answers from the original one; many of those questions sill have answers after modification.

D Workload Distribution Among Workers
As each annotator may be biased to only ask questions in a certain way, it is important to make sure that the entire dataset is not labeled by only a few annotators Geva et al. (2019). Figure 15a shows the contribution of each crowd worker to TORQUE and we can see even the rightmost worker only provided 5%. Figure 15b further adopts the notion of Gini Index to show the dispersion. 5 The Gini index of TORQUE is 0.42.

E Worker Agreement With Aggregate
In Sec. 5 we described the worker agreement with aggregate (WAWA) metric for measuring the interannotator agreement (IAA) between crowd workers of TORQUE. This WAWA metric is explained in the figure below. It is to first get an aggregated answer set from multiple workers (we used majority vote as the aggregate function), then compare each worker with the aggregated answer set, and finally compute the micro-average across multiple workers and multiple questions. Tables 5 and 6 show the quality of event annotations and question-answering annotations, respectively. In both of them, the IAA are using the WAWA metric explained above; the "Init Annotator" rows are a slight modification of WAWA, which means that all workers are used when aggregating those answers, 5 A high Gini Index here means the data were provided by a small group of workers. The Gini Index of family incomes in the United States was 0.49 in 2018 (Semega et al., 2019). but only the first annotator is compared against the aggregated answer set.