Modeling Preconditions in Text with a Crowd-sourced Dataset

Preconditions provide a form of logical connection between events that explains why some events occur together and information that is complementary to the more widely studied relations such as causation, temporal ordering, entailment, and discourse relations. Modeling preconditions in text has been hampered in part due to the lack of large scale labeled data grounded in text. This paper introduces PeKo, a crowd-sourced annotation of preconditions between event pairs in newswire, an order of magnitude larger than prior text annotations. To complement this new corpus, we also introduce two challenge tasks aimed at modeling preconditions: (i) Precondition Identification – a standard classification task defined over pairs of event mentions, and (ii) Precondition Generation – a generative task aimed at testing a more general ability to reason about a given event. Evaluation on both tasks shows that modeling preconditions is challenging even for today’s large language models (LM). This suggests that precondition knowledge is not easily accessible in LM-derived representations alone. Our generation results show that fine-tuning an LM on PeKo yields better conditional relations than when trained on raw text or temporally-ordered corpora.


Introduction
Recognizing logical connections between events in text is important for comprehensive document understanding and to improve global coherence in language generation systems. There is a rich body of work in identifying relations between textual events which covers causation (Mirza et al., 2014), temporal relations (Pustejovsky et al., 2003), textual entailment (Dagan et al., 2005), and discourse relations (Blair-Goldensohn and McKeown, 2006).
In this work, we focus on the precondition relation, which offers a general view of why certain events occur together in the world. This is not easily deduced from other event-event relations. Temporal ordering systems can sequence the order in which events occurred (Bethard, 2013;Chambers et al., 2014;Han et al., 2019) but can't explain why they occurred at all. Which events in a sequence were by chance, and which were required? Textual entailment identifies event paraphrases (Berant et al., 2015) and some causation (Girju, 2003a), but their view misses the broader look at enabling events like preconditions. Let the following serve as an example: I heard a bird sing above as I turned the key in the door. It opened with a push.
You can sequence these four events in order, but an ordering does not understand the why of the situation. One of these events (sing) is clearly not relevant to the door opening. How do we know that turning the key is a precondition to opened and not push? Turning the key usually doesn't cause the door to open (perhaps on some doors, but here a push was needed). Turning is simply a precondition. Causation and entailment do not apply to turn either. Preconditions thus provide a unique and still fine-grained understanding of this situation.
How do we build models that can recognize (and learn from) this type of common-sense knowledge in text? Do language models trained on vast amounts of data already capture it? Since there are no large scale datasets that can effectively answer these questions, we introduce PeKo, the Precondition Knowledge dataset. We also introduce two tasks -one aimed at recognizing preconditions in text, and the other at generating precondition events for any given target event.
The core contribution in this paper is this new publicly available crowd-sourced PeKo dataset. It consists of 28,948 event pairs annotated with precondition relations. We will first present our working definition of preconditions, and then discuss how to practically get crowd workers to identify them in text. We provide analysis of the new corpus and compare it against other existing corpora.
In addition to the corpus, this paper proposes two new challenge tasks. The first is a traditional classification task on the corpus itself. We thus address critical questions of how to model precondition knowledge. For instance, do today's large language models (e.g., BERT or XLNet) already capture precondition knowledge, and how do they perform on a precondition prediction task? Second, does textual context assist in precondition prediction? We experiment with varying levels of context and show that identifying preconditions requires careful modeling of the context.
The second proposed task is a precondition generation evaluation: models must generate necessary preconditions for a given target event. This is a test for how well models can reason about the necessary preconditions for a given situation, which is a useful capability for story generation and learning generalized scripts. We show how PeKo can be used to train (fine-tune) standard generative models, such as GPT-2, for this task. Empirical results show that fine-tuning on the PeKo-derived training set generates at least twice as many preconditions as compared to training on general instances.

Related Work
There has been a vast amount of research on extracting different types of relations between events including temporal (Pustejovsky et al., 2003), causal (Girju, 2003b), and paraphrasal relationships (Lin and Pantel, 2001), but relatively less research into precondition relationships. One of the early definitions and computational use of preconditions comes from the STRIPS program (Fikes and Nils son, 1971). Preconditions were defined as a set of conditions that MUST be met in order for the action (event) to be allowed to take place. Later work focused on aggregating precondition knowledge for a small class of action words, leveraging FrameNet and a text corpus to generate candidate precondition words using a PMI-based heuristic (Sil et al., 2010;Sil and Yates, 2011). Using small amounts of labeled data, they use hand-crafted PMI and wordnet based features to learn a SVM-based classifier that scores preconditions for a given action. Branavan et al. (2012) learned domain-specific preconditions from written instructions for the game of Minecraft. The instructions are procedural and well suited for identification. These mostly target preconditions that are eventstate relations as opposed to our goals of textual event-event identification.
ATOMIC (Sap et al., 2019) is a related crowdsourced dataset of event-event relations, where given a simple target event (verb phrase and its arguments), crowd workers provided various types of common-sense knowledge. This included 'NEED' events analogous to our precondition events for a target. The main difference is our work grounds both target and precondition events in news text, whereas ATOMIC elicits general world knowledge, a complementary approach with different tradeoffs. Interestingly, we find that the precondition relations learnt from textually grounded news events generalize to story events in ATOMIC for our generation task. Annotated Text Corpora Three existing datasets capture some form of precondition knowledge in their annotations: the Rich Event Description (RED) dataset (O'Gorman et al., 2016), CaTeRS (Mostafazadeh et al., 2016), and Event StoryLine (Caselli and Vossen, 2017). These are generally too small for learning text classifiers as we briefly describe now.
RED is the most directly related, created to model a broad set of event-event relations in news. Preconditions are not their sole focus, though, so this dataset only contains~1000 precondition instances. CaTeRs shares a similar problem to RED. It has an enables relation similar to precondition, but since the domain is 5-sentence short stories and preconditions aren't the main focus, it only has 400 instances. The Event StoryLine dataset is small in size too, but also doesn't have a precise precondition relation. The dataset instead has RIS-ING_ACTION that includes preconditions in its definition, but the same label captures other concepts like subevents and entailment. There arẽ 5000 instances, but only a fraction are preconditions and it is not possible to separate them out.
This paper is thus unique to prior work by annotating grounded written text at a scale large enough to enable machine learning solutions. This enables our target tasks: text classification and generation.

Preconditions as Relations
Our goal is to develop a resource that can help models reason about the necessary preconditions for events mentioned in text. This is useful for planning towards a goal, explaining how a certain situation came about, and predicting what future events are plausible. We make two important design choices in building such a resource: Grounding -the resource is grounded in text, particularly over events in the news domain, and Framing -we construct the resource with preconditions framed as event-to-event relation pairs in a specific context.
Grounding: We ground the resource to text so that we can leverage the full context of the events, and we choose the news domain due to its common use in other event-related tasks such as event extraction, schema generation, and temporal reasoning.
Framing: Broadly speaking, preconditions specify what must exist/happen before something else can exist/happen (Fikes and Nilsson, 1971;Sap et al., 2019). It is natural to think of a precondition as a state of the world that must be satisfied for an event to happen i.e. a state-event relation. However, the state of the world is hard to circumscribe for most real world events, and more importantly the precondition state is often left unsaid in a story. Rather, the author will more often mention an event from which it follows that the precondition state is satisfied. Thus, it makes sense to frame preconditions as relations between two events described in their specific textual context.
We first present a formal definition based on this notion and then describe a crowdsourcing methodology for obtaining this knowledge at scale.
Definition: Given a target event mention t and a candidate event mention p, we assert p is a precondition event for t if p is necessary for t to happen i.e., t likely would not have occurred without p, in the current text context.
Using the example of opening a door from the Introduction, turning the key is a precondition event (for opening the door) because it results in a state where the door is unlocked. The opening event cannot occur without such a state. Importantly, we do not define a precondition event as an absolute requirement for the target (the door opening) to occur in all scenarios. However, we do require that the target event likely would not have occurred in the current context. This allows another story with an alternate event, such as "I picked the lock". Both picking-lock and turning-key are preconditions in their own story contexts. Strict logicians might take issue, but language understanding requires a looser definition that uses likelihood of occurrence when interpreting real-world scenarios.

Preconditions Dataset
This section describes our methodology to annotate news articles with the previous section's definitions. One problem with annotating preconditions in text is the large number of event mentions in each article, which means annotation of all possible event pairs is infeasible. The temporal community has struggled with this same dilemma (Chambers et al., 2014;Vashishtha et al., 2019).
We address the question of which pairs to annotate with two approaches. First, instead of attempting a dense annotation of few articles, we sub-sample candidate pairs of events across many articles. Second, we use an automatic temporal relation classifier to filter pairs by identifying possible candidates. We then ask crowd-workers to annotate the resulting pairs for preconditions.

Candidate Event Pair Extraction
Sub-sampling event pairs at random from a document can result in a large number of pairs that are not preconditions. Because precondition event pairs ought to be temporally related (i.e., the precondition should precede the target event), we can filter the candidate event pairs to only those that are in a BEFORE or AFTER relationship.
As a first step, we extract events and their temporal relations from news articles using CAEVO (Chambers et al., 2014), a temporal relation extraction system. We chose CAEVO over other available systems for two main reasons, although it's not the only option out there: (1) it automatically extracts both events and their temporal relations, and (2) it extracts events in any form (verbs, nouns, and adjectives), which gives a broader coverage than some other recent systems that only consider verbs as events (Ning et al., 2018). We used CAEVO on a random sample of 6,837 articles in the New York Times Annotated Corpus (Sandhaus, 2008).
On average CAEVO extracted around 63 events per article, which yielded a total of 3,906 possible relation candidates per document. We filtered these to retain only pairs of events that have a BEFORE or AFTER temporal relation between them. We call the temporally preceding event the candidate precondition, and the temporally subsequent event in the pair the target event. We filtered out pairs involving causative targets or reporting verb preconditions to remove trivial context independent preconditions (see Appendix for examples).
From the remaining, we randomly sampled 40,500 pairs for annotation. We used the first 500 in a pilot annotation to help us improve the task instructions. We then used the remainder for the actual annotation.

Crowdsourcing
The annotators were presented with a text snippet and two event mentions highlighted. Figure 1 shows two examples. To prune out event extraction errors from CAEVO, the annotators were first asked if the highlighted text denoted valid events. An event was deemed valid only if it describes an action that occurs in the world. 1 If both triggers were deemed valid, then the annotators evaluated whether or not the candidate precondition event was an actual precondition for the target event. Specifically they check if the candidate event is necessary for the target event to happen. 2 1 We left the decision for event validity up to annotators on their own. We asked annotators to consider an event with its context rather than the meaning of the word alone. This includes the negation of an event, which might imply a prevention relation. 2 We expected annotators to make decisions on the given CAEVO output, and they were not allowed to suggest a directional change. We limited the number of labeling options to Precond. Non-Precond. 13.5% 9% -Event Validity 1.5% 3.5% -Relation 12% 5.5% Table 1: Expert review of PeKo annotations. "Event Validity" indicates annotation error on validity labels, "Relation" indicates errors on identifying the event-event relation.
We used a pilot task to refine the instructions and the examples to improve consistency amongst the annotators. For the main annotation task, we used four crowd-workers to annotate each instance. For quality control, each HIT included control instances whose labels we knew a priori. We retained only those event pairs where a majority (i.e., at least three) of the annotators agreed on the label and use the majority label as the gold label for each instance.

Dataset Quality and Analysis
The resulting dataset, which we call PeKo, contains more than 30K annotated relations (~10k preconditions,~20k not).
Annotation Quality: The annotators had fair interannotator agreement with a Fleiss Kappa value κ = 0.387. We used 4 Turkers per event pair to ensure accuracy and filter out disagreements. To further measure the quality of the annotation, we randomly sub-sampled 400 instances from the annotated data and re-annotated them using four "expert" graduate students trained to recognize preconditions. A post-analysis of the expert and crowd annotations shows the annotation to be of high quality. Table 1 summarizes the quality statistics. Experts disagree with the crowd-sourced annotations in only 11.75% of the cases, with a slightly higher disagreement for precondition instances at 13.5%. A small percentage of these disagreements are on determining when an event is valid.
We also analyzed the discarded instances that received conflicting votes. Only 10% of these instances can be considered as preconditions and some of them are arguable based on their context. Here's an example: Before he was hired in 2005, before his team upset Texas last season, he educated himself on the college culture.
keep the annotation instructions as straightforward as possible.
According to the context with discourse cues, one can reasonably conclude that educated is necessary for the event hired to happen. However, one might also disagree based on the fact that the connection is not perfectly clear.
Text Position: As with temporal and other eventevent relations, one might ask if position in text is an indicator of a precondition relation. We thus tallied our annotations and identified how many intervening verbs occurred between the annotated event pairs, as well as how far apart they are in the document based on token distance. Figure 2 shows these distributions. The negative numbers indicate distance when the precondition event occurs after the target event. As the graphs show, the majority of preconditions occur first in the text, but a sizable amount are actually reversed with an evenly spread out distribution over distance.

Precondition Predicate → Target Predicate
pay → provide try → get know → miss ask → make use → provide love → miss go → provide delay → mean look → find find → use take → get ask → take work → make tell → take use → find know → get born → die agree → pay use → help touch → miss go → find get → help move → take lose → help leave → take For further insight into the dataset, Table 2 lists the most frequent verbs that were annotated as precondition-target pairs. While there are a few pairs that can be readily interpreted without other context (e.g. everyone is born before they can die, and you must look before you can find), most other pairs require additional context from the text itself.

Comparison to Other Datasets
Section 2 described how this new dataset differs from prior work. We now include Table 3 to further illustrate the size difference, showing an order of magnitude more precondition instances than prior corpora with specific precondition annotations.
We consider our precondition as a broader concept than that in the RED. We focus on necessary events, which covers both precondition and causal relations in the RED dataset.  Table 3: Comparison of labeled corpora. The instances are how many total labels, and precondition is how many precondition-related instances. We included cau-sation+precondition labels in the total counts if causation exists. *Event StoryLine mixes preconditions with many other relations, so the 5,519 is an upper bound.

PeKo Tasks and Evaluation
Having created the PeKo annotated corpus, we now propose two tasks that test for the ability to recognize and generate preconditions in textual contexts.
Here we describe evaluations to benchmark the performance of current models on these tasks and to better understand the challenges involved.

PeKo Task 1: Precondition Identification
Given a text snippet with a target and candidate event pair, the task is to classify if the candidate event is a precondition for the target in the context described by the text snippet. This is a standard sentence-level classification task. We evaluate two strong and widely-used large transformer-based language models -fine-tuned BERT (Devlin et al., 2019) and XLNet (Yang et al., 2019) base models. For each model, we take the final representation of each event trigger, concatenate together, and then feed into a linear classification layer. We also evaluate a 1-layer GRU sequence model (Cho et al., 2014) with GloVe embeddings (Pennington et al., 2014) to calibrate against a much simpler baseline. See the Appendix for more details on parameters, layer sizes, and training time.
Precondition identification is a difficult task. Table 4 shows the results. The GRU-based sequence model trained from scratch on PeKo is better than a prior-based random baseline 3 but still leaves a large room for improvement. BERT and XLNet both perform substantially better (> 71 F1) than the GRU model (63.7 F1) but their F1 score of 71 illustrates that this is a difficult task not readily solved by simply fine-tuning large LMs.
Precondition information is not readily available in BERT.
One premise for our work is that distributional    knowledge alone is insufficient to capture precondition relations. We conduct two sets of inoculationbased probing experiments (similar to Liu et al. (2019)) to get at how the information in the pretrained LM representations relate to preconditions. We use BERT in the fine-tuning and featureextractor mode (the parameters for BERT are fixed and only those in the classification layer are updated) and measure performance with increasing amounts of data. If the performance peaks early with only small amounts of data then it tells us that most of the information necessary for recognizing preconditions is in a readily accessible form in the original LM representations. On the other hand, if performance keeps increasing then it suggests that PeKo provides extra information. Figure 3 shows that neither model plateaus quickly. BERT, as a feature-extractor (dashed line) plateaus around 50% of the data. The fixed features from the LM pre-training BERT hits a performance ceiling. Whereas fine-tuning BERT, which finetunes the representation to the PeKo task, provides continuous improvements for increasing amounts of data. These together suggest that a substantial amount of precondition knowledge is not easily adapted from the language modeling information captured in BERT but can be learned from PeKo. Table 5 compares the perfor- Figure 3: Inoculation: Performance of fine-tuning (solid) and feature extractor (dotted) modes of BERT with increasing amounts of PeKo training data. Neither plateaus quickly suggesting that precondition knowledge is not readily accessible in BERT. mance of BERT when using different levels of context. Using event triggers alone achieves moderate performance. This suggests that the verb trigger does carry a lot of the precondition knowledge regardless of event arguments (e.g., canceling requires scheduling first, but in most cases it doesn't matter what is canceled). However, if we use event tuples 4 , which also captures the main entities of the event, then we see a significant improvement in performance (+6.9 points). In addition to the tuples of the event pair, adding tuple representa-tions of neighboring events provides an additional gain (+1.5 points). Further inspection of the tuplebased representation shows that automatic tuple extraction sometimes introduces errors and misses critical context and other important discourse cues. The best results come from using the sentence(s) that contain the event pair in its entirety -adding further sentences leads to worse performance.

Role of Context.
When is it difficult to identify preconditions? The first plot in Figure 4 shows that F1 score is highest where the target event is in the same sentence as the precondition event, higher where the target event is in the sentence that follows the precondition event, and lowest when the target event is in the previous sentence. A similar trend holds for different verb distances as well, as seen in the second plot. As the distance increases, the F1 score decreases in either direction. However, on the negative side, F1 scores are lower compared to the positive side showing the difficulty of the task when the target verb precedes the precondition. when the target event precedes, is in the same, or follows the precondition's sentence. Lower: F1 for varying # of intervening verbs between the event pair.

PeKo Task 2: Precondition Generation
Here we introduce Precondition Generation as a more general challenge that a dataset like PeKo now enables. Given a target event t, generate an event p that is a precondition for t. We first show how to create instances for this task using the PeKo dataset and then benchmark performance on evaluation instances drawn from both PeKo and an out-of-domain dataset ATOMIC (Sap et al., 2019).
Generation Training Task. We created precondition generation training instances by transforming each PeKo instance as follows. The input is the entire snippet of a PeKo instance (i.e, the entire text snippet with a pair of events where one is marked as a precondition of the other) but with the precondition portion of the snippet replaced by a [BLANK] slot. The output for the generation instance is the entire sentence where the [BLANK] is to be filled in with text representing a precondition event. See Table 7 for examples. Note that because the precondition portion can occur anywhere (earlier or later) in the sentence, we do not frame this as a typical left-to-right language model completion task. Instead, the models have to generate the entire sentence in addition to filling in the [BLANK] slot with a plausible precondition. We use the text chunk spanned by the precondition trigger node in the constituency parse as the precondition portion.
We benchmark three variations of a large language model GPT-2 (Radford et al., 2019) to show how much of precondition information can be generated directly from general language models and from temporal knowledge in comparison to learning from PeKo: (i) LM-GPT-2 -training instances created from a random collection of sentences to mimic fine-tuning GPT only for the format of this task but with no special constraint on the relation between the events in the instance. We randomly select sentences with a pair of events, and choose at random one event as target and the other as precondition and then create the generative training instances as described earlier. (ii) Temp-GPT-2training on instances created from temporally BE-FORE events, randomly sampled from the nonprecondition portion of PeKo dataset. (iii) PeKo-GPT-2 -training on generation instances created from the training portion of the PeKo dataset. LM-GPT-2 trains on 18,000 instances (since it is not limited by PeKo data), whereas Temp-GPT-2 and PeKo-GPT-2 train on 6000 instances.
Testing on PeKo For testing, we transform instances from the testing portion of PeKo. Because precondition instances can sometimes contain strong linguistic and syntactic cues for preconditions, we create test instances only from the non-preconditions in PeKo. This is a stronger test of models' abilities that mitigates some of the confounds of how the sentence is structured.
Testing on ATOMIC We used the following heuristics to address the peculiarities of ATOMIC and improve compatibility with training. We filtered instances such that they are full sentences,  Table 6: Human evaluation of generation. Sense: Average sensibility rating on a 0-3 scale. Precond.: Percentage of instances with valid precondition outputs. Parenthetical numbers are precentages within instances with sensible score ≥ 2. Bold face indicates best results. with fully-specified arguments for events, and with single participant instances. We replace Person variable mentions with third-person pronouns.
Benchmarking Precondition Generation. Table 6 shows results of a manual evaluation of the generated preconditions 5 . Three of the authors of this paper evaluated 150 instances of generated text snippets from three systems. The snippets from the systems were randomly swapped during the blind evaluation. Each output was first rated for sensibility on a scale of 0 to 3, where 3 means the output is perfectly sensible as English, and 0 means nonsensical. The output, which contains the marked target and precondition event pairs, were then rated on a binary scale -1 if the precondition relation holds; 0 otherwise. The same annotation guidelines described in Section 4.2 were taken to ignore invalid events, hypotheticals, and other noisy output.
Results in Table 6 shows that LM-GPT-2, the version that trains on random event pairs, struggles. It produces the least precondition outputs. Peko-GPT-2 generates plausible preconditions nearly twice as often as the Temp-GPT-2 baseline. These results illustrate the need for PeKo as preconditions do not easily fall out from today's large LMs. The trends also hold for the out-of-domain ATOMIC instances indicating generalization to everyday events in the ATOMIC dataset. On ATOMIC we see more preconditions than on the original PeKo dataset. We hypothesize that this is in part because in the PeKo test set, we created harder cases where the models have to generate preconditions to fit in text that 5 Automatic evaluation against reference preconditions is not informative since there can be multiple preconditions for any given event. We found that using BLEU for instance showed no difference between Temp-GPT2 and PeKo-GPT2 despite the huge difference in manual evaluation.

PeKo INPUT
[BLANK] that will enable consumers to quickly download videos...

Temp-G2
The company has said that it will be able to sell more phones in the coming months PeKo-G2 The company also agreed on a plan INPUT And finally a third will rebuild homes in the historic older district.  originally contained a non-precondition event. Table 7 shows some examples that illustrate the differences between training on PeKo and otherwise. As expected, the non-precondition trained model outputs events that temporally precede the target event but not necessarily preconditions.
Error Analysis. We evaluated the outputs for 50 instances from Peko-GPT-2 and found three main categories of failures: (i) Difficulty in handling input context (56%) -In some cases the input target event context is too limited, whereas in others the context is too complex with many intervening entities or a chained set of events after which the model is supposed to generate a precondition. Another set of cases have to do with the sentence structure of the context sets up for a hypothetical precondition event, or a reporting verb. (ii) Common Language Generation Errors (28%) -Cases like repetition or semantically implausible text and hallucinating new entities whose relation to the original context is not clear. (iii) Temporally related (16%) -Cases where the outputs are temporally and topically related but are not preconditions, indicating failures in generalizing precondition knowledge.
Overall, these first results on PeKo suggests that training on this new dataset enables a generative model to learn some common-sense precondition knowledge beyond basic language modeling cues. We see room for improvement both in terms of modeling as well as training approaches.

Conclusions
Knowing what conditions are necessary for an event to happen is critical for understanding and reasoning about events mentioned in text. In this work, we address the lack of a large scale resource for learning precondition knowledge about events. Our crowdsourcing methodology yielded more than 10,000 precondition event relations (and 20,000 negative examples) from news domain texts. We showed in both classification and generation that these relations are not readily accessible in distributional knowledge encoded by large language models, highlighting the challenges in learning common-sense knowledge from text. We also proposed two new challenge tasks based on PeKo and hope it helps drive further research into rich event understanding that touches a variety of areas from schema learning, information extraction, and even story generation.

A.1 Candidate Filtering for Crowdsourcing
We discard event pairs that come from the same sentence when the candidate precondition is a causative verb or when the target is a reporting verb. This is because both cases are always true regardless of their context. Consider the following examples: (A) He said that his birth mother lived nearby. (B) The president made his secretary create copies of the report As these examples show -A is a reporting verb ('said') in the target position, and B is a causative ('made') as the candidate precondition -the candidates in both cases are reliable preconditions independent of the context. For instance, in example in (B) if we use a new context "not create copies of the report", the precondition relation would still hold. Since we aim to collect precondition knowledge that can be obtained at least partially from context, we excluded these reporting and causative precondition verb instances from our candidate pool.

A.2.1 Data Split
We split our dataset into train/dev/test set with the ratio of 6:2:2. Since the number of instances in each class is imbalanced, we split the data separately based on the class and then randomly shuffle instances in each set together.

A.2.3 Parameters
Identification Task: All models for identification task are trained for 50 epochs with 16 of the batch size. A model is picked based on the performance (i.e., F1 score) on the dev set among 5 different random seeds. All other parameters are describe in Table 8. Generation Task: All three models use the same GPT-2 architecture, which has 163,047,936 trainable parameters. The epochs are set to 100 with 16 as the batch size. Models are picked based on loss on the dev set.
We use AdamW (Loshchilov and Hutter, 2019) for the optimizer in both tasks.

A.3 Testing on ATOMIC
We following heuristics to address the peculiarities of the ATOMIC dataset and improve compatibility with training: 1) We remove instances that do not have a fully specified argument for the event (referred to as placeholders in their paper (Sap et al., 2019)). 2) We only use 'simple' instances that mention a single participant because the context often contains enough information to fully understand the target event. 3) We only use instances that are complete sentences and not fragments. 4) To make the inputs more natural, we replace the Person variable mentions with a third-person pronoun and added markers to the main verb and the placeholder [BLANK] at the end: "PersonX is in dire need of money" to "He <tar-get> is </target> in dire need of money [BLANK]"