Defining and Learning Refined Temporal Relations in the Clinical Narrative

We present refinements over existing temporal relation annotations in the Electronic Medical Record clinical narrative. We refined the THYME corpus annotations to more faithfully represent nuanced temporality and nuanced temporal-coreferential relations. The main contributions are in re-defining CONTAINS and OVERLAP relations into CONTAINS, CONTAINS-SUBEVENT, OVERLAP and NOTED-ON. We demonstrate that these refinements lead to substantial gains in learnability for state-of-the-art transformer models as compared to previously reported results on the original THYME corpus. We thus establish a baseline for the automatic extraction of these refined temporal relations. Although our study is done on clinical narrative, we believe it addresses far-reaching challenges that are corpus- and domain- agnostic.


Introduction
Temporal relation extraction and reasoning in the clinical domain continues to be a primary area of interest due to the potential impact on disease understanding and, ultimately, patient care. A significant body of text available for this purpose is the THYME (Temporal Histories of Your Medical Events) corpus (Styler IV et al., 2014), consisting of 594 clinical and pathology notes on colon cancer patients and 600 radiology, oncology and clinical notes on brain cancer patients, all from the Electronic Medical Record (EMR) of a leading US medical center. This dataset has previously undergone a variety of annotation efforts, most notably temporal annotation (Styler IV et al., 2014). It has been part of several SemEval shared tasks such as Clinical TempEval (Bethard et al., 2017) where state-of-the-art results have been established. Our goal was to utilize this THYME corpus to enable the extraction of more extensive patient timelines by manually creating cross-document links that built off the pre-existing single file annotations. (Wright-Bettner et al., 2019) discuss that a subset of the THYME temporal annotations contributed to incompatible temporal inferences, thus reducing their ability to support meaningful temporal reasoning. Accuracy and informativeness of temporal relation gold annotations are essential for their effectiveness in training a system for temporal relation extraction. We build on this work by offering an in-depth discussion of three key temporal relations -CONTAINS, CONTAINS-SUBEVENT (abbreviated as CON-SUB), and NOTED-ON -and how the addition of the last two types enhances the learnability of the temporal relations by resolving the conflicting temporal information in the original annotations.
While these revisions are corpus-specific, the reasoning behind them has far-reaching implications for automated timeline extraction. Since the cross-document linking task inherently deals with multiple, discrete narratives, it exposes the practical impact of discourse context on word

Defining and Learning Refined Temporal Relations in the Clinical Narrative
105 2 sense (different narratives have different goals, which in turn influences meaning interpretation). This is discussed in detail in Section 4. We empirically found it essential to take changes in discourse context into account and suggest the same would be true for any annotation project that is interested in temporal reasoning, particularly those dealing with longer timelines (i.e., beyond the single-document level). Recent developments in natural language processing establish neural approaches and more specifically transformer-based methods as the state of the art. Pre-trained models such as BERT (Devlin et al., 2019), BioBERT (Lee et al., 2020), Xlnet (Yang et al., 2019, ALBERT (Lan et al., 2020), RoBERTa , BART , and SpanBERT (Joshi et al., 2020) report significant gains on multiple tasks. Thus, we demonstrate the learnability of the refined temporal relations in the context of these recent methodological developments.

Dataset
The 594 notes that make up the colon cancer part of the THYME corpus are grouped into sets, each set pertaining to a single patient and consisting of three notes written at different times during the patient's course of care. These notes had been previously annotated for five different intradocument temporal relations (BEFORE, OVERLAP, BEGINS-ON, ENDS-ON and CONTAINS), a subset of the ISO-TimeML temporal link (TLINK) types (Pustejovsky et al., 2010, Styler IV et al., 2014 1 . To keep annotation manageable and circumvent massively inferential temporal linking, the THYME guidelines constrained TLINK creation to events within the same sentence or adjacent sentences, and specifically prohibited TLINKing across sections (these are clinically-delineated sections separated from each other by numerical section IDs -History of Present Illness is section 20103, Vital Signs is section 20110, etc.) Linguistic evidence for creating these TLINKs included local cues such as temporal 1 This approach to temporal annotation in fact evolved directly from the ISO-TimeML model, in collaboration with Prof. James Pustejovsky. The Basic Formal Ontology has introduced similar relations (Smith et al., 2005), and a posthoc mapping could be done between the relations proposed here and those in BFO 2.0. However, we defer to more philosophical scholars to determine the similarities and differences between ISO TimeML and BFO 2.0. prepositions and adjectives (e.g., during, subsequent to, prior to), chronological narrative progression, and so forth.
Additionally, the notes had been separately annotated for intra-document coreference (IDENTICAL) and bridging (SET-SUBSET, WHOLE-PART) relations, which were later merged with the temporal annotations (Wright-Bettner et al., 2019). Temporal relations alone are insufficient for timeline extraction; coreference relations are also necessary (is the tumor seen in September the same as the one seen in March?).
Pursuant to our goal of reasoning over largescale timelines, we built off these pre-existing within-note annotations by manually adding coreference and bridging links across each set of three notes. In the process, we discovered a subset of the original CONTAINS relations contributed to temporally-conflicting information, which led to the addition of two new TLINKs: NOTED-ON and CON-SUB (Wright-Bettner et al., 2019). We discuss below how these updates contribute to more accurate and comprehensive temporal relations which facilitated cross-document linking. As such, this is one of the few studies in clinical NLP for cross-document temporal relation annotations (see Raghavan et al., 2014 andWright-Bettner et al., 2019; also see Song et al., 2018 for general domain cross-document temporal annotation discussions).

Refined Temporal Relation: NOTED-ON
The THYME guidelines specified that tests should always be in a CONTAINS relation with their results, as in (1), observing that it is inferential to say that something seen on the test exists before or after the test.

a. CT CONTAINS metastases
We agree with the authors in part, particularly for a project that did not have coreference relations and that relied heavily on explicitly-temporal cues (after, during, before, etc.). However, once the coreference links had been merged with the temporal annotations, more information was revealed about the temporal nature of the events. For example, metastases in (1) might well be IDENTICAL to a later mention, such as: 2. February 20, 2009, liver metastases resected with clear margins.
The merged THYME and coreference annotations 2 for (1) and (2) were as follows: Together, these relations entail that the same liver abnormalities were temporally contained by January 17, 2009 and temporally overlapped February 20, 2009 -a logical impossibility. This situation was extremely frequent in the data, reducing the informativeness of the CONTAINS links and the timeline as a whole. We therefore manually converted these links to a subtype of OVERLAP relation, NOTED-ON.

NOTED-ON
conveys temporal overlap between events and additionally says that Event A (result) was observed on/by Event B (test). (1a) was therefore re-annotated as follows:

metastases NOTED-ON CT
The application of this link was so constrained that it was fairly easy to annotate; we therefore added it as a single-annotated post-processing step.

Motivation and annotation
Early in the cross-document annotation pilot, we found the pre-existing intra-document schema categories were insufficient for dealing with cross-narrative phenomena, which in turn led to the addition of the new CON-SUB relation. Consider: pelvis ordered for further staging of this colon cancer.
i. adenocarcinoma IDENT cancer Both sets of intra-document links are pragmatically appropriate. Discourse contexts can expand or reduce the level of granularity at which a sense is interpreted (Recasens et al., 2011;also see Hobbs, 1985). In Note A, the text supports a coarse-grained interpretation of adenocarcinoma, or what Hovy et al., 2013 term a "wide" reading; it refers generally to the patient's cancer. Note B, however, requires a fine-grained ("narrow") interpretation -adenocarcinoma here refers specifically to the new, inoperable tumor and is contrasted with the original, resected tumor.
The quandary for the cross-document task lies in whether to link adenocarcinoma in A as IDENTICAL to adenocarcinoma in B. An IDENTICAL relation entails logical impossibilities: assuming we also link cancer A as IDENT to cancer B , the combined within-and crossdocument relations now say the recurrent adenocarcinoma temporally contains itself and the primary tumor which was removed years earlier. This reduces the meaningfulness of the CONTAINS links and therefore the timeline. On the other hand, leaving the two adenocarcinoma references unlinked fails to capture the significant semantic relation between two identical strings (recurrent adenocarcinoma) that do in fact refer, on some level of granularity, to the same real-world event. In either case, annotators are stymied.
Clearly, the pre-existing intra-document schema categories, specifically the binary coreference choice (A = B or A ≠ B), were insufficient for dealing with the sense variation and nuanced event structure exposed by multiple narratives. We therefore introduced CON-SUB (based on O'Gorman et al., 2016) as a subtype of the CONTAINS TLINK type. While CONTAINS conveys only temporal containment, CON-SUB additionally says that Event B is intrinsically part 107 4 of the structure of Event A 3 . The difference may be seen in (5): 5. During patient's neoadjuvant treatment, she was in a car accident which delayed cycle 4.
a. treatment CONTAINS accident, delayed b. treatment CON-SUB cycle We were then able to re-annotate the intradocument relations for both notes A and B above as cancer CON-SUB adenocarcinoma, which streamlined cross-document decisions -cancer A IDENT cancer B and adenocarcinoma A IDENT adenocarcinoma B -while preserving the "quasiidentical" relation (Hovy et al., 2013) between the cancer and adenocarcinoma terms. While this solution does not fully resolve the problem of binary annotation choices, it does provide more "wiggle room" along the meaning spectrum (Cruse, 1986) by introducing a third value -two mentions may be identical, non-identical, or mereologically (part-whole) related.
In keeping with our primary goal of enabling timeline extraction, we implemented CON-SUB as a TLINK since it conveys true temporal containment. CON-SUB, however, differs from other TLINKs, which were constrained by proximity and lexical cues, as discussed in section 2. The fact that CON-SUB also represents structural information allowed us to treat it like a coreference/bridging relation in terms of permissible textual evidence for link creation: namely, semantic scripts. These may be defined as: "a stereotypical sequence of events" (Araki et al., 2014) or "prototypical schematic sequences of events" (Chambers and Jurafsky, 2008). We can expect a surgery, for example, to consist of certain, typical subevents (incisions, subprocedures, anesthesia administration, etc.), which therefore enables annotators to look throughout the whole document for lexical items with meanings that fit those subevents. The concept of semantic scripts is what facilitates attainable long-distance coreference /bridging linking, and therefore, long-distance CON-SUB linking. This is obviously not the case for nonsubevent CONTAINS relations.
We were therefore able to revise the THYME annotations to accommodate CON-SUB in two ways: First, we converted CONTAINS links to CON-SUB as appropriate, e.g.: treatment CONTAINS radiation became treatment CON-SUB radiation. Secondly, we added certain longdistance CON-SUB links for which there were no pre-existing CONTAINS annotations. These changes were made via a double-blind annotation process followed by an adjudication pass. The annotation team for the entire project (including the cross-document stage) consisted of nine annotators, eight of whom either had or were obtaining undergraduate or graduate degrees in linguistics. The ninth annotator was a physician who received on-the-job linguistics training and focused primarily on annotation subtasks that demanded considerable medical knowledge. Additionally, we consulted regularly with an oncologist and a medical coder with a decade of NLP annotation experience.

Inter-annotator Agreement
While the gold intra-document CON-SUB relations enabled high cross-document coreference agreement (93.77%), the IAA score for single-file CON-SUB links themselves was low at 34.14% 4 . Several factors contributed to this, but we focus on one major one here. The size and complexity of the guidelines 5 reflected the size and complexity of the task, augmenting the already heavy cognitive burden on annotators.
The cross-document task exposes greater nuance in event structure, involves greater variability of word sense, and attempts to join narratives that are temporally and linguistically disjunct. Annotation guidelines that set out to accurately represent information that is inherently nuanced, variable, and noncohesive will not be simple (see Savkov et al., 2016).
In determining guidelines for the subevent relation, we found that the best course for handling variability in lexical sense differed depending on the semantic potential (that is, how adaptable a word is to different meanings; see Evans, 2006) of the individual words used most frequently for each event category (i.e., semantic script) 6 . Not surprisingly, lexically-specific rules contributed significantly to the sheer size and complexity of the guidelines. Compare the following: 6. We are seeing the patient for recent diagnosis of colon cancer. The tumor in her colon is quite large.

7.
We recommended adjuvant treatment. Patient will return to start chemo in two weeks.
Unmodified, treatment has an impoverished semantic potential (Evans, 2006); the meaning it conveys in itself is sparse, yielding a semantic flexibility that allows it to represent a wide range of referents. It is intuitive to understand it in (7) as coreferential with the much more precise chemo. Cancer, however, has a richer potential; it conveys a temporally-extensive disease that may have multiple manifestations (a primary tumor, recurrent tumors, metastatic tumors, etc.), rendering it less amenable to a coreferential link with tumor.
Therefore, for the cancer semantic script, annotators were asked to distinguish between terms that are defined more generally (cancer, disease) and terms that are more specific (adenocarcinoma, tumor, metastasis, mass, etc.), such that the specific terms were always subevents of the general terms, regardless of pragmatic support for wide or narrow readings for a given term. The reason for this has already been partially discussed in example (4); we add here that the semantic rigidity of cancer (compared to treatment/therapy terms, for example) also informed this choice.
This required a degree of abstraction from the text and an intentional suppression of instinctive linguistic judgments on the single-document level -for example, adenocarcinoma in (4) was re-annotated as a subevent of cancer, in spite of the fact that the text supports the wide reading. 7 On the other hand, for the treatment/therapy semantic script, annotators were to determine relations more intuitively. Compare the resulting impact for cross-document linking in (8) to (4): 8. Note A: We recommended adjuvant treatment. Patient will return for the first day of chemo in two weeks.
a. treatment IDENT chemo Note B: Adjuvant treatment consisted of four months of chemo and radiation and was without complication.
b. treatment CON-SUB chemo c. treatment CON-SUB radiation If we linked treatment A to treatment B and chemo A to chemo B , the product would be the same undesirable entailments discussed in (4): in this case, that radiation is a subevent of chemotherapy (an entirely different treatment), and that the chemo event temporally contains itself. Here, however, our solution differed: Rather than reannotate (8a) as treatment CON-SUB chemo, we simply chose to leave treatment A and treatment B unlinked in cross-document annotation. The only cross-document link for this context was chemo A IDENT chemo B . Again, this is due to the semantic malleability of treatment. Leaving two identical strings (adjuvant treatment) unlinked to each other is less problematic when that string regularly refers to a wide variety of referents. Furthermore, in experiments with re-analysis that paralleled the cancer semantic script approach, we found that attempting to always annotate treatment as an umbrella event proliferated unnecessary nested relations (because of how modifiable it is), increasing disagreement potential. Unsurprisingly, an analysis of CON-SUB disagreements suggests that annotators struggled to remember when to abstract terms away from the context and when to interpret them intuitively; an example like (7) was apt to produce a disagreement, shown here in (9):

9.
We recommended adjuvant treatment. Patient will return to start chemo in two weeks.
Annotator A: treatment IDENT chemo Annotator B: treatment CON-SUB chemo While Annotator A correctly interpreted the terms as coreferential (based on the context), Annotator B mistakenly followed the approach for the cancer semantic script in analyzing "treatment" as an umbrella event.
The annotation task was already demanding, requiring annotators to learn and assimilate specialized medical knowledge and terminology from the clinical texts, which themselves are written in heavy shorthand and with a mix of template language and free text that sometimes conflict. In addition, the non-linguistic nature of the cross-document component forced the creation of several counterintuitive annotation rules, which frequently (but not always) required them to ignore real linguistic cues.
Finally, we did not calculate IAA for nonsubevent CONTAINS links because they already existed. They did, however, change slightly, along with all the TLINK types, since the auto-merger of the temporal and coreference annotations (discussed in section 2) produced some informational conflicts that we calibrated in a manual single-annotated pre-processing pass. However, as a proxy for annotator agreement, we show below that learnability improved for all temporal relations.

Summary of Refined THYME Relations
The pre-existing CONTAINS annotations were revised in part through the addition of two new links, both of which convey temporal and nontemporal information. The original CONTAINS and OVERLAP relations were therefore reimagined as supertypes 8 , each consisting of two subtypes. All links are described in Table 1. We refer to these fine-grained THYME annotations as THYME+ and to the original THYME annotations as THYME.
In the next section we demonstrate the learnability of the refined THYME+ temporal relations with state-of-the-art transformer methods. We report results that establish baselines for the THYME+ corpus for further methods development and offer insights into the challenges we faced which we view as exciting venues for future research.

Learning Refined Temporal Relations
Following the same window-based processing (using a span of contiguous tokens disregarding sentence boundaries for generating relational candidates) and argument-marking mechanism developed by the prior study that achieved the state-of-the-art results on THYME (Lin et al.,8 It is worth noting that while CONTAINS itself may be thought of as a specific subtype of an overlap temporal relation, we use the OVERLAP category specifically for non-containment temporal overlap or underspecified cases for which we cannot claim containment.  2019), we tested a series of pre-trained models for extracting both within and cross-sentence temporal relations (i.e. TLINKs) in a multi-class classification fashion. Figure 1 shows a CON-SUB relation between "cancer" and "adenocarcinoma", and its representation as a token sequence. Special token pairs, "eas" and "eae", "ebs" and "ebe" mark the events of interest in the sequence.

Relation type
Pre-trained models BERT (Devlin et al., 2019), BioBERT (Lee et al., 2020), Xlnet (Yang et al., 2019), ALBERT (Lan et al., 2020), RoBERTa , BART , and SpanBERT (Joshi et al., 2020) were used to encode each input sequence with a relational candidate by the [CLS] token, which was fed to the classification layer to predict the relation type for every relational pair candidate. For some of the popular models, such as BERT, RoBERTa, and BART, we also tried their large version in addition to their base releases.
We used NVIDIA GTX Titan Xp GPU and Titan RTX GPU cluster of 7 nodes for fine-tuning the pre-trained models. The fine-tuning is done with HuggingFace's Transformers API (Wolf et al., 2019) and the TensorFlow-based BERT API, with batch size selected from (16, 32), a 60-token sliding window for generating candidate relational pairs, a maximal sequence length of 100 word pieces to accommodate all word pieces from the 60 tokens, and a learning rate selected from (1e-5, 2e-5, 3e-5, 5e-5). The performance was evaluated by the standard Clinical TempEval (Bethard et al., 2017) evaluation script, modified only to accommodate the new categories.

Experimental Results
The model that performed best on THYME (BioBERT-base) was trained and evaluated on THYME+ annotations. The first two rows of Table 2 gauge performance purely based on the refinements of the THYME+ annotations. Splitting CONTAINS into CONTAINS and CON-SUB relations and OVERLAP into OVERLAP and NOTED-ON relations leads to better learnability: CONTAINS goes from 0.664 F1 on THYME to 0.748 F1 on THYME+, and OVERLAP goes from 0.179 on THYME to 0.416 on THYME+. The best results for the new categories of CON-SUB and NOTED-ON are 0.072 F1 and 0.744 F1 respectively -results that establish baselines for these two new temporal relations. The performance on all types of relations for THYME+ is 0.625 F1 compared to 0.548 for THYME (Table 2, Overall column, rows 1 and 2).
Lin et al., 2019 report 0.684 F1 for THYME CONTAINS, however the result is achieved when training on and evaluating for only the CONTAINS links, and augmenting the training data with automatically generated CONTAINS relations. Thus, it is not a fair comparison to use for the results reported in Table 2.
Of the models beyond BioBERT that we explored, BART-large was the most successful. The result with BART-large was 0.748 F1 (Table  2, CONTAINS column, row 3). In general, certain pre-trained models, like BioBERT and BART, yield better results than the other models. BioBERT is pre-trained on biomedical text and thus can help encode clinical text better. BART masks a contiguous span of text rather than random tokens, which can be helpful for encoding clinical text where many event and temporal expressions consist of multiples tokens, e.g. "ascending colon cancer".  Figure 2, so their low performance could thus be due to their lack of representation. CON-SUB's low performance could be attributed to their coreferential nature and the long distance relations between the two arguments which in many cases surpass our 60-token window limit. Table 2, row 4 presents the results with the BART-large model trained and evaluated when excluding CON-SUB links. The 0.750 F1 on CONTAINS is similar to the 0.748 F1 on CONTAINS when training and evaluating on all THYME+ relations. Thus, while the model is not able to accurately predict CON-SUB relations, including them does not appear to cause confusion for the model.

Discussion
Splitting CONTAINS into CONTAINS and CON-SUB categories improved the annotation quality of the CONTAINS class instances in THYME+. The CONTAINS class is the most frequent relation in clinical text and very easy for transitive closure to operate upon. As we already pointed out, the CONTAINS performance on THYME+ is improved from 0.664 F1 to 0.748 F1 using the same BioBERT model (Table 2, CONTAINS column, row 1 and 1), with both improved P and R.
The creation of gold NOTED-ON instances was straightforward, thus with high quality. NOTED-ON is the second most frequent relation in the corpus. 65.12% of the NOTED-ON relations are within one sentence and very few cases are long-distance. This makes the NOTED-ON class very learnable.
The results on THYME+ BEFORE, BEGINS-ON, ENDS-ON, and OVERLAP also improved compared to their respective THYME results. We attribute it to the improved performance of CONTAINS and NOTED-ON links as the definitions, hence the space separation, are tightened.
An error analysis of the CON-SUB relations showed that the main error consisted of missed links that relied on the semantic-script concept discussed in section 4.2. For example, the system often failed to capture the CON-SUB relation between cancer and adenocarcinoma. These are often long-distance relations: of all gold CON-SUB relations, 67.63% of them are beyond our 60token window limit. Even if we focused on those within-window CON-SUB relations (32.37% of total), the performance was still low (0.441 P, 0.112 R, 0.178 F1), which showed our models had not captured the peculiarities of the CON-SUB class. The fact that the majority of instances of CON-SUB class are long-distance is quite different from the other TLINKs and hard for transitive closure to act upon, suggesting they might need a different approach than the other TLINKs.
However, one error category that could be resolved by transitive closure are links the system marked that are not present locally in the gold, but are correct by inference via the coreference relations.  In this example, one or both terms are linked as IDENT to earlier mentions in the note. To save time and visual clutter, annotators only linked bridging relations like CON-SUB to first mentions, under the assumption that redundant information could be retrieved from the IDENT chains. Therefore, there is no error here if the IDENT chains are taken into account.
In short, CON-SUB relations capture temporality and mereological relations. Thus representations and methods for combined temporality and coreference are suitable venues to explore.
Except for CONTAINS, the other types of TLINKs have relatively low numbers of instances. The creation of instances for the low number of temporal relation types is limited by two main factors: (1) availability of data due to privacy constraints on EMR clinical narratives, and (2) the time, effort and budget required for such an activity. We have shown that with enough training instances (see CONTAINS), temporal relations are learnable at improved rates with the latest stateof-the-art methods. Although NOTED-ON has a similarly low number of instances as OVERLAP and BEFORE, it is highly learnable (0.744 F1) which we attribute to its semantic-script characteristics as discussed in section 4.2. This suggests that there are several paths to explore among which are: (1) re-defining and refining the other types of relations, and (2) devising methods for relations with low number of instances.

Conclusion
In this study, we presented our refinements for temporal relation annotations of the THYME corpus resulting in the THYME+ corpus. The main modifications are in re-defining CONTAINS and OVERLAP relations into CONTAINS, CONTAINS-SUBEVENT, OVERLAP and NOTED-ON. This strategy is theoretically based and led to better learnability with the latest transformer methods. Our results establish baselines for future methods --CONTAINS 0.750 F1 OVERLAP 0.438 F1, CON-SUB 0.072 F1 and NOTED-ON 0.744 F1.