SemEval-2017 Task 12: Clinical TempEval

Clinical TempEval 2017 aimed to answer the question: how well do systems trained on annotated timelines for one medical condition (colon cancer) perform in predicting timelines on another medical condition (brain cancer)? Nine sub-tasks were included, covering problems in time expression identification, event expression identification and temporal relation identification. Participant systems were evaluated on clinical and pathology notes from Mayo Clinic cancer patients, annotated with an extension of TimeML for the clinical domain. 11 teams participated in the tasks, with the best systems achieving F1 scores above 0.55 for time expressions, above 0.70 for event expressions, and above 0.40 for temporal relations. Most tasks observed about a 20 point drop over Clinical TempEval 2016, where systems were trained and evaluated on the same domain (colon cancer).


Introduction
The TempEval shared tasks have, since 2007, provided a focus for research on temporal information extraction (Verhagen et al., 2007(Verhagen et al., , 2010;;UzZaman et al., 2013).In recent years the community has moved toward testing such information extraction systems on clinical data, to address a common need of doctors and clinical researchers to search over timelines of clinical events like symptoms, diseases, and procedures.In the Clinical TempEval shared tasks (Bethard et al., 2015(Bethard et al., , 2016)), participant systems have competed to identify critical components of the timeline of a clinical text: time expressions, event expressions, and temporal relations.For example, Figure 1 shows the annotations that a system is expected to produce when given the text: April 23, 2014: The patient did not have any postoperative bleeding so we'll resume chemotherapy with a larger bolus on Friday even if there is slight nausea.
Clinical TempEval 2017 introduced a new aspect to this problem: domain adaptation.Whereas in Clinical TempEval 2015 and 2016, systems were both trained and tested on notes from colon cancer patients, in 2017, systems were trained on colon cancer patients, but tested on brain cancer patients.The diseases, symptoms, procedures, etc. vary widely across these two patient populations, and the doctors treating these different kinds of cancer make a variety of different linguistic choices when discussing such patients.As a result, systems that participated in Clinical TempEval 2017 were faced with a much more challenging task than systems from 2015 or 2016.

Data
The Clinical TempEval corpus was based on a set of clinical notes and pathology reports from 200 colon cancer patients and 200 brain cancer patients at the Mayo Clinic.These notes were manually de-identified by the Mayo Clinic to replace names, locations, etc. with generic placeholders, but time expressions were not altered.The notes were then manually annotated by the THYME project (thyme.healthnlp.org)using an extension of ISO-TimeML for the annotation of times, events and temporal relations in clinical notes (Styler, IV et al., 2014b).This extension includes additions such as new time expression types (e.g., PRE-POSTEXP for expressions like postoperative), new EVENT attributes (e.g., DEGREE=LITTLE for expressions like slight nausea), and an increased focus on temporal relations of type CONTAINS (a.k.a.

INCLUDES).
The annotation procedure was as follows: More details on the corpus annotation process are documented in Styler, IV et al. (2014a).
Because the data contained incompletely deidentified clinical data (the time expressions were retained), participants were required to sign a data use agreement with the Mayo Clinic to obtain the raw text of the clinical notes and pathology reports. 1 The event, time and temporal relation annotations were distributed separately from the text, in an open source repository2 using the Anafora standoff format (Chen and Styler, 2013).
Each corpus (colon cancer and brain cancer) was split into three portions: Train (50%), Dev (25%) and Test (25%).Patients were sorted by patient number (an integer arbitrarily assigned by the de-identification process) and stratified across these splits.Table 1 shows the number of documents, event expressions (EVENT annotations), time expressions (TIMEX3 annotations) and narrative container relations (TLINK annotations with TYPE=CONTAINS attributes) in the Train, Dev, and Test portions of each corpus.
The raw text of both the colon cancer and brain cancer corpora were already released as part of Clinical TempEval 2015 and 2016, as were the time, event, and temporal relation annotations for the colon cancer corpus.However, none of the annotations for the brain cancer corpus were previously released.
Clinical TempEval 2017 ran several phases of evaluation, where different data were released for training and testing sets3 .tested on the annotations of the brain cancer Test set.Systems were again free to use all the raw brain cancer text if they had a way to do so.
Note that across all phases, the only brain cancer data released was the Train-10 set.The remainder of the brain cancer data was reserved for future evaluations.

Tasks
Nine tasks were included (the same as those of Clinical TempEval 2015 and 2016), grouped into three categories: • Identifying time expressions (TIMEX3 annotations in the THYME corpus) consisting of the following components: -The span (character offsets) of the expression in the text -Class: DATE, TIME, DURATION, QUANTI-FIER, PREPOSTEXP, or SET • Identifying event expressions (EVENT annotations in the THYME corpus) consisting of the following components: -The span (character offsets) of the expression in the text -Contextual Modality: ACTUAL, HYPOTHETI-CAL, HEDGED, or GENERIC -Degree: MOST, LITTLE, or N/A -Polarity: POS or NEG -Type: ASPECTUAL, EVIDENTIAL, or N/A • Identifying temporal relations between events and times, focusing on the following types: -Relations between events and the document creation time (BEFORE, OVERLAP, BEFORE-OVERLAP, or AFTER), represented by DOC-TIMEREL annotations.
-Narrative container relations (Pustejovsky and Stubbs, 2011), which indicate that an event or time is temporally contained in (i.e., occurred during) another event or time, represented by TLINK annotations with TYPE=CONTAINS.

Evaluation Metrics
All of the tasks were evaluated using the standard metrics of precision (P ), recall (R) and F 1 : where S is the set of items predicted by the system and H is the set of items annotated by the humans.Applying these metrics only requires a definition of what is considered an "item" for each task.
• For evaluating the spans of event expressions or time expressions, items were tuples of (begin, end) character offsets.Thus, systems only received credit for identifying events and times with exactly the same character offsets as the manually annotated ones.
• For evaluating the attributes of event expressions or time expressions -Class, Contextual Modality, Degree, Polarity and Type -items were tuples of (begin, end, value) where begin and end are character offsets and value is the value that was given to the relevant attribute.Thus, systems only received credit for an event (or time) attribute if they both found an event (or time) with the correct character offsets and then assigned the correct value for that attribute.
• For relations between events and the document creation time, items were tuples of (begin, end, value), just as if it were an event attribute.Thus, systems only received credit if they found a correct event and assigned the correct relation (BEFORE, OVERLAP, BEFORE-OVERLAP, or AFTER) between that event and the document creation time.
• For narrative container relations, items were tuples of ((begin 1 , end 1 ), (begin 2 , end 2 )), where the begins and ends corresponded to the character offsets of the events or times participating in the relation.Thus, systems only received credit for a narrative container relation if they found both events/times and correctly assigned a CON-TAINS relation between them.
For narrative container relations, the P and R definitions were modified to take into account temporal closure, where additional relations are deterministically inferred from other relations (e.g., A CON-TAINS B and B CONTAINS C, so A CONTAINS C):  , 2013), following the intuition that precision should measure the fraction of systempredicted relations that can be verified from the human annotations (either the original human annotations or annotations inferred from those through closure), and that recall should measure the fraction of human-annotated relations that can be verified from the system output (either the original system predictions or predictions inferred from those through closure).

Human Agreement
We also provide two types of human agreement on the tasks, measured with the same evaluation metrics as the systems: ann-ann Inter-annotator agreement between the two independent human annotators who annotated each document.This is the most commonly reported type of agreement, and often considered to be an upper bound on system performance.

adj-ann
Inter-annotator agreement between the adjudicator and the two independent annotators.This is usually a better bound on system performance in adjudicated corpora, since the models are trained on the adjudicated data, not on the individual annotator data.
Only F 1 is reported in these scenarios since precision and recall depend on the arbitrary choice of one annotator as human (H) and the other as system (S).

Baseline Systems
Two rule-based systems were used as baselines to compare the participating systems against.
memorize For all tasks but the narrative container task, a memorization baseline was used.
To train the model, all phrases annotated as either events or times in the training data were collected.All exact character matches for these phrases in the training data were then examined, and only phrases that were annotated as events or times greater than 50% of the time were retained.
For each phrase, the most frequently annotated type (event or time) and attribute values for instances of that phrase were determined.
To predict with the model, the raw text of the test data was searched for all exact character matches of any of the memorized phrases, preferring longer phrases when multiple matches overlapped.Wherever a phrase match was found, an event or time with the memorized (most frequent) attribute values was predicted.
closest For the narrative container task, a proximity baseline was used.Each time expression was predicted to be a narrative container, containing only the closest event expression to it in the text.

Participating Systems
11 teams submitted a total of 28 runs, 10 for the unsupervised domain adaptation phase, and 18 for the supervised domain adaptation phase ULISBOA (Lamurias et al., 2017) combined conditional random fields and rules with features including character n-grams, words, part-ofspeech tags, and UMLS concept types.
XJNLP (Long et al., 2017) combined rules, support vector machines, and recurrent and convolutional neural networks, with features including words, word embeddings, and verb tense.
Several other teams (WuHanNLP, UNICA, UTD, and IIIT) also competed, but did not submit a system description.

Evaluation Results
Tables 2 to 4 show the results of the evaluation.In all tables, the best system score from each column is in bold.Systems marked with † were submitted after the competition deadline, and are thus not considered part of the official evaluation.

Time Expressions
Table 2 shows results on the time expression tasks.
The GUIR system had the top F1 in almost all time expression tasks across both unsupervised and supervised domain adaptation phases, achieving F1s between 0.51 and 0.59.Compared to human agreement, the best systems were more than 0.20 lower than the inter-annotator agreement (and further, of course, from the annotator-adjudicator agreement).
In Clinical TempEval 2016, for comparison, when models were both trained and tested on colon cancer notes, the top system achieved 0.80 F1 for time spans, and 0.77 F1 for time types.This suggests that a time expression system trained on one clinical condition (e.g., colon cancer) can expect a 20+ point drop when tested on another clinical condition (e.g., brain cancer).Providing 30 annotated notes in the target domain narrowed that gap by only a few points.
The drop in performance can probably be partly attributed to differences in time expressions across the two corpora.For example, post-op is 26.5 times more common in brain cancer (212 occurrences in brain cancer data vs. 27 occurrences in colon cancer data), overnight is 13 times more common ( 148Table 3: System performance and annotator agreement on EVENT tasks: identifying the event expression's span (character offsets), contextual modality (ACTUAL, HYPOTHETICAL, HEDGED or GENERIC), degree (MOST, LITTLE or N/A), polarity (POS or NEG) and type (ASPECTUAL, EVIDENTIAL or N/A). in brain vs. 11 in colon), and intraoperative is 2.3 times more common (156 in brain vs. 68 in colon).Formatting is also different across the corpora.For example, POST-OP (all capitals) occurs 161 times in all the brain cancer data, but never occurs with this capitalization in any of the colon cancer data.

Event Expressions
Table 3 shows results on the event expression tasks.The LIMSI-COT system achieved the best F1 on all event expression tasks for both the unsupervised and supervised domain adaptation phases, achieving around 0.70 F1 for most subtasks in the unsupervised setting, and around 0.75 F1 in the supervised setting.Compared to human agreement, the LIMSI-COT system ranged between 0.06 and 0.09 below the inter-annotator agreement.
In Clinical TempEval 2016, for comparison, the top system achieved F1s of 0.92, 0.87, 0.91, 0.90, and 0.89 for event spans, modality, degree, polarity, and type, respectively.This suggests that, much like for time expressions, an event expression system trained on one clinical condition (e.g., colon cancer) can expect a 20+ point drop when tested on another clinical condition (e.g., brain cancer).Providing 30 annotated notes in the target domain again narrows the gap by only a few points.
The drop in performance can again probably be attributed to differences across the two corpora.Even more so than time expressions, event expressions for brain cancer are very different from event expressions for colon cancer.For example, craniotomy, glioma, glioblastoma, oligoastrocytoma, aphasia, and temozolomide all occur as events more than 150 times in the brain cancer data, but do not occur as events even once in the colon cancer data.

Temporal Relations
Table 4 shows performance on the temporal relation tasks.The LIMSI-COT system had the top F1 in almost all of the temporal relation tasks in both the unsupervised and supervised domain adaptation settings, achieving above 0.50 F1 in linking events to the document creation time, and above 0.30 F1 for linking events to their narrative containers.Compared to humans, the LIMSI-COT system was more than 0.30 below inter-annotator agreement for narrative container relations, but above inter-annotator agreement (though still below annotator-adjudicator agreement) on document time relations when using the additional target domain (brain cancer) training data.
In Clinical TempEval 2016, for comparison, the top system achieved F1s of 0.76 for document time

Discussion
Clinical TempEval 2017 showed that developing clinical timeline extraction tools that generalize across domains is still a challenging problem.Almost across the board, we saw 20+ point drops in performance when systems were trained on one domain (colon cancer) and tested on another (brain cancer), as compared to systems that were trained and tested on a single domain (colon cancer, as in Clinical TempEval 2016).And across the board, providing a small amount of target domain (brain cancer) training data narrowed that gap only by a couple of points.This is an important finding because it stresses how much work remains to build robust clinical information extraction tools that are useful across a wide range of medical applications.Though the focus in Clinical TempEval 2017 was on domain adaptation, only a small number of fairly simple domain adaptation techniques were applied by participants, probably because producing even an initial system for all the Clinical Temp-Eval sub-tasks is already a significant effort.Two participants (LIMSI-COT and KULeuven-LIIR, two of the top ranking systems) included special handling of unknown words to try to increase generalization power.Other approaches attempted by participants included giving a heavier weight to the target domain (brain cancer) training data, and using pre-trained domain independent word embeddings.A wide variety of more sophisticated domain adaptation techniques exist that were not applied by participants, and we expect that some of these will make future progress in reducing the cross-domain performance degradation that was observed in Clinical TempEval 2017.

Table 1 :
Number of documents, event expressions, time expressions and narrative container relations in Train, Dev, and Test portions of the THYME data.All colon cancer data was released as part of Clinical TempEval 2015 and 2016.The Train-10 column is the data from the first 10 patients of the brain cancer Train data, which was the only additional training data released in Clinical TempEval 2017.

Table 4 :
System performance and annotator agreement on temporal relation tasks: identifying relations between events and the document creation time (DOCTIMEREL), and identifying narrative container relations (CONTAINS).relations, and 0.48 for narrative containers.Again we see a major drop when training on one condition (e.g., colon cancer) and testing on another (e.g., brain cancer): a 20+ point drop for document time relations, and around a 15 point drop for narrative containers.