SemEval-2016 Task 12: Clinical TempEval

Clinical TempEval 2016 evaluated temporal information extraction systems on the clinical domain. Nine sub-tasks were included, covering problems in time expression identiﬁcation, event expression identiﬁcation and temporal relation identiﬁcation. Participant systems were trained and evaluated on a corpus of clinical and pathology notes from the Mayo Clinic, annotated with an extension of TimeML for the clinical domain. 14 teams submitted a total of 40 system runs, with the best systems achieving near-human performance on identifying events and times. On identifying temporal relations, there was a gap between the best systems and human performance, but the gap was less than half the gap of Clinical TempEval 2015.


Introduction
The TempEval shared tasks have, since 2007, provided a focus for research on temporal information extraction (Verhagen et al., 2007;Verhagen et al., 2010;UzZaman et al., 2013). Participant systems compete to identify critical components of the timeline of a text, including time expressions, event expressions and temporal relations. However, the Temp-Eval campaigns to date have focused primarily on in-document timelines derived from news articles. In recent years, the community has moved toward testing such information extraction systems on clinical data (Sun et al., 2013; to broaden our understanding of the language of time beyond newswire expressions and structure.
Clinical TempEval focuses on discrete, welldefined tasks which allow rapid, reliable and repeatable evaluation. Participating systems are expected to take as input raw text, for example: April 23, 2014: The patient did not have any postoperative bleeding so we'll resume chemotherapy with a larger bolus on Friday even if there is slight nausea.
The systems are then expected to output annotations over the text, for example, those shown in Figure 1. That is, the systems should identify the time expressions, event expressions, attributes of those expressions, and temporal relations between them.
Clinical TempEval 2016 addressed one of the major challenges in Clinical TempEval 2015: data distribution. Because Clinical TempEval is based on real patient notes from the Mayo Clinic, participants go through a lengthy authorization process involving a data use agreement and an interview. For Clinical TempEval 2016, we streamlined this process and were able to authorize data access for more than twice as many participants as Clinical TempEval 2015. And since all the training and evaluation data distributed for Clinical TempEval 2015 was used as the training data for Clinical TempEval 2016, participants had more than a year to work on their systems. The result was that four times as many teams participated.

Data
The Clinical TempEval corpus was based on a set of 600 clinical notes and pathology reports from cancer patients at the Mayo Clinic. These notes were manually de-identified by the Mayo Clinic to replace names, locations, etc. with generic placeholders, but time expressions were not altered. The notes were then manually annotated by the THYME project (thyme.healthnlp.org) using an extension of ISO-TimeML for the annotation of times, events and temporal relations in clinical notes (Styler et al., 2014b). This extension includes additions such as new time expression types (e.g., PREPOSTEXP for expressions like postoperative), new EVENT attributes (e.g., DE-GREE=LITTLE for expressions like slight nausea), and an increased focus on temporal relations of type CONTAINS (a.k.a. INCLUDES).
The annotation procedure was as follows: 1. Annotators identified time and event expressions, along with their attributes 2. Adjudicators revised and finalized the time and event expressions and their attributes 3. Annotators identified temporal relations between pairs of events and events and times 4. Adjudicators revised and finalized the temporal relations More details on the corpus annotation process are documented in a separate article (Styler et al., 2014a). Because the data contained incompletely de-identified clinical data (the time expressions were retained), participants were required to sign a data use agreement with the Mayo Clinic to obtain the raw text of the clinical notes and pathology reports. 1 The event, time and temporal relation annotations were distributed separately from the text, in an open source repository 2 using the Anafora standoff format (Chen and Styler, 2013). The corpus was split into three portions: Train (50%), Dev (25%) and Test (25%). Patients were sorted by patient number (an integer arbitrarily assigned by the de-identification process) and stratified across these splits. The Train and Dev portions were released to participants for training and tuning their systems. The Test portion was reserved for evaluation of the systems. Table 1 shows the number of documents, event expressions (EVENT annotations), time expressions (TIMEX3 annotations) and narrative container relations (TLINK annotations with TYPE=CONTAINS attributes) in the Train, Dev, and Test portions of the corpus.

Tasks
Nine tasks were included (the same as those of Clinical TempEval 2015), grouped into three categories: • Identifying time expressions (TIMEX3 annotations in the THYME corpus) consisting of the following components: -The span (character offsets) of the expression in the text -Class: DATE, TIME, DURATION, QUAN-TIFIER, PREPOSTEXP or SET • Identifying event expressions (EVENT annotations in the THYME corpus) consisting of the following components: -The span (character offsets) of the expression in the text -Contextual Modality: ACTUAL, HYPO-THETICAL, HEDGED or GENERIC -Degree: MOST, LITTLE or N/A -Polarity: POS or NEG -Type: ASPECTUAL, EVIDENTIAL or N/A • Identifying temporal relations between events and times, focusing on the following types: -Relations between events and the document creation time (BEFORE, OVER-LAP, BEFORE-OVERLAP or AFTER), represented by DOCTIMEREL annotations. -Narrative container relations (Pustejovsky and Stubbs, 2011), which indicate that an event or time is temporally contained in (i.e., occurred during) another event or time, represented by TLINK annotations with TYPE=CONTAINS.
The evaluation was run in two phases: 1. Systems were provided access only to the raw text, and were asked to identify time expressions, event expressions and temporal relations 2. Systems were provided access to the raw text and the manual event and time annotations, and were asked to identify only temporal relations

Evaluation Metrics
All of the tasks were evaluated using the standard metrics of precision (P ), recall (R) and F 1 : where S is the set of items predicted by the system and H is the set of items annotated by the humans. Applying these metrics only requires a definition of what is considered an "item" for each task.
• For evaluating the spans of event expressions or time expressions, items were tuples of (begin, end) character offsets. Thus, systems only received credit for identifying events and times with exactly the same character offsets as the manually annotated ones. • For evaluating the attributes of event expressions or time expressions -Class, Contextual Modality, Degree, Polarity and Type -items were tuples of (begin, end, value) where begin and end are character offsets and value is the value that was given to the relevant attribute. Thus, systems only received credit for an event (or time) attribute if they both found an event (or time) with the correct character offsets and then assigned the correct value for that attribute. • For relations between events and the document creation time, items were tuples of (begin, end, value), just as if it were an event attribute. Thus, systems only received credit if they found a correct event and assigned the correct relation (BEFORE, OVERLAP, BEFORE-OVERLAP or AFTER) between that event and the document creation time. In the second phase of the evaluation, when manual event annotations were provided as input, only recall (which in this case is equivalent to standard classification accuracy) is reported. • For narrative container relations, items were tuples of ((begin 1 , end 1 ), (begin 2 , end 2 )), where the begins and ends corresponded to the character offsets of the events or times participating in the relation. Thus, systems only received credit for a narrative container relation if they found both events/times and correctly assigned a CONTAINS relation between them.
For event and time attributes, we also measure how accurately a system predicts the attribute values on just those events or times that the system predicted. The goal here is to allow a comparison across systems for assigning attribute values, even when different systems produce different numbers of events and times. This metric is calculated by dividing the F 1 on the attribute by the F 1 on identifying the spans: For narrative container relations, the P and R definitions were modified to take into account temporal closure, where additional relations are deterministically inferred from other relations (e.g., A CONTAINS B and B CONTAINS C, so A CONTAINS C): Similar measures were used in prior work (UzZaman and Allen, 2011) and TempEval 2013 (UzZaman et al., 2013), following the intuition that precision should measure the fraction of system-predicted relations that can be verified from the human annotations (either the original human annotations or annotations inferred from those through closure), and that recall should measure the fraction of human-annotated relations that can be verified from the system output (either the original system predictions or predictions inferred from those through closure).

Baseline Systems
Two rule-based systems were used as baselines to compare the participating systems against.
memorize For all tasks but the narrative container task, a memorization baseline was used. To train the model, all phrases annotated as either events or times in the training data were collected. All exact character matches for these phrases in the training data were then examined, and only phrases that were annotated as events or times greater than 50% of the time were retained. For each phrase, the most frequently annotated type (event or time) and attribute values for instances of that phrase were determined.
To predict with the model, the raw text of the test data was searched for all exact character matches of any of the memorized phrases, preferring longer phrases when multiple matches overlapped. Wherever a phrase match was found, an event or time with the memorized (most frequent) attribute values was predicted. closest For the narrative container task, a proximity baseline was used. Each time expression was predicted to be a narrative container, containing only the closest event expression to it in the text.
6 Participating Systems 14 research teams submitted a total of 40 runs: brundlefly (Fries, 2016) submitted 1 run for phase 1 based on recurrent neural networks, word embeddings, and logistic regression, and 1 run for phase 2 run based on the DeepDive framework (http://deepdive.stanford.edu). CDE-IIITH (Chikka, 2016) submitted 2 runs for each phase, the first based on deep learning models, and the second based on conditional random fields and support vector machines. Cental (Hansart et al., 2016) submitted 1 run for phase 1, based on conditional random fields and lexical resources. GUIR (Cohan et al., 2016) submitted 2 runs for phase 1 and 1 run for phase 2, based on conditional random fields and logistic regression with lexical, morphological, syntactic, dependency, and domain specific features, combined with pattern matching rules. HITACHI (Sarath P R et al., 2016) submitted 2 runs for the time portion of phase 1, based on ensembles of rule-based and machine learning systems with lexical, syntactic and morphological features. The second run included 50% more training data than the first. KULeuven-LIIR (Leeuwenberg and Moens, 2016) submitted 2 runs for phase 2, based on the cTAKES-temporal machine-learning model (Lin et al., 2015), with additional features. LIMSI (Grouin and Moriceau, 2016) submitted 2 runs for each phase, based on conditional random fields with lexical, morphological, and word cluster features, and the rule-based Heidel-Time (Strötgen and Gertz, 2013 2 runs for each phase, based on conditional random fields with morpho-syntactic, lexical, UMLS, and DBpedia features. The first run was a two-step approach to temporal relations, the second, a one step approach.

Human Agreement
We also provide two types of human agreement on the task, measured with the same evaluation metrics as the systems: ann-ann Inter-annotator agreement between the two independent human annotators who annotated each document. This is the most commonly reported type of agreement, and often considered to be an upper bound on system performance. adj-ann Inter-annotator agreement between the adjudicator and the two independent annotators. This is usually a better bound on system performance in adjudicated corpora, since the models are trained on the adjudicated data, not on the individual annotator data.
Precision and recall are not reported in these scenarios since they depend on the arbitrary choice of one annotator as human (H) and the other as system (S). Note that since temporal relations between events and the document creation time were annotated at the same time as the events themselves, agreement for this task is only reported in phase 1 of the evaluation. Similarly, since narrative container relations were only annotated after events and times had been adjudicated, agreement for this task is only reported in phase 2 of the evaluation. Table 2 shows results on the time expression tasks. The UTHealth systems achieved the best results on almost all time-related tasks. For finding times, while one system had comparable precision to UTHealth (0.836 UTHealth vs. 0.840 LIMSI), no system had competitive recall (0.757 UTHealth vs. 0.714 from the next best, UtahBMI), and thus the UTHealth system consistently outperformed the other systems in F 1 . The results were similar for jointly finding times and assigning them a time class, though a couple systems (HITACHI, GUIR) did have more accurate predictions for the time class when scored only on the times that they were able to find (0.971 UTHealth vs. 0.975 HITACHI vs. 0.989 GUIR).

Time Expressions
Compared to human agreement, the UTHealth and UtahBMI systems exceeded the inter-annotator agreement on times of 0.731, but even UTHealth's F 1 of 0.795 did not reach the annotator-adjudicator agreement of 0.830, and the results were similar for jointly finding times and assigning their classes (0.772 vs. 0.807). Nonetheless, these 0.025 and 0.035 gaps between the top system and the human agreement are smaller than the 0.051 and 0.038 gaps observed in Clinical TempEval 2015 . Table 3 shows results on the event expression tasks. Again, UTHealth dominated the field, achieving the highest score on almost every event-related task. However, the gap to the second place team was much smaller for events than it was for times: only a 0.011  and class (DATE, TIME, DURATION, QUANTIFIER, PREPOSTEXP or SET). The best system score from each column is in bold.

Event Expressions
Systems marked with † were submitted after the competition deadline and are not considered official.
gap between UTHealth's 0.903 F 1 and UtahBMI's 0.892. The gap was even smaller if we look at precision and recall separately: a 0.007 gap between UTHealth's 0.915 precision and UTA's 0.908, and a 0.005 gap between UTHealth's 0.891 precision and UtahBMI's 0.886. The results were similar for most of the attributes, though the precision gaps were larger (1.1-1.4) and the recall gaps were smaller (0.3-0.7).
Compared to human agreement, UTHealth, UtahBMI, Cental, GUIR, and UTA all exceeded inter-annotator agreement on identifying events, and UTHealth and UtahBMI exceeded inter-annotator agreement on all of the attributes. None of the systems reached the level of annotator-adjudicator agreement: even UTHealth's F 1 on events of 0.903 had a gap of 0.019 from the annotator-adjudicator agreement of 0.922, and the results were similar for event attributes: 0.049 for modality, 0.021 for degree, 0.029 for polarity, 0.024 for type. These gaps are almost all bigger than the gaps observed in Clinical Temp-Eval 2015: 0.005 for event spans, 0.031 for modality, 0.007 for degree, 0.012 for polarity, 0.030 for type. However, Clinical TempEval 2016's human agreement was substantially higher, with all annotatoradjudicator agreement above 0.90, while in Clinical TempEval 2015, annotator-adjudicator agreement ranged from 0.853 to 0.880. Table 4 shows performance on the temporal relation tasks. In both phase 1 (where systems were provided only the raw text) and phase 2 (where systems were provided the manually annotated events and times), the UTHealth system was again the top system for most tasks. For relating events to the document creation time, the UTHealth system had the best precision, recall, and F 1 (0.766, 0.746, and 0.756) in phase span span + modality span + degree  Team  P  R  F1  P  R  F1  A  P  R  F1 Table 4: System performance and annotator agreement on temporal relation tasks: identifying relations between events and the document creation time (DOCTIMEREL), and identifying narrative container relations (CONTAINS). The best system score from each column is in bold. Systems marked with † were submitted after the competition deadline and are not considered official.

Temporal Relations
1, and the second best score (0.835 vs. UtahBMI's 0.843) in phase 2. For finding narrative container relations, the UTHealth system had the best recall (0.471 in phase 1, 0.559 in phase 2), and though other systems (UtahBMI, VUACLTL, LIMSI-COT, and KULeuven-LIIR) had higher precisions, the recall gap from UTHealth to the next system was large (0.203 in phase 1 and 0.088 in phase 2) and thus UTHealth had the best F 1 in both phases (0.479 in phase 1, 0.573 in phase 2). Compared to human agreement, UTHealth and UtahBMI exceeded inter-annotator agreement on relations to the document time (while still leaving a gap of 0.088 to the annotator-adjudicator agreement), but no participant system was near the human agreement for narrative containers (a gap of 0.078 from inter-annotator agreement and a gap of 0.244 from annotator-adjudicator agreement). For relations to the document time, the 0.088 gap between systems and annotator-adjudicator agreement is slightly larger than the 0.059 of Clinical TempEval 2015, but for narrative container relations the 0.244 gap is much smaller than the 0.412 of Clinical TempEval 2015. As with other tasks, human agreement is higher this year (0.844 and 0.817 in 2016 vs. 0.761 and 0.672 in 2015), which may explain the larger gap for document time relations. The smaller gap for narrative container relations despite the increased human agreement suggests that major improvements have been made to the systems for this task.

Discussion
The results of Clinical TempEval 2016 suggest that current state-of-the-art systems are close to solving most event and time related tasks. For all of these tasks, the gap between system performance and human performance was less than 0.05, and for half the tasks (time spans, event spans, event degree, event type) it was 0.025 or less.
The temporal relation tasks were more difficult. Systems trying to predict the temporal relation between an event and the time at which the document was written lagged about 0.09 behind human performance. And systems trying to predict narrative containers (whether one event or time contains another) lagged about 0.25 behind human performance, even when provided human-annotated events and times.
Nonetheless, the latter result was a major improvement over Clinical TempEval 2015, where the gap on narrative containers was more than 0.4.
While there was variability across the subtasks in the rankings of teams, UTHealth and UtahBMI were always at the top of the lists. Both of these systems relied on structured learning models (UTHealth used HMM support vector machines; UtahBMI used conditional random fields) with a wide variety of features (lexical, morphological, syntactic, and many others). We can thus infer that such approaches hold promise for temporal information extraction. However, these two teams were also among the first to make it through the data use agreement process, so their success may in part reflect the advantage of having more time for experimentation and feature engineering on the training data.
Overall, Clinical TempEval 2016 represented a major step forward from Clinical TempEval 2015. It saw a much greater breadth of participating systems (14 teams in 2016 vs. 3 teams in 2015), with the top systems maintaining 2015's high performance on the event and time tasks, while making major progress on the harder temporal relation tasks. Future plans for Clinical TempEval target the robustness of these systems: instead of testing on only colon cancer notes from the Mayo Clinic (the same domain as the training set), systems will be tested on other types of medical conditions and notes from other institutions.