UTHealth at SemEval-2016 Task 12: an End-to-End System for Temporal Information Extraction from Clinical Notes

The 2016 Clinical TempEval challenge addresses temporal information extraction from clinical notes. The challenge is composed of six sub-tasks, each of which is to identify: (1) event mention spans, (2) time expression spans, (3) event attributes, (4) time attributes, (5) events’ temporal relations to the document creation times (DocTimeRel), and (6) narrative container relations among events and times. In this article, we present an end-to-end system that addresses all six sub-tasks. Our system achieved the best performance for all six sub-tasks when plain texts were given as input. It also performed best for narrative container relation identiﬁcation when gold standard event/time annotations were given.


Introduction
Temporality is crucial in understanding the course of clinical events from a patient's electronic health records. Since a large part of the information on temporality resides in narrative clinical notes, automatic extraction of temporal information from clinical notes using natural language processing (NLP) techniques has received much attention. Over the years, research community challenges on clinical temporal information extraction have been organized; i.e., the 2012 Informatics for Integrating Biology and the Bedside (i2b2) challenge (Sun et al., 2013), the 2013/2014 CLEF/ShARe challenge (Mowery et al., 2014), and the 2015 Clinical TempEval challenge . These challenges provide annotated corpora on temporal entities and relations, which facilitate comparisons of multiple systems and expediate the development of clinical temporal information extraction methodologies.
The 2016 Clinical TempEval challenge is the most recent community challenge that addresses temporal information extraction from clinical notes. Following the 2015 Clinical TempEval challenge, the 2016 challenge consists of six sub-tasks, each of which is to identify: (1) spans of event mentions, (2) spans of time expressions, (3) attributes of events, (4) attribute of times, (5) events' temporal relations to the document creation times (DocTimeRel), and (6) narrative container relations among events and times (TLINK:Contains). 440 annotated clinical notes from Mayo Clinic, or the THYME corpus (Styler IV et al., 2014), were provided as the training data set, and 153 plain text clinical notes were provided as the test set. The participating systems were evaluated through two phases. In phase 1, the systems were evaluated on their results for all six sub-tasks given plain texts as inputs. In phase 2, system predictions on DocTimeRel and TLINK:Contains were evaluated given the gold-standard event annotations (EVENT) and time annotations (TIMEX3).
In this article, we describe a comprehensive system that addresses all six sub-tasks. We designed the system by adapting state-of-the-art techniques from previous work on named entity recognition (Tang et al., 2013a;Jiang et al., 2011) and temporal relation identification (Tang et al., 2013b;Lin et al., 2015) in the medical domain. Our end-to-end system achieved top performance for all six sub-tasks in the phase 1 and the TLINK:Contains identification task in the phase 2 stages of the challenge.

Methods
Our temporal information extraction system consists of four modules: the first module identifies the spans of event mentions and time expressions along with their types; the second module identifies attributes of events and times; the third module predicts DocTimeRel; and the last module identifies TLINK:Contains among events and times. The output results from previous modules are utilized by the latter modules. We describe those modules in detail in the following sections.

Event mentions and temporal expressions recognition
As the first step, our system identifies the spans of event mentions and time expressions along with their types. According to our observations of the corpus, different types of event mentions and time expressions may show characteristics different from one another. For instance, events with EVIDENTIAL type are usually represented with verbs such as 'showed', 'reported', 'confirms', in contrast to the events with N/A type that are usually represented with medical terms such as 'nausea', 'chemotherapy' or 'colonoscopy'. Similarly, times with DATE type appear more often with the preposition 'on', while times with DURATION type appear more often with 'during' or 'since'. Such variations in the characteristics may limit the system's performance, if one tries to identify event mentions or time expressions of all types at once and then identify their types. Therefore, our system identifies the spans of events and times as well as their types simultaneously.
An HMM-SVM sequence tagger (Joachims et al., 2009) is employed to tag each token in the clini-cal notes as either O (outside of an event mention), B-type (beginning of an event mention of type), or I-type (inside of an event mention of type), where type can be any of the three event types defined by the Clinical TempEval challenge (i.e, N/A, ASPEC-TUAL, and EVIDENTIAL). Another HMM-SVM tagger is used in a similar manner to identify spans and types of time expressions.
We use various features that have been successfully used for many entity recognition tasks in the clinical domain (Tang et al., 2013b;Lin et al., 2015). In addition, we incorporate the results of SUTime (Stanford temporal tagger) (Chang and Manning, 2012) into our system as a feature. SUTime is a rulebased tagger that identifies time expressions as defined by the TimeML (Mani and Pustejovsky, 2004). The features used are as follows: Lexical features: n-gram (uni-, bi-, and tri-) of nearby words (window size of +/-2), character n-gram (bi-and tri-) of each word, prefix and suffix of each word (up to three characters), and orthographic forms of each word (obtained by normalizing numbers, uppercase letters, and lowercase letters to '#', 'A', and 'a', respectively, and by regular expression matching) Syntactic features: POS n-gram (uni-, bi-, and tri-) of nearby words (window size of +/-2) Discourse level features: sentence length, sentence type (e.g., whether the sentence ends with a colon or starts with an enumeration mark such as '1.'), and section information Word representation features: features derived from Brown clustering (Brown et al., 1992), random indexing (Lund and Burgess, 1996) and word embedding (Tang et al., 2014) (trained on MiPACQ (Albright et al., 2013) and MIMIC II (Saeed et al., 2011)

corpora)
Features from external resources: dictionary matching results using customized dictionaries of medical/temporal terms, and the temporal expression prediction results from SUTime (TIMEX3 only).

Event attribute identification
Given spans and types of event mentions, our system further identifies three attributes of the events, i.e., modality, degree, and polarity. We trained three SVM classifiers for each of the three attributes using LIBLINEAR SVM package (Fan et al., 2008). We used features similar to those described in Section 2.1, where the features are extracted from a window size of +/-5 tokens around each event mention. Additionally, we used the attribute-specific features (described below) for event attribute identification.

DocTimeRel identification
Our system identifies DocTimeRel of each event mention in a manner similar to which it identifies event attributes. An SVM classifier was trained using the LIBLINEAR package, where the features are extracted from the window of +/-5 tokens around each event mention. In addition to the set of features similar to the ones described in Section 2.1, the following features are used: DocTimeRel-specific features: tense information of the verbs in the same sentence, event attributes, and information on time expressions in the same sentence (token/POS of time expressions before/after the event mention, token/POS of words between the closest time expression and the event mention)

TLINK:Contains identification
We divide the task of narrative container relation identification into six sub-problems based on two criteria: (1) whether the target narrative container relation is between two events or between an event and a time and (2) whether the two event/time mentions are within one sentence, within two adjacent sentences, or across more than two sentences. For each sub-problem, we trained an SVM classifier that identifies whether an ordered pair of two events/times (or a candidate pair) forms a TLINK of Contains type, using the LINLINEAR SVM package. Before training the classifiers, we apply the following steps in order to take into account the data distribution characteristics. First, in the gold standard dataset, a large number of implicit temporal re-lations are left unannotated intentionally. Since providing implicit relations as negative instances to the SVM learners may harm the learning process, we extended the gold standard set of TLINK:Contains to its transitive closure, and used the extended set as the positive instances for training. The transitive closure was generated by applying Floyd-Warshall algorithm (Floyd, 1962) on the gold standard TLINK set based on the transitivity of the TLINK:Contains relation (i.e, A contains B ∧ B contains C → A contains C).
Second, since any two events/times can be a candidate pair to train a classifier, the number of candidate pairs becomes huge with small portion of positive instances among them. This may not be ideal for training a classifier. In order to reduce the number of prospective negative instances, we filtered out some of the candidate pairs that are highly unlikey to form a TLINK:Contains relation based on the THYME corpus annotation guideline 1 . We removed a candidate pair either 1) when the two event/time mentions are not in the same section, or 2) when one event has ACTUAL modality while the other has HYPOTHETICAL modality, or 3) when one event has BEFORE DocTimRel while the other has AFTER DocTimeRel. For candidate pairs whose event/time mentions are across more than two sentences, we further filtered out the pairs based on heuristic rules, in order to keep only the candidate pairs that are higly likely to form a TLINK:Contains relation. We kept a candidate pair only when an event/time among the two events/times is mentioned in a section header that includes the keywords 'history' or 'evaluation' or in a section header that ends with a time expression.
We also applied cost-sensitive learning in order to counterbalance the effect of dominating number of negative instances. To each class, we assigned weight that is inversely proportional to the class frequency, adjusting the penalty factor in SVM training (Ben-Hur and Weston, 2009). For instance, if there were 20 positive pairs among 100 candidate pairs, we would assign the weight 5 (100/20) to the positive class and the weight 1.25 (100/80) to the negative class. The features used for the six classifiers are as follows. Note that an event mention was expanded to its covering noun phrase before the feature extraction: Common features: event/time attributes, token and POS features on event/time mentions (as provided by cTAKES), punctuation between event/time mentions, other event/time mentions within the same sentence as the two event/time mentions, number of other event/time mentions between the two event/time mentions, tense of the verbs in the same sentence, section information, sentence type (the same as in Section 2.1), and word embedding representations of the head words of event/time mentions Feaures for single-sentence cases: dependency path linking the two event/time mentions (as provided by cTAKES) Features for multi-sentence cases: line distance between the two event/time mentions, and tokens that are common to the two event/time mentions

Results
In this section, we present our system's performance on test set along with the top and the median results from the challenge.

Conclusion and discussion
In this article, we describe a system that shows the top performance in the 2016 Clinical TempEval challenge. We adapted the state-of-the-art techniques for entity recognition and temporal relation identification in the clinical domain, and show that those techniques are effective for the Clinical Tem-pEval challenge as well.
For time expression identification, we found some error cases in which the system's prediction differs with the gold standard annotation only on the inclusion or exclusion of a preposition. For example, while a DURATION type time is annotated for the phrase "for the past 40 years" in the gold set, our system predicted a DURATION for the phrase "the past 40 years" omitting the preposition 'for' from the gold standard annotation. Table 4 shows the DocTimeRel identification accuracy on each DocTimeRel value. Accuracy on the value OVERLAP is the highest, which might come from the abundance of the training data. Surprisingly, the classifier worked better for the value AF-TER than the value BEFORE, even though there were three times more events with BEFORE Doc-TimeRel than those with AFTER. We conjecture that explicit keywords that indicate the future tense such as "will" and "potential" played key roles in identifying AFTER DocTimeRel. Table 5 shows the 10-fold cross validation results of the six classifiers for TLINK:Contains identification. Temporal relations between an event and a time were predicted more accurately than the relations between two events. Classifiers for pairs across more than two sentences showed the best F1 scores, due to the heuristic filtering steps in which we kept only the candidate pairs that are highly likely to form a narrative container relation. sub-problem F EVENT-EVENT-1 66.9% EVENT-EVENT-2 69.1% EVENT-EVENT-3 76.2% EVENT-TIMEX3-1 79.9% EVENT-TIMEX3-2 76.3% EVENT-TIMEX3-3 84.3% Table 5: F1 scores of the six classifiers for TLINK:Contains identification (10-fold cross validation on the training set).
EVENT-EVENT and EVENT-TIMEX represent the subproblems regarding the candidate pairs between two events, and the sub-problems regarding the pairs between an event and a time, respectively. The suffixes '-1', '-2' and '-3' indicate that the pairs should be within one sentence, within two adjacent sentences, and across more than two sentences, respectively.
We plan to further improve our system to show higher performance based on the observations above.