GUIR at SemEval-2016 task 12: Temporal Information Processing for Clinical Narratives

Extraction and interpretation of temporal information from clinical text is essential for clinical practitioners and researchers. SemEval 2016 Task 12 (Clinical TempEval) addressed this challenge using the THYME 1 corpus, a corpus of clinical narratives annotated with a schema based on TimeML 2 guidelines. We developed and evaluated approaches for: extraction of temporal expressions (T IMEX 3) and E VENT s; T IMEX 3 and E VENT attributes; document-time relations; and narrative container relations. Our approach is based on supervised learning ( CRF and logistic regression), utilizing various sets of syntactic, lexical and semantic features with addition of manually crafted rules. Our system demonstrated substantial improvements over the baselines in all the tasks.


Introduction
SemEval-2016 Task 12 (Clinical TempEval) is a direct successor to 2015 Clinical TempEval (Bethard et al., 2015) and the past I2b2 temporal challenge (Sun et al., 2013). Clinical TempEval is designed to address the challenge of understanding clinical timeline in medical narratives and it is based on the THYME corpus (Styler IV et al., 2014) which includes temporal annotations. 1 Temporal Histories of Your Medical Event. https://clear.colorado.edu/TemporalWiki/index.php/Main Page 2 TimeML is a standard specification language for events and temporal expressions in natural language. http://www.timeml.org/ Researchers have explored ways to extract temporal information from clinical text. Velupillai et al. (2015) developed a pipeline based on ClearTK 3 and SVM with lexical features to extract TIMEX3 and EVENT mentions. In I2b2 2012 temporal challenge, all top performing teams used a combination of supervised classification and rule based methods for extracting temporal information and relations (Sun et al., 2013). Besides THYME corpus, there have been other efforts in clinical temporal annotation including works by Roberts et al. (2008), Savova et al. (2009) and Galescu and Blaylock (2012). Previous work has also investigated extracting temporal relations.
Clinical TempEval 2016 was focused on designing approaches for timeline extraction in the clinical domain. There were 6 different tasks in the TempEval 2016, which are listed in Table 1. Per TimeML specifications (Pustejovsky et al., 2003), we refer to temporal expressions as TIMEX3 and events as EVENT throughout the paper. Attributes of TIMEX3 and EVENTs are outlined according to the THYME annotations (Styler IV et al., 2014). 16 teams participated in TempEval 2016 (Bethard et al., 2016).
For extracting temporal information from clinical text, we utilize supervised learning algorithms Task   Description  TS  TIMEX3 spans  ES  EVENT spans  TA  Attributes of TIMEX3  Class   DATE, TIME, DURATION,  (Conditional Random Fields (CRF) and logistic regression) with diverse sets of features for each task. We also utilize manually-crafted rules to improve the performance of the classifiers, when appropriate. We show the effectiveness of the designed features and the rules for different tasks.
Our system outperforms the baselines across all tasks, and is above the median results of all the teams in all tasks but one (CR in precision) 1 .

Methodology
Our approach to all tasks is based on supervised learning using lexical, syntactic and semantic features extracted from the clinical text. We also designed custom rules for some tasks when appropriate. Details are outlined below:

TIMEX3 and EVENT Span Detection (TS, ES)
To extract TIMEX3 and EVENT spans (TS and ES), we use a combination of linear-chain CRFs (Lafferty et al., 2001) with manually-crafted rules 1 . Linear-chain CRFs are one of the most robust structured prediction approaches in natural language processing. We train the CRF for detecting TIMEX3s and EVENTs using BIO (Begin Inside Outside) labeling. That is, for the TIMEX3 classifier, after tokenizing the text, each token is labeled as 1 The official ranking of participating teams is unknown at the time of writing and would be announced at the SemEval workshop. Features lowercase; token letter case; if token is title; if token is numeric; if token is stopword; POS tag; brown cluster; prefix; suffix; noun chunk shape of the token; lemma Table 2: Base feature set for supervised algorithms either "O," "B-TIMEX3," or "I-TIMEX3". Similarly, the event classifier labels the tokens as either "O" or "B-EVENT," as virtually all EVENT annotations are only one token long. We use the CRF-Suite toolkit (Okazaki, 2007) for our experiments.
The main features that we use for CRF in TS and ES tasks are outlined in Table 2. Among these features is Brown clustering (Brown et al., 1992) which is a form of hierarchical clustering based on the contexts in which the words appear. Brown clusters mitigate lexical sparsity issues by considering the words in their related cluster. We constructed fifty clusters across the the train and test datasets and passed the binary identifier of a token's cluster as the feature.
In addition to these features, we use domain specific features for EVENT span detection. Our domain feature extraction is based on the Unified Medical Language System (UMLS) ontology (Bodenreider, 2004). We use MetaMap 2 (Aronson and Lang, 2010), a tool for mapping text to UMLS concepts, for extracting the concepts. The semantic types of the extracted concepts are then used as features. Since UMLS is very comprehensive, considering all the semantic types causes drift. Thus, we limit semantic types to those indicative of clinical events (e.g. diagnostic procedure, disease or syndrome, and therapeutic procedure). For each feature set, we expand the features by considering a context window of +/-3 tokens (The context window of size 3 yielded the best results on the development set).
For EVENT spans, we supplement the CRF output spans with manually crafted rules designed to capture EVENT spans. Particularly, we add rules to automatically identify EVENTs relating to standard patient readings. For example in: "Diastolic=55 mm[Hg]", using simple regular expressions, we isolate the word "Diastolic" as an EVENT span.
For TIMEX3 spans, we use regular expressions that were designed to capture standard formatted Features Set 1 UMLS semantic type; tense of the related verb in dependency tree; dependency root of the sentence Set 2 class, text and brown cluster of closest DATE, PREPOSTEXP and TIMEX3; comparison with section time; comparison with document time; sentence tense and modals dates. These rules improved the results of ES and TS considerably, as shown in Section 3.2.1.

Time and Event Attribute Detection
The main attribute for TIMEX3 mentions is their "class," which can be one of the following six types: SET, or TIME.
EVENTs have four attributes each of which includes different types (Table 1). Full description of the types of the attributes are described by Styler IV et al. (2014). To properly classify each TIMEX3 and EVENT attribute, we train a separate logistic regression classifier 1 for each TIMEX3 and EVENT attribute value. These classifiers are trained on the TIMEX3 and EVENT spans that were previously extracted from the span CRFs (Section 2.1), and employ a similar feature set as the others.
In addition to the base feature set, we also incorporate rules as features in our classifier. We consider words that are indicative of certain EVENT attributes. For example, words such as "complete" or "mostly" indicate DEGREE:MOST, "possibly" indicates MODALITY:HEDGED and "never" shows POLARITY:NEG. We add such contextual features for DEGREE, MODALITY, and POLARITY.
In addition to the rules mentioned above, we further devise rules that lead to immediate classification as a specific class or attribute value. For example, TIMEX3 annotations in the format "[number] per [number]" are classified as SET automatically. We use the most probable predicted class as the final assigned label.  Table 3 describes the additional features that we use for DR extraction. In addition to the base features, we consider features specific to the EVENT annotation. These features are illustrated as Set 1 in table 3. We furthermore expanded the features by considering contextual features from the sentence and nearby time and date mentions (Set 2 in Table  3). Medical narratives often follow a chronological order. Therefore, nearest TIMEX3 mentions, and their comparison with the section timestamp or document timestamp can be good indicators of DRs. Similarly, verb tense and the modals in the sentence are also indicative of the sentence tense and can help in identifying the document-time relation. These additional features improved the results, as shown in Section 3.2.3.

Narrative Container Relations (CR)
Narrative containers (Pustejovsky and Stubbs, 2011)   To extract narrative container relations, we use the semantic frames of the sentences. We only consider the intra-sentence containment relations (appearing in the same sentence) and do not handle inter-sentence relations (crossing sentences). According to the THYME annotation guidelines (Styler IV et al., 2014), both EVENTs and TIMEX3s can provide boundaries of narrative containers. The first step in identifying narrative container relations is to identify the anchor, the EVENT or TIMEX3 span which contains all the other related EVENTs (targets). To learn the anchor, target and containment relation, in addition to the base features for anchor and target, we use Semantic Role Labeling (SRL) and dependency parse tree features of the sentence. SRL assigns semantic roles to different syntactic parts of the sentence. Specifically, according to PropBank guidelines (Palmer et al., 2005), SRL identifies the semantic arguments (or predicates) in a sentence. If the anchor or the target fall in a semantic argument of the sentence, we assign the argument label as the feature to the associated anchor or the target. Using semantic roles, we extract the semantics of constituent parts of the sentence in terms of features which help to identify the container relations. For SRL, we use Collobert et al. (2011) neural model 1 . An example of semantic role labels is outlined in figure 1, in which labels below the sentence indicate the semantic labels.
Next, we consider the dependency parse tree of the sentence. Given the anchor and the target we traverse the dependency parse tree of the sentence to identify if they are related through a same root. In the sample sentence shown in figure 1, 1 SENNA implementation: http://ml.nec-labs.com/senna/ chemotherapy is the anchor and MI is another event which is the target. As shown, they are connected through the root of the sentence ("occurred").
Per annotation guidelines, TIMEX3 spans should receive higher priority over EVENTs for being labeled as the anchor. Therefore, we also consider the type of the expression (TIMEX3 or EVENT) as feature. Additional features such as UMLS semantic types, POS tags, dependency relations and verb tense of the sentence's root are also considered. To extract POS, syntactic and dependency-based features, we use the Spacy toolkit (Honnibal and Johnson, 2015).

Experiments
The 2016 Clinical TempEval task consisted of two evaluation phases. In phase 1, only the plain text was given and the TIMEX3 and EVENT mentions were unknown. In phase 2, which was only for DR and CR tasks, the TIMEX3 and EVENT mentions were revealed. In phase 1, we participated in all tasks, except for CR. In phase 2, we participated in both the DR and CR tasks.

Baselines
The baselines are two rule-based systems (Bethard et al., 2015) that are provided along with the corpus. The memorize baseline, which is the baseline for all tasks except for narrative containers, memorizes the EVENT and TIMEX3 mentions and attributes based on the training data. Then, it uses the memorized model to extract temporal information from new data. For narrative containers, the closest match baseline, predicts a time expression to be narrative container, if it is the closest EVENT expression.

Results
Our system's results on test set for all tasks are presented in Table 4 (phase 1) and   Our results in all tasks outperform the baselines, and in all but one case (CR-Precision) are above the median of all the participating teams.

TIMEX3 and EVENT spans (TS, ES)
For TS and ES, our system achieved F1 scores of 0.735 and 0.881 (on the test set) which gives +33.4% and +3.0% improvement over the baseline. While the improvement for TS is much larger, we observed less improvement on the ES task. For ES, Table 6 shows the effect of incorporating manually crafted rules to the output of CRF. These rules improved the F1 performance by 4.6%. In addition, as illustrated in Table 7, adding domain specific features (UMLS semantic types) improved the performance of base features (+2% F1). Adding manual rules to the output of CRF resulted in further improvement (additional +1% F1).

TIMEX3 and EVENT attributes (TA, EA)
For TA and EA, our system achieved an F1 of 0.710 and an average F1 of 0.856, respectively (Table 4). Our results improve over the baseline by 33.5% in TA and 4.8% in EA, respectively (for baseline, the average F1 of EA over all attribute types is 0.817). For EA, while performance of all types of attributes is comparable, the best performance relates to DEGREE attribute class with F1 of .887. The results of the TA and EA tasks for development set are also reported in Table 8. Generally, our results on the test set are marginally higher than on the development set which shows  that we have successfully avoided over-fitting on the training and development sets.

Document-time relation (DR)
The DR task was included in both evaluation phases.
Its F1 score in phase 1 was 0.711 and in phase 2 was 0.815. Naturally, since in phase 1, the spans of EVENTs were unknown, lower performance is expected in comparison with phase 2. The DR results in both phases show substantial improvements over the baseline (+17.7% F1 in phase 1 and +20.4% recall in phase 2).
The effect of context window size on DR performance on development set is reported in Table  9. As the window size increases, more contextual features are added and therefore performance increases. However, after a certain point, when the window becomes excessively large, the performance decreases. We attribute this to overfitting the training data because of too many features. The optimal context window size is 6 which we used for our final submission. As far as features, we evaluated three primary feature sets (using a window of 6), the results of which are outlined in Table  10. The features are defined in tables 2 and 3. As illustrated, the addition of Set 1 and Set 2 features resulted in improvements in all DR types.
Error analysis for DR showed that many of the misclassified examples were for the BEFORE/OVERLAP relations, as also reflected in the low relative performance of BEFORE/OVERLAP relations (Table 10).
In many cases, these relations are wrongly classified as either BEFORE or OVERLAP categories. For some cases, it is w (+/-)

Narrative Container relations (CR)
The CR results are presented in Table 5. Our approach substantially improves over the baseline, especially in terms of recall (+2.06 times recall improvement).
This demonstrates that using semantic frames of the sentences as well as their dependency structure can be effective in identifying container relations. However, F1 score of 0.506 shows that there is still plenty of room for improvement on this task. Error analysis showed that many of the false negatives relate to the inter-sentence relations. Our approach is designed for capturing only intra-sentence container relations. Similarly, some other false negatives were due to the dates that were not syntactically part of the sentence. An example is: "{June 14, 2010}: His first [colonoscopy] was positive for [polyp]". In this example, {June 14, 2010} is the anchor and [colonoscopy] and [polyp] are the targets. However, the designated date is not any syntactic part of the sentence and consequently, our approach is unable to capture that as the correct anchor of the narrative container.

Discussion and conclusions
SemEval 2016 task 12 (Clinical TempEval) was focused on temporal information extraction from clinical narratives. We developed and evaluated a system for identifying TIMEX3 and EVENT spans,   Table 2, Set 1 and 2 features are defined in Table 3. TIMEX3 and EVENT attributes, document-time relations, and narrative container relations. Our system employed machine learning classification scheme for all the tasks based on various sets of syntactic, lexical, and semantic features. In all tasks, we showed improvement over the baseline and, in all but one case (CR-Precision) we placed above the median of all participants (The official ranking of the systems were not announced at the time of writing). While we showed the effectiveness of diverse set of features along with supervised classifiers, we also illustrated that incorporating manually crafted extraction rules improves results. However, manual rules should be constrained as some rules interfere with the learning algorithm and negatively affect the results. The strongest rules were those based on consistent patterns, such as dates in the standard format (e.g. MM-DD-YYYY). On the other hand, while some other rules improved the recall, they led to much lower precision and F1 score. For example, a rule that matches the word "time" as TIMEX3 span, improved our TS recall considerably but at the expense of overall precision and therefore was not included in the final submission.
For narrative containment relations, we showed that semantic frames and dependency structure of the sentence are helpful in identifying the relations. However, our approach is limited to intra-sentence relations and we are not detecting relations that are cross-sentences. In future work, we aim to expand our approach to detect inter-sentence container relationships.