GUIR at SemEval-2017 Task 12: A Framework for Cross-Domain Clinical Temporal Information Extraction

Clinical TempEval 2017 (SemEval 2017 Task 12) addresses the task of cross-domain temporal extraction from clinical text. We present a system for this task that uses supervised learning for the extraction of temporal expression and event spans with corresponding attributes and narrative container relations. Approaches include conditional random fields and decision tree ensembles, using lexical, syntactic, semantic, distributional, and rule-based features. Our system received best or second best scores in TIMEX3 span, EVENT span, and CONTAINS relation extraction.

The focus of Clinical TempEval 2017 is domain adaptation. The source domain consists of clinical text about patients undergoing colon cancer treatments, while the target domain consists of clinical text about those with brain cancer. There are two phases in the task. In phase 1, the shared task provides no annotations for the target domain (unsupervised). In phase 2, the shared task provides a small annotated training set from the target domain (supervised). Both phases evaluate system performance on thirteen tasks via precision, recall, and F1-score.
In Clinical TempEval 2016, the top-performing system employed structural support vector machines (SVM) for entity span extraction and linear support vector machines for attribute and relation extraction (Lee et al., 2016). For the previous iteration, Velupillai et al. (2015) developed a pipeline based on ClearTK and SVM with lexical and rule-based features to extract TIMEX3 and EVENT mentions. In the i2b2 2012 temporal challenge, all top performing teams used a combination of supervised classification and rule-based methods for extracting temporal information and relations (Sun et al., 2013). Other efforts in clinical temporal annotation include works by Roberts et al. (2008), Savova et al. (2009), and Galescu and Blaylock (2012).
Previous work has also investigated extracting temporal relations. Examples of these efforts in the general domain include: classification by SVM (Chambers et al., 2007), Integer Linear Programming (ILP) for temporal ordering (Chambers and Jurafsky, 2008), Markov Logic Networks (Yoshikawa et al., 2009), and SVM with Tree Kernels (Miller et al., 2013).
In this paper, we present a framework for temporal information extraction in clinical narratives. Specifically we utilize Conditional Random Fields (CRFs) and decision tree ensembles for extracting temporal entities and relations from clinical text. The features we use are covered in detail in Section 2. This work can be seen as an extension and refinement of the system used for Clinical TempEval 2016 by Cohan et al. (2016).

Methodology
Our approach uses supervised learning algorithms with lexical, syntactic, semantic, distributional, and rule-based features for span, attribute, and relation extraction.

Span Extraction
Extraction of TIMEX3 and EVENT spans uses linear-chain CRFs.
We use BIO labels for the classification of spans of text from the tokenized source text: "B" indicates that the token begins a span, "I" indicates that the token is inside the span, and "O" indicates that the token is outside all spans. This approach allows for spans to represent one or more adjacent tokens. Non-contiguous spans, although not supported, have a low occurrence.
Basic lexical features computed for each token are as follows: lowercase form of the token; uppercase and lowercase flags; prefix and suffix; lemmatized form; shape; punctuation flag; and stop word flag. Syntactic features are coarseand fine-grained part-of-speech tags. We used spaCy 1 for tokenization and basic features. In addition, we used the Unified Medical Language System (UMLS) ontology (Bodenreider, 2004) via MetaMap 2 to capture semantic concepts and use them as features. We limited the types to those indicative of clinical events (diagnostic procedure, disease or syndrome, and therapeutic procedure).
We also include regular expression-based features to capture more complicated and specialized token properties (summarized in Table 1). While the more generalized features we used (e.g. shape and suffix) capture some of the same information, this approach prioritizes likely generalizations and avoid over-fitting to specific cases. For instance, it allows the algorithm to generalize "Summer 2010" as " [Season] [Year]" instead of a more literal sequence.
We use distributional features for generalization. We construct Brown clusters (Brown et al., 1992) on the text with fifty clusters. The binary representation of each token's cluster is a feature. We also use word embeddings trained using Word2Vec (Mikolov et al., 2013)  model. For each token's feature set, we also include the features from the ±1 adjacent tokens.

Attribute Extraction
We treat the extraction of the attributes of EVENT and TIMEX3 as a classification problem. Our system trains a CRF model for each attribute, with the labels of each model corresponding to the attribute values and the same features used in span extraction. An expanded window of ±3 tokens is used for this task. Our system treats DOCTIMEREL (the EVENT's temporal relation to the document time) as attribute extraction.

Narrative Containers
Our approach trains gradient boosted trees (Friedman, 2001) on candidate relation pairs and uses this model to predict relations. Our system uses XGBoost (Chen and Guestrin, 2016) for this task.
Clinical TempEval 2017 only considers temporal links (TLINK) with a type of CONTAINS; other types of TLINKs are not evaluated due to lower inter-annotator agreements. Our system uses TLINK type labels when the relation exists, and a null label when the candidate relation does not represent an actual relation. We note that our approach extracts all relation types. Our system uses both entity features (describing each relation endpoint) and relation features (describing the relationship between the source and target).
Entity features include the entity type, entity attribute values, and the case-folded text value. Additionally, we use each token and related features (e.g. suffix) contained within the entity as features. We apply semantic Role Labeling (SRL) to the sentence containing the entity, which identifies semantic predicates in the sentence per PropBank guidelines (Palmer et al., 2005). If the entity text  is found in a semantic predicate, we use the argument label as a feature for the model. We used the SENNA 3 implementation for SRL tagging.
Relation features capture information about the relationship between two entities. Basic relation features included are the character distance between the entities and the pair of entity types. Syntactic features applied capture the path along the constituent and dependency trees between the entities. Our system uses the spaCy toolkit for dependency parsing. We derive n-gram segments of the path, the full path, and the distance of the path, and use them as features.
We limit candidate relations to permutations of entities belonging to the same sentence. This approach precludes relations that cross sentence boundaries, but limits the extent of negative training samples.

Domain Adaptation
Our system splits the phase 2 text ("train10") into a dev set and a test set. A grid search is performed for span, property, and relation extraction over the applicable hyperparameters. Text from the source domain is used for training, and the dev set from the target domain is used for evaluation. The test set is used after the grid search to verify that the procedure did not overfit hyperparameters.

Experimental Setup
In phase 1, we train our system on all available annotations from the source domain. In phase 2, we train our system on all available data from the source domain and the "train10" dataset from the target domain.
Baselines The baselines are two rule-based systems (Bethard et al., 2015) that the shared task 3 http://ml.nec-labs.com/senna/ provides along with the corpus. The MEMORIZE baseline, which is the baseline for all tasks except for narrative containers, memorizes the EVENT and TIMEX3 mentions and attributes based on the training data. Then it uses the memorized model to extract temporal information from new data. For narrative containers, the CLOSEST baseline predicts a TLINK relation with type CONTAINS between every TIMEX3 annotation and its closest EVENT.
Furthermore, we compare our results against the other submissions to Clinical TempEval 2017. We report the median value for each metric, as well as indicators when our system achieves either the top result ( †), or second-highest result ( ‡). Only the systems that submitted values for a particular task are considered; systems reported as p = 0.00, r = 1.00, and F 1 = 0.00 are ignored.
Evaluation metrics Clinical TempEval 2017 evaluates thirteen tasks. Each task reports the precision recall, and F1-score of the submitted results as compared to a human annotated and adjudicated ground truth. The following tasks are not reported in this paper for brevity: "All spans & all properties", "All spans only", "Time span & all properties", and "Event span & all properties".

Results and discussion
Our system outperformed other participating systems, receiving best or second best results extracting TIMEX3 spans, EVENT spans, and CONTAINS relations. Generally our domain adaptation procedure improved results, but it reduced the results of CONTAINS relations. Although we received top scores, we fell short of the single-domain performance achieved in Clinical TempEval 2016. Table 2 shows the results for TIMEX3 and EVENT span extraction. Our system achieved the top F1 score for TIMEX3 spans and the second highest F1 score for EVENT spans in both phases. Furthermore, our system met or exceeded the median and MEMORIZE baseline in all but one metric (TIMEX3 precision), in which it had significant gains in recall. Table 3 shows the results for TIMEX3 and EVENT attribute extraction. We note that while our system performs well on some of these categories, on some other categories it underperforms the median results (e.g. such as EVENT Modality and EVENT Polarity). Our system performed well at CONTAINS relations, but only achieved median results at DOCTIMEREL   relations (see Table 4). In phase 1, our system achieved the top results for CONTAINS precision and F1. Our domain adaptation procedure resulted in a drop in recall for CONTAINS relations. We suspect this is due to overfitting the model to the sample data. We suspect that including more contextual or semantic features would improve the performance of attribute extraction (including DOCTIMEREL).

Error Analysis
We conducted an unsupervised domain adaptation run against the "train10" dataset to get an idea of failure cases. (We could not use the full target domain test set because these data are not available.) One issue with TIMEX3 extraction is previously unseen or atypical date formats, for instance "12Jun2013" (no hyphens). One way to resolve this issue could be to use a more generalized library for extracting time expressions (e.g. Heidel-Time), but even this library does not extract the example shown above. Furthermore, it would not generalize to new and otherwise unknown formats. The supervised training subset could be used in each domain to identify these kinds of conventions, but this is labor-intensive and prone to error.
Another issue is inconsistency in TIMEX3 annotation conventions (e.g. annotating a date and time separately sometimes and jointly in others). This complicates the model and leads to otherwise inexplicable annotation absences.
One example of an EVENT extraction failure is the false positive of "Cancer" in the phrase "Cancer Research Hospital". An approach to resolve this would be to use named entity recognition features, or by treating named entities as chunks that are annotated using a different technique. False positive EVENTs were common in certain sections of the notes (e.g. ongoing care; suggested interventions), indicating that document segmentation by section could be useful. This would only work in a supervised environment, unless domain sections have a great degree of overlap and can be mapped to one another. TLINK error cases include the known limitation of intra-sentence relations. Other false negatives candidates seemed to be due to domain-specific language (e.g. "temozolomide"), suggesting that lexical features are overused, or the syntactic and semantic features we use are inadequate.

Conclusions
The results of Clinical TempEval 2017 show that there is still room to explore cross-domain temporal information extraction. We presented a system for both unsupervised and supervised temporal domain adaptation. It performed among best of participating teams, receiving best or second best scores in TIMEX3 span, EVENT span, and CON-TAINS relation extraction. All teams fell short of meeting the top results for the source domain. Future work in this area could focus on techniques for using a small number of annotations to tune a system to other domains due to the modest improvements in phase 2.