UTH-CCB: The Participation of the SemEval 2015 Challenge – Task 14

This paper describes the system developed by the University of Texas Health Science Center at Houston (UTHealth), for the 2015 SemEval shared task on “Analysis of Clinical Text” (Task 14). We participated in both sub-tasks: Task 1 for “Disorder Identification”, which aims to detect disorder entities and encode them to UMLS (Unified Medial Language System) CUI (Concept Unique Identifier) and Task 2 for Disorder Slot Filling, where the task is to identify normalized value for modifiers of disorders. For Task 1, we developed an ensemble approach that combined machine learning based named entity recognition classifiers with MetaMap, an existing symbolic biomedical NLP system, to recognize disorder entities, and we used a general Vector Space Model-based approach for disorder encoding to UMLS CUIs. To identify modifiers of disorders (Task 2), we developed Support Vector Machines-based classifiers for each type of modifier, by exploring various types of features. Our system was ranked 3 rd for Task 1 and 1 st for the Task 2 (both 2A and 2B), demonstrating the effectiveness of machine learning-based approaches for extracting clinical entities and their modifiers from clinical narratives.


Introduction
Natural language processing (NLP) plays a critical role in unlocking important patient information from narrative clinical texts, to support various clinical applications such as decision support systems and translational research. One of the very important tasks for clinical NLP research is to extract clinical concepts such as diseases and treatments. Many clinical NLP systems such as MedLEE system (Friedman et al., 1994), MetaMAP system (Aronson and Lang, 2010) and cTAKES system (Savova et al., 2010), have been developed to extract these important clinical concepts from text.
A number of shared tasks for clinical concepts extraction have been organized by different entities, including i2b2 (The Center for Informatics for Integrating Biology and the Bedside), ShARe/CLEF eHealth Evaluation Lab, and SemEval (International Workshop on Semantic Evaluation) (Kelly et al., 2014;Pradhan et al., 2014;Suominen et al., 2013;Uzuner et al., 2011). These challenges have greatly promoted clinical NLP research by building benchmark datasets and innovative methods. The 2015 SemEval Shared Task 14, entitled "Analysis of Clinical Text", is to identify disorders and their modifiers from clinical text, which is an extension of the SemEval-2014 challenge. The 2015 SemEval challenge consists of two subtasks: Task 1 -disorder recognition, where disorder entities need to be detected and normalized to UMLS CUIs, and Task 2 -disorder slot filling, where the normalized value for nine types of modifiers of disorders are to be identified. Task 2 is further divided into two subtasks: 1) Task 2Aidentifying modifiers based on gold standard disorders; and 2) Task 2B -identifying modifiers based on disorders recognized by our system, an end-to-end evaluation. In this paper, we describe our approaches and results for both tasks.

Datasets
For this shared task, organizers prepared three datasets: 1) training set -298 clinical documents, 2) development set -133 documents and 3) test set -100 documents. We developed our models using the training set and optimized parameters using the development set. For final submissions on the test set, we combined training and development sets to build the machine learning classifiers.

Task 1 -Disorder Identification
The disorder identification consists of two subtasks: 1) recognize disorder entities, and 2) encode recognized disorder entities to concept IDs (CUIs) in UMLS (limited to SNOMED-CT). We describe our approaches for both steps below: Disorder Entity Recognition -The disorder recognition task is a typical named entity recognition (NER) task. We developed two machine learning based NER models, including the Conditional Random Fields (CRFs) (Lafferty et al., 2001) and the Structural Support Vector Machines (SSVMs). The CRFsuite package (Okazaki, 2007) and SVM hmm * are used for CRFs and SSVMs implementations, respectively. In addition, we also developed hybrid models that combine the two machine learning models with an existing symbolic biomedical NLP system -MetaMap. We developed hybrid systems for disorder recognition by adopting two previously developed ensemble learning strategies, including ensemble ML and ensemble MV , which were originally developed in our participation of the SemEval-2014 (Zhang et al., 2014). The ensemble MV approach follows the majority voting strategy to combine the three systems. The ensemble ML approach trains an SVM classifier to combine the predictions from the three systems.
We adopted the features engineered in the previous participation of SemEval 2014 (Zhang et al., 2014), including: word-level features, such as bag-of-word; linguistic features; and discourse features, such as section name in the clinical notes and type of the notes (e.g. 'DISCHARGE_ SUMM ARY'). In this challenge, we further explored the deep neural network (DNN) based word embeddings. We obtained word embeddings by training a deep neural network (Collobert et al., 2011) from the unlabeled MIMIC II corpus (about 3G clinical notes) provided by the SemEval organizers.
Disorder Entity Encoding -We adopted the same Vector Space Model (VSM) approach developed for the SemEval-2014 to encode the disorder to UMLS/SNOMED-CT CUIs (Zhang et al., 2014). * http://www.cs.cornell.edu/people/tj/svm_light/svm_hmm.html This is a general approach to encode clinical entities to UMLS CUIs, without utilizing training samples provided by this task.

Task 2 -Disorder Slot Filling
The task is to identify eight types of disorder modifiers, including negation indicator (NI), subject class (SC), uncertainty indicator (UI), course class (CC), severity class (SV), conditional class (CO), generic class (GC) and body location (BL). For each of the first seven types of modifiers, we built SVMsbased individual classifiers. The implementation of SVMs in LibShortText package (Yu et al., 2013) was used for this purpose. The LibShortText package is an open source library for large-scale shorttext classification.
We systematically extracted the following features to train SVMs classifiers, including: 1). N-gram features. All unigrams and bigrams in the sentence were extracted as features.
2). Context words with position and direction (left or right) information. Here we describe the features using the following sentence: "patient said he has no acute distress before". There is one disorder ('distress') in this sentence.
Group-1 features: context words within the window size of 1 to disorder: ['acute_L1', 'be-fore_R1'] Group-2 features: context words within the window size of 4 to disorder: ['he_L4', 'has_L4', 'no_L4', 'acute_L4', 'before_R4'] Group-3 features: context words within window size range of 5 to 8: ['patient_L8', 'said_L8'] 3). Lexicon features, including word lists for negation, pseudo-negation, conjunction, condition, uncertainty, subject, severity, and course. 4). Dependency relation features. We used the Stanford Parser to generate dependency relations of a sentence. We only counted dependency relations where a target disorder is the governor or the dependent in the relation. We extracted all these syntactic relations as features. 5). Section names, e.g. 'Family History'. The final set of features was optimized based on the performance of cross-validation of the training set for each modifier.
The body location modifiers require specification of the text spans and the corresponding UMLS CUIs. Therefore, we first built a NER system for body location entities and then applied the same encoding approach, similar to the methods used in disorder identification task. We also constructed a comprehensive body location dictionary from UMLS and WordNet (Miller, 1995). The relative positions of the target disorder and the candidate body location were extracted as features (e.g., whether the body location is part of the target disorder). For body location encoding, we extended VSM-based lookup method by adding a regressionbased re-ranking layer trained from the training corpus.

Submissions and Evaluation
We combined training and development datasets to build our final models for all tasks. Since each task allows for three submissions, we tried different strategies for the three runs. For Task 1, run 0 and run 1 used the ensemble MV method to get better F1; while run 2 used the ensemble ML method to get higher precision, in disorder entity recognition. For Task 2A and Task 2B, run 0 and run 2 used two sets of parameters optimized for better weighted performances; while run 1 used a set of parameters optimized for un-weighted performance. For body location recognition, only run 2 of Task 2A used SSVMs model, all other runs used CRFs models for better prediction.
The evaluation metrics for this task include F-1 score (strict vs. relaxed), un-weighted accuracy, and weighted accuracy etc., as defined by the organizers. For more details, please refer to the task description paper or the task website † .

Results and Discussion
For Task 1, the main evaluation scores were strict F1. Table 1 shows the overall performance of three runs of our system in Task 1 as reported by the organizer, where 'P', 'R', 'F' denotes precision, recall, and F1 score respectively. Our best run of Task 1 ranked 3 rd among all participants. Our disorder entity recognition step actually achieved the highest F1 of 0.927 under 'relaxed' criterion (please see Table 2). The performance of disorder encoding was not as good as other top performed teams in task † http://alt.qcri.org/semeval2015/task14/index.php 1, because we used a general encoding module that did not use the CUI annotations in the training/development set for training. As reported by the organizers, our system achieved the best performance in Task 2, both for Task 2A -slot filling given gold-standard disorder spans and Task 2B -end-to-end system for disorder span identification and slot filling. Table 2 shows the overall performance of our systems in Task 2A and Task 2B. 'F,' 'A', and 'WA' denotes 'relaxed' F1 score for disorder entity recognition, overall unweighted and weighted accuracy respectively.

Conclusion
In this paper, we described our participation in the SemEval-2015 challenge -Task 14 "Analysis of Clinical Text". Our system was among the top ranked systems (ranked 3 rd for Task 1, 1 st for Task 2A and Task 2B). These results show that machine learning based methods, integrated with medical domain specific features, could reasonably identify disorders and associated modifiers from clinical narratives.
1R01GM102282, and CPRIT R1307. The first author is partially supported by NSFC 61203378.