Temporal information extraction from clinical text

In this paper, we present a method for temporal relation extraction from clinical narratives in French and in English. We experiment on two comparable corpora, the MERLOT corpus and the THYME corpus, and show that a common approach can be used for both languages.


Introduction
Temporal information extraction from electronic health records has become a subject of interest, driven by the need for medical staff to access medical information from a temporal perspective (Hirsch et al., 2015). Diagnostic and treatment could be indeed enhanced by reviewing patient history synthetically in the order in which medical events occurred. However, most of this temporal information remains locked within unstructured texts and requires the development of NLP methods in order to be accessed.
In this paper, we focus on the extraction of temporal relations between medical events (EVENT), temporal expressions (TIMEX3) and document creation time (DCT). More specifically, we address intra-sentence narrative container relation identification between medical events and/or temporal expressions (CR task, for Container Relation) and DCT relation identification between medical events and documents (DR task, for Document creation time Relation).
In the DR task, the objective is to temporally locate EVENT entities according to the Document Creation Time of the document in which they occur. Possible tags are Before, Before-Overlap, Overlap and After.
In the CR task, the objective is to identify temporal inclusion relations between pairs of entities (EVENT and/or TIMEX3) formalized as narrative container relations following Pustejovsky and Stubbs (2011).
In this context, we build on Tourille et al. (2016) and show how this type of model can be applied for extracting temporal relations from clinical texts similarly in two languages. We experimented more specifically on two corpora: the THYME corpus (Styler IV et al., 2014), a corpus of de-identified clinical notes in English from the Mayo Clinic and the MERLOT corpus (Campillos et al., to appear), a comparable corpus in French from a group of French hospitals.

Related Work
Temporal information extraction from clinical texts has been the topic of several shared tasks over the past few years.
The i2b2 Challenge for Clinical Records (Sun et al., 2013) offered to work on events, temporal expressions and temporal relation extraction. Participants were challenged to detect clinically relevant events and time expressions and link them with a temporal relation.
SemEval has been offering the Clinical TempE-val task related to the topic for the past two years (Bethard et al., 2015;Bethard et al., 2016). Its first track focused on extracting clinical events and temporal expressions, while its second track included DR and CR tasks. Different approaches were implemented by the teams, among which SVM classifiers (Lee et al., 2016;Tourille et al., 2016;Cohan et al., 2016;AAl Abdulsalam et al., 2016) and CRF approaches (Caselli and Morante, 2016;AAl Abdulsalam et al., 2016) for the DR task, and CRF, Convolutional neural networks (Chikka, 2016) and SVM classifiers (Tourille et al., 2016;Lee et al., 2016;AAl Abdulsalam et al., 2016) for the CR task.

Corpus Presentation
The MERLOT corpus is composed of clinical documents written in French from a Gastroenterology, Hepatology and Nutrition department. These documents have been de-identified  and annotated with entities, temporal expressions and relations (Deléger et al., 2014). The THYME corpus is a collection of clinical texts written in English from a cancer department that have been released during the Clinical TempEval campaigns. This corpus contains documents annotated with medical events and temporal expressions as well as container relations. The definition of a medical event is slightly different in each corpus. According to the annotation guidelines of the THYME corpus, a medical event is anything that could be of interest on the patient's clinical timeline. It could be for instance a medical procedure, a disease or a diagnosis. There are five attributes given to each event: Contextual Modality (Actual, Hypothetical, Hedged or Generic), Degree (Most, Little or N/A), Polarity (Pos or Neg), Type (Aspectual, Evidential or N/A) and DocTimeRel (Before, Before-Overlap, Overlap and After). Concerning the temporal expressions, a Class attribute is given to each of them: Date, Time, Duration, Quantifier, Pre-PostExp or Set.
For the French corpus, medical events are described according to UMLS R (Unified Medical Language System) Semantic Groups and Semantic Types. Several categories are considered as events: disorder, sign or symptom, medical procedure, chemical and drugs, concept or idea and biological process or function. Events carry only one DocTime attribute (Before, Before-Overlap, Over-lap or After). Similarly to the THYME corpus, temporal expressions within the French corpus are given a class among: Date, Time, Duration or Frequency.
Narrative containers (Pustejovsky and Stubbs, 2011) can be apprehended as temporal buckets in which several events may be included. These containers are anchored by temporal expressions, medical events or other concepts. Styler IV et al. (2014) argue that the use of narrative containers instead of classical temporal relations (Allen, 1983) yields better annotation while keeping most of the useful temporal information intact. The concept of narrative container is illustrated in Fig The French corpus does not explicitly cover container relations. However, we consider that During relations are equivalent to Contains relations. In addition, we also considered that Reveals and Conducted relations imply Contains relations. Furthermore, the corpus does not cover inter-sentence relations (relations that can spread over multiple sentences). We focus in this paper on intra-sentence container relations (relations that are embedded within the same sentence) and we will refer to them as CONTAINS relations in the rest of this paper.
Descriptive statistics of the two corpora are provided in Table 1.

Model Description
In our model, we consider both DR and CR tasks as supervised classification problems. Concerning the DR task, each medical event is classified into one category among Before, Before-Overlap, Overlap and After. The number of document creation time relations per class for both corpora is presented at table 3. For the CR task, we are dealing with a binary classification problem for each pair of EVENT and/or TIMEX3. However, considering all pairs of entities within a sentence would give us an unbalanced data set with a very large amount of negative examples. Thus, to reduce the number of candidate pairs, we transformed the 2-   category problem (contains or no-relation) into a 3-category problem (contains, is-contained, or norelation). In other words, instead of considering all permutations of entities within a sentence, we consider all combinations of entities from left to right, changing when necessary the contains relations into is-contained relations. Moreover, this transformation solves the problem of possible contradictory predictions. If we were to consider all pairs of entities within a sentence, we could have the situation where the prediction of our classifier implies that two entities contain each other (A contains B and B contains A). By considering all combinations instead of all permutations, the problem will never occur during the prediction phase. However, our system does not handle temporal closure, and conflicts could still appear at sentence level (X contains Y , X is contained by Z, Y contains Z).  Furthermore, some entities are more likely to be the anchor of narrative containers. For instance, temporal expressions are, by nature, potential anchors and may contain other temporal expressions and/or medical events. This is also the case for some medical events. For instance, a surgical operation may contain other events such as bleeding or suturing whereas it will not be the same with the two latter in most cases. Following this observation, we have built a model to classify entities as being potential container anchors or not (CON-TAINER classifier). This classifier obtains a high performance. We use its output as feature for our CONTAINS relation classifier.

Preprocessing and Feature Extraction
The THYME corpus has been preprocessed using cTAKES (Savova et al., 2010), an open-source natural language processing system for extraction of information from electronic health records. We extracted several features from the output of cTAKES: sentences boundaries, tokens, partof-speech (PoS) tags, token types and semantic types of the entities that have been recognized by cTAKES and that have a span overlap with at least one EVENT entity of the THYME corpus.
Concerning the MERLOT corpus, no specific pipeline exists for French medical texts; we thus used Stanford CoreNLP system (Manning et al., 2014) to segment and tokenize the text. We also extracted PoS tags. As the corpus already provides a type for each EVENT, there is no need for detecting other medical information.
For both DR and CR tasks, we used a combination of structural, lexical and contextual features yielded from the corpora and the preprocessing steps. These features are presented in Table 2.

Lexical Feature Representation
We implemented two strategies to represent the lexical features in both DR and CR tasks. In the (a) Cross-validation results over the training corpus for all tasks. We report F1-measure for CONTAINER and CONTAINS tasks and accuracy for DCT task. We also report standard deviation for all models.
MERLOT (fr) THYME (  first one, we used the plain forms of the different lexical attributes we mentioned in the previous section. In the second strategy, we substituted the lexical forms with word embeddings. For English, these embeddings have been computed on the Mimic 3 corpus (Saeed et al., 2011). Concerning the French language, we used the whole collection of raw clinical documents from which the MERLOT corpus has been built. In both cases, we computed 1 the word embeddings using the word2vec (Mikolov et al., 2013) implementation of gensim (Řehůřek and Sojka, 2010). We used the max of the vectors for multi-word units. Lexical contexts are thus represented by 200-dimensional vectors. When several contexts are considered, e.g. right and left, several vectors are used.

Experimentation
We divided randomly the two corpora into train and test set following the ratio 80/20. We performed hyper-parameter optimization using a Tree-structured Parzen Estimator approach (Bergstra et al., 2011), as implemented in the library hyperopt (Bergstra et al., 2013), to select the hyper-parameter C of a Linear Support Vector Machine, the lookup window around entities and the percentile of features to keep. For 1 Parameters used during computation: algorithm = CBOW; min-count = 5; vector size = 200; window = 10. the latter we used the ANOVA F-value as selection criterion. We used the SVM implementation provided within Scikit-learn (Pedregosa et al., 2011). In each case, we performed a 5-fold crossvalidation. For the container classifier and contains relation classifier, we used the F1-Measure as performance evaluation measure. Concerning the DCT classifier, we used the accuracy.

Results and Discussion
Cross-validation results are presented in Table 4a. DR and CR tasks results are presented respectively in Table 4b and Table 4c. For both tasks, we present a baseline performance. For the DR task, the baseline predicts the majority class (overlap) for all EVENT entities. For the CR task, the baseline predicts that all EVENT entities are contained by the closest TIMEX3 entity within the sentence in which they occur.
Concerning the DR task, there is a gap of 0.04 in performance between the French (0.83) and English (0.87) corpora. We notice that results per category are not homogeneous in both cases. Concerning the MERLOT corpus, the score obtained for the category Overlap is better (0.90) than the score obtained for Before-Overlap (0.69), Before (0.69) and After (0.73). Concerning the THYME corpus, the performance for the category Before-Overlap (0.66) is clearly detached from the others which are grouped around 0.85 (0.88 for Before, 0.84 for After and 0.89 for Overlap). This may be due to the distribution of categories among the corpora. Typically, the performance is lower for the categories where we have a lower number of training examples (Before-Overlap for the THYME corpus and categories other than Overlap for the MERLOT corpus).
Concerning the CR task, results are separated by a 10 percent gap (0.65 for the MERLOT corpus and 0.53 for the THYME corpus). Results obtained for the THYME corpus are coherent with those obtained by Tourille et al. (2016) on the Clinical TempEval 2016 evaluation corpus 2 . We increased the recall value in comparison to their results (from 0.436 to 0.47) but this measure is still the main point to improve.
More globally, the best results of the Clinical TempEval shared task were 0.843 (accuracy) for the DR task and 0.573 (F1-Measure) for the CR task, which are comparable to our results (0.87 for the DR task and 0.53 for the CR task).
Table 4a also indicates that replacing lexical forms by word embeddings seems to have a negative impact on performance in every case.
As for the difference of performance according to the language, several parameters can affect the results. First, the sizes of the corpora are not comparable. The THYME corpus is bigger and has more annotations than the MERLOT corpus. Second, the quality of annotations is more formalized and refined for the MERLOT corpus. This difference can influence the performance, especially for the CR task. Third, the lack of specialized clinical resources for French can negatively influence the performance of all classifiers.
Concerning the quality of annotations, it has to be pointed out that inter-annotator agreement (IAA) for temporal relation is low to moderate: in MERLOT, IAA measured on a subset of the corpu s is 0.55 for During relations, 0.32 for Conducted relations and 0.64 for Reveals relations. In Thyme, IAA for Contains relation is 0.56. The inter-annotator agreement is comparable in both languages, and suggests that temporal relation extraction is a difficult task even for humans to perform.

Conclusion and Perspectives
In this article, we have presented a work focusing on the extraction of temporal relations between medical events, temporal expressions and document creation time from clinical notes. This work, based on a feature engineering approach, obtained competitive results with the current state-of-theart and led to two main conclusions. First, the use of word embeddings in place of lexical features tends to degrade performance. Second, our feature engineering approach can be applied with comparable results to two different languages, English and French in our case.
To follow-up with the first conclusion, we would like to test a more integrated approach for using embeddings, either by turning all features into embeddings as in Yang and Eisenstein (2015) or by adopting a neural network architecture as in Chikka (2016).