LIMSI-COT at SemEval-2017 Task 12: Neural Architecture for Temporal Information Extraction from Clinical Narratives

In this paper we present our participation to SemEval 2017 Task 12. We used a neural network based approach for entity and temporal relation extraction, and experimented with two domain adaptation strategies. We achieved competitive performance for both tasks.


Introduction
SemEval 2017 Task 12 offers 6 subtasks addressing medical event recognition and temporal reasoning in the clinical domain using the THYME corpus (Styler IV et al., 2014).Similarly to the two previous editions of the challenge (Bethard et al., 2015(Bethard et al., , 2016)), the first group of subtasks concerns medical event (EVENT) and temporal expression (TIMEX3) extraction from raw text.In a second group of subtasks, participants are challenged to extract containment (CONTAINS) relations between EVENT and/or TIMEX3 as well as Document Creation Time (DCT) relations between EVENT entities and documents in which they are embedded.The novelty of the 2017 edition lies in the difference of domains between train and test corpora.More details about the task and the definition of each subtask can be found in Bethard et al. (2017).

Methodology
The EVENT and TIMEX3 entity extraction subtasks can be seen as two sequence labeling problems where each token of a given sentence is assigned a label.Entities can spread over several tokens and therefore, we used the IOB format (Inside, Outside, Beginning) for label representation.Each token can be at the beginning of an entity (B), inside an entity (I) or outside (O).EVENT entities are characterized by a type attribute that we used in our IOB scheme resulting in 7 possible labels.Similarly, TIMEX3 entities are characterized by a class attribute that we used in our IOB scheme resulting in 13 possible labels.
The container relation extraction task can be cast as a 3-class classification problem.For each combination E1 -E2 of EVENT and/or TIMEX3 from left to right, three cases are possible: -E1 temporally contains E2, -E1 is temporally contained by E2, -there is no relation between E1 and E2.Intra-and inter-sentence relation detection can be seen as two different tasks with specific features.Intra-sentence relations can benefit from intra-sentential clues such as adverbs (e.g.during) or pronouns (e.g.which) which are not available at the inter-sentence level.Furthermore, past work on the topic seems to indicate that this differentiation improves overall performance (Tourille et al., 2016).We have adopted this approach by building two separate classifiers, one for intra-sentence relations and one for inter-sentence relations.
If we were to consider all combinations of entities within documents for inter-sentence relations, it would result in a very large training corpus with very few positive examples.In order to cope with this issue, we limit our experiments to inter-sentence relations that do not span over more than three sentences.By doing so, we obtain a manageable training corpus size with less unbalanced classes while keeping a good coverage.

Corpus Preprocessing
We preprocessed the corpus using cTAKES 3.2.2(Savova et al., 2010), an open-source natural language processing system for the extraction of information from electronic health records.We extracted sentence and token boundaries, as well as token types and semantic types of the entities that have a span overlap with a least one gold standard EVENT entity of the THYME corpus.This information was added to the set of gold standard attributes available for EVENT entities in the corpus.
We also preprocessed the corpus using Heidel-Time 2.2.1 (Strötgen and Gertz, 2015), a multilingual domain-sensitive temporal tagger, and used the results to further extend our feature set.

Entity Extraction
Our approach relies on Long Short-Term Memory Networks (LSTMs) (Hochreiter and Schmidhuber, 1997).The architecture of our model is presented in Figure 1.For a given sequence of tokens, represented as vectors, we compute representations of left and right contexts of the sequence at every token.These representations are computed using two LSTMs (forward and backward LSTM in figure 1).Then these representations are concatenated and linearly projected to a n-dimensional vector representing the number of categories.Finally, as Huang et al. (2015), we add a CRF layer to take into account the previous label during prediction.Following preliminary experiments, we built one specific classifier for each entity type (EVENT or TIMEX3).

Event Attribute and Document Creation Time Relation Extraction
We treated each EVENT attribute (Contex-tualModality, Degree, Polarity) extraction subtask as a supervised classification problem.We built a common architecture for all attributes based on a linear SVM.Concerning DCT relation extraction subtask, we used the same architecture.We trained a separate classifier for each of the four subtasks based on lexical, contextual and structural features extracted from the documents: -EVENT type attribute, -EVENT plain lexical form, -EVENT position within the document, -POS tags of the verbs within the right and left contexts of the considered entity, -EVENT POS tag, -type or class of the other entities that are present within the left and right contexts, -token unigrams and bigrams within a window around the entity.

Temporal Relation Extraction
Similarly to our entity extraction approach, we built a system based on LSTMs for CONTAINS relation extraction.The architecture of our model is presented in Figure 2.For a given sequence of tokens between two entities (EVENT and/or TIMEX3), we compute a representation by scanning the sequence from left to right (forward LSTM in Figure 2).As LSTMs tend to be biased toward the most recent inputs, this model is biased toward the second entity of each pair processed by the network.To counteract this effect, we compute the reverse representation with an LSTM reading the sequence from right to left (backward LSTM in Figure 2).The two final states are then concatenated and linearly transformed into a 3-dimensional vector representing the number of categories (concatenation and projection in figure 2).Finally, a softmax function is applied.

Input Word Embeddings
Input vectors are built differently depending on the subtask.For the entity extraction subtask, vectors representing tokens are built by concatenating a character-based embedding and a word embedding.Whether we are dealing with EVENT or TIMEX3 entities, we add one embedding per An overview of the embedding computation is presented in Figure 3.Following Lample et al. (2016), the character-based representation is constructed with a Bi-LSTM 1 .First, a random embedding is generated for every character in the training corpus.Token characters are then processed with a forward and backward LSTM architecture similar to the one of our entity extraction model.The final character-based representation results from the concatenation of the forward and backward representations.Since medical terms often include prefixes and suffixes derived from ancient Greek and classical Latin (Namer and Zweigenbaum, 2004), we believe that both entity and containment relation extractions particularly benefit from this character-based representation of tokens for terms that have not been seen during training or that don't have a pretrained word embedding.
We use pretrained word embeddings computed with word2vec (Mikolov et al., 2013) 2 on the Mimic 3 corpus (Johnson et al., 2016) and the colon cancer part of the THYME corpus.In order to account for unknown tokens during the test phase, we train a special embedding UNK by replacing randomly some singletons with the UNK embedding (probability of replacement = 0.5).In  the inter-sentence relation classifier, we introduce a specific token for identifying sentence breaks.This token is composed of one distinctive character and it is associated to a specific word embedding.Similarly to the character embeddings, we randomly initialize one embedding per token attribute value, with an embedding size of 4. All these embeddings are then concatenated.

Network Training
We implemented the two neural networks models described in the previous section using Tensor-Flow 0.12 (Abadi et al., 2015).We trained our networks with mini-batch Stochastic Gradient Descent using Adam (Kingma and Ba, 2014)3 .We use dropout training to avoid overfitting.We apply dropout on input embeddings with a rate of 0.5.
The optimization of hyperparameters for the attribute and DCT relation extraction subtasks was addressed by using a Tree-structured Parzen Estimator approach (Bergstra et al., 2011) and applied to the hyperparameter C of the linear SVM, the lookup window around entities and the percentile of features to keep.For the latter we used the ANOVA F-value as selection criterion.

Domain Adaptation Strategies
We implemented two strategies for domain adaptation during the first phase.In the first strategy, we blocked further training of the pretrained word embeddings during network training.Since a large number of medical events mentioned in the test set are not seen during training, we believe that our system should rely on untuned word embeddings to make its prediction.
beddings are pretrained on the Mimic 3 corpus and on the colon cancer part of the THYME corpus, a number of tokens (and therefore EVENTs) of the test part of the corpus may not have a specific word embedding.By replacing randomly EVENT token, we force our networks to look at other contextual clues within the sentence.Both strategies were applied on EVENT entity and CONTAINS relation extraction subtasks.Phase 2 was addressed by implementing two strategies.In the first one, we mixed the 30 texts about brain cancer to the 591 texts about colon cancer.In the second one, we randomly chose 30 texts related to colon cancer and combined them to the 30 texts about brain cancer, resulting in a balanced training corpus.Both strategies were applied on EVENT, TIMEX3 and CONTAINS extraction subtasks.

Results and Discussion
Results for our four runs are presented in Table 1.The two strategies implemented for Phase 1 yield similar results (0.01 difference in F1-measure at most), with only a very slight advantage for the strategy blocking further training of the word embeddings (STATIC strategy in the table).In Phase 2, the two strategies also yield close results (0.04 difference in F1-measure) for the EVENT entity extraction and temporal relation subtasks.However, the strategy consisting in taking all available annotations (ALL strategy in the table) outperforms slightly the training on a balanced corpus, especially for the extraction of CON-TAINS relations.The same strategy seems to perform much better for the TIMEX3 entity ex-traction subtask where the gap in F1-measure reaches 0.06.This superiority agrees the general observation that the size of the training corpus has often a greater impact on results than its strict matching with the target domain.Overall, in both phases and for all strategies, results are competitive for entity and temporal relation extraction.
The performance obtained by our system relies in part on corpus tayloring.Some sections of the test corpus related to medication and diet are not to be annotated according to the annotation guidelines.However, these sections are not formally delimited within the documents.To avoid annotating them during test time, we developed a semi-automatic approach for detecting these sections and put them aside.
Other aspects linked to the corpus limit the performance.Some sections should not be annotated as they are duplicate of other sections found in the corpus as a whole.However, we have no information on how to formally identify these sections.Furthermore, a number of temporal expressions are annotated as SECTIONTIME or DOCTIME entities.Detecting TIMEX3 entities instead decreases the precision of our model.
In future work, we plan to explore additional strategies.For instance, adding a feature predicting whether a given EVENT entity is a container or not has proved useful in previous work (Tourille et al., 2016), but was not implemented in our system due to time constraints.

Figure 1 :
Figure 1: Neural model for EVENT extraction.

Figure 2 :
Figure 2: Neural architecture for CONTAINS relation extraction.