Representations of Time Expressions for Temporal Relation Extraction with Convolutional Neural Networks

Token sequences are often used as the input for Convolutional Neural Networks (CNNs) in natural language processing. However, they might not be an ideal representation for time expressions, which are long, highly varied, and semantically complex. We describe a method for representing time expressions with single pseudo-tokens for CNNs. With this method, we establish a new state-of-the-art result for a clinical temporal relation extraction task.


Introduction
Convolutional Neural Networks (CNNs) utilize convolving filters and pooling layers for exploring and subsampling a feature space, and show excellent results in tasks such as semantic parsing (Yih et al., 2014), search query retrieval (Shen et al., 2014), sentence modeling (Kalchbrenner et al., 2014), and many other natural language processing (NLP) tasks (Collobert et al., 2011).
Token sequences are often used as the input for a CNN model in NLP. Each token is represented as a vector. Such vectors could be either word embeddings trained on the fly (Kalchbrenner et al., 2014), pre-trained on a corpus (Pennington et al., 2014;Mikolov et al., 2013), or one-hot vectors that index the token into a vocabulary (Johnson and Zhang, 2014). CNN filters then act as n-grams over continuous representations. Subsequent network layers learn to combine these n-gram filters to detect patterns in the input sequence.
This token vector sequence representation has worked for many NLP tasks, but has not been wellstudied for temporal relation extraction. Time expressions are complex linguistic expressions that are challenging to represent because of their length and variety. For example, for the time expressions in the THYME (Styler IV et al., 2014) colon cancer training corpus, there are 3,833 occurrences of Figure 1: CNN with encoded timex 2,014 unique expressions of which 1,624 (80.6%) are multi-token, 1,104 span three or more tokens, and some span as many as 10 tokens. CNNs, which represent meaning through fragments of word sequences, might struggle to compose these fragments to represent the meaning of time expressions. For example, can a CNN properly generalize that May 7 as a date is closer to April 30 than May 20? Can it embed years like 2012 and 2040 to recognize that the former was in the past, while the latter is in the future? Time normalization systems can handle such phenomena, but they are complex and language-specific, and often require significant manual effort to re-engineer for a new domain (Strötgen and Gertz, 2013;Bethard, 2013).
Fortunately, not all tasks require full time normalization, so if the CNN can at least embed a meaningful subset of the time expression semantics, it may still be helpful in such tasks. An open question then, is how to best feed time expressions to the CNN so that it can usefully generalize over them as part of its solution to a larger task.
We propose representing time expressions as single pseudo-tokens, with single vector representations (as in Figure 1), that encode easily extractable information about the time expression that is valuable for the task of temporal relation extraction. The benefits are two-fold: 1) Only minimal linguistic preprocessing is required: off-the-shelf time expression identifiers are available with low over-A EVENT surgery is scheduled on TIME Mar 11 . ⇓ 1: a e surgery /e is scheduled on t mar 11 /t . 2: a e surgery /e is scheduled on t timex /t . 3: a e surgery /e is scheduled on t date /t . 4: a e surgery /e is scheduled on t nn cd /t . 5: a e surgery /e is scheduled on t date nn cd /t . 6: a e surgery /e is scheduled on t index 721 /t . 7: a e surgery /e is scheduled on t mar 11 date /t .   (Miller et al., 2015). 2) CNN filters are more effective because they operate over the time expression as one unit. The filter process can thus focus on the informative surrounding context to catch generalizable patterns instead of being trapped within lengthy time expressions.
We explored a variety of one-tag representations for time expressions, from very specific to very general. We also experimented with other ways to inject temporal information into the CNN models and compared them with our one-tag representations. We picked a challenging learning task where time expressions are critical cues for evaluating our proposed representation: clinical temporal relation extraction. The identification of temporal relations in medical text has been drawing growing attention because of its potential to dramatically increase the understanding of many medical phenomena such as disease progression, longitudinal effects of medications, a patient's clinical course, and its many clinical applications such as question answering (Das and Musen, 1995;Kahn et al., 1990), clinical outcomes prediction (Schmidt et al., 2005), and the recognition of temporal patterns and timelines (Zhou and Hripcsak, 2007;.
Through experiments, we not only demonstrate the usefulness of one-tag representations for time expressions, but also establish a new state-of-theart result for clinical temporal relation extraction.

Methods
We trained two CNN-based classifiers for recognizing two types of within-sentence temporal relations, event-event and event-time relations, as they usually call for different temporal cues (Lin et al., 2016a). The input to our classifiers was manually annotated (gold) events and time expressions during both training and testing stages. That way we isolated the task of time expression representation for temporal relation extraction from the tasks of event and time expression recognition. We adopted the same xml-tag marked-up token sequence representation and model setup as (Dligach et al., 2017). Figure 2(1) illustrates the marked-up token sequence for an event-time instance, in which the event is marked by e and /e and the time expression is marked by t and /t . Event-event instances are handled similarly, e.g. a e1 surgery /e1 is e2 scheduled /e2 on march 11. We tried different ways of representing a time expression as a one-token tag. The most coarse option would be to represent all time expressions with one universal tag, timex , as in Figure 2(2). For more granular options, we experimented with these additional representations: 1) The time class 1 of a time expression, as in Figure 2(3), where the time expression, Mar 11, is represented by its class, date .
2) The Penn Treebank POS tags of the tokens in a time expression, as in Figure  To show the contribution of one-tag representations versus adding new information to the system, we explored incorporating temporal information by adding time-class tags to the original token sequences (Figure 2(7)) and adding BIO tags with/without time classes for time expression (Figure 2(8,9)) alongside the original token sequences.
We used the same CNN architecture as the CNN used in (Dligach et al., 2017), and focused on extracting the contains relation. The word embeddings were randomly initialized 2 and 1 We used the standard clinical domain classification (Styler IV et al., 2014), where the classes are date (e.g., next Friday, this month), time (e.g. 3:00 pm), duration (e.g., five years), quantifier (e.g. twice, four times), prepostexp (e.g., preoperative, post-surgery), and set (e.g., twice monthly).
2 Our preliminary experiments showed better results for randomly-initialized embeddings than several pre-trained embeddings. One-hot vectors were too slow for processing.

Evaluation Methodology and Results
We tested our new representations of time expressions on the THYME corpus (Styler IV et al., 2014). We followed the evaluation setup of Clinical Temp-Eval 2016 (Bethard et al., 2016). The THYME corpus contains a colon cancer set and a brain cancer set. The colon cancer set was our main focus. Models were trained on the colon cancer training set, hyper-parameters were tuned on the colon cancer development set. Finally, the best models were re-trained using the best hyper-parameters on the combined training and development sets, tested and compared on the colon cancer test set. As a secondary validation set, we also considered the brain cancer portion of the THYME corpus. The models were re-trained on the brain cancer training and development sets (using the best hyperparameters found for colon cancer) and tested on the brain cancer test set.
For results on the test sets, we used the official Clinical TempEval evaluation scripts (with closureenhanced precision, recall, and F1-score). Table 1 shows performance on the colon development set for the THYME system and the various methods of representing time expressions to CNN models. The order of representation settings is identical to that in Figure 2. For event-time relations, all our neural models outperformed the state-ofthe-art THYME system's F1. Three one-tag temporal representations with moderate granularity, time class (Table 1(3)), POS tags (Table 1(4)), and time class plus POS tags (Table 1(5)), performed better than the token sequence CNN baseline (Table 1(1)), with the time class tag representation achieving the highest score (Table 1(3)). CNNs were better able to leverage time class information in our tagbased representation (Table 1(3)), than adding time class information to the original token sequence (Table 1(7)) or adding a separate time-class neural embedding (Table 1(9)).
For event-event relations, none of the neural models performed as well as the state-of-the-art THYME system. The CNN token-based model had similar performance as some of the one-tag temporal representations (Table 1(3,4,5)). Removing the time expression entirely (Table 1(10)) did not hurt performance much, confirming that time expressions were not critical cues for within-sentence event-event relation reasoning (Xu et al., 2013). Thus, on the colon test set, we evaluated the contribution of encoding time expressions on the eventtime CNN model only. For the event-event part, we used the THYME event-event system, so that our results were directly comparable with the outcomes of Clinical TempEval 2016 (Bethard et al., 2016) and the performance of the THYME system (Lin et al., 2016a,b). As for the Brain cancer data, we only evaluated on the event-time CNN models, so that we could directly assess the contribution of encoding time expressions as time class tags. The top 4 rows of Table 2 show performance on the colon cancer test set for the best model from Clinical TempEval 2016, the THYME system, our CNN model with tokens only, and our CNN model where time expressions are encoded with time class tags. (To allow comparison with prior work, the event-time relation predictions made by our CNN models were coupled with the event-event relation predictions from the THYME system.) The bottom two rows of Table 2 show performance on the brain cancer test set. On both colon and brain corpora, the encoded CNN model outperformed the regular CNN model significantly, based on a Wilcoxon signed-rank test over document-by-document comparisons, as in (Cherry et al., 2013).

Discussion
The CNN filters in the first layers are designed to detect the presence of highly discriminative patterns. For the event-time relation extraction task, one such pattern signaling a contains relation is "on Mar 11, 2014" as in Figure 1. However, a more generalizable pattern should be -"on DATE". Our time-class tag representation provided such information and contributed towards generalizability. A size-two filter can easily capture such a useful pattern, instead of picking up less generalizable patterns like "on March" or "11 ," (shown in Figure 1). For a time-sensitive learning task, especially the event-time relation extraction, our time encoding technique has been proved effective on two corpora. We hypothesize the contribution is from generalizability and efficient filter computation.
Our method did not work for event-event relations because time expressions are not critical cues for such relations. CNN models as a whole did not outperform the conventional THYME event-event system, as confirmed by Dligach et al. (2017). Event-event relations have lower inter-annotator agreement and usually leverage more of the syntactic information and event properties (Xu et al., 2013), which are not perfectly captured by token sequences. The class imbalance issues are more severe for event-event relations than for event-time relations as well (Dligach et al., 2017). These likely lead to a lower performance for event-event CNNs. In the future, we will investigate methods to improve the event-event model including incorporating syntactic information and event properties into a deep neural framework, and positive instance augmentation Yu and Jiang (2016).
Word embeddings trained by conventional methods such as word2vec and GloVe did not prove to be useful in our preliminary experiments. This is likely due to (1) lack of sufficiently large publicly available domain-specific corpora, and (2) inability of the conventional methods to capture the semantic properties of events that are key for the relation extraction task (such as event durations).
Currently, when we combined our encoded CNN-based event-time model with the THYME event-event model, we achieved the state-of-theart performance (0.621F) on the colon cancer data. The best 2016 Clinical TempEval system achieved 0.573F (Bethard et al. (2016); row 1 of Table 2), the result of the THYME system was 0.594F (Lin et al. (2016b); row 2 of Table 2), while our best combined model reached 0.621F, significantly higher (p=0.03) than the 0.612F of the combination of a regular CNN event-time model and the THYME event-event model. Note that the number of gold event-time contains relation instances is similar to the number of gold event-event contains relations (Lin et al., 2016a). Having a better event-time model indeed made the difference.
The conventional machine learning world has focused on heavy feature engineering, while the new deep learning world has called for minimalistic pre-processing as input to powerful learners. We propose a new direction to combine the best of both worlds -infusing some knowledge into the learner input. For CNN models, multi-word time expressions are imperfectly represented in the token sequence representation. With a little engineering, we can encapsulate the time expressions in one tag with different granularities. Our experiments show that this small change still takes minimum linguistic preprocessing but delivers a significant performance boost for a temporal relation extraction task. There are other multi-token named entities (locations, organizations, etc.) where it may be hard to generalize over their multiple tokens. We believe our encoding strategy is likely to benefit tasks where critical linguistic information resides in phrases or multi-word units.