UTA DLNLP at SemEval-2016 Task 12: Deep Learning Based Natural Language Processing System for Clinical Information Identification from Clinical Notes and Pathology Reports

We propose a deep neural network based nat-ural language processing system for clinical information (such as time information, event spans


Introduction
In past several years, there has been much interest in applying neural network based deep learning techniques to solve many natural language processing (NLP) tasks. From low-level tasks such as language modeling, POS tagging, named entity recognition, and semantic role labeling (Collobert et al., 2011;Mikolov et al., 2013), to high-level tasks such as machine translation, information retrieval, semantic analysis (Kalchbrenner and Blunsom, 2013;Socher et al., 2011a;Tai et al., 2015) and sentence relation modeling tasks such as paraphrase identification and question answering (Socher et al., 2011b;Iyyer et al., 2014;Yin and Schutze, 2015). Deep representation learning has demonstrated its importance for * To whom all correspondence should be addressed. This work was partially supported by NSF-IIS 1117965, NSF-IIS 1302675, NSF-IIS 1344152, NSF-DBI 1356628, NIH R01 AG049371. these tasks. All the tasks get performance improvement via learning either word level representations or sentence level representations.
In this work, we introduce the deep representation learning technologies to the electronic medical record research. Specifically, we focus on clinical information extraction, using clinical notes and pathology reports from the Mayo Clinic. Our system is designed to identify event expressions consisting of the following components: • The spans (character offsets) of the expression in the raw text The input of our system consists of raw clinical notes or pathology reports. The following is an example: April 23, 2014: The patient did not have any postoperative bleeding so we will resume chemotherapy with a larger bolus on Friday even if there is slight nausea.
The output annotations over the text capture the key information such as event mentions and attributes. Table 1 illustrates the output of clinical information extraction in details.
To solve this task, the major challenge is how to precisely identify the spans (character offsets) of   (Velupillai et al., 2015) extracted morphological (lemma), lexical (token), and syntactic (part-ofspeech) features encoded from cTAKES. Although using the domain specific information extraction tools can improve the performance, learning how to use this software well for clinical domain feature engineering is still very time-consuming. In short, a simple and effective method that only leverages basic NLP modules and achieves high extraction performance is desired for regular users.
To address this challenge, we propose a deep neural networks based method, especially convolution neural network (Collobert et al., 2011), to learn hidden feature representations directly from raw clinical notes. More specifically, one method first extracts a window of surrounding words for the candidate word. Then, we attach each word with their part-of-speech tag and shape information as extra features. After that, our system deploys a temporal convolution neural network to learn hidden feature representations. Finally, our system uses Multilayer Perceptron (MLP) to predict event spans. Note that we use the same model to predict event attributes.
1 Apache cTAKES is a natural language processing system for extraction of information from electronic medical record clinical free-text 2 Constructing High Quality Training Dataset The major advantage of our system is that we only leverage NLTK 2 tokenization and a POS tagger to preprocess our training dataset. When implementing our neural network based clinical information extraction system, we found it is not easy to construct high quality training data due to the noisy format of clinical notes. Choosing the proper tokenizer is quite important for span identification. After conducting experiments, we found that "RegexpTokenizer" can match our needs. This tokenizer can generate spans for each token via sophisticated regular expression such as: n l t k . t o k e n i z e . R e g e x p T o k e n i z e r ( " We then use "PerceptronTagger" as our part-ofspeech tagger due to its fast tagging speed. Note that when extracting context words, please make sure you deploy the same tokenization module instead of just splitting strings by space.

Neural Network Classifier
Event span identification is the task of extracting character offsets of the expression in raw clinical notes. This subtask is quite important due to the fact that the event span identification accuracy will affect the accuracy of attribute identification. We first run our neural network classifier to identify event spans. Then, given each span, our system tries to identify attribute values.

Temporal Convolutional Neural Network
The way we use temporal convolution neural network for event span and attribute classification is similar with the approach proposed by (Collobert et al., 2011). Generally speaking, we consider a word as represented by K discrete features w ∈ D 1 ×· · ·× D K , where D K is the dictionary for the k th feature. In our scenario, we just use three features such as token mention, pos tag, and word shape. Note that word shape features are used to represent the abstract letter pattern of the word by mapping lowercase letters to "x", upper-case to "X", numbers to "d", and retaining punctuation. We associate to each feature a lookup table. Given a word, a feature vector is then obtained by concatenating all lookup table outputs. Then a clinical snippet is transformed into a word embedding matrix. The matrix can be fed to further 1-dimension convolutional neural network and max pooling layers. Below we will briefly introduce core concepts of Convolutional Neural Network (CNN).

Temporal Convolution
Temporal Convolution applies one-dimensional convolution over the input sequence. The onedimensional convolution is an operation between a vector of weights m ∈ R m and a vector of inputs viewed as a sequence x ∈ R n . The vector m is the filter of the convolution. Concretely, we think of x as the input sentence and x i ∈ R as a single feature value associated with the i-th word in the sentence. The idea behind the one-dimensional convolution is to take the dot product of the vector m with each mgram in the sentence x to obtain another sequence c: Usually, x i is not a single value, but a ddimensional word vector so that x ∈ R d×n . There exist two types of 1d convolution operations. One was introduced by (Waibel et al., 1989) and also known as Time Delay Neural Networks (TDNNs). The other one was introduced by (Collobert et al., 2011). In TDNN, weights m ∈ R d×m form a matrix. Each row of m is convolved with the corresponding row of x. In (Collobert et al., 2011) architecture, a sequence of length n is represented as: where ⊕ is the concatenation operation. In general, let x i:i+j refer to the concatenation of words x i , x i+1 , . . . , x i+j . A convolution operation involves a filter w ∈ R hk , which is applied to a window of h words to produce the new feature. For example, a feature c i is generated from a window of words x i:i+h−1 by: where b ∈ R is a bias term and f is a non-linear function such as the hyperbolic tangent. This filter is applied to each possible window of words in the sequence {x 1:h , x 2:h+1 , . . . , x n−h+1:n } to produce the feature map: where c ∈ R n−h+1 . We also employ dropout on the penultimate layer with a constraint on 2 -norms of the weight vector. Dropout prevents co-adaptation of hidden units by randomly dropping out a proportion p of the hidden units during forward-backpropagation. That is, given the penultimate layer z = [ĉ 1 , . . . ,ĉ m ], instead of using: for output unit y in forward propagation, dropout uses: where • is the element-wise multiplication operator and r ∈ R m is a masking vector of Bernoulli random variables with probability p of being 1. Gradients are backpropagated only through the unmasked units. At test step, the learned weight vectors are scaled by p such thatŵ = pw, andŵ is used to score unseen sentences. We additionally constrain l 2 -norms of the weight vectors by re-scaling w to have ||w|| 2 = s whenever ||w|| 2 > s after a gradient descent step.

Dataset
We use the Clinical TempEval corpus 3 as the evaluation dataset. This corpus was based on a set of  600 clinical notes and pathology reports from cancer patients at the Mayo Clinic. These notes were manually de-identified by the Mayo Clinic to replace names, locations, etc. with generic placeholders, but time expression were not altered. The notes were then manually annotated with times, events, and temporal relations in clinical notes. These annotations include time expression types, event attributes, and an increased focus on temporal relations. The event, time, and temporal relation annotations were distributed separately from the text using the Anafora standoff format. Table 2 shows the number of documents, event expressions in the training, development and testing portions of the 2016 THYME data.

Evaluation Metrics
All of the tasks were evaluated using the standard metrics of precision (P), recall (R), and F 1 : where S is the set of items predicted by the system and H is the set of items manually annotated by the humans. Applying these metrics of the tasks only requires a definition of what is considered an "item" for each task. For evaluating the spans of event expressions, items were tuples of character offsets. Thus, system only received credit for identifying events with exactly the same character offsets as the manually annotated ones. For evaluating the attributes of event expression types, items were tuples of (begin, end, value) where begin and end are character offsets and value is the value that was given to the relevant attribute. Thus, systems only received credit for an event attribute if they both found an event with correct character offsets and then assigned the correct value for that attribute (Bethard et al., 2015).

Objective Function
We want to maximize the likelihood of the correct class. This is equivalent to minimizing the negative log-likelihood (NLL). More specifically, the labelŷ given the inputs x h is predicted by a softmax classifier that takes the hidden state h j as input: After that, the objective function is the negative log-likelihood of the true class labels y k : where m is the number of training examples and the superscript k indicates the k-th example.

Hyperparameters
We use Lasagne 4 deep learning framework. We first initialize our word representations using publicly available 300-dimensional Glove word vectors 5 . We deploy CNN model with kernel width of 2, a filter size of 300, sequence length is 2 * windows size+1, number filters is seqlen−kw+1, stride is 1, pool size is seqlen − f ilter size + 1, cnn activation function is tangent, MLP activation function is sigmoid. MLP hidden dimension is 50. We initialize CNN weights using a uniform distribution. Finally, by stacking a softmax function on top, we can get normalized log-probabilities. Training is done through stochastic gradient descent over shuffled mini-batches with the AdaGrad update rule (Duchi et al., 2011). The learning rate is set to 0.05. The mini-batch size is 100. The model parameters were regularized with a per-minibatch L2 regularization strength of 10 −4 . Table 3 shows results on the event expression tasks. Our initial submits RUN 4 and 5 outperformed the memorization baseline on every metric on every task. The precision of event span identification is close to the max report. However, our system got lower recall. The reason of lower recall values is that the word2vec may not cover more domain specific words. Table 4 shows results on the phase 2 subtask.

Conclusions
In this paper, we introduced a new clinical information extraction system that only leverages deep neural networks to identify event spans and their attributes from raw clinical notes. We trained deep neural networks based classifiers to extract clinical event spans. Our method attached each word to their part-of-speech tag and shape information as extra features. We then hire temporal convolution neural network to learn hidden feature representations. The entire experimental results demonstrate that our approach consistently outperforms the existing baseline methods on standard evaluation datasets. Our research proved that we can get competitive results without the help of a domain specific feature extraction toolkit, such as cTAKES. Also we only leverage basic natural language processing modules such as tokenization and part-of-speech tagging. With the help of deep representation learning, we can dramatically reduce the cost of clinical information extraction system development. Due to the recall values are still low, we will consider use domain specific tools to enhance feature engineering.