Relation extraction from clinical texts using domain invariant convolutional neural network

In recent years extracting relevant information from biomedical and clinical texts such as research articles, discharge summaries, or electronic health records have been a subject of many research efforts and shared challenges. Relation extraction is the process of detecting and classifying the semantic relation among entities in a given piece of texts. Existing models for this task in biomedical domain use either manually engineered features or kernel methods to create feature vector. These features are then fed to classifier for the prediction of the correct class. It turns out that the results of these methods are highly dependent on quality of user designed features and also suffer from curse of dimensionality. In this work we focus on extracting relations from clinical discharge summaries. Our main objective is to exploit the power of convolution neural network (CNN) to learn features automatically and thus reduce the dependency on manual feature engineering. We evaluate performance of the proposed model on i2b2-2010 clinical relation extraction challenge dataset. Our results indicate that convolution neural network can be a good model for relation exaction in clinical text without being dependent on expert's knowledge on defining quality features.


Introduction
The increasing amount of biomedical and clinical texts such as research articles, clinical trials, discharge summaries, and other texts created by social network users, represents immeasurable source of information.Automatic extraction of relevant information from these resources can be useful for many applications such as drug repositioning, medical knowledge base creation etc.The performance of concept entity recognition systems for detecting mention of proteins, genes, drugs, diseases, tests and treatments has achieved sufficient level of accuracy, which gives us opportunity for using these data to do next level tasks of natural language processing (NLP).Relation extraction is the process of identifying how given entities are related in considered sentence or text.As given in the example sentence [S1] below, the entities Lexix and congestive heart failure are related by treatment administered medical problem relation.These relations are important for other upper level NLP tasks and also in biomedical and clinical research (Shang et al., 2011).
[S1]: He was given Lexix to prevent him from congestive heart failure .
Relation extraction task in unstructured text has been modeled in many different ways.cooccurrence based methods due to their simplicity and flexibility are most widely used methods in biomedical and clinical domain.In co-occurrence analysis it is assumed that if two entities are coming together in many sentences, their must be a relation between them (Bunescu et al., 2006;Song et al., 2011).Quite obviously this method can not differentiate types of relations and suffers from low precision and recall.To improve its results, different statistical measures such as point wise mutual information, chi-square or log-likelihood ratio has been used in this approach (Stapley and Benoit, 2000).
Rule based methods are another commonly adapted methods for relation extraction task (Thomas et al., 2000;Park et al., 2001;Leroy et al., 2003).Rules are created by carefully observing the syntactic and semantic patterns in rela-tion instances.Bootstrapping method (Xu, 2008) is used to improve the performance of rule based methods.Bootstrapping uses small number of known relation pair of each relation type as a seed and use these seeds to search patterns in huge unannotated text (Xu, 2008) in iterative fashion.Bootstrapping method generates lots of irrelevant patterns too, which can be controlled by distantly supervised approach.Distantly supervised method uses large knowledge base such as UMLS or Freebase as an input and extract patterns from huge corpus for all pair of relations present in knowledge base (Mintz et al., 2009;Riedel et al., 2010;Roller and Stevenson, 2014).The advantage of bootstrapping and distantly supervised methods over supervised methods is that they do not require lots of manually labeled training data which is generally very hard to get.
Feature based methods use sentences with predefined entities to construct feature vector through feature extraction (Hong, 2005;Minard et al., 2011b;Rink et al., 2011).Feature extraction is mainly based on linguistic and domain knowledge.Extracted feature vectors are used to decide correct class of relation present between entities in the sentence through any classification techniques.Kernel methods are extension of feature based methods which utilize kernel functions to exploit rich syntactic information such as parse trees (Zelenko et al., 2003;Culotta and Sorensen, 2004;Qian and Zhou, 2012;Zeng et al., 2014).State of the art results have been obtained by these class of methods.
However, the performance of feature and kernel based methods are highly dependent on suitable feature set selection, which is not only tedious and time consuming task but also require domain knowledge and is dependent on other NLP systems.Often such dependencies make many existing work less reproducible simply because of absence of the full and finer details of feature extraction.Further often these methods lead to huge number of features and may get affected from curse of dimensionality issues (Bengio et al., 2003;Collobert et al., 2011).Another issue faced by these methods is feature extraction will have to be adjusted according to the data source.As discussed earlier we are having multiple but diverse information resources such as research articles, discharge summaries, clinical trials outcome etc.While in one hand multiple sources bring more information but the other hand it makes it challenging to extract meaningful information automatically simply because of diverse nature of the data source.For example, sentences in research articles are well formed and likely to use only well accepted technical terms.But sentences in clinical discharge summaries may not be well formed sentences instead it could be fragmented sentences with lots of acronyms or terms used only locally.Similarly social media articles may use slang or terms which are not technically used.This makes it difficult for above discussed methods.
Motivated by these issues, this work aims to exploit recent advances in machine learning and NLP domains to reduce such dependencies and utilize convolutional neural network to learn important features with minimal manual dependencies.Convolution neural network has shown to be a powerful model for image processing, computer vision (Krizhevsky et al., 2012;Karpathy and Fei-Fei, 2014) and subsequently in natural language processing it has given state of the art results in different tasks such as sentence classification (Kim, 2014;Kalchbrenner et al., 2014;Hu et al., 2014;Sharma et al., 2016), relation classification (Zeng et al., 2014;dos Santos et al., 2015) and semantic role labeling (Collobert et al., 2011).
In this paper we propose a new framework for extracting relations among problem, treatment and test in clinical discharge summaries.In particular we use data available under clinical relation extraction task organized by Informatics for Integrating Biology and the Bedside (i2b2) in 2010 as part of i2b2/VA challenge (Uzuner et al., 2011).Extracting relations in clinical texts is more challenging compared to research articles as it contains incomplete or fragmented sentences, and lots of acronyms.Current state of the art methods heavily depend on manual feature engineering and use hundreds of thousands of features (Minard et al., 2011b;Rink et al., 2011).Our result indicates the proposed model can outperform the current state of the art models by using only a small fraction of features.However the main observation is the features used in our model is easy to replicate and adapt as per the data source compared to the feature sets generally used in these tasks.For extracting relation among disease and treatment, Rosario and Hearst (2004) used various graphical and neural network models.They used variety of lexical, semantic and syntactic features for classification and found that semantic features were contributing most among all.The dataset used in this study was relatively smaller and was prepared from biomedical research articles.Li et al. (2008) proposed kernel methods for relation extraction between entities in MEDLINE R articles.They modified the tree kernel function by incorporating trace kernel to capture richer contextual features for classifying the relation.Their results shows that tree kernel outperform other kernel methods such as word and sequence kernels for the considered task.
Conditional random field (CRF) has been used for relation extraction between disease treatment and gene by (Bundschus et al., 2008).In this experiment setting, they did not assume that entities were given, instead their model also predicted en-tities and its type.They developed two variants of CRF both modeled relation extraction task as sequence labeling task.Recently Bravo et al. (2015) proposed a system for identifying association between drug disease and target in EU-ADR dataset (van Mulligen et al., 2012) and named it BeeFree.BeeFree usese combination of shallow linguistic kernel and dependency kernel for identifying relations.
In contrast to above methods recently there are few work applying convolution neural network based models (Zeng et al., 2014;dos Santos et al., 2015) for relation classification in SemEval 2010 relation classification dataset (Hendrickx et al., 2009).Convolution neural network used in this models are using constant length filters, and word embedding and distance embedding as features.Our model leverage on the linguistic features also and we considered relation extraction task in clinical notes which is much more informal, rich with acronyms and number of samples for each relations are not stable (Uzuner et al., 2011).

CNN for Clinical Relation Extraction
The proposed model based on CNN is first summarized in the next section.Subsequent sections describe it in more detail.

Model Architecture
The proposed model architecture is shown in the figure 1, which takes a complete sentence with mentioned entities as an input and outputs a probability vector corresponding to all possible relation types.Each feature is having vector representation which is initialized randomly except word embedding feature.For word embedding, we used pretrained word vector (TH et al., 2015) learned on Pubmed articles using word2vec tool (Mikolov et al., 2013b).
Embedding layer maps every feature value with its corresponding feature vectors and concatenate them.In order to get local features from each part of the sentence we have used multiple filters of different lengths (Kim, 2014) in all possible continuous n-gram of the sentence, where n is the length of filter (We have shown four filters with constant length three in the figure 1).We use max pooling over time to get global features through all filters.Here time indicates filter running over the length of the sentence.Pooled features are then fed to fully connected feed-forward neural network to make inference.In the output layer we use softmax classifier with number of outputs equal to number of possible relations between entities.

Feature Layer
We represent each word in the sentence with 6 discrete features namely word itself (W), distance from the first entity (P 1 ), distance from the second entity (P 2 ), parts of speech tag of the word (PoS), chunk tag of the word (Chunk) and entity type (T).Each feature is briefly described below: 1. W : Exact word appeared in the sentence.
2. P 1 : Distance from the first entity in terms of number of words (Collobert and Weston, 2008).For instance in our earlier example [S1] He is at −3 distance and prevent is at +2 distance away from the first entity Lexis.This value would be zero for all words which is a part of the first entity.
3. P 2 : Similar to P 1 but considers distance from the second entity.
4. P oS: Parts of speech tag of the considered word.We use genia tagger 1 to obtain pos tag of each word.
Again we use genia tagger to obtain chunk tag of each word.
6. T : Type of the considered word.For example, it would be entity type such as B −P rob, 1 http://www.nactem.ac.uk/GENIA/tagger/I − P rob etc. for entity word and Other for rest words following the BIO tagging convention.
This way a word w ∈ D 1 × D 2 × .....D 6 , where D i is the dictionary for i th local features.

Embedding Layer
In lookup or embedding layer each feature value is mapped to its vector representation using feature embedding matrix.Lets say M i ∈ R n×N is the feature embedding matrix for i th local feature (here n represents dimension of feature embedding and N is number of possible values or size of the dictionary for i th local feature).Each column of M i is vector of corresponding value of i th features.Mapping can be done by taking product of one hot vector of feature value with its embedding matrix (Collobert and Weston, 2008).Suppose a (i) j is the one hot vector for j th feature value of i th feature then: (1) Here ⊕ is concatenation operation so x i ∈ R (n 1 +....n 6 ) is feature vector for i th word in sentence and n k is dimension of k th feature.For word embedding we used pre-trained word vector obtained after running word2vec tool (Mikolov et al., 2013b;Mikolov et al., 2013a) on huge Pubmed open source articles (TH et al., 2015).Other feature matrix were initialized randomly at the beginning.Since number of elements in all feature dictionary except word dictionary (D 1 ) are not huge, we assume that while training these vectors will get sufficient updation.

Convolution Layer
We apply convolution on text to get local features from each part of the sentence (Collobert and Weston, 2008).Consider x 1 x 2 .....x m is the sequence of feature vectors of a sentence, where x i ∈ R d is a vector obtained by concatenating all feature vector of i th word.Let x i:i+j represents concatenation of x i .....x i+j feature vectors.Suppose there is a filter parameterized by weight vector w ∈ R cd where c is the length of filter (in figure 1 filter length is three).Then output sequence of convolution layer would be Where i = 1, 2, . . .m − c + 1, . is dot product, f is rectify linear unit (ReLu) function and b ∈ R is biased term.w and b are the learning parameters and will remain same for all i = 1, 2, . . .m−c+1.

Max Pooling Layer
Output of convolution layer length (m−c+1) will vary based on number of words m in the sentence.We applied max pooling (Collobert and Weston, 2008) over time to get fixed length global features for whole sentence.The intuition behind using max pooling is to consider only most useful feature from entire sentence.
We have just explained the process of extracting one feature from a whole sentence using one filter.In figure 1 we extracted four features using four filters of the same length three.In our experiment we use multiple such filters of variable length (Kim, 2014;Yin and Schtze, 2015).The objective of using different length filter is to accommodate context in varying window size around words.

Fully Connected Layer
The output of max pooling layer is sequence z came with different filters.We call this global feature because it came by taking max over entire sentence.To make classifier over extracted global feature, we used fully connected feed forward layer.Suppose z i ∈ R l is output of max pooling layer for entire filters then output of fully-connected layer would be Here W o ∈ R [r]×l and b o ∈ R [r] are parameters of neural network and [r] denotes number of classes.

Softmax Layer
In output layer we used softmax classifier for which objective function would be minimization of for i th sentence.Here y i is correct class of relation for i th instance.

Implementation
We experiment with filter lengths in two different experiment settings.In first, we use 100 different filters of a fixed length in the convolutional layer, while in another set of experiments we use varying length filters, but used 100 different filters for each varying length.So, in the first setting, we obtain 100 features after max pooling, while in the second, we obtain 100 times number of different length filter features.For regularization (Srivastava et al., 2014), we follow (Kim, 2014) and use dropout technique in output of max pooling layer.Dropout prevents co-adaptation of hidden units by randomly dropping few nodes.We set this value to 0.5 during training and 1 while testing.We use Adam technique (Kingma and Ba, 2014) to optimize our loss function.Entire neural network parameters and feature vectors are updated while training.We have implemented the proposed model in Python language using tensorflow package (Abadi et al., 2015) and will make it available on request.Results of each filter length were explained in results section.Dimension of word vector is set to 50 and rest all feature embedding size is kept to 5.

Dataset and Experimental Settings
In recent years several challenges have been organized to automatically extract information from clinical texts (Uzuner et al., 2007;Uzuner et al., 2008;Uzuner et al., 2011;Uzuner et al., 2010;Sun et al., 2013) For extracting relations among entities we considered all sentences having more than one entities in each discharge summary to check whether any relation exists between them or not.In our experiment we assume that entities and their types are already known like other existing works (Rink et al., 2011;Minard et al., 2011a;Minard et al., 2011b).We created data sample for every pair of entities present in the sentence and labeled it with the existing relation type.For example in sentence [S2] (all continuous bold phrases are entities) entity pairs ("her white count", "elevated") label would be "TeRP", for entity pair ("her gcsf", "elevated") label would be "TrNAP" and for ("her white count", "her G-CSF") label would be "None".

Influence of filter lengths
We combined the training and testing data and performed five-fold cross-validation on the available limited i2b2 dataset for all our evaluations.First we evaluate the influence of filter lengths.We experiment with selection of filter length using all features.Results as average of five-fold experiment are shown in the Table 2.
In case of single filter, the results indicate increasing the size of filter length generally tends to improve the performance.Using only single filter the best performance with F1 score as 70.43% was obtained by using filter length of 6. However further increasing the filter length did not improve the result.Intuitively it also seems that selection of either of too small or too large filter length may not be a good option.Filter length gives the window size to capture context features.One can expect that too small filter length (window size) may not capture enough good context feature and too big filter length may include noise or irrelevant contexts.
Further, we used multiple filters to see whether it improves the result.Results indicate that combination of small and mid-length filter size is perhaps the better choice.For example, combination of filter lengths 3 and 4 together did not improve the performance compared to the single filter length of 3 or 4. On the other hand combination of filter lengths 3 and 5, and 4 and 5 improved the performance compared to use of single filters of either length.It can be seen, the best result with F1 score as 71.16% is obtained by using filter lengths of 4 and 6 together.But adding more than two filters did not lead to performance improvement.

Classwise Performance
We took the best combination of filter lengths and looked at the classwise performance.Results are described in the Table 3.
We see from the results that as number of training examples (see

Contribution of Each Features
In order to investigate the contribution of each feature in final result we gradually include one feature in our model and compared the performance.Table 4 shows the obtained results.First we use only random vector (RV) representation along with entity types (T) (first row in the We could not compare our results directly with the state of the art results obtained on the i2b2 dataset as we did not have the complete dataset.We build a linear SVM classifier using similar features as defined in earlier studies (Rink et al., 2011)  This way we prepared attribute-value and numerical features for each instances.Table 5 shows the comparison of best results obtained by the proposed model and SVM based model.Linear SVM classifier with different cost parameter C was implemented using scikit learn (Pedregosa et al., 2011).Here again results shown are average over the 5-folds.Based on the results, We can make following observations:

Name
• Instead of SVM, other classifier could have been also used.We decided to use SVM as SVM based model with similar features obtained the best performance in the 2010 challenge.
• In any case we still would have to define huge number of features and only few of them would have non-zero values in any given sample or instance.
• The proposed model with limited number of features (75 * number of words in the sentence; 5 dimensional vector for 5 features other than word embedding, which is 50 dimensional vector) still gave the better performance.
• Consistent with our observations in the section 5.1, too many features trying to capture more contexts adversely affect the performance of classifier.If we look at the features defined above it includes features which try to capture context of all possible window size between the mentioned entities.

Conclusion
In this work we present a new framework based on CNN for extracting relations among clinical entities in clinical texts.The proposed model has shown better performance by using only a small fraction of features compared to the SVM based baseline model.Our results indicate that CNN is able to learn global features which can capture contextual features quite well and thus helps in improving the performance.
as a baseline for comparison.The following features are used for each entity pair instance:• Any word between relation arguments• Any P oS tags between relation arguments.We used genia tagger for P oS• Any bigram between relation arguments• Word preceding first and second argument• Any three words succeeding the first and second arguments• Sequence of chunk tags between relation arguments.We used genia tagger for chunk tag• String of words between relation arguments• First and second argument type (problem, treatment and test)• Order of argument type appeared in sentence• Distance between two arguments in terms of number of words• Presence of only punctuation sign between arguments.
While during the challenge original dataset had 394 documents for training and 477 documents for testing but when we downloaded this dataset from i2b2 website we got only 170 documents for training and 256 documents for testing.After preliminary experiment we found that we did not have

Table 2 :
Comparative performance of the proposed model using filters of different lengths separately and together.Each of the models used all features (WV+P 1 +P 2 +PoS+Chunk+Type) and 100 filters for each filter length.

Table 1 )
increases, performance of the model also improves.The relation class

Table 5 :
Comparative performance of SVM and CNN with filter length [4,6] each with 100 filters