Learning local and global contexts using a convolutional recurrent network model for relation classification in biomedical text

The task of relation classification in the biomedical domain is complex due to the presence of samples obtained from heterogeneous sources such as research articles, discharge summaries, or electronic health records. It is also a constraint for classifiers which employ manual feature engineering. In this paper, we propose a convolutional recurrent neural network (CRNN) architecture that combines RNNs and CNNs in sequence to solve this problem. The rationale behind our approach is that CNNs can effectively identify coarse-grained local features in a sentence, while RNNs are more suited for long-term dependencies. We compare our CRNN model with several baselines on two biomedical datasets, namely the i2b2-2010 clinical relation extraction challenge dataset, and the SemEval-2013 DDI extraction dataset. We also evaluate an attentive pooling technique and report its performance in comparison with the conventional max pooling method. Our results indicate that the proposed model achieves state-of-the-art performance on both datasets.


Introduction
Relation classification is the task of identifying the semantic relation present between a given pair of entities in a piece of text. Since most search queries are some forms of binary factoids (Agichtein et al., 2005), modern questionanswering systems rely heavily upon relation classification as a preprocessing step (Fleischman 1 The code for the can be found at: https://github.com/desh2608/ crnn-relation-classification. et al., 2003;Lee et al., 2007). Accurate relation classification also facilitates discourse processing and precise sentence interpretations. Hence, this task has witnessed a great deal of attention over the last decade (Mintz et al., 2009;Surdeanu et al., 2012).
In the biomedical domain, in particular, extracting such tuples from data may be essential for identifying protein and drug interactions, symptoms and causes of diseases, among others. Further, since clinical data tends to be obtained from multiple (and diverse) information sources such as journal articles, discharge summaries, and electronic patient records, relation classification becomes a more challenging task.
To identify relations between entities, a variety of lexical, syntactic, or pragmatic cues may be exploited, which results in a challenging variability in the type of features used for classification purpose. Due to this variability, a number of approaches have been suggested, some of which rely on features extracted from POS tagging, morphological analysis, dependency parsing, and world knowledge (Kambhatla, 2004;Santos et al., 2015;Suchanek et al., 2006;. Deep learning architectures have recently gathered much interest because of their ability to conveniently extract relevant features without the need of explicit feature engineering. For this reason, a number of convolutional and recurrent neural network models (Zeng et al., 2014;Xu et al., 2015b) have been used for this task.
In this paper, we propose a model that uses recurrent neural networks (RNNs) and convolutional neural networks (CNNs) in sequence to learn global and local context, respectively. We refer to this as CRNN, following the naming convention used in (Huynh et al., 2016). We argue that in order for any classification task to be effective, the regression layer must see a complete representation of the sentence, i.e., both short and long-term dependencies must be appropriately represented in the sentence embedding. This argument forms the basis of our approach. In a deep learning framework, since the complete information available to the classifier at the top-level is obtained through manipulation of the sentence embedding itself, the task of relation classification essentially emulates other popular objectives such as text classification and sentiment analysis if the representation for the entity types are integrated in the sentence. Although our proposed model uses RNNs and CNNs in sequence, it is only two layers deep, as opposed to the very deep architectures proposed earlier (Conneau et al., 2016). This simplicity allows for intuitive understanding of each level of the model, while still learning a sufficiently complex representation of the input sentence.
In addition to local and global contexts, we also experiment with attention for relation classification. Although attention as a concept is relatively well-known, especially in computational neuroscience (Itti et al., 1998;Desimone and Duncan, 1995), it became popular only recently with applications to image captioning and machine translation (Xu et al., 2015a;Vinyals et al., 2015;Bahdanau et al., 2014). Attention has also been employed to some success in relation classification tasks (Wang et al., 2016a;Zhou et al., 2016a). In our experiments, we use an attention-based pooling strategy and compare the results with those obtained using conventional pooling methods. Our model variants are accordingly named CRNN-Max and CRNN-Att, depending upong the pooling scheme used.
Our model is distinctive in that it does not rely upon any linguistic feature for relation classification. In domains such as biomedicine, texts may not always be written in syntactically/grammatically correct form. Furthermore, lack of necessary training data may not provide good feature extractors such as those in generic domains. Hence, we explored only models without any extra features. Of course, adding other features such as part-of-speech taggers or dependency parsers, if they are available easily, may improve the performance further. Our key contributions in this paper are as follows: • We propose and validate a two-layer architecture comprising RNNs and CNNs in se-quence for relation classification in biomedical text. Our model's performance is comparable to the state-of-the-art on two benchmark datasets, namely the i2b2-2010 clinical relation extraction challenge, and the SemEval-2013 DDI extraction dataset, without any need for handcrafted features.
• We analyze and discuss why such a model effectively captures short and long-term dependencies in a sentence, and demonstrate why this representation facilitates classification.
• We evaluate an attention-based pooling technique and compare its performance with conventional pooling strategies.
• We provide evidence to further the argument in favor of using RNNs to obtain regional embeddings in a sentence.

Related Research
CNNs have been effectively employed in NLP tasks such as text classification (Kim, 2014), sentiment analysis (Dos Santos and Gatti, 2014), relation classification (Zeng et al., 2014;Nguyen and Grishman, 2015b), and so on. Similarly, RNN models have also been used for similar tasks (Johnson and Zhang, 2016). The improved performance of these models is due to several reasons: 1. Pretrained word vectors are used as inputs for most of these models. These embeddings capture the semantic similarity between words in a global context better than one-hot representations.
2. CNNs are capable of learning local features such as short phrases or recurring n-grams, similar to the way they provide translational, rotational and scale invariance in vision.
3. RNNs utilize the word order in the sentence, and are also able to learn the long-term dependencies.
These observations amply motivate a model which captures both short-term and long-term dependencies using a combination of CNNs and RNNs to form a robust representation of the sentence. Earlier, researchers have proposed RCNN models that compute "regional embeddings" using a CNN at the first level, and these embeddings  are then fed into an RNN layer which uses sequence information to generate the sentence representation (Huynh et al., 2016;Wang et al., 2016b;Chen et al., 2017;Nguyen and Grishman, 2015a). These models are similar to ones that have also been employed to some success for visual recognition (Donahue et al., 2015). However, such models are still limited because the RNN may "forget" features that occurred in the past if the sequence is very long.
We solve this problem by obtaining the output of the RNN at each time step (or word), and then pooling small phrases. This method of using a "re-current+pooling" module for regional embedding is inspired from (Johnson and Zhang, 2016), who showed that for text categorization, embeddings of text regions, which can convey higher-level concepts than single words in isolation, are more useful than word embeddings. We also experiment with attention-pooling to integrate weighted features from discontinuous regions in the sentence.

Proposed Method
Given a sentence S with marked entities e 1 and e 2 , belonging to entity types t 1 and t 2 , respectively, and a set of relation classes C = {c 1 , . . . , c m } we formulate the task of identifying the semantic relation as a supervised classification problem, i.e., we learn a function f : (S, E, T ) → C, where S is the set of all sentences, E is the set of entity pairs, and T denotes the set of entity types. Our training objective is to learn a joint representation of the sentence and the entity types, such that a softmax regression layer predicts the correct label. To learn such an embedding, we propose a two-layer neural network architecture con-sisting of a "recurrent+pooling" layer and a "con-volutional+pooling" layer in sequence. This architecture is diagrammatically described in Fig. 1, and the remainder of this section explains each of the layers in detail.

Embedding layer
The only features we use from S are the words themselves. The vector representation of these words is obtained using the GloVe method (Pennington et al., 2014).
Pre-trained word vectors are used for the word embeddings and the words not present in the embeddings list are initialized randomly. All the word vectors are updated during training.

Recurrent layer
RNN is a class of artificial neural networks which utilizes sequential information and maintains history through its intermediate layers (Graves et al., 2009). We use long short-term memory (LSTM) based model (Hochreiter and Schmidhuber, 1997), which uses memory and gated mechanism to compute the hidden state. In particular we use a bidirectional LSTM model (Bi-LSTM) similar to the ones used in (Graves, 2013;Huang et al., 2015).
Let h (t) l and h (t) r be the outputs obtained from the forward and backward direction of the LSTM at time t. Then the combined output is given as where : denotes the concatenation operation. We obtain the output at each word and pass it to the first pooling layer.

First pooling layer
The recurrent layer generates word-level embeddings that incorporate information from the past and future context. Sometimes the word itself may not be important for the sentence representation, and in such cases, it may be better to extract the most important features from short phrases using a pooling technique. If f 1 denotes the length of the filter used for pooling, and (z 1 , . . . , z m ) is the sequence of vectors obtained from the previous layer, then where p i ∈ R n O is given as i.e. the maximum among all vectors z i+1 to z i+f 1 .

Convolutional layer
We apply convolution on p to get local features from each part of the sentence (Collobert and Weston, 2008). Consider a convolutional filter parametrized by weight vector w c ∈ R n O * f 2 , where f 2 is the length of filter. Then the output sequence of convolution layer would be where i = 1, 2, . . . , m − f 1 − f 2 + 2, · is dot product, f is the rectifier linear unit (ReLU) function (f (x) = max{0, x}), and b c ∈ R is the bias term. The parameters w c and b c are shared across all convolutions i = 1, 2, . . . , m − f 1 − f 2 + 2. On applying n c such filters, we obtain an output matrix H c ∈ R nc×(m−f 1 −f 2 +2) .

Second pooling layer
The output of the convolutional layer is of variable length (m − f 1 − f 2 + 2), since it depends on the length m of the input sentence. To obtain fixed length global features for the entire sentence, we apply pooling over the entire sequence. For this, we experiment with two different pooling schemes based on which our model has two variations, namely CRNN-Max and CRNN-Att.
3.5.1 Max pooling over time Max pooling over time (Collobert and Weston, 2008) takes the maximum over the entire sentence, with the assumption that all the relevant information is accumulated in that position. Since the input to this layer are the local convolved vectors, this strategy essentially extracts the most important features from several short phrases. The output is then given as where z pool ∈ R nc is the dimension-wise maximum over all h i c 's. 3.5.2 Attention-based pooling A max pooling scheme may fail when important cues are distributed across different clauses in the sentence. We solve this problem by using an attention-based pooling scheme, which obtains an optimal feature dimension-wise by taking weighted linear combinations of the vectors. These weights are trained using an attention mechanism such that more important features are weighed higher (Bahdanau et al., 2014;Yang et al., 2016;Zhou et al., 2016b). The attention mechanism produces a vector α of size m − f 1 − f 2 + 2, and the values in this vector are the weights for each phrase obtained from the convolutional layer feature vectors.
Here, H c is the matrix of CNN output vectors, W α 1 , W α 2 ∈ R nc×nc is the parameter matrix, α ∈ R m−f 1 −f 2 +2 are the attention weights, and z att ∈ R nc is the output of the pooling layer. The attention weights are a function of the input sentence, and hence α is different for every sentence.

Fully connected and softmax
To obtain a classifier over the extracted global features, we use a fully connected layer consisting of |C| nodes, where C is the set of all possible relation classes, followed by a softmax layer to generate a probability distribution over the set of all possible labels. The final output is given as where W o and b o are the weight and bias parameters, and z may be either z pool or z att , depending on the second pooling layer scheme. The predicted output y is obtained as Class Train size Test size  TrCP  436  108  TrAP  2131  532  TrWP  109  26  TrIP  165  41  TrNAP  140  34  TeRP  2457  614  TeCP  409  101  PIP  1776  443  None  44588  11146  Total  52211  13045   Table 1: Number of training and testing instances for each relation type in the i2b2 dataset.

i2b2-2010 relation extraction
This dataset contains sentences from discharge summaries collected from three different hospitals and have 8 relation types: treatment caused medical problems (TrCP), treatment administered medical problem (TrAP), treatment worsen medical problem (TrWP), treatment improve or cure medical problem (TrIP), treatment was not administered because of medical problem (TrNAP), test reveal medical problem (TeRP), test conducted to investigate medical problem (TeCP), and medical problem indicates medical problem (PIP). If a sentence has more than two entities, we make an instance for each pair. Since only 170 of the 394 original training documents and 256 of the 477 testing documents were available for download, we combined all the training and testing instances, and then split it in a 80:20 ratio for training and test sets respectively. The statistics of the dataset are described in Table 1.

SemEval 2013 DDI extraction
This dataset contains annotated sentences from two sources, Medline abstracts (biomedical research articles) and DrugBank database (documents written by medical practitioners). The dataset is annotated with following four kinds of interactions: advice (opinion or consultation related to the simultaneous use of the two drugs), effect (effect of the DDI together with pharmacodynamic effect or mechanism of interaction),  mechanism (pharmacokinetic mechanism), and int (drug interaction without any other information). Dataset provides the training and test instances by sentences. Similar to i2b2 relation extraction dataset if a sentence has more than two drug names, all possible pairs of drugs in the sentence have been separately annotated, such that a single sentence having multiple drug names leads to separate instances of drug pairs and corresponding interaction. Statistics of the dataset (along with negative instance filtering, discussed in Section 4.1.1) is shown in Table 2.

Preprocessing
As a preprocessing step, we replace the entities in the i2b2 dataset with the corresponding entity types. For instance, the sentence: "He was given Lasix to prevent him from congestive heart failure." was converted to: "He was given TREATMENT A to prevent him from PROB-LEM B." Similarly, for the DDI extraction dataset, the two targeted drug names are replaced with DRUG-A and DRUG-B respectively, and other drug names in the same sentence are replaced with DRUG-N. Further, all numbers were replaced with the keyword NUM. Similar to the earlier studies (Sahu and Anand, 2017; Liu et al., 2016;Rastegar-Mojarad et al., 2013), negative instances were filtered from training sets.

Implementation details
Pretrained 100-dimensional word vectors in the embedding layer are obtained using the GloVe method (Pennington et al., 2014) trained on a corpus of PubMed open source articles (Muneeb et al., 2015), and are updated during the training process. We use both l 2 regularization and dropout (Srivastava et al., 2014) techniques for regularization. Dropout is applied only on the output of the second pooling layer, and it prevents co-adaptation of hidden units by randomly dropping few nodes. After tuning the hyperparameters on a validation set (20% of training set), the val-ues of 0.01 (0.001) and 0.7 (0.5) were found optimal for the regularization parameter and dropout for the i2b2 (DDI extraction) dataset, respectively. We use Adam technique (Kingma and Ba, 2014) to optimize our loss function, with a learning rate of 0.01. For all the models, n O and n C were tuned on the validation set, and values of 200 and 100 were found to be optimal. Hyperparameters of baseline methods were taken from the values suggested in the respective papers. Entire neural network parameters and feature vectors are updated while training. We have implemented the proposed model in Python language using the Tensorflow package (Abadi et al., 2016). We experiment with different filter sizes for f 1 and f 2 and discuss the results in Section 5.1.

Baseline methods
We compare our models with 5 methods that have earlier been used for relation classification to satisfactory results. These baselines were selected for one of the following three purposes.

Feature-based methods
We selected a feature-based SVM classifier (Rink et al., 2011) that uses several handcrafted features such as distance of word from entities, POS tags, chunk tags, etc., to compare whether our models were able to outperform classifiers with rigorous feature engineering. It is to be noted that we use our own implementation of the SVM classifier (using the scikit-learn (Pedregosa et al., 2011) library), using features as described in (Sahu et al., 2016).

Single-layer neural networks
We selected a multiple-filter CNN with maxpooling (Sahu et al., 2016) and an LSTM model with max and attentive pooling (Sahu and Anand, 2017). In Section 5.5, we compare our models with these single layer models to justify using a combination of RNN and CNN to learn long-term and short-term dependencies, respectively. To observe the effect of the network model independent of the feature set, we use only the word embeddings as features for each of these models. Further, we used the same hyperparameters as mentioned in the respective papers.

Recurrent convolutional neural network
This model, inspired from (Wang et al., 2016b), obtains regional embeddings using a convolutional  layer. These are then fed into a recurrent layer and a single output is obtained after traversing the entire sequence. We compare our models with this RCNN model to observe the effect of obtaining outputs at every word, as opposed to at the end of the sequence.

Effect of filter sizes f 1 and f 2
We experiment with various combinations of filter sizes f 1 and f 2 on the i2b2 dataset using our CRNN-Att model. Since f 1 denotes the size of the first pooling filter, it essentially represents the amount of information present in a regional embedding that is fed into the convolutional layer. If f 1 is too small (f 1 = 1, i.e., no pooling), embeddings from seemingly unimportant words may get through, and if it is large (f 1 ≥ 3), individual embeddings may get pooled such that a few words dominate the majority of regions. For the filter size f 2 in the convolutional layer, a midrange value (4 to 6) was found to work well. This may be because this layer learns to identify short phrases which are usually of this length. These observations were common for both datasets. The F1 scores for various combinations of filter sizes on the i2b2 data are shown in Table 3. In the remaining experiments, we choose (f 1 ,f 2 ) = (2,5) for both our model variants.

Initialization and tuning of word embeddings
The only feature used in our models is the word vectors for every word in the sentence. We perform several experiments on the i2b2 data to observe the effect of word vector initialization and update on the model performance. The results are summarized in Table 5. Interestingly, the best performing model uses randomly initialized word embeddings that are not updated during training. This is in contrast to earlier studies (Sahu and Anand, 2017;Collobert and Weston, 2008) where pretrained embeddings    usually improved model performances by 3-4%. However, this result aligns with the observations made in (Johnson and Zhang, 2015) and supports the argument for one-hot LSTMs. It may be enlightening to discuss why such a result is obtained. First, we note that in the formulas for LSTM, e.g., u t = tanh (W (u) , if x t is the one-hot representation of a word, the term W (u) x t serves as a word embedding. Thus, a one-hot LSTM inherently includes a word embedding in its computation. Further, a word vector lookup is a linear operation, and hence it may be merged into the LSTM layer itself by multiplying the LSTM weights by the word embedding matrix. This means that the expressive power of an LSTM which uses pretrained vectors is the same as that of one which uses randomly initialized word embeddings. It has also been shown in earlier studies that pretrained embeddings do not improve the performance of networks as the number of layers increases. Johnson et al. (2015) even argued that the embedding layer can be replaced with a one-hot representation without compromising on the performance. Empirically, inclusion of an embedding layer makes training from scratch more difficult, even with the help of adaptive learning rates. Similar observations have been made regarding CNNs (Kim, 2014;Johnson and Zhang, 2014). Table 4 shows the results obtained on the i2b2 and DDI extraction datasets using our proposed models, as compared to the baseline methods. Our models outperform the baselines even without the need for explicit feature engineering. It is interesting to note that our CRNN-Max performs better than the CRNN-Att, and a similar result has also been observed earlier in (Sahu and Anand, 2017).

Class-wise performance analysis
We compare class-wise performance of our models on the i2b2 dataset with some of the baselines, and this is summarized in Table 6. It is evident that performances improve with training size, and from the confusion matrices (not shown here), we found that samples of a lower frequency class were misclassified into a higher frequency class comprising the same entity types. For instance, samples belonging to TrWP (Treatment Worsen medical Problem) were often classified as TrAP (Treatment Administered medical Problem).

Effect of attention-based pooling
Our CRNN-Att model uses an attention-based technique in the final pooling layer, i.e. it obtains a weighted linear combination of different phrases depending upon their relative importance in the sentence embedding. To confirm this, we visualize attention weights in a CRNN-Att model with (f 1 , f 2 ) = (1, 3), for 5 samples in the i2b2 dataset through a heat map as shown in Figure 2. Since weights are assigned to phrases rather than words, to obtain attention for each word we take the mean of weights of all phrases that the word is present in. The figure shows that the attentive pooling scheme is able to select important phrases depending upon the classification label. It is evident that the model assigns a higher weight to semantically relevant words such as "showed," "question," and "revealed".

Long and short term dependencies
We conjecture that our proposed CRNN models perform better than single layer CNNs or RNNs because they capture both local and global contexts efficiently. To confirm our hypothesis, we determine the average sentence lengths and entity separations for several sets of sentences belonging to classes where our models performed well, and for classes where either the CNN model or the LSTM-Max model performed relatively well, for the i2b2-2010 dataset. These results are visualized in the box plots shown in Fig. 3.
From the figure, we note that our models CRNN-Max and CRNN-Att perform significantly better than a CNN model in classifying long sentences with large entity separation, while CNN models work well with shorter sentences where the entities are less separated. This is evident by observing the median and range of lower to upper quartile values in the figure. This confirms our conjecture that our models learn long-term dependencies better than a simple CNN model. Similarly, our proposed models perform better on a larger range of sentence lengths than LSTMs, which may be due to more effective modeling of local contexts.

Effect of linguistic features
The SVM baseline model described earlier consists of the following features obtained for each word in the sentence: word embedding, part-ofspeech (POS) tag, chunk tag, distance from first entity, distance from second entity, and entity type. Of these, the entity type feature is already used in our CRNN model in the preprocessing step  by replacing the entities with their corresponding types. Furthermore, we have also described experiments with initialization and update of word embeddings.
In this section, we add the four other linguistic features in our proposed model to observe its performance in comparison with the SVM model. Table 7 summarizes this comparison.
Although the F1 scores for the models are relatively close, the precision (P) and recall (R) vary significantly: P is 67.44 and 61.00, while R is 57.85 and 67.54, for the SVM and CRNN-Max models, respectively. Our CRNN-Max model, therefore, is more sensitive while the SVM classifier has a higher specificity. Furthermore, it is evident that SVM outperforms our model only on classes with a disproportionately low instance count. We may argue that due to the presence of more features and less number of records, our model gets over-trained only on the larger classes. This problem may then be avoided with better regularization, to achieve even higher performance.

Conclusion
In this work, we proposed and evaluated a twolayer architecture comprising recurrent and convolutional layers in sequence to learn global and local contexts in a sentence, which was then used for relation classification. To the best of our knowledge, this is the first attempt at combining CNNs and RNNs in sequence for a relation classification task in biomedical domain. Two variants of the model, namely CRNN-Max and CRNN-Att, were evaluated on the i2b2-2010 dataset and the SemEval 2013 DDI extraction dataset, and max-pooling was found to perform better than attentive pooling. Even though our method employed only word embeddings as in-put feature, it was able to conveniently outperform state-of-the-art techniques that use extensive feature engineering. Finally, our results indicated that a "recurrent+pooling" layer effectively generates regional embedding without the need for pretrained word vectors. It would be interesting to see whether one-hot word vectors perform better than randomly initialized embeddings. We may also benefit from probing whether tree-based or non-continuous convolutions work as well as our CRNN models for learning long and short term dependencies for relation classification. Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadar-