Convolutional neural networks for chemical-disease relation extraction are improved with character-based word embeddings

We investigate the incorporation of character-based word representations into a standard CNN-based relation extraction model. We experiment with two common neural architectures, CNN and LSTM, to learn word vector representations from character embeddings. Through a task on the BioCreative-V CDR corpus, extracting relationships between chemicals and diseases, we show that models exploiting the character-based word representations improve on models that do not use this information, obtaining state-of-the-art result relative to previous neural approaches.


Introduction
Relation extraction, the task of extracting semantic relations between named entities mentioned in text, has become a key research topic in natural language processing (NLP) with a variety of practical applications (Bach and Badaskar, 2007). Traditional approaches for relation extraction are feature-based and kernel-based supervised learning approaches which utilize various lexical and syntactic features as well as knowledge base resources; see the comprehensive survey of these traditional approaches in Pawar et al. (2017). Recent research has shown that neural network (NN) models for relation extraction obtain state-of-theart performance. Two major neural architectures for the task include the convolutional neural networks, CNNs, (Zeng et al., 2014;Nguyen and Grishman, 2015;Zeng et al., 2015;Lin et al., 2016;Jiang et al., 2016;Zeng et al., 2017;Huang and Wang, 2017) and long short-term memory networks, LSTMs (Miwa and Bansal, 2016;Zhang et al., 2017;Katiyar and Cardie, 2017;Ammar et al., 2017). We also find combinations of those two architectures (Nguyen and Grishman, 2016;Raj et al., 2017).
Relation extraction has attracted particular attention in the high-value biomedical domain. Scientific publications are the primary repository of biomedical knowledge, and given their increasing numbers, there is tremendous value in automating extraction of key discoveries (de Bruijn and Martin, 2002). Here, we focus on the task of understanding relations between chemicals and diseases, which has applications in many areas of biomedical research and healthcare including toxicology studies, drug discovery and drug safety surveillance (Wei et al., 2015). The importance of chemical-induced disease (CID) relation extraction is also evident from the fact that chemicals, diseases and their relations are among the most searched topics by PubMed users (Islamaj Dogan et al., 2009). In the CID relation extraction task formulation (Wei et al., 2015(Wei et al., , 2016, CID relations are typically determined at document level, meaning that relations can be expressed across sentence boundaries; they can extend over distances of hundreds of word tokens. As LSTM models can be difficult to apply to very long word sequences (Bradbury et al., 2017), CNN models may be better suited for this task.
New domain-specific terms arise frequently in biomedical text data, requiring the capture of unknown words in practical relation extraction applications in this context. Recent research has shown that character-based word embeddings enable capture of unknown words, helping to improve performance on many NLP tasks (dos Santos and Gatti, 2014;Ma and Hovy, 2016;Lample et al., 2016;Plank et al., 2016;Nguyen et al., 2017). This may be particularly relevant for terms such as gene or chemical names, which often have identifiable morphological structure (Krallinger et al., 2017).
We investigate the value of character-based word embeddings in a standard CNN model for relation extraction (Zeng et al., 2014;Nguyen and Grishman, 2015). To the best of our knowledge, there is no prior work addressing this.
We experiment with two common neural architectures of CNN and LSTM for learning the character-based embeddings, and evaluate the models on the benchmark BioCreative-V CDR corpus for chemical-induced disease relation extraction (Li et al., 2016a), obtaining state-of-theart results.

Our modeling approach
This section describes our relation extraction models. They can be viewed as an extension of the well-known CNN model for relation extraction (Nguyen and Grishman, 2015), where we incorporate character-level representations of words.  1: Our model architecture. Given the input relation mention marked with two entities "hemolysis" and "tamoxifen", the convolutional layer uses the window size k = 3 and the number of filters m = 4.
Figure 1 presents our model architecture. Given an input fixed-length sequence (i.e. a relation mention) of n word tokens w 1 , w 2 , w 3 , ..., w n , 1 marked with two entity mentions, the vector representation layer encodes each i th word in the input relation mention by a real-valued vector representation v i ∈ R d . The convolutional layer takes the input matrix S = [v 1 , v 2 , ..., v n ] T to extract high level features. These high level features are then fed into the max pooling layer to capture the most important features for generating a feature vector of the input relation mention. Finally, the feature vector is fed into a fully-connected neural network with softmax output to produce a probability distribution over relation types. For convenience, we detail the vector representation layer in Section 2.2 while the remaining layers appear in Section 2.1. 1 We set n to be the length of the longest sequence and pad shorter sequences with a special "PAD" token.

CNN layers for relation extraction
Convolutional layer: This layer uses different filters to extract features from the input matrix S = [v 1 , v 2 , ..., v n ] T ∈ R n×d by performing convolution operations. Given a window size k, a filter can be formalized as a weight matrix F = [f 1 , f 2 , ..., f k ] T ∈ R k×d . For each filter F , the convolution operation is performed to generate a feature map x = [x 1 , x 2 , ..., x n−k+1 ] ∈ R n−k+1 : is some non-linear activation function and b ∈ R is a bias term.
Assume that we use m different weight matrix filters F (1) , F (2) , ..., F (m) ∈ R k×d , the process above is then repeated m times, resulting in m feature maps x (1) , x (2) , ..., x (m) ∈ R n−k+1 . Max pooling layer: This layer aims to capture the most relevant features from each feature map x by applying the popular max-over-time pooling operation:x = max{x} = max{x 1 , x 2 , ..., x n−k+1 }. From m feature maps, the corresponding outputs are concatenated into a feature vector z = [x (1) ,x (2) , ...,x (m) ] ∈ R m to represent the input relation mention. Softmax output: The feature vector z is then fed into a fully connected NN followed by a softmax layer for relation type classification. In addition, following Kim (2014), for regularization we apply dropout on z only during training. The softmax output procedure can be formalized as: where p ∈ R t is the final output of the network in which t is the number of relation types, and W 1 ∈ R t×m and b 1 ∈ R t are a transformation weight matrix and a bias vector, respectively. In addition, * denotes an element-wise product and r ∈ R m is a vector of independent Bernoulli random variables, each with probability ρ of being 0 (Srivastava et al., 2014).

Input vector representation
This section presents the vector representation v i ∈ R d for each i th word token in the input relation mention w 1 , w 2 , w 3 , ..., w n . Let word tokens w i 1 and w i 2 be two entity mentions in the input. 2 We obtain v i by concatenating word embeddings e w i ∈ R d 1 , position embeddings e (p1) i−i 1 and e (p2) i−i 2 ∈ R d 2 , and character-level embeddings e (c) Word embeddings: Each word type w in the training data is represented by a real-valued word embedding e w ∈ R d 1 .
Position embeddings: In relation extraction, we focus on assigning relation types to entity pairs. Words close to target entities are usually informative for identifying a relationship between them. Following Zeng et al. (2014), to specify entity pairs, we use position embeddings e (p1) i−i 2 ∈ R d 2 to encode the relative distances i − i 1 and i − i 2 from each word w i to entity mentions w i 1 and w i 2 , respectively. Character-level embeddings: Given a word type w consisting of l characters w = c 1 c 2 ...c l where each j th character in w is represented by a character embedding c j ∈ R d 4 , we investigate two approaches for learning character-based word embedding e (c) (1) Using CNN (dos Santos and Gatti, 2014; Ma and Hovy, 2016): This CNN contains a convolutional layer to generate d 3 feature maps from the input c 1:l , and a max pooling layer to produce a final vector e (c) w from those feature maps for representing the word w.
(2) Using a sequence BiLSTM (BiLSTM seq ) (Lample et al., 2016): In the BiLSTM seq , the input is the sequence of l character embeddings c 1:l , and the output is a concatenation of outputs of a forward LSTM (LSTM f ) reading the input in its regular order and a reverse LSTM (LSTM r ) reading the input in reverse:

Model training
The baseline CNN model for relation extraction (Nguyen and Grishman, 2015) is denoted here as CNN. The extensions incorporating CNN and BiLSTM character-based word embeddings are CNN+CNNchar and CNN+LSTMchar, respectively. The model parameters, including word, position, and character embeddings, weight matrices and biases, are learned during training to minimize the model negative log likelihood (i.e. crossentropy loss) with L 2 regularization.

Experimental setup
We evaluate our models using the BC5CDR corpus (Li et al., 2016a) which is the benchmark dataset for the chemical-induced disease (CID) relation extraction task (Wei et al., 2015(Wei et al., , 2016. 3 The corpus consists of 1500 PubMed abstracts: 500 for each of training, development and test. The training set is used to learn model parameters, the development set to select optimal hyperparameters, and the test set to report final results. We make use of gold entity annotations in each case. For evaluation results, we measure the CID relation extraction performance with F1 score. More details of the dataset, evaluation protocol, and implementation are in the Appendix. Table 1 compares the CID relation extraction results of our models to prior work. The first 11 rows report the performance of models that use the same experimental setup, without using additional training data or various features extracted from external knowledge base (KB) resources. The last 6 rows report results of models exploiting various kinds of features based on external relational KBs of chemicals and diseases, in which the last 4 SVM-based models are trained using both training and development sets.

Main results
The models exploiting more training data and external KB features obtained the best F1 scores. Panyam et al. (2016) and Xu et al. (2016) have shown that without KB features, their model performances (61.7% and 67.2%) are decreased by 5 and 11 points of F1 score, respectively. 4 Hence we find that external KB features are essential; we plan to extend our models to incorporate such KB features in future work.
In terms of models not exploiting external data or KB features (i.e. the first 11 rows in Table  1), our CNN+CNNchar and CNN+LSTMchar obtain the highest F1 scores; with 1+% absolute F1 improvements to the baseline CNN (p-value < 0.05). 5 In addition, our models obtain 2+% higher  Zhou et al. (2016) and Gu et al. (2017) used the same post-processing heuristics to handle cases where models could not identify any CID relation between chemicals and diseases in an article, resulting in final F1 scores at 61.3%. dency parser (Chen and Manning, 2014). However, this dependency parser was trained on the Penn Treebank (in the newswire domain) (Marcus et al., 1993); training on a domain-specific treebank such as CRAFT (Bada et al., 2012) should help to improve results (Verspoor et al., 2012).
We also achieve slightly better scores than the more complex model BRAN (Verga et al., 2017), the Biaffine Relation Attention Network, based on the Transformer self-attention model (Vaswani et al., 2017). BRAN additionally uses byte pair encoding (Gage, 1994) to construct a vocabulary of subword units for tokenization. Using subword tokens to capture rare or unknown words has been demonstrated to be useful in machine translation (Sennrich et al., 2016) and likely captures similar information to character embeddings. However, Verga et al. (2017) do not provide comparative results using only original word tokens. Therefore, it is difficult to assess the usefulness specifically of using byte-pair encoded subword tokens in the CID relation extraction task, as compared to the impact of the full model architecture. We also plan to explore the usefulness of subword tokens in the baseline CNN for future work, to enable comparison with the improvement when using the character-based word embeddings.
It is worth noting that both CNN+CNNchar and CNN+LSTMchar return similar F1 scores, showing that in this case, using either CNN or BiL-STM to learn character-based word embeddings produces a similar improvement to the baseline. There does not appear to be any reason to prefer one of these in our relation extraction application.

Conclusion
In this paper, we have explored the value of integrating character-based word representations into a baseline CNN model for relation extraction. In particular, we investigate the use of two well-known neural architectures, CNN and LSTM, for learning character-based word representations. Experimental results on a benchmark chemical-disease relation extraction corpus show that the character-based representations help improve the baseline to attain state-of-the-art performance. Our models are suitable candidates to serve as future baselines for more complex models in the relation extraction task.   2017), these are derived from either (i) a pair of entity mentions that has been positively classified to form a CID relation based on the document or (ii) a pair of entity mentions that co-occurs in the document, and that has been annotated as having a CID relation in a document in the training set.
In an article, a pair of chemical and disease concept identifiers may have multiple entity mention pairs, expressed in different relation mentions.
The longest relation mention has about 400 word tokens; the longest word has 37 characters.
We use the training set to learn model parameters, the development set to select optimal hyperparameters, and the test to report final results using gold entity annotations. For evaluation results, we measure the CID relation extraction performance using F1 score. For all three models, position embeddings are randomly initialized with 50 dimensions, i.e. d 2 = 50. Word embeddings are initialized by using 200dimensional pre-trained word vectors from Chiu et al. (2016), i.e. d 1 = 200; and word types (including a special "UNK" word token representing unknown words), which are not in the embedding list, are initialized randomly. Following Kiperwasser and Goldberg (2016), the "UNK" word embedding is learned during training by replacing each word token w appearing n w times in the training set with "UNK" with probability p unk (w) = 0.25 0.25+nw (this procedure only involves the word embedding part in the input vector representation layer). We use ReLU for the activation function g, and fix the window size k at 5 and the L 2 regularization value at 0.001.
We train the models with Stochastic gradient descent using Nadam (Dozat, 2016). For training, we run for 50 epochs.
We perform a grid search to select the optimal hyperparameters by monitoring the F1 score after each training epoch on the development set. Here, we select the initial Nadam learning rate λ ∈ {5e-06, 1e-05, 5e-05, 1e-04, 5e-04}, the number of filters m ∈ {100, 200, 300, 400, 500} and the dropout probability ρ ∈ {0.25, 0.5}. We choose the model with highest F1 on the development set, which is then applied to the test set for the evaluation phase.