Extracting Drug-Drug Interactions with Attention CNNs

We propose a novel attention mechanism for a Convolutional Neural Network (CNN)-based Drug-Drug Interaction (DDI) extraction model. CNNs have been shown to have a great potential on DDI extraction tasks; however, attention mechanisms, which emphasize important words in the sentence of a target-entity pair, have not been investigated with the CNNs despite the fact that attention mechanisms are shown to be effective for a general domain relation classification task. We evaluated our model on the Task 9.2 of the DDIExtraction-2013 shared task. As a result, our attention mechanism improved the performance of our base CNN-based DDI model, and the model achieved an F-score of 69.12%, which is competitive with the state-of-the-art models.


Introduction
When drugs are concomitantly administered to patients, the effects of the drugs may be enhanced or weakened, which may cause side effects. These kinds of interactions are called Drug-Drug Interactions (DDIs). Several drug databases, such as DrugBank (Law et al., 2014), Therapeutic Target Database , and Phar-mGKB (Thorn et al., 2013), have been provided to summarize drug and DDI information for researchers and professionals; however, many newly discovered or rarely reported interactions are not covered in the databases, and they are still buried in biomedical texts. Therefore, studies on automatic DDI extraction that extract DDIs from texts are expected to support maintenance of databases with high coverage and quick update to help medical experts deepen their understanding of DDIs.
For the DDI extraction, deep neural networkbased methods have recently drawn a considerable attention Zhao et al., 2016;Sahu and Anand, 2017). Deep neural networks have been widely used in the NLP field. They show high performance on several NLP tasks without requiring manual feature engineering. Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) are often employed for the network architectures. Among these, CNNs have an advantage that they can be easily parallelized and the calculation is thus fast with recent Graphical Processing Units (GPUs).  showed that CNN-based model can achieve a high accuracy on the task of DDI extraction. Sahu and Anand (2017) proposed an RNN-based model with attention mechanism to tackle the DDI extraction task and show the stateof-the-art performance. The integration of an attention mechanism into a CNN-based relation extraction is proposed by . This is applied to a general domain relation extraction task SemEval 2010 Task 8 (Hendrickx et al., 2009). Their model showed the state-of-the-art performance on the task. CNNs with attention mechanisms, however, are not evaluated on the task of DDI extraction.
In this study, we propose a novel attention mechanism that is integrated into a CNN-based DDI extraction model. The attention mechanism extends attention mechanism by  in that it deals with anonymized entities separately from other words and incorporates a smoothing parameter. We implement a CNNbased relation extraction model and integrate the novel mechanism into the model. We evaluate our model on the Task 9.2 of the DDIExtraction-2013 shared task .
The contribution of this paper is as follows. First, this paper proposes a novel attention mechanism that can boost the performance on CNNbased DDI extraction. Second, the DDI extraction model with the attention mechanism achieves Figure 1: Overview of our model the performance with an F-score of 69.12% that is competitive with other state-of-the-art DDI extraction models when we compare the performance without negative instance filtering (Chowdhury and Lavelli, 2013).

Methods
We propose a novel attention mechanism for a CNN-based DDI extraction model. We illustrate the overview of the proposed DDI extraction model in Figure 1. The model extracts interactions from sentences with drugs are given. In this section, we first present preprocessing of input sentences. We then introduce the base CNN model and explain the attention mechanism. Finally, we explain the training method.

Preprocessing
Before processing a drug pair in a sentence, we replace the mentions of the target drugs in the pair with "DRUG1" and "DRUG2" according to their order of appearance. We also replace other mentions of drugs with "DRUGOTHER". Table 1 shows an example of preprocessing when an input sentence Exposure to oral Sketamine is unaffected by itraconazole but greatly increased by ticlopidine is given with a target entity pair. By performing preprocessing, it is possible to prevent the DDI extraction model to be specialized for the surface forms of the drugs in a training data set and to perform DDI extraction using the information of the whole context.

Base CNN model
The base CNN model for extracting DDIs is one by Zeng et al. (2014). In addition to their original objective function, we employ an ranking-based objective function by dos Santos et al. (2015). The model consists of four layers: embedding, convolution, pooling, and prediction layers. We show the CNN model at the bottom half of Figure 1.

Embedding layer
In the embedding layer, each word in the input sentence is mapped to a real-valued vector representation using an embedding matrix that is initialized with pre-trained embeddings. Given an input sentence S = (w 1 , · · · , w n ) with drug entities e 1 and e 2 , we first convert each word w i into a real-valued vector w w i by an embedding matrix W emb ∈ R dw×|V | as follows: where d w is the number of dimensions of the word embeddings, V is the vocabulary in the training data set and the pre-trained word embeddings, and v w i is a one hot vector that represents the index of word embedding in W emb . v w i thus extracts the corresponding word embedding from W emb .

Entity1
Entity2 Preprocessed input sentence S-ketamine itraconazole Exposure to oral DRUG1 is unaffected by DRUG2 but greatly increased by DRUGOTHER. S-ketamine ticlopidine Exposure to oral DRUG1 is unaffected by DRUGOTHER but greatly increased by DRUG2. itraconazole ticlopidine Exposure to oral DRUGOTHER is unaffected by DRUG1 but greatly increased by DRUG2. Table 1: An example of preprocessing on the sentence "Exposure to oral S-ketamine is unaffected by itraconazole but greatly increased by ticlopidine" for each target pair.
The word embedding matrix W emb is fine-tuned during training. We also prepare d wp -dimensional word position embeddings w p i,1 and w p i,2 that correspond to the relative positions from first and second target entities, respectively. We concatenate the word embedding w w i and these word position embeddings w p i,1 and w p i,2 as in the following Equation (2), and we use the resulting vector as the input to the subsequent convolution layer: (2)

Convolution layer
We define a weight tensor for convolution as W conv k ∈R dc×(dw+2dwp)×k and we represent the jth column of W conv k as W conv k,j ∈R (dw+2dwp)×k . Here, d c denotes the number of filters for each window size, k is a window size, and K is a set of the window sizes of the filters. We also introduce z i,k that is concatenated k word embeddings: We apply the convolution to the embedding matrix as follows: where is an element-wise product, b is the bias term, and f is the ReLU function defined as:

Pooling layer
We employ the max pooling (Boureau et al., 2010) to convert the output of each filter in the convolution layer into a fixed-size vector as follows: We then obtain the d p -dimensional output of this pooling layer, where d p equals to d c ×|K|, by concatenating the obtained outputs c k for all the window sizes k 1 , · · · , k K (∈ K):

Prediction layer
We predict the relation types using the output of the pooling layer. We first convert c into scores using a weight matrix W pred ∈ R o×dp : where o is the total number of relationships to be classified and s = [s 1 , · · · , s o ]. We then employ the following two different objective functions for prediction.
Softmax We convert s into the probability of possible relations p by a softmax function: . (9) The loss function L sof tmax is defined as in the Equation (10) when the gold type distribution y is given. y is a one-hot vector where the probability of the gold label is 1 and the others are 0.
Ranking We employ the ranking-based objective function following dos Santos et al. (2015).
Using the scores s in the Equation (8), the loss is calculated as follows: where m + and m − are margins, γ is a scaling factor, y is a gold label, and c ( = y) is a negative label with the highest score in s. We set γ to 2, m + to 2.5 and m − to 0.5 following dos Santos et al. .

Attention mechanism
Our attention mechanism is based on the input attention by  1 . The proposed attention mechanism is different from the base one in that we prepare separate attentions for entities and we incorporate a bias term to adjust the smoothness of attentions. We illustrate the attention mechanism at the upper half of Figure 1. We define the word index of the first and second target drug entities in the sentence as e 1 and e 2 , respectively. We also denote by E = {e 1 , e 2 } the set of indices and by j ∈ {1, 2} the index of the entities. We calculate our attentions using these: Here, a drug is an attention parameter for entities and b α is the bias term. a drug and b α are tuned during training. If we set E to empty and b α to zero, the attention will be the same as one by . We incorporate the attentions α i into the CNN model by replacing the Equation (4) with the following equation:

Training method
We use L2 regularization to avoid over-fitting. We use the following objective functions L * (L sof tmax or L ranking ) by incorporating the L2 regularization on weights to the Equation (10).
Here, λ is a regularization parameter and || · || F denotes the Frobenius norm. We update all the parameters including the weights W emb , W conv , and W pred , biases b and b α , and the attention parameter a drug to minimize L * . We use the adaptive moment estimation (Adam) (Kingma and Ba, 2015) for the optimizer. We randomly shuffle training data set and divide them into mini-batch samples in each epoch.

Experimental settings
We illustrate the workflow of the DDI extraction in Figure 2. As preprocessing, we performed word segmentation of the input sentences using the GE-NIA tagger (Tsuruoka et al., 2005). In this section, we explain the settings for the data sets, tasks, initial embeddings, and hyper-parameter tuning.

Data set
We used the data set from the DDIExtraction-2013 shared task (SemEval-2013 Task 9) (Segura  for the evaluation. This data set is composed of documents annotated with drug mentions and their relationships. The data set consists of two parts: MEDLINE and DrugBank. MEDLINE consists of abstracts in PubMed articles, and DrugBank consists of the descriptions of drug interactions in the DrugBank database. This data set annotates the following four types of interactions.
• Mechanism: A sentence describes pharmacokinetic mechanisms of a DDI, e.g., "Grepafloxacine may inhibit the metabolism of theobromine." • Effect: A sentence represents the effect of a DDI, e.g., "Methionine may protect against the ototoxic effects of gentamicin."    • Advice: A sentence represents a recommendation or advice on the concomitant use of two drugs, e.g., "Alpha-blockers should not be combined with uroxatral." • Int: A sentence simply represents the occurrence of a DDI without any information about the DDI, e.g., "The interaction of omeprazole and ketoconazole has established." The statistics of the data set is shown in Table 2. As shown in this table, the number of pairs that have no interaction (negative pairs) is larger than that of pairs that have interactions (positive pairs).

Task settings
We followed the task setting of Task 9.2 in the DDIExtraction-2013 shared task (SemEval task 9). The task is to classify a given pair of drugs into the four interaction types or no interaction. We evaluated the performance with precision (P), recall (R), and F-score (F) on each interaction type as well as micro-averaged precision, recall, and Fscore on all the interaction types. We used the official evaluation script provided by the task organizers and report the averages of 10 runs. Please note that we took averages of precision, recall and F-scores individually, so F-scores cannot be calculated from precision and recall.

Initializing embeddings
Skip-gram (Mikolov et al., 2013) was employed for the pre-training of word embeddings. We used 2014 MEDLINE/PubMed baseline distribution, and the size of vocabulary was 1,630,978. The embedding of the drugs, i.e., "DRUG1", "DRUG2" and "DRUGOTHER" are initialized with the pretrained embedding of the word "drug". The embeddings of training words that did not appear in the pre-trained embeddings, as well as the word position embeddings, are initialized with the random values drawn from a uniform distribution and normalized to unit vectors. Words whose frequencies are one in the training data were replaced with an "UNK" word during training, and the embedding of words in the test data set that did not appear in both training and pre-trained embeddings were set to the embedding of the "UNK" word.

Hyperparameter tuning
We split the official training data set into two parts: training and development data sets. We tuned the hyper-parameters on the development data set on the softmax model without attentions. Table 3 shows the best hyperparameters on the softmax model without attentions. We applied the same  hyperparameters to the other models. The statistics of our development data set is shown in Table 4. We set the sizes of the convolution windows to [3,4,5] that are the same as in Kim (2014

Results
In this section, we first summarize the performance of the proposed models and compare the performance with existing models. We then compare attention mechanisms and finally illustrate some results for the analysis of the attentions.

Performance analysis
The performance of the base CNN models with two objective functions, as well as with or without the proposed attention mechanism, is summa-rized in Table 5. The incorporation of the attention mechanism improved the F-scores by about 2 percent points (pp) on models with both objective functions. Both improvements were statistically significant (p < 0.01) with t-test. This shows that the attention mechanism is effective for both models. The improvement of F-scores from the least performing model (softmax objective function without our attention mechanism) to the best performing model (ranking objective function with our attention mechanism) is 3.19 pp (69.12% versus 65.93%), and this shows both objective function and attention mechanism are key to improve the performance. When looking into the individual types, ranking function with our attention mechanism archived the best F-scores on Mechanism, Effect, Advice, while the base CNN model achieved the best F-score on Int.

Comparison with existing models
We show comparison with the existing state-ofthe-art models in Table 6. We mainly compare Methods P (%) R (%) F (%) No negative instance filtering CNN  75.29 60.37 67.01 MCCNN (Quan et al., 2016) --67.80 SCNN (Zhao et al., 2016) 68   Table 7: Comparison of attention mechanisms on CNN models with ranking objective function the performance without negative instance filtering, which omits some apparent negative instance pairs with rules (Chowdhury and Lavelli, 2013), since we did not incorporate it. We also show the performance of the existing models with negative instance filtering for reference. In the comparison without negative instance filtering, our model outperformed the existing CNN models Quan et al., 2016;Zhao et al., 2016). The model was competitive with Joint AB-LSTM model (Sahu and Anand, 2017) that was composed of multiple RNN models.
When considering negative instance filtering, our model showed lower performance than the state-of-the-art. However we believe we can get similar performance with theirs if we incorporate negative instance filtering. Still, the model outperformed several models such as Kim et al. (2015), Chowdhury and Lavelli (2013) and SCNN model even if we consider negative instance filtering.

Comparison of attention mechanisms
We compare the proposed attention mechanism with the input attention of  to show the effectiveness of our attention mechanism. Table 7 shows the comparison of the atten-tion mechanisms. We also show the base CNNbased model with ranking loss for reference, and the results of ablation tests. As is shown in the table, the attention mechanism by  did not work in DDI extraction. However, our attention improved the performance. This result shows that the proposed extensions are crucial for modeling attentions in DDI extraction. The ablation test results show that both extensions to our attention mechanism, i.e., separate attentions for entities and incorporation of the bias term, are effective for the task. Figure 3 shows visualization of attentions on some sentences with DDI pairs using our attention mechanism. In the first sentence, "DRUG1" and "DRUG2" have a Mechanism interaction. The attention mechanism successfully highlights the keyword "concentration". In the second sentence, which have an Effect interaction, the attention mechanism put high weights on "increase" and "effects". The word "necessary" has a high weight on the third sentence with an Advice interaction. For an Int interaction in the last sentence, the word "interaction" is most highlighted.  Björne et al. (2013) tackled with DDI extraction using Turku Event Extraction System (TEES), which is an event extraction system based on the Support Vector Machines (SVMs). Thomas et al. (2013) and Chowdhury and Lavelli (2013) proposed twophase processing models that first detected DDIs and then classified the extracted DDIs into one of the four proposed types. Thomas et al. (2013) used the ensembles of several kernel methods, while Chowdhury and Lavelli (2013) proposed hybrid kernel-based approach with negative instance filtering. The negative instance filtering is employed by all the subsequent models except for ours. Kim et al. (2015) proposed a two-phase SVM-based approach that employed a linear SVM with rich features including word features, word pairs, dependency relations, parse tree structures, and noun phrase-based constraint features. Our model does not use features and instead employs CNNs.

Visual analysis
Deep learning-based models recently dominated the DDI extraction task. Among these, CNN-based models have been often employed and RNNs has received less attention.  built a CNN-based model on word embedding and word position embeddings. Zhao et al. (2016) proposed Syntax CNN (SCNN) that employs syntax word embeddings with the syntactic information of a sentence as well as features of POS tags and dependency trees.  tackled DDI extraction using Multi-Channel CNN (MCCNN) that enables the fusion of multiple word embeddings. Our work is different from theirs in that we employed an attention mechanism.
As for RNN-based approach, Sahu and Anand (2017) proposed an RNN-based model named Joint AB-LSTM (Long Short-Term Memory).
Joint AB-LSTM is composed of the concatenation of two RNN-based models: bidirectional LSTM (Bi-LSTM) and attentive pooling Bi-LSTM. The model showed the state-of-the-art performance on the DDIExtraction-2013 shared task data set. Our model is a single model with a CNN and attention mechanism, and it performed comparable to theirs as shown in Table 6.  proposed muli-level attention CNNs and applied it to a general domain relation classification task SemEval 2010 Task 8 (Hendrickx et al., 2009). Their attention mechanism improved the macro F1 score by 1.9pp (from 86.1% to 88.0%), and their model achieved the state-of-the-art performance on the task.

Conclusions
In this paper, we proposed a novel attention mechanism for the extraction of DDIs. We built base CNN-based DDI extraction models with two different objective functions, softmax and ranking, and we incorporated the attention mechanism into the models. We evaluated the performance on the Task 9.2 of the DDIExtraction-2013 shared task, and we showed that both attention mechanism and ranking-based objective function are effective for the extraction of DDIs. Our final model achieved an F-score of 69.12% that is competitive with the state-of-the-art model when we compared the performance without negative instance filtering.
As future work, we would like to incorporate an attention mechanism in the pooling layer  and adopt negative instance filtering (Chowdhury and Lavelli, 2013) for the further performance improvement and fair comparison with the state-of-the-art methods.