Deep learning for extracting protein-protein interactions from biomedical literature

State-of-the-art methods for protein-protein interaction (PPI) extraction are primarily feature-based or kernel-based by leveraging lexical and syntactic information. But how to incorporate such knowledge in the recent deep learning methods remains an open question. In this paper, we propose a multichannel dependency-based convolutional neural network model (McDepCNN). It applies one channel to the embedding vector of each word in the sentence, and another channel to the embedding vector of the head of the corresponding word. Therefore, the model can use richer information obtained from different channels. Experiments on two public benchmarking datasets, AIMed and BioInfer, demonstrate that McDepCNN provides up to 6% F1-score improvement over rich feature-based methods and single-kernel methods. In addition, McDepCNN achieves 24.4% relative improvement in F1-score over the state-of-the-art methods on cross-corpus evaluation and 12% improvement in F1-score over kernel-based methods on “difficult” instances. These results suggest that McDepCNN generalizes more easily over different corpora, and is capable of capturing long distance features in the sentences.


Introduction
With the growing amount of biomedical information available in the textual form, there has been considerable interest in applying natural language processing (NLP) techniques and machine learning (ML) methods to the biomedical litera-ture (Huang and Lu, 2015;Leaman and Lu, 2016;Singhal et al., 2016;Peng et al., 2016).One of the most important tasks is to extract protein-protein interaction relations (Krallinger et al., 2008).
Protein-protein interaction (PPI) extraction is a task to identify interaction relations between protein entities mentioned within a document.While PPI relations can span over sentences and even cross documents, current works mostly focus on PPI in individual sentences (Pyysalo et al., 2008;Tikk et al., 2010).For example, "ARFTS" and "XIAP-BIR3" are in a PPI relation in the sentence "ARFTS PROT1 specifically binds to a distinct domain in XIAP-BIR3 PROT2 ".
Recently, deep learning methods have achieved notable results in various NLP tasks (Manning, 2015).For PPI extraction, convolutional neural networks (CNN) have been adopted and applied effectively (Zeng et al., 2014;Quan et al., 2016;Hua and Quan, 2016).Compared with traditional supervised ML methods, the CNN model is more generalizable and does not require tedious feature engineering efforts.However, how to incorporate linguistic and semantic information into the CNN model remains an open question.Thus previous CNN-based methods have not achieved state-ofthe-art performance in the PPI task (Zhao et al., 2016a).
In this paper, we propose a multichannel dependency-based convolutional neural network, McDepCNN, to provide a new way to model the syntactic sentence structure in CNN models.Compared with the widely-used one-hot CNN model (e.g., the shortest-path information is firstly transformed into a binary vector which is zero in all positions except at this shortest-path's index, and then applied to CNN), McDepCNN utilizes a separate channel to capture the dependencies of the sentence syntactic structure.
To assess McDepCNN, we evaluated our model on two benchmarking PPI corpora, AIMed (Bunescu et al., 2005) and BioInfer (Pyysalo et al., 2007).Our results show that McDepCNN performs better than the state-of-theart feature-and kernel-based methods.We further examined McDepCNN in two experimental settings: a cross-corpus evaluation and an evaluation on a subset of "difficult" PPI instances previously reported (Tikk et al., 2013).Our results suggest that McDepCNN is more generalizable and capable of capturing long distance information than kernel methods.
The rest of the manuscript is organized as follows.We first present related work.Then, we describe our model in Section 3, followed by an extensive evaluation and discussion in Section 4. We conclude in the last section.

Related work
From the ML perspective, we formulate the PPI task as a binary classification problem where discriminative classifiers are trained with a set of positive and negative relation instances.In the last decade, ML-based methods for the PPI tasks have been dominated by two main types: the featurebased vs. kernel based method.The common characteristic of these methods is to transform relation instances into a set of features or rich structural representations like trees or graphs, by leveraging linguistic analysis and knowledge resources.Then a discriminative classifier is used, such as support vector machines (Vapnik, 1995) or conditional random fields (Lafferty et al., 2001).
While these methods allow the relation extraction systems to inherit the knowledge discovered by the NLP community for the pre-processing tasks, they are highly dependent on feature engineering (Fundel et al., 2007;Van Landeghem et al., 2008;Miwa et al., 2009b;Bui et al., 2011).The difficulty with feature-based methods is that data cannot always be easily represented by explicit feature vectors.
Since natural language processing applications involve structured representations of the input data, deriving good features is difficult, time-consuming, and requires expert knowledge.Kernel-based methods attempt to solve this problem by implicitly calculating dot products for every pair of examples (Erkan et al., 2007;Airola et al., 2008;Miwa et al., 2009a;Kim et al., 2010;Chowdhury et al., 2011).Instead of extracting fea-ture vectors from examples, they apply a similarity function between examples and use a discriminative method to label new examples (Tikk et al., 2010).However, this method also requires manual effort to design a similarity function which can not only encode linguistic and semantic information in the complex structures but also successfully discriminate between examples.Kernel-based methods are also criticized for having higher computational complexity (Collins and Duffy, 2002).
Convolutional neural networks (CNN) have recently achieved promising results in the PPI task (Zeng et al., 2014;Hua and Quan, 2016).CNNs are a type of feed-forward artificial neural network whose layers are formed by a convolution operation followed by a pooling operation (LeCun et al., 1998).Unlike feature-and kernel-based methods which have been well studied for decades, few studies investigated how to incorporate syntactic and semantic information into the CNN model.To this end, we propose a neural network model that makes use of automatically learned features (by different CNN layers) together with manually crafted ones (via domain knowledge), such as words, part-of-speech tags, chunks, named entities, and dependency graph of sentences.Such a combination in feature engineering has been shown to be effective in other NLP tasks also (e.g.(Shimaoka et al., 2017)).
Furthermore, we propose a multichannel CNN, a model that was suggested to capture different "views" of input data.In the image processing, (Krizhevsky et al., 2012) applied different RGB (red, green, blue) channels to color images.In NLP research, such models often use separate channels for different word embeddings (Yin and Schütze, 2015;Shi et al., 2016).For example, one could have separate channels for different word embeddings (Quan et al., 2016), or have one channel that is kept static throughout training and the other that is fine-tuned via backpropagation (Kim, 2014).Unlike these studies, we utilize the head of the word in a sentence as a separate channel.

Model Architecture Overview
Figure 1 illustrates the overview of our model, which takes a complete sentence with mentioned entities as input and outputs a probability vector (two elements) corresponding to whether there is a relation between the two entities.Our model mainly consists of three layers: a multichannel embedding layer, a convolution layer, and a fullyconnected layer.

Embedding Layer
In our model, as shown in Figure 1, each word in a sentence is represented by concatenating its word embedding, part-of-speech, chunk, named entity, dependency, and position features.

Word embedding
Word embedding is a language modeling techniques where words from the vocabulary are mapped to vectors of real numbers.It has been shown to boost the performance in NLP tasks.In this paper, we used pre-trained word embedding vectors (Pyysalo et al., 2013) learned on PubMed articles using the word2vec tool (Mikolov et al., 2013).The dimensionality of word vectors is 200.

Part-of-speech
We used the part-of-speech (POS) feature to extend the word embedding.Similar to (Zhao et al., 2016b), we divided POS into eight groups.Then each group is mapped to an eight-bit binary vector.In this way, the dimensionality of the POS feature is 8.

Chunk
We used the chunk tags obtained from Genia Tagger for each word (Tsuruoka and Tsujii, 2005).We encoded the chunk features using a one-hot scheme.The dimensionality of chunk tags is 18.

Named entity
To generalize the model, we used four types of named entity encodings for each word.The named entities were provided as input by the task data.In one PPI instance, the types of two proteins of interest are PROT1 and PROT2 respectively.The type of other proteins is PROT, and the type of other words is O.If a protein mention spans multiple words, we marked each word with the same type (we did not use a scheme such as IOB).The dimensionality of named entity is thus 4.

Dependency
To add the dependency information of each word, we used the label of "incoming" edge of that word in the dependency graph.Take the sentence from Figure 2 as an example, the dependency of "ARFTS" is "nsubj" and the dependency of "binds" is "ROOT".We encoded the dependency features using a one-hot scheme, and their dimensionality is 101.

Position feature
In this work, we consider the relationship of two protein mentions in a sentence.Thus, we used the position feature proposed in (Sahu et al., 2016), which consists of two relative distances, d1 and d2, for representing the distances of the current word to PROT1 and PROT2 respectively.For example in Figure 2, the relative distances of the word "binds" to PROT1 ("ARFTs") and PROT2 ("XIAP-BIR3") are 2 and -6, respectively.Same as in Table S4 of (Zhao et al., 2016b), both d1 and d2 are non-linearly mapped to a ten-bit binary vector, where the first bit stands for the sign and the remaining bits for the distance.

Multichannel Embedding Layer
A novel aspect of McDepCNN is to add the "head" word representation of each word as the second channel of the embedding layer.For example, the second channel of the sentence in Figure 2 is "binds binds ROOT binds domain domain binds domain" as shown in Figure 1.There are several advantages of using the "head" of a word as a separate channel.
First, it intuitively incorporates the dependency graph structure into the CNN model.Compared with (Hua and Quan, 2016) which used the shortest path between two entities as the sole input for CNN, our model does not discard information outside the scope of two entities.Such information was reported to be useful (Zhou et al., 2007).Compared with (Zhao et al., 2016b) which used the shortest path as a bag-of-word sparse 0-1 vector, our model intuitively reflects the syntactic structure of the dependencies of the input sentence.
Second, together with convolution, our model can better capture longer distance dependencies than the sliding window size.As shown in Figure 2, the second channel of McDepCNN breaks the dependency graph structure into structural <head word, child word> pairs where each word is a modifier of its previous word.In this way, it reflects the skeleton of a constituent where the second channel shadows the detailed information of all sub-constituents in the first channel.From the perspective of the sentence string, the second channel is similar to a gapped n-gram or a skipped  n-gram where the skipped words are based on the structure of the sentence.

Convolution
We applied convolution to input sentences to combine two channels and get local features (Collobert et al., 2011).Consider x 1 , . . ., x n to be the sequence of word representations in a sentence where Here ⊕ is concatenation operation so x i ∈ R d is the embedding vector for the ith word with the dimensionality d.Let x c i:i+k−1 represent a window of size k in the sentence for channel c.Then the output sequence of the convolution layer is where f is a rectify linear unit (ReLU) function and b k is the biased term.Both w c k and b k are the learning parameters.
1-max pooling was then performed over each map, i.e., the largest number from each feature map was recorded.In this way, we obtained fixed length global features for the whole sentence.The underlying intuition is to consider only the most useful feature from the entire sentence.

Fully Connected Layer with Softmax
To make a classifier over extracted global features, we first applied a fully connected layer to the feature vectors of multichannel obtained above.
The final softmax then receives this vector O as input and uses it to classify the PPI; here we assume binary classification for the PPI task and hence depict two possible output states.
Here, θ is a vector of the hyper-parameters of the model, such as w c k , b k , w o , and b o .Further, we used dropout technique in the output of the max pooling layer for regularization (Srivastava et al., 2014).This prevented our method from overfitting by randomly "dropping" with probability (1 − p) neurons during each forward/backward pass while training.

Training
To train the parameters, we used the log-likelihood of parameters on a mini-batch training with a batch size of m.We use the Adam algorithm to optimize the loss function (Kingma and Ba, 2015).

Experimental setup
For our experiments, we used the Genia Tagger to obtain the part-of-speech, chunk tags, and named entities of each word (Tsuruoka and Tsujii, 2005).We parsed each sentence using the Bllip parser with the biomedical model (Charniak, 2000;Mc-Closky, 2009).The universal dependencies were then obtained by applying the Stanford dependencies converter on the parse tree with the CCProcessed and Universal options (De Marneffe et al., 2014).We implemented the model using Tensor-Flow (Abadi et al., 2016).All trainable variables were initialized using the Xavier algorithm (Glorot and Bengio, 2010).We set the maximum sentence length to 160.That is, longer sentences were pruned, and shorter sentences were padded with zeros.We set the learning rate to be 0.0007 and the dropping probability 0. In this paper, we experimented with three window sizes: 3, 5 and 7, each of which has 400 filters.Every filter performs convolution on the sentence matrix and generates variable-length feature maps.We got the best results using the single window of size 3 (see Section 4.2)

Data
We evaluated McDepCNN on two benchmarking PPI corpora, AIMed (Bunescu et al., 2005) and BioInfer (Pyysalo et al., 2007).These two corpora have different sizes (Table 1) and vary slightly in their definition of PPI (Pyysalo et al., 2008).Tikk et al. (2010) conducted a comparison of a variety of PPI extraction systems on these two corpora 1 .In order to compare, we followed their 1 http://mars.cs.utu.fi/PPICorporaexperimental setup to evaluate our methods: selfinteractions were excluded from the corpora and 10-fold cross-validation (CV) was performed.

Results and discussion
Our system performance, as measured by Precision, Recall, and F1-score, is shown in Table 2.
To compare, we also include the results published in (Tikk et al., 2010;Peng et al., 2015;Van Landeghem et al., 2008;Fundel et al., 2007).Row 2 reports the results of the previous best deep learning system on these two corpora.Rows 3 and 4 report the results of two previous best single kernel-based methods, an APG kernel (Airola et al., 2008;Tikk et al., 2010) and an edit kernel (Peng et al., 2015).Rows 5-6 report the results of two rule-based systems.As can be seen, McDepCNN achieved the highest results in both precision and overall F1-score on both datasets.Note that we did not compare our results with two recent deep-learning approaches (Hua and Quan, 2016;Quan et al., 2016).This is because unlike other previous studies, they artificially removed sentences that cannot be parsed and discarded pairs which are in a coordinate structure.Thus, our results are not directly comparable with theirs.Neither did we compare our method with (Miwa et al., 2009b) because they combined, in a rich vector, analysis from different parsers and the output of multiple kernels.
To further test the generalizability of our method, we conducted the cross-corpus experiments where we trained the model on one corpus and tested it on the other (Table 3).Here we compared our results with the shallow linguistic model which is reported as the best kernel-based method in (Tikk et al., 2013).
The cross-corpus results show that McDepCNN achieved 24.4% improvement in F-score when trained on BioInfer and tested on AIMed, and 18.2% vice versa.
To better understand the advantages of McDe-pCNN over kernel-based methods, we followed the lead of (Tikk et al., 2013) to compare the method performance on some known "difficult" instances in AIMed and BioInfer.This subset of difficult instances is defined as 10% of all pairs with the least number of 14 kernels being able to classify correctly (Table 4).
Table 5 shows the comparisons between McDe-pCNN and kernel-based methods on difficult instances.The results of McDepCNN were obtained from the difficult instances combined from AIMed and BioInfer (172 positives and 479 negatives).And the results of APG, Edit, and SL were obtained from AIMed, BioInfer, HPRD50, IEPA, and LLL (190 positives and 521 negatives) (Tikk et al., 2013).While the input datasets are different, our outcomes are remarkably higher than the prior studies.The results show that McDe-pCNN achieves 17.3% in F1-score on difficult instances which is more than three times better than other kernels.Since there are no examples of difficult instances that could not be classified correctly by at least one of the 14 kernel methods, below, we only list some examples that McDepCNN can classify correctly.
1. Immunoprecipitation experiments further re-veal that the fully assembled receptor complex is composed of two IL-6 PROT1 , two IL-6R alpha PROT2 , and two gp130 molecules.
Together with the conclusions in (Tikk et al., 2013), "positive pairs are more difficult to classify in longer sentences" and "most of the analyzed classifiers fail to capture the characteristics of rare positive pairs in longer sentences", this comparison suggests that McDepCNN is probably capable of better capturing long distance features from the sentence and are more generalizable than kernel methods.
Finally, Table 6 compares the effects of different parts in McDepCNN.Here we tested McDe-pCNN using 10-fold of AIMed.Row 1 used a single window with the length of 3, row 2 used two windows, and row 3 used three windows.The reduced performance indicate that adding more windows did not improve the model.This is partially because the multichannel in McDepCNN has captured good context features for PPI.Second, we used the single channel and retrained the model with window size 3.The performance then dropped 1.1%.The results underscore the effectiveness of using the head word as a separate chan-  (Tikk et al., 2013) were obtained from AIMed, BioInfer, HPRD50, IEPA, and LLL (190 positives and 521 negatives).

Conclusion
In this paper, we describe a multichannel dependency-based convolutional neural network for the sentence-based PPI task.Experiments on two benchmarking corpora demonstrate that the proposed model outperformed the current deep learning model and single feature-based or kernelbased models.Further analysis suggests that our model is substantially more generalizable across different datasets.Utilizing the dependency structure of sentences as a separated channel also enables the model to capture global information more effectively.
In the future, we would like to investigate how to assemble different resources into our model, similar to what has been done to rich-featurebased methods (Miwa et al., 2009b) where the current best performance was reported (F-score of 64.0% (AIMed) and 66.7% (BioInfer)).We are also interested in extending the method to PPIs beyond the sentence boundary.Finally, we would like to test and generalize this approach to other biomedical relations such as chemical-disease relations (Wei et al., 2016).

Figure 1 :Figure 2 :
Figure 1: Overview of the CNN model.
5. During the training, we ran 250 epochs of all the training examples.For each epoch, we randomized the training examples and conducted a mini-batch training with a batch size of 128 (m = 128).

Table 1 :
Statistics of the corpora.

Table 2 :
Evaluation results.Performance is reported in terms of Precision, Recall, and F1-score.

Table 3 :
Cross-corpus results.Performance is reported in terms of Precision, Recall, and F1-score.

Table 4 :
(Tikk et al., 2013)the most difficult to classify correctly by the collection of kernels using cross-validation(Tikk et al., 2013).

Table 5 :
Comparisons on the difficult instances with CV evaluation.Performance is reported in terms of Precision, Recall, and F1-score * .The results of McDepCNN were obtained on the difficult instances combined from AIMed and BioInfer (172 positives and 479 negatives).The results of others *

Table 6 :
Contributions of different parts in McDe-pCNN.Performance is reported in terms of Precision, Recall, and F1-score.