Amalgamation of protein sequence, structure and textual information for improving protein-protein interaction identification

An in-depth exploration of protein-protein interactions (PPI) is essential to understand the metabolism in addition to the regulations of biological entities like proteins, carbohydrates, and many more. Most of the recent PPI tasks in BioNLP domain have been carried out solely using textual data. In this paper, we argue that incorporating multimodal cues can improve the automatic identification of PPI. As a first step towards enabling the development of multimodal approaches for PPI identification, we have developed two multi-modal datasets which are extensions and multi-modal versions of two popular benchmark PPI corpora (BioInfer and HRPD50). Besides, existing textual modalities, two new modalities, 3D protein structure and underlying genomic sequence, are also added to each instance. Further, a novel deep multi-modal architecture is also implemented to efficiently predict the protein interactions from the developed datasets. A detailed experimental analysis reveals the superiority of the multi-modal approach in comparison to the strong baselines including unimodal approaches and state-of the-art methods over both the generated multi-modal datasets. The developed multi-modal datasets are available for use at https://github.com/sduttap16/MM_PPI_NLP.


Introduction
Understanding protein-protein interactions (PPI) is indispensable to comprehend different biological processes such as translation, protein functions (Kulmanov et al., 2017), gene functions (Dutta and Saha, 2017;Dutta et al., 2019b), metabolic pathways, etc. The PPI information helps researchers to discover disease mechanisms and plays seminal role in designing the therapeutic drugs (Goncearenco et al., 2017). Over the years, a significant amount of protein-protein interaction information has been published in scientific articles in unstructured text formats. However, in recent years, there has been an exponential rise in the number of biomedical publications (Khare et al., 2014). Therefore, it becomes imperative, urgent and of extreme interest to develop an intelligent information extraction system to assist biologists in curating and maintaining PPI databases.
This pressing need has motivated Biomedical Natural Language Processing (BioNLP) researchers to automatically extract PPI information by exploring various AI techniques. Recent advancements in deep learning (LeCun et al., 2015)(Bengio et al., 2007 have opened up new avenues in solving different well-known problems ranging from computational biology (Alipanahi et al., 2015;Dutta et al., 2019a), machine translations (Cho et al., 2014), image captioning (Chen et al., 2017). Subsequently, there is a notable trend in using deep learning for solving different natural language processing (NLP) tasks in the biomedical and clinical domains (Asada et al., 2018;Alimova and Tutubalina, 2019) including the identification of protein-protein interactions from biomedical corpora (Yadav et al., 2019;Peng and Lu, 2017). Multi-modal deep learning models, combining information from multiple sources/modalities, show promising results compared to the conventional single modal-based models while solving various NLP tasks like sentiment and emotion recognition (Qureshi et al., 2019(Qureshi et al., , 2020, natural language generation, machine translation (Poria et al., 2018;Zhang et al., 2019;Qiao et al., 2019;Fan et al., 2019) etc. There exist few popular multi-modal datasets which are extensively used in solving various problems in NLP like emotion recognition from conversations (Poria et al., 2018;Chen et al., 2018), image captioning (Lin et al., 2014), sentiment analysis (Zadeh et al., 2016), etc. Compared to single modal-based approaches, multi-modal techniques provide a more comprehensive perspective of the dataset under consideration.
Despite the popularity of multi-modal approaches in solving traditional NLP tasks, there is a dearth of multi-modal datasets in BioNLP domain especially for the PPI identification task. The available PPI benchmark datasets contain solely the textual knowledge of different protein pairs, which do not help in anticipating the molecular properties of the proteins. Hence, along with the textual information, incorporation of molecular structure or underlying genomic sequence can aid in understanding the regulations of the protein interactions. The integration of multi-modal features can help in obtaining deeper insights but the concept of multimodal architecture, for textual and biological aspects, has not been cultivated much in the BioNLP domain (Peissig et al., 2012;Jin et al., 2018).

Motivation and Contribution
The main motivation for this research work is to generate multi-modal datasets for PPI identification task, where along with the textual information present in the biomedical literature, we did explore the genetic and structure information of the proteins. The biomedical and clinical text database is an important resource for learning about physical interactions amongst protein molecules; however, it may not be adequate for exploring biological aspects of these interactions. In the field of Bioinformatics, there are various web-based enriched archives 12 that contain multi-omics biological information regarding protein interactions. The integration of multi-omics information from these aforementioned databases helps in understanding the various physiological characteristics (Sun et al., 2019;Ray et al., 2014;Amemiya et al., 2019;Hsieh et al., 2017;Dutta et al., 2020). Hence, in our current work, along with the textual information from biomedical corpora, we have also incorporated structural properties of protein molecules as biological information for solving PPI task. For structural information of proteins, we have considered the atomic structure (3D PDB structure) and underlying nucleotide sequence (FASTA sequence) of protein molecules. In the BioNLP domain, collection of biological data (muti-omics information) from the text corpus is little difficult. To obtain the aforementioned information about other modalities, we need to exploit different web-based archives that are meant for biological structures.
Drawing inspirations from these findings, we have generated a protein-protein interaction-based multi-modal dataset which includes not only textual information, but also the structural counterparts of the proteins. Finally, a novel deep multimodal architecture is developed to efficiently predict the protein-protein interactions by considering all modalities. The main contributions of this study are summarized as follows: 1. For this study, we extend and further improve two biomedical corpora containing PPI information for multi-modal scenario by manually annotating and web-crawling two different bio-enriched archives.
2. Our proposed multi-modal architecture uses self-attention mechanism to integrate the extracted features of different modalities.
3. This work is a step towards integrating multiomics information with text-mining from biomedical articles for enhancing PPI identification. To the best of our knowledge, this is the first attempt in this direction.
4. The results and the comparative study prove the effectiveness of our developed multimodal datasets along with proposed multimodal architecture.

Related Works
There are few works (Ono et al., 2001;Blaschke et al., 1999;Huang et al., 2004) which focus on rule-based PPI information extraction method such as co-occurrence rules (Stapley and Benoit, 1999) from the biomedical texts. In (Giuliano et al., 2006), relation is extracted from entire sentence by considering the shallow syntactic information. (Erkan et al., 2007) utilize semi-supervised learning and cosine similarity to find the shortest dependency path (SDP) between protein entities. Some important kernel-based methods for PPI extraction task are graph kernel (Airola et al., 2008a), bagof-word (BoW) kernel (Saetre et al., 2007), editdistance kernel (Erkan et al., 2007) and all-path kernel (Airola et al., 2008b). (Yadav et al., 2019) presented an attention-based bidirectional long shortterm memory networks (BiLSTM) model that uses SDP between protein pairs, latent PoS and position

Dataset Formation and Preprocessing
In this study, we have extended, improved, and further developed two popular benchmark PPI corpora, namely BioInfer 3 and HRPD50 4 dataset for the multi-modal scenario. Along with the textual information, these enhanced multi-modal datasets contain the biological counterparts of the interacting or non-interacting protein pairs. Biological information comes from the underlying FASTA sequence and the atomic structures of interacting protein pairs. 3 http://corpora.informatik.hu-berlin.de/ 4 https://goo.gl/M5tEJj Figure 2: Statistics of positive and negative instances across our developed multi-modal datasets.

Dataset Preparation
Firstly, we have extracted data, primarily consisting of two and more protein entities, from the XML representations of two PPI corpora mentioned earlier. To simplify this complex relations among multiple protein entities, we have considered only a single protein pair at a time and found out if they are interacting or not. Among these relations, we have considered positive instances that are directly mentioned in the dataset. The other interactions are considered as non-interacting proteins, i.e., negative instances.
Consider an instance of HRPD50 dataset, "Megalin and cubilin: multifunctional endocytic receptors Megalin and cubilin are two structurally different endocytic receptors that interact to serve such functions" (Figure 1). In this particular example, we have four protein entities but we have considered the interactions between two proteins at a time and arrived at six possible relations (shown in table of Figure 1). Among these relations, only one pair (Megalin, cubilin) is denoted as interacting proteins in the HRPD50 dataset. Hence, the number of instances in our dataset is much higher than those in BioInfer and HRPD50 datasets.
After generating both positive and negative instances, next we have downloaded other two modalities. To download the genomic sequence and the 3d structure of proteins, the ensemble ID and PDB ID of the proteins are required to be known. But all the biological archives contain the relationships between gene and PDB ID or Ensemble ID instead of any relationship between the proteins and aforementioned IDs. Hence, we have used manual annotation to find out the respective gene names of each protein name and then python based methodologies to find out Ensembl ID and PDB ID of each of these genes. These IDs help us in downloading the underlying genomic sequence (FASTA sequence) from 5 and structures of these proteins (3D PDB structure) from the RCSB Protein Data Bank 6 archive. The pre-processing and generation of the multimodal datasets from the biomedical corpora 5 https://useast.ensembl.org/index.html 6 http://www.rcsb.org/ are pictorially depicted in Figure 1. The complete exemplified multi-modal datasets are available at the provided GitHub link.

Dataset Annotation and Statistics
A major challenge in creating the dataset is to manually encode the relationships between genes and proteins, a many to many mapping for biological reasons. Hence, to find out the genes which are more related to a particular protein, we asked three annotators who have strong biological knowledge. The disagreement between the annotators was less than 1% and the disagreement is solved by the ma-jority voting. The total number of instances of the developed multi-modal datasets are shown in Figure 2.

Problem Formalization
Our goal is to develop a deep multi-modal architecture that can efficiently predict whether two proteins are interact with each other or not from the developed multi-modal datasets. Formally, consider the multi-modal dataset Seq represent the textual, structural and sequence modality of S i sentence/instance, respectively. The proposed PPI task for an instance S i is mathematically formulated as Here M 1 , M 2 , M 3 are three different deep learning based models for text, structure and sequence modality, respectively. The extracted features are fused by self attention mechanism (f sa ) which is finally fed to an activation function(f act ) for predicting protein interactions.

Proposed Methodology
The major steps of our proposed multi-modal architecture are shown in Figure 3.

Feature Extraction from Textual Modality
The proposed deep learning model (M 1 ) for extracting features from textual modality is described in Figure 4. Firstly, we use BioBERT v1.1 (Lee et al., 2019) model to provide a vector representation (u i ∈ R d ) of the textual instance (I i T ext ). With almost same architecture of BERT (Bidirectional Encoder Representation from Transformers) model (Devlin et al., 2018), BioBERT v1.1 is pre-trained on 1M PubMed abstracts. Here, each sentence is embedded as a unique vector of size 768 (i.e., d=768) by averaging the last four transformer layers of the first token ([CLS]) of BioBERT model. Inspired by the efficient usage of stacked Bidirectional long short term memory (BiLSTM) (Yadav et al., 2019), we use this to encode the embedded representation (u i ). In stacked BiLSTM, the l th level BiLSTM computes the forward ( → h l u i ) and backward hidden states ( ← h l u i ) which are then concatenated and fed to the next (l + 1) th level of BilSTM layer. Therefore, the final representation (F i T ext ) of I i T ext is obtained from the last layer (L) of the stacked BiLSTM model as

Sequence Feature Extraction
Firstly, we have downloaded the FASTA sequence of protein pairs of an instance (S i ) from Ensembl genome browser. In this modality, each protein (I i Seq ) is represented as string of four nucleotides, i.e., I i Seq = {A, T, G, C} + . The underlying genomic sequence is considered as a separate channel of the text modality. Since molecular properties of protein molecules are heavily dependend on the sequence of nucleotides, we apply capsule network (Sabour et al., 2017) to capture the spatial information between the nucleotides. In this regard, firstly, we have converted all four nucleotides into one-hot vector representation, i.e., the protein is represented as a 2D matrix, O = {0, 1} 4×m where m is the number of nucleotides in the sequence. Now, three convolutional layers (f conv ) are applied on O where the output of the third layer is fed to the primary capsule. Finally, the output of the primary capsule is fed to secondary capsule which

Structural Feature Extraction
For the structure modality, firstly we have downloaded protein 3D structure from RCSB protein data bank website and obtained the atomic coordinates from the PDB file. Among all the modalities, structural modality is the most relevant modality for inferring biological information. In this modality, we have considered the atomic structure of the proteins. Inspired by the inherent capabilities of graph convolutional neural network (Kipf and Welling, 2016; Zamora-Resendiz and Crivelli, 2019) for understanding the effective latent representation of the graph, we have used it to learn a local neighborhood representation around each atom of the proteins. For this structural modality, the developed model ( Figure 6) learns the chemical bonding information from the atomic structure of the proteins rather from its corresponding image. Each protein, which consists of a set of atoms {a 1 , a 2 , . . . , a n }, has an adjacency matrix, A ∈ {0, 1} n×n , and a node feature matrix, X ∈ R n×dv . In this study, we have considered two proteins (P 1 , P 2 ) in an instance and extracted the insightful features (y 1 , y 2 ) using GCNN and then concatenated them for the final representation (F i Struc ). The GCNN takes A and X as inputs of the proteins and the structural feature represented as Here, ⊕, f, σ are the concatenation operator, a non-linear activation function and the propagation rule, respectively. W i j is the weight matrix of layer i of protein P j and H i j is defined as

Attention-based Multi-modal Integration
After extracting the features of three modalities (textual, protein sequence and protein structure), we have fused the features using attention mechanism. Attention mechanism has the ability to focus on the features which are the most relevant to a context specific task. In this study, we have used self-attention mechanism of the transformer model which concatenates the final integrated feature representations (F) of i th instance (S i ) using the following formula.
(4) Here, W i represents the attention weight of ith modality. Finally, this final representation (F) is fed to softmax layer for final classification.

Experimental Results and Analysis
In this section, we have briefly described the details of the hyper-parameters and the comparative analysis of the proposed deep multi-modal architecture. To explore the role of developed multi-modal datasets along with the proposed multi-modal architecture for predicting the protein interactions, several experiments are conducted for evaluating each modality and also different combinations of the modalities. Additionally, we have compared the performance of our multi-modal approach with various state-of-the-art methods.

Details of Hyper-parameters
In our proposed multi-modal architecture, for the final classification we have used softmax. Adam optimizer is used through out the multi-modal architecture. In stacked BiLSTM model for textual modality, 6 (i.e., L=6) layers of BiLSTM are used.
In case of structural features, graph convolutional neural network with two hidden layers is used. For sequence modality, capsule network followed by three ReLU convolutional layers are used. In the developed capsule network, the number of primary capsules are eight along with two secondary capsules. Finally, self-attention of transformer model is utilized for integrating the features of different modalities. For self-attention, we have used three encoders which are followed by a fully connected network with two hidden layers. The output of the fully connected network is then fed to softmax for final classification.

Comparative analysis with baselines
For baselines, we have compared our multi-modal approach with three uni-modal, three bi-modal and two other multi-modal architectures.
• Textual modality BioBERT and stacked BiL-STM are utilized for this model.
• Protein sequence modality Capsule network is utilized to understand the underlying features extracted from the protein sequences.
• Protein structural modality Inspired by the effective performance of GCNN in understanding the graph representation, GCNN is applied on atomic structure of proteins.
• 3D structural + sequence modality In this bimodal architecture, GCNN and capsule network are used for structural and sequence modality, respectively. Finally, self-attention is utilized to understand the integrated features of these two modalities.
• Textual + sequence modality In this model, self-attention is applied on the extracted features of textual and sequence modality.
• Textual + 3D structure modality: To learn the different attributes discussed in the text and protein structural modality, self-attention mechanism is applied to fuse them.
• Multi-modal approach 1 This architecture of this baseline is the same as the proposed multimodal approach, except the learned features of each modality are simply concatenated instead of using any attention mechanism.
• Multi-modal approach 2 In this model, attention mechanism is applied for integrating  Table 1: Comparative study of our proposed deep multi-modal approach with several baselines in terms of precision, recall, F-measure the features of textual, protein sequence and structural modalities. For extracting the features from textual, protein sequence and protein structure, we use BioBERT, BiLSTM and CNN, respectively.
The results reported in Table 1 illustrate the supremacy of the proposed multi-modal approach over other baselines.

Comparison with State-of-the-art
Additionally, along with the baselines, we have compared the performance of our multi-modal approach with several existing works reported in the literature. For BioInfer dataset, we have compared our proposed method with nine state-of-theart models. These existing methods are based on different techniques like kernel-based (Choi and Myaeng, 2010;Tikk et al., 2010;Qian and Zhou, 2012;Li et al., 2015), deep neural network-based (Zhao et al., 2016), multi-channel dependencybased convolutional neural network model (Peng and Lu, 2017), semantic feature embedding (Choi, 2018) and shortest dependency path (Hua and Quan, 2016). Along with the aforementioned methods, we have also compared our approach with a recent deep learning-based approach proposed by (Yadav et al., 2019). The comparative performance analysis for BioInfer dataset is tabulated in Table   Precision (Qian and Zhou, 2012) 63.61 61.24 62.40 (Peng and Lu, 2017) 62.70 68.2 65.30 (Zhao et al., 2016) 53.90 72.9 61.60 (Tikk et al., 2010) 53.30 70.10 60.00 (Li et al., 2015) 72  2. We have also compared our approach with nine existing approaches for HRPD50 dataset. The comparative results for HRPD50 dataset are presented in Table 3.

Discussion
By analyzing the above comparative study, we can infer that the overall performance of our proposed multi-modal approach surpasses other baselines and existing methods. Among the baseline models, proposed multi-modal approach outperforms its unimodal and bimodal counterparts. Among the uni-modal architecture, structural modality outperforms other two modalities which suggests the importance of structural modality over textual and sequence modalities. The sequence modality performs poorly because of its huge length (length of most of the sequences is approx 10,000 nucleotides). Among the bimodal architectures, (textual + structural) model surpasses other bimodal and unimodal counterparts. This fusion shows improvements of 5.1% and 5.5% F-score values over the best unimodal architecture for HRPD50 and BioInfer data sets, respectively. Similarly, our proposed multi-modal architecture shows an improvement over bi-modal counterparts. Also, the proposed multi-modal architecture shows an aver-  Van Landeghem et al., 2008) 60.00 51.00 55.00 (Miwa et al., 2009) 68.50 76.10 70.90 (Airola et al., 2008a)(Co-occ) 38.90 100 55.40 (Pyysalo et al., 2008) 76.00 64.00 69.00 Table 3: Comparative analysis of the proposed multimodal approach with other state-of-the-art approaches for HRPD50 dataset. age improvement of 3.87% and 2.24% F-scores over multi-modal approach1 and multi-modal ap-proach2, respectively. This improvement indicates that in addition to multiple modalities, underlying deep learning models and fusion technique contribute significantly in improving the performance of the overall architecture. In addition, Table 2 and Table 3 indicate that the proposed multi-modal architecture outperforms the best and recent existing methods for both BioInfer and HRPD50 dataset, respectively. We have performed Welch's t-test to show that obtained improvements by the proposed approach are statistically significant. From the above comparative study, it is evident that our proposed multi-modal approach identifies the protein interactions in an efficient way and can be further improved in different ways.

Error Analysis
After thoroughly analyzing false positive and false negative instances, it can be inferred that following are the possible reasons of errors: Yu-Lun Hsieh, Yung-Chun Chang, Nai-Wen Chang, and Wen-Lian Hsu. 2017. Identifying proteinprotein interactions in biomedical literature using recurrent neural networks with long short-term memory. In Proceedings of the eighth international joint conference on natural language processing (volume 2: short papers), pages 240-245.
Lei Hua and Chanqin Quan. 2016. A shortest dependency path based convolutional neural network for protein-protein relation extraction. BioMed research international, 2016.