CRF-LSTM Text Mining Method Unveiling the Pharmacological Mechanism of Off-target Side Effect of Anti-Multiple Myeloma Drugs

Sequence labeling of biomedical entities, e.g., side effects or phenotypes, was a long-term task in BioNLP and MedNLP communities. Thanks to effects made among these communities, adverse reaction NER has developed dramatically in recent years. As an illuminative application, to achieve knowledge discovery via the combination of the text mining result and bioinformatics idea shed lights on the pharmacological mechanism research.


Introduction
Drug genetics aimed to discern the association between drugs and adverse reaction, and allowed to personalized medication (Stephen, 2011). Associating off-target effects with adverse reaction of drugs to discover the new pharmacological effect of them is a daunting task when using experimental method alone. (Eugen et al., 2012).
As a pioneer work, Lountkine et al., (Eugen et al., 2012) explored a computational method to predict novel off-target effects of 656 marketed drugs. By using the chemoinformatics information, like ligand affinity, Similarity Ensemble Approach (SEA) (Keiser et al., 2009) was used to calculate structural similarity of drugs and targets, and the relations between drugs and targets was rebuild. In the meantime, the adverse reaction (ADR) of drug targets were curated from authoritative database including Drug-Bank, GeneGo Metabase, and Thompson Reuters Integrity. Thus, a large scale drug-target-ADR network was built, and the coincidental overlap of ADR among target gene and drugs potential gene gave illuminative explanation for the mechanism of off-targets effect.
Generally, the mechanism of off-target candidate filtering requires the prerequisite of target-drug pair indications. So far, this pair information has been widely predicted by inferring the similarity both in chemical structure and relevance info. Andreas et al., (Andreas et al., 2007) used chemical structure information to infer the drug-target pair, while Monica et al., (Monica et al., 2008) used phenotypic side effect similarities to make the inference. As a largescale bioinformatics attempt, Mohan et al., (Mohan et al., 2008) exploited a huge training set of 10 million compounds with known in-vitro activities, predicted both primary and secondary pharmacology for 1279 molecules, and over 30 thousands possible interactions were predicted for these drugs.
Multiple myeloma (MM) is one of the most common hematological malignancies, the incidence of which ranks second just next to non-Hodgkin lymphoma. Although recent advances in MM treatment has largely improved the patients clinical outcome, it remains an incurable disease due to drug-resistance and relapse which are almost inevitable (Terpos, 2017). Common adverse drug reactions (ADRs) related to anti-MM treatment include hematologic toxic effects (eg. anemia, neutropenia and thrombocytopenia), thrombosis, impaired immune function, pe-ripheral neuropathy, and gastrointestinal toxic effects (eg. mucositis, diarrhea), among many others. These ADRs bring harm to patients health and quality of life, and may result in premature discontinuation of treatment due to intolerance to side effects. Since the underlying mechanisms are largely unclear, currently they are mostly managed with symptomatic and/or supportive care, along with dosage reduction or treatment discontinuation (McCullough et al., 2018). A better understanding on the mechanisms will help us find ways to effectively cope with the above mentioned safety concerns in treating MM.
In this research, we proposed a novel pharmacological knowledge discovery strategy which integrated both Biomedical natural language processing (BioNLP) and medical informatics. The adverse reactions (ADRs) were trained by newly released ADR training data (Demner-Fushman et al., 2018), and were extracted on-line with large-scale of text mining upon 16 anti-MM drugs by using conditioned random field (CRF) and long short term memory (LSTM) neural networks. Subsequently, Human Phenotype Ontology (HPO) (Sebastian et al., 2017) and Ligand Similarity prediction were used to calculate the target phenotypes. Bioinformatics analysis hinted that an off-target gene, SLC7A7, played vital role in the side effect of a combination usage of anti-MM drugs.

Data Resource
Marketed drugs for MM were collected from drugs.com (Drugs). After searching anti-MM chemicals and removing drug synonyms, 16 drugs were extracted from the original pharmaceutical list, and drug targets were collected from SwissTargetPediction (David et al., 2014), as shown in supplementary table, Table S1 (Sixteen anti-MM drugs their possible targets). Meanwhile, drug labels were extracted from DailyMED database (National Library of Medicine and Services, 2005).
Human Phenotype Ontology (HPO) (Sebastian et al., 2017) provides standardized vocabulary of phenotypic abnormalities in human diseases. From HPO, matches of target genes and their corresponding phenotype terms were retrieved, as shown in table S2(Phenotype matching result for specific gene).

Vector representation of tokens
Regarding the input form for a neural network, word embedding, controlled vocabulary -DISORDER, and part of speech (POS) are used for vector representation of tokens.
• Pre-trained Embeddings: Compared with randomly initialized word embeddings, pre-trained word embeddings generally yield better experimental results. 200 dimensional embeddings of GloVe ( (Pennington et al., 2014)) was chosen, instead of word2vec word vectors, as GloVe is more preferable for named entity recognition tasks than word2vec (Ma and Hovy, 2016).
• DISO is a standardized dictionary from Metathesaurus of UMLS. The dictionary consists of the following 12 subtypes, i.e. acquired abnormality, anatomical abnormality, cell or molecular dysfunction, congenital abnormality, disease or syndrome, experimental model of disease, finding, injury or poisoning, mental or behavioral dysfunction, neoplastic process, pathologic function, and sign or symptom.
• The NLTK toolkit is taken into consideration to obtain the POS of each token. Randomly initialized feature weights was assigned to each POS type, and a lookup operation convert each sentence into a POS-embedding vector.

Integration of CRF and LSTM for sequence labeling
For sequence labeling task as ADR extraction, CRF is a popular mathematical method which defines the probability of the annotation of the label sequence L = (l 1 , l 2 , ..., l l ), given the observation sequence is a transition feature function that represents the transition distribution of label pair {l i−1 , l i } based on observation sequence O, while s k (l i , O, i) refers to state feature function that quantify the state distribution of the label y i given the observation sequence O. The mechanism of CRF is to optimize the parameters λ j and µ k , and maximize the probability of P (L|O): (Lafferty et al., 2001).
In the meantime, LSTM is a special Recurrent neural networks(RNNs) which could capture time dynamics via cycles in the graph, and especially, is capable of capturing long-distance dependencies with the employment of a special cell and three grates, i.e. input gate, forget gate, and output gate. Supposing that t represents a time point, x t is the input vector at time where σ is the element-wise sigmoid function and * is the element-wise product. And h t is the hidden state, namely the finally output of LSTM unit at time t.
To achieve a better semantic understanding in biologic domain, a combined BLSTM-CNNs-CRF neural network was put forward by Ma et al.(Ma and Hovy, 2016), where CNNs are utilized to model character-level information, bi-directional LSTM (BLSTM) is used to capture past and future information respectively, and CRF is employed to decode the best label sequence. In order to further improve the labeling accuracy for this specific task, double-BLSTM layer is taken into consideration instead of single-BLSTM layer, namely BLSTM, mentioned in Ma et al.(2016).
The detailed algorithm steps are shown in Figure  1. For each word in training text, the character-level representation vector computed by CNN, the DISO and POS feature got by lookup random initialization weights, concatenated with word embedding vector are designed as the input of the double-BLSTM network. And the output vectors of double-BLSTM are fed to the CRF layers to jointly decode the best label sequence. The flowchart of a specific labeling employment example is presented in the following.
For instance, "I have a cough." where "cough" is the target word. After the sentence being separated into words, the words are broken into letters, which can be embedded into a one-hot vector to compute the character representation vector by CNN. The character-level representation vectors of each words, their DISO and POS representation vector and word embeddings, computed by glove, are combined as the inputs of double-BLSTM, which has double-layer of two processes, i.e. the past(left) and the future(right). The past process takes information only from 'I' to 'cough' while the future process takes information only from 'cough' to 'I'. These two pieces of information was concatenated as the final outputs of double-BLSTM and, simultaneously, the inputs of CRF. With the utilization of CRF, the labels of sentence are tagged as 'O O O B'.

Phenotype matching algorithm
To decide whether two phenotype words match or not, two criteria were applied. First, both phenotypes are available in the database with the same is a · ID; second, the word embedding distance of two terms are small sufficiently. The algorithm is shown in the following.
• If both phenotypes are available in the database with the same is a·ID, the output will be T rue.
• If not, each target phenotype is converted into a word embedding, and if the distance of the two vectors is less than a threshold value t, the two phenotype terms are matched. Otherwise, the two terms are not matched.
Algorithm 1 Phenotype matching algorithm Input: Term A, term B, threshold value t Output: T rue/F alse The purpose of this research is to find the cooccurrence of phenotype from both drug and the related protein, so as to illuminate the pharmacological mechanism of the drug side effect. By using an integration of the CRF and LSTM text mining algorithms, sequence labeling was carried on to extract side effects, SE drug , of anti-MM drugs from DailyMed drug labels. Potential drug targets, T arget drug , were filtered by querying SwissTargetPredcition Database. Meanwhile, related phenotype of T arget drug , i.e., P heno(T arget drug ), was obtained by using Human Phenotyping Ontology (HPO). Subsequently, off-target gene, T arget of f −target drug , of corresponding drugs were filtered out by intersection analysis of SE drug and P heno(T arget drug ).

Database querying result
In total, 48 types of anti-MM drugs are collected by searching drug.com. And with the 48 drug names as searching condition, 16 different drugs and 16 corresponding labels are extracted from 27 drug labels, acquired by DailyMED. Among the 16 drugs, 2 are protein drugs, and the left 14 non-protein drugs are taken to predict their potential targets with the utilization of SwissTargetprediction, where 15 potential targets can be obtained from each drug. Searching the 15 potential targets in HPO, targets, not only our predicted targets but also targets existing in HPO, are achieved. With the application of HPO, drugs, potential target genes, and corresponding phenotype are related with each other. Meanwhile, corresponding ADRs form acquired drug labels can be collected with the strategy of sequence labeling, and improvement of the relationship between drugs and ADRs can be achieved through drugs.com, where related drugs' ADRs are collected. Eventually, drugs, potential targets, and data of overlapping ADRs are acquired via artificial recognition. And it is revealed in the result that under the circumstance of a certain drug, its potential targets have a tight relation with its ADRs.

Phenotype matching and phenotype coincidence
For the trained samples in table S3, F-Score and Matthews Correlation Coefficient (MCC) were calculated, and a best threshold t = 0.57 was obtained.
Here F − score = 2 P recision×Recall P recision+Recall , and M CC = . The selection of t is shown in figure 3. The best F-score and MCC are 0.733, 0.622 separately. Figure 3: threshold selection for phenotype matching By using algorithm 1, the gene whose phenotypes in HPO are highly consistent with drug ADRs were retrieved, and the coincidence were evaluated by Jaccard similarity coefficient. Among all the intersection of phenotype terms, the most prominent output pair is melphalan-SLC7A7 for Jaccard value being 0.280 and melphalan-CA2 for Jaccard value being 0.198. As shown in table 1, phenotype coincidence for melphalan and SLC7A7/CA2 is clear, that hinted that the two genes possibly play roles in the side effects of the drug.

Knowledge discovery of off-target side effect
An illuminative evidence comes from Melphalan, a common anti-MM drug. Through intersection analysis of SE M elphalan and P heno(T arget M elphalan ), anemia, thrombocytopenia and diarrhea were found to be the same phenotypes of the drug Melphalan and the possible target SLC7A . Observing its target genes are NR3C1, NR0B1, ANXA1, NOS2, NR1L2, and its possible target gene is SLC7A, we found that, after taking another anti-MM drug Prednisone, mRNA level of target genes goes down and that of SLC7A  goes up. That made it high chance for off-target event of SLC7A to manifest its off-target side effects: P heno(T arget M elphalan ). Thus SLC7A is with high chance the factor of the off-target effect.

Discussion
Mechanism of off-target effect is illustrated in this section. First, literature evidences are shown to address the side effect after anti-MM drug usage, and then the up/down regulatory mRNA-level tendency of on/off targets are shown.

Literature evidence
It was reported that a combined usage of melphalan, prednisone, and bortezomib (MPV) is regarded as common treatment for the high-risk MM patient, while neutropenia, thrombocytopenia, anemia, and gastrointestinal symptoms were common after MPV treatment (Kyle and Rajkumar., 2009).
Meanwhile, SLC7A is a heterotrimeric amino acid transporter (HAT) y+LAT-1 gene located on chromosome 14q11.2. It was reported that mutation in SLC7A caused Lysinuric Protein Intolerance. Then, delayed physical development, intestinal malabsorption, vomiting, and failure to thrive are the prominent clinical manifestations (Lawson and Loyd, 2013).

Up/Down regulation of target/off-target gene
Drug usage of Prednisone is treated as exposure in comparison analysis, and the connectivity map (CMAP) is used to unveil the up/down regulation by analyzing the before/after mRNA level of patience.
We input the target genes as down regulated genes and the off-target genes as up regulated, and the output off-target gene is Prednisone, with significant P value, 0.01029. As shown in figure 4, after taking Prednisone, as it mentioned above, the expression levels of target genes are down regulated while the off-target genes are up, the steady state is broken. In this condition, more off-target proteins lead to more combination with Melphalan than usual, which contribute to more significant side effects.
Here, we infer that the usage of Prednisone lead to an up regulation of SLC7A, and it arises competition between SLC7A and the drug targets, i.e., NR3C1, NR0B1, ANXA1, NOS2, NR1L2. The binding of SLC7A to Melphalan brings the off-target effect. Therefore, thrombocytopenia, anemia, and gastrointestinal symptoms can be easily observed after combined usage of Melphalan and Prednisone.

Conclusion
Sequence labeling of biomedical entities, e.g., side effects or phenotypes, was a long-term task in BioNLP and MedNLP communities. Thanks to effects made among these communities, adverse reaction NER has developed dramatically in recent years (Demner-Fushman et al., 2018). As an illuminative application, to achieve knowledge discovery via the combination of the text mining result and bioinformatics idea shed lights on the pharmacological mechanism research.