Biomedical Relation Classification by single and multiple source domain adaptation

Relation classification is crucial for inferring semantic relatedness between entities in a piece of text. These systems can be trained given labelled data. However, relation classification is very domain-specific and it takes a lot of effort to label data for a new domain. In this paper, we explore domain adaptation techniques for this task. While past works have focused on single source domain adaptation for bio-medical relation classification, we classify relations in an unlabeled target domain by transferring useful knowledge from one or more related source domains. Our experiments with the model have shown to improve state-of-the-art F1 score on 3 benchmark biomedical corpora for single domain and on 2 out of 3 for multi-domain scenarios. When used with contextualized embeddings, there is further boost in performance outperforming neural-network based domain adaptation baselines for both the cases.


Introduction
In the biomedical domain, a relation can exist between various entity types like protein-protein, drug-drug, chemical-protein etc. Detecting relationships is a fundamental sub-task for automatic Information Extraction to overcome efforts of manual inspection, especially for growing biomedical articles. However, existing supervised systems are highly data-driven. This poses a challenge since manual labelling is a costly and timeconsuming process and there is a dearth of labelled data in the biomedical domain covering all tasks and for new datasets. A system trained on a specific dataset 1 may perform poorly on another, for the same task (Mou et al., 2016), due to dataset variance which can arise owing to sample selection bias (Rios et al., 2018).
Domain Adaptation aims at adapting a model trained on a source domain to another target domain that may differ in their underlying data distributions. Past work on domain adaptation for bio-medical relation classification has focused on single-source adaptation (Rios et al., 2018). However, multiple sources from related domains can prove to be beneficial for classification in a lowresource scenario.
In this paper, we perform domain adaptation for biomedical binary relation classification at the sentence-level. For single-source single target (SSST) we transfer between different datasets of protein-protein interaction, along with drug-drug interaction. We also explore multi-source single target (MSST) adaptation to incorporate more richness in the knowledge transferred by using additional smaller corpora for protein-protein relation and multiple labels for chemical-protein relation respectively. Given an unlabeled target domain, we transfer common useful features from related labelled source domains using adversarial training (Goodfellow et al., 2014). It helps to overcome the sampling bias and learn common indistinguishable features, promoting generalization, using min-max optimization. We adopt the Multinomial Adversarial Network integrated with the Shared-Private model (Chen and Cardie, 2018) which was originally proposed for the task of Multi-Domain Text Classification. It can handle multiple source domains at a time which is in contrast to traditional binomial adversarial networks. The Shared-Private model (Bousmalis et al., 2016) consists of a split representation where the private space learns specific features related to a particular domain while a shared space learns features common to all the domains. Such representation promotes non-contamination of the two spaces preserving their uniqueness. The contributions of our approach are as follows: 1) We show that using a shared-private model along with adversarial training improves SSST adaptation compared to neural network baselines. When multiple source corpora from similar domains are used it leads to further performance enhancement. Moreover, use of contextualized sentential embeddings leads to better performance than exisitng baseline methods for both MSST and SSST.
2) We explore the generalizability of our framework using two prominent neural architectures: CNN (Nguyen and Grishman, 2015) and Bi-LSTM (Kavuluru et al., 2017), where we find the former to be more robust across our experiments.

Methodology
For every labeled sources and a single unlabeled target we have set of NER tagged sentences, each of which is represented as: X = {e 1 , e 2 , w 1 ...w n } where e 1 and e 2 are two tagged entities and w j is the j th word in the sentence . A labelled source instance is accompanied by the relation label (True or False). In this section we discuss the input representation followed by model description.

Input Representation
We form word and position embeddings for every word in an NER tagged sentence. We use the PubMed-and-PMC-w2v 2 to generate word embeddings. The size being (|V | · d w ), where d w is the word embedding dimension which is 200 and |V | is the vocabulary size. The position embedding vector for j th word in a sentence relative to two tagged entities e 1 and e 2 is represented as a tuple: (p e1(j) , p e2(j) ) where, p e1(j) and p e2(j) R e .

Model
Fig 1 shows the adaptation of MAN framework whose various components are discussed below.
Shared & Domain feature extractor (F s , F d i ) The input representation is fed to both F d i and F s for labeled source domains whereas for unlabeled target instances it is fed only to F s . For SSST the model is trained on a single labeled source domain and tested on a unlabeled target domain. For MSST we do not combine the sources as a single corpus since that leads to a number of false negatives. We make two different assumptions to consider multiple sources: 1) Following Nguyen et al., (2014) we consider multiple labels Domain discriminator, D is a fullyconnected layer with softmax that predicts multiple domain probabilities using M ultinomial Adversarial N etwork. The output from F s is fed to D which is adversarially trained separately from the entire network using L2 loss described as follows: where, d is the index assigned for a domain andd is the prediction. It is generalized as N i=1d i = 1 and ∀ i :d i ≥ 0. F s tries to fool D so that it can not correctly guess the domain from where a sample instance is coming from. Thus F s learns indistinguishable features in the process.
Relation Classifier C is a fully-connected layer with a softmax, used to predict the class probabilities. We use Bio-BERT (Lee et al., 2019) embeddings for every sentence as features (Geeticka Chauhan, 2019) BERT[CLS] that have shown to improve performance in many downstream tasks. This is concatenated with the fixed size sentence representation from F s and F d i , together  they serve as input to C. For unlabeled target, during test no domain specific features are generated from F d i and that part is set to zero vector. For binary classification we adopt Negative Log Likelihood Loss for C described below: where, y is the true relation label andŷ is the softmax label. The objective of F d i is same as that of C and it relies only on labeled data. On the other hand the objective of the Shared Feature Extractor F s is represented as follows: Loss of F s = Classif ier loss +λ Domain loss It consists of two loss components: improve performance of C and enhance learning of invariant features across all domains. A hyper parameter λ is used to balance both of them.

Datasets
The dataset statistics is summarized in Table 1 and

Experiments
Pre-processing: We anonymize the named entities in the sentence by replacing them with predefined tags like @PROT1$, @DRUG$ (Bhasuran and Natarajan, 2018).

.2 Multi-source single target (MSST)
The experiments with two different assumptions to consider multiple sources are as follows: Multiple smaller corpora from similar domain: For Protein Protein Interaction there are three smaller standard corpora in literature, namely, LLL (Nedellec, 2005), IEPA (Ding et al., 2001), HPRD50 (Fundel et al., 2007). All three were considered as additional sources to transfer knowledge. AiMed (AM) and BioInfer (BI) were alternately selected as the unlabeled target in 2 different experiments while the remaining 4 denoted as 4P are considered as source corpus.
Multiple labels from single corpora: For ChemProt corpora we consider various labels as different sources following Nguyen et al., (2014) The five positive labels of ChemProt are: CPR: 3, CPR: 4, CPR: 5, CPR: 6, CPR: 9 which stand for upregulator, downregulator, agonist, antagonist and substrate, respectively. We predict the classification performance for unlabeled targets CPR:6 and CPR:9 taking multi-source labeled input denoted as 3C from three sources-CPR: 3, CPR: 4, CPR: 5 as positive instances and remaining as negative.

Baselines
We compare our approach with different baselines which are mentioned as follows: -BioBERT (Rios et al., 2018): For SSST we train it on one dataset and test on another. For MSST we combine the multiple sources as a single source and test on labeled target.
-CNN+DANN (Lisheng Fu, 2017) : A variant of adversarial training which is gradient reversal (RevGrad) is used with CNN (Nguyen and Grishman, 2015).  -Adv Bi-LSTM + Adv CNN (Rios et al., 2018): Conducts two-step training: pre-training with source followed by adversarial training with target. For MSST experiment we compare our method with Adv CNN and Adv Bi-LSTM by combining multiple sources.

Results and Discussions
In Table 3 we see that BioInfer generalizes well to AiMed and DDI corpora using vanilla LSTM or CNN architecture. However, with MAN and contextual embeddings, we do not see prominent gains as much as the other datasets. This can be due to the class imbalance in data (positive to negative instance ratio 1:5.9) (Hsu et al., 2015;Rios et al., 2018). For AiMed and BioInfer, we find that the knowledge transfer among themselves gives the best performance thus strengthening the fact that datasets from the same domain can contribute to performance enhancement justifying the performance gains in MSST experiments. Our model outperforms other baselines just with the use of adversarial training which might be attributed to joint learning better representation from shared and private feature extractors. The use of contextual BERT[CLS] tokens leads to increase in performance scores since they encode important relations between words in a sentence (Vig, 2019;Hewitt and Manning, 2019).
In Table 4, BioBERT is seen to perform well for ChemProt. We hypothesize that this may be due to the same underlying dataset being used during train and test. Though we use different labels as multi-source, that may not contribute to generating enough variance in sources since they  were from the same dataset. For AiMed and BioInfer, however, three different smaller corpora were used, where the proposed method outperforms BioBERT. When compared across all the six SSST experiments, the Bi-LSTM based model lacks in performance may be due to absence of any attention mechanism which would have helped in selecting more relevant context (Chen and Cardie, 2018). We observe that adversarial training along with contextualized BERT sentence embeddings leads to performance gains across all datasets.

Conclusions
Our proposed model significantly outperformed the existing neural network based domain adaptation baselines for SSST. Among the two MSST experiments, we showed that the system gains when multiple source corpora are used. We also experiment with two architectures out of which CNN is seen to perform marginally better compared to Bi-LSTM. Our analysis on Section 5 further explains the effect of sources, adversarial training and use of contextualized BERT sentential embeddings.