Syntax-aware Multi-task Graph Convolutional Networks for Biomedical Relation Extraction

In this paper we tackle two unique challenges in biomedical relation extraction. The first challenge is that the contextual information between two entity mentions often involves sophisticated syntactic structures. We propose a novel graph convolutional networks model that incorporates dependency parsing and contextualized embedding to effectively capture comprehensive contextual information. The second challenge is that most of the benchmark data sets for this task are quite imbalanced because more than 80% mention pairs are negative instances (i.e., no relations). We propose a multi-task learning framework to jointly model relation identification and classification tasks to propagate supervision signals from each other and apply a focal loss to focus training on ambiguous mention pairs. By applying these two strategies, experiments show that our model achieves state-of-the-art F-score on the 2013 drug-drug interaction extraction task.


Introduction
Recently relation extraction in biomedical literature has attracted increasing interests from medical language processing research community as an important stage for downstream tasks such as question answering (Hristovski et al., 2015) and decision making (Agosti et al., 2019). Biomedical relation extraction aims to identify and classify relations between two entity mentions into pre-defined types based on contexts. In this paper we aim to extract drug-drug interactions (DDIs), which occur when taking two or more drugs within a certain period of time that alters the way one or more drugs act in human body and may result in unexpected side effects (Figure 1). Extracting DDI provides important clues for research in drug safety and human health care. Dependency parses are widely used in relation extraction task due to the advantage of shortening the distance of words which are syntactically related. As shown in Figure 1, the partial dependency path {iron ← cobalt ← interactions} reveals that these two drugs are interactive, and the path {interactions → absorption → retention} further indicates the mechanism relation between these two mentions. Therefore capturing the syntactic information involving the word interaction on the dependency path {iron ← cobalt ← interactions → absorption → retention} can effectively help on the classification of the relation between these two mentions cobalt, iron . In order to capture indicative information from wide contexts, we adopt the graph convolutional networks (GCN) (Kipf and Welling, 2016;Marcheggiani and Titov, 2017) to obtain the syntactic information by encoding the dependency structure over the input sentence with graph convolution operations. To compensate the loss of local context information in GCN, we incorporate the contextualized word representation pre-trained by the BERT model (Devlin et al., 2019) in large-scale biomedical corpora containing over 200K abstracts from PubMed and over 270K full texts from PMC  .
Moreover, we notice that data imbalance is another major challenge in biomedical text as the distribution of relations among biomedical mentions are usually very sparse. Over 80% candidate mention pairs have no relation in DDI 2013 (Herrero-Zazo et al., 2013) training set. To tackle this problem, we propose a binary relation identification task as an auxiliary task to facilitate the main multi-classification task. For instance, the detection of drug interaction on dependency path {iron ← cobalt ← interactions → absorption → re-tention} will assist the prediction of the relation type mechanism by using the signals from binary classification as an inductive bias to avoid misclassifying it as no relation. We also exploit the focal loss  to potentially help the multiclass relation classification task by forcing the loss implicitly focus on ambiguous examples.
To recap, our contributions are twofold: First, we adopt the syntax-aware graph convolutional networks incorporating contextualized representation. Second, we further design an auxiliary task to solve the data imbalance problem, which achieves the state-of-the-art micro F-score on the DDIExtraction 2013 shared task.

Contextual and Syntax-aware GCN
As a variant of the convolutional neural networks (LeCun et al., 1998), the graph convolutional networks (Kipf and Welling, 2016) is designed for graph data and it has been proven effective in modeling text data via syntactic dependency graphs (Marcheggiani and Titov, 2017).
We encode the tokens in a biomedical sentence of size n as x = {x 1 , . . . , x n }, where x i is a vector which concatenates the representation of the token i and the position embeddings corresponding to the relative positions from candidate mention pairs. We feed the token vectors into a Llayer GCN to obtain the hidden representations of each token which are directly influenced by its neighbors no more than L edges apart in the dependency tree. We apply the Stanford dependency parser (Chen and Manning, 2014) to generate the dependency structure: where A = A+I with A is the adjacent matrix of tokens in dependency tree, I is the identity matrix. W (l) is a linear transformation, b (l) is a bias term, and σ is a nonlinear function. Following Zhang et al. (2018), d i is the degree of the token i in dependency tree with an additional self-loop. We notice that some token representations are more informative by gathering information from syntactically related neighbors through GCN. For example, the representation of the token interactions from a 2-layer GCN operating on its two edges apart neighbors provides inductive information for predicting a mechanism relation. Thus, we adopt attentive pooling (Zhou et al., 2016) to achieve the optimal pooling: where w is a trained parameter to assign weights based on the importance of each token representation.
We obtain the final representation by concatenating the sentence from attentive pooling and the mention representations from max pooling. We finally obtain the prediction of relation type by feeding the final representations into a fully connected neural network followed by a softmax operation.
Graph neural networks (Zhou et al., 2018b) can learn effective representations but suffer from the loss of local context information. We believe the local context information is also crucial for biomedical relation extraction. For example, in the following sentence "The response to [Factrel]DRUG may be blunted by [phenothiazines]DRUG and [dopamine antagonists]DRUG ", it's intuitive to tell Factrel and phenothiazines are interactive while phenothiazines and dopamine antagonists have no interaction according to the sentence order. However, GCNs treat the three drugs as interacting with each other as they are close in dependency structure with no order information.
BERT (Devlin et al., 2019) is a recently proposed model based on a multi-layer bidirectional Transformer (Vaswani et al., 2017). Using pretrained BERT has been proven effective to create contextualized word embeddings for various NLP tasks (Han et al., 2019;Wang et al., 2019). The BioBERT ) is a biomedical language representation model pre-trained on largescale biomedical corpora. The output of each encoder layer of the input token can be used as a feature representation of that token. As shown in Figure 2, we encode the input tokens as contextualized embeddings by leveraging the last hidden layer of the corresponding token in BioBERT. As the BERT model uses WordPiece (Wu et al., 2016) to decompose infrequent words into frequent subwords for unsupervised tokenization of the input token, if the token has multiple BERT subword units, we use the first one. After getting the contextualized embedding of each token, we feed them into the GCN layer to make our model context-aware.

Auxiliary Task Learning with Focal Loss
In the DDIExtraction 2013 task, all possible interactions between drugs within one sentence are annotated, which means a single sentence with multiple drug mentions will lead to separate instances of candidate mention pairs (Herrero-Zazo et al., 2013). There are 21,012 mention pairs generated from 3,790 sentences in training set and over 80% have no relations. This data imbalance problem due to sparse relation distribution is a main reason for low recall in DDI task (Zhou et al., 2018a;Sun et al., 2019).
Here we address this relation type imbalance problem by adding an auxiliary task on top of the syntax-aware GCN model. To conduct the auxiliary task learning, we add a separate binary classifier for relation identification as shown in Figure 2. All classifiers share the same GCN representation and contextualized embeddings, and thus they can potentially help each other by propagating their supervision signals.
Additionally, instead of setting the objective function as the negative log-likelihood loss, here we optimize the parameters in training by mini-mizing a focal loss

which focuses on hard relation types. For instance, the int relation indicates drug interaction without providing any extra information (e.g., Some [anticonvulsants]DRUG may interact with [Mephenytoin]DRUG
). This relation type only accounts for 0.82% in training set and is often misclassified into other relation types. We denote t i and p i as the ground truth and the conditional probability value of the type i in relation types C, the focal loss can be defined as: where α is a weighting factor to balance the importance of samples from various types, γ is the focusing parameter to reduce the influence of wellclassified samples in the loss. λ is the L2 regularization parameter and θ is the parameter set.
The auxiliary task along with the focal loss enhances our model's ability to handle imbalance data by leveraging the inductive signal from the easier identification task and meanwhile downweighting the influence of easy classified instances thus directing the model to focus on difficult relation types.

System
Prec Rec F1 CNN (Liu et al., 2016) 75.70 64.66 69.75 Multi Channel CNN (Quan et al., 2016) 75.99 65.25 70.21 GRU  73  We evaluate our model on the DDIExtraction 2013 relation dataset (Herrero-Zazo et al., 2013). The corpus is annotated with drug mentions and their four types of interactions: Mechanism (pharmacokinetic mechanism of a DDI), Effect (effect of a DDI), Advice (a recommendation or advice regarding a DDI) and Int (a DDI simply occurs without extra information). We randomly choose 10% from the training dataset as the development set. Following previous work (Liu et al., 2016;Quan et al., 2016;Zhou et al., 2018a;Sun et al., 2019), we use a negative instance filtering strategy to filter out some negative drug pairs based on manually-formulated rules. Instances containing drug pair referring to the same thing and drug pair appearing in the same coordinate structure with more than two drugs (e.g., drug1, drug2, and drug3) will be filtered. Entity mentions are masked with DRUG for better generalization and avoiding overfitting.
We train the model with GCN hidden state size of 200, the SGD optimizer with a learning rate of 0.001, a batch size of 30, and 50 epochs. Dropout is applied with a rate of 0.5 for regularization. The contextual embedding size from BioBERT is 768. The focusing parameter γ is set as 1. All hyperparameters are tuned on the development set.

Results and Analysis
The experiment results are reported from a 2-layer GCN which achieves the best performance and shown in Table 1. Our model significantly outperforms all previous methods at the significance level of 0.05. To analyze the contributions and effects of the various components in our model, we also perform ablation tests. The ablated GCN model outperforms the LSTM baseline by 3.6% F1 score, which demonstrates the effectiveness of GCN on modeling mention relations through dependency structure. The utilization of contextualized embedding from BioBERT which encodes the contextual information involving sequence order and word disambiguation implicitly helps the model to learn contextual relation patterns, therefore the performance is further improved. We obtain a significant F-score improvement (2.7%) by applying multi-task learning. As over 80% mention pairs are negative samples, the multi-task learning effectively solves the problem by jointly modeling relation identification and classification tasks and applying focal loss to focusing on ambiguous mention pairs, and thus we also gain 3.8% absolute score on recall. Specifically, the F1 score of int type is increased from 54.38% to 59.79%.
For the remaining errors, we notice that our model often fails to predict relations when the sen-tence are parsed poorly due to the complex content which suggests us to seek for more powerful parser tools. Besides, we also observe some errors occurring in extremely short sentences. For example, in the following sentence "[Calcium]DRUG Supplements/[Antacids]DRUG", our model cannot capture informative representations as the mentions are masked with DRUG and the sentence is too concise to offer indicative evidence.

Related Work
Traditional feature/kernel-based models for biomedical relation extraction rely on engineered features which suffer from low portability and generalizability (Kim et al., 2015;Zheng et al., 2016;Raihani and Laachfoubi, 2017). To tackle this problem, recent studies apply Convolutional Neural Networks (CNNs) and Recurrent Neural Networks (RNNs) to automatically learn feature representations with input words encoded as pre-trained word embeddings Liu et al., 2016;Quan et al., 2016;Zhang et al., 2017;Zhou et al., 2018a;Sun et al., 2019). Learning representations of graphs are widely studied and several graph neural networks have been applied in the biomedical domain. Lim et al. (2018) proposed recursive neural network based model with a subtree containment feature. Asada et al. (2018) encoded drug pairs with CNNs and used external knowledge base to encode their molecular pairs with two graph neural networks. Here we directly apply syntax-aware GCNs on biomedical text to extract drug-drug interaction.

Conclusions and Future Work
We propose a syntax-aware multi-task learning model for biomedical relation extraction. Our model can effectively extract the drug-drug interactions by capturing the syntactic information through graph convolution operations and modeling context information via contextualized embeddings. An auxiliary task with focal loss is designed to mitigate the data imbalance by leveraging the inductive signal from binary classification and increasing the influence of decisive relation types. In the future, we plan to explore more informative parsers like the abstract meaning representation parser to create graph structure and consider leveraging external knowledge to further enhance the extraction quality.