Leveraging Dependency Forest for Neural Medical Relation Extraction

Medical relation extraction discovers relations between entity mentions in text, such as research articles. For this task, dependency syntax has been recognized as a crucial source of features. Yet in the medical domain, 1-best parse trees suffer from relatively low accuracies, diminishing their usefulness. We investigate a method to alleviate this problem by utilizing dependency forests. Forests contain more than one possible decisions and therefore have higher recall but more noise compared with 1-best outputs. A graph neural network is used to represent the forests, automatically distinguishing the useful syntactic information from parsing noise. Results on two benchmarks show that our method outperforms the standard tree-based methods, giving the state-of-the-art results in the literature.


Introduction
The sheer amount of medical articles and their rapid growth prevent researchers from receiving comprehensive literature knowledge by direct reading. This can hamper both medical research and clinical diagnosis. NLP techniques have been used for automating the knowledge extraction process from the medical literature (Friedman et al., 2001;Yu and Agichtein, 2003;Hirschman et al., 2005;Xu et al., 2010;Sondhi et al., 2010;Abacha and Zweigenbaum, 2011). Along this line of work, a long-standing task is relation extraction, which mines factual knowledge from free text by labeling relations between entity mentions. As shown in Figure 1, the sub-clause "previously observed cytochrome P450 3A4 ( CYP3A4 ) interaction of the dual orexin receptor antagonist almorexant" contains two entities, namely "orexin receptor" and "almorexant". There is an "adversary" relation between these two entities, denoted as"CPR:6". Previous work has shown that dependency syntax is important for guiding relation extraction (Culotta and Sorensen, 2004;Bunescu and Mooney, 2005;Liu et al., 2015;Gormley et al., 2015;Xu et al., 2015a,b;Miwa and Bansal, 2016;, especially in biological and medical domains Peng et al., 2017;Song et al., 2018b). Compared with sequential surface-level structures, such as POS tags, dependency trees help to model word-toword relations more easily by drawing direct connections between distant words that are syntactically correlated. Take the phrase "effect on the medicine" for example; "effect" and "medicine" are directly connected in a dependency tree, regardless of how many modifiers are added in between.
Dependency parsing has achieved an accuracy over 96% in the news domain (Liu and Zhang, 2017;Kitaev and Klein, 2018). However, for the medical literature domain, parsing accuracies can drop significantly (Lease and Charniak, 2005;Mc-Closky and Charniak, 2008;Sagae et al., 2008;Candito et al., 2011). This can lead to severe er-ror propagation in downstream relation extraction tasks, offsetting much of the benefit that relation extraction models can obtain by exploiting dependency trees as a source of external features.
We address the low-accuracy issue in biomedical dependency parsing by considering dependency forests as external features. Instead of 1best trees, dependency forests consist of dependency arcs and labels that a parser is relatively confident about, therefore having better recall of gold-standard arcs by offering more candidate choices with noise. Our main idea is to let a relation extraction system learn automatically from a forest which arcs are the most relevant through end-task training, rather than relying solely on the decisions of a noisy syntactic parser. To this end, a graph neural network is used for encoding a forest, which in turn provides features for relation extraction. Back-propagation passes loss gradients from the relation extraction layer to the graph encoder, so that the more relevant edges can be chosen automatically for better relation extraction.
Results on BioCreative VI ChemProt (CPR) (Krallinger et al., 2017) and a recent dataset focused on phenotype-gene relations (PGR)  show that our method outperforms a strong baseline that uses 1-best dependency trees as features, giving the state-of-the-art accuracies in the literature. To our knowledge, we are the first to study dependency forests for medical information extraction, showing their advantages over 1-best tree structures. Our code is available at http://github.com/freesunshine/ dep-forest-re.

Related work
Syntactic forests There have been previous studies leveraging constituent forests for machine translation (Mi et al., 2008;Ma et al., 2018;Zaremoodi and Haffari, 2018), sentiment analysis (Le and Zuidema, 2015) and text generation (Lu and Ng, 2011). However, the usefulness of dependency forests is relatively rarely studied, with one exception being Tu et al. (2010), who use dependency forests to enhance long-range word-to-word dependencies for statistical machine translation. To our knowledge, we are the first to study the usefulness of dependency forests for relation extraction under a strong neural framework.
Graph neural network Graph neural networks (GNNs) have been successful in encoding dependency trees for downstream tasks, such as semantic role labeling , semantic parsing (Xu et al., 2018), machine translation Bastings et al., 2017), relation extraction (Song et al., 2018b) and sentence ordering (Yin et al., 2019). In particular, Song et al. (2018b) showed that GNNs are more effective than DAG networks (Peng et al., 2017) for modeling syntactic trees in relation extraction, which cause loss of important structural information. We are the first to exploit GNNs for encoding search spaces in the form of dependency forests.

Task
Formally, the input to our task is a sentence s = w 1 , w 2 , . . . , w N , where N is the number of words in the sentence and w i represents the i-th input word. s is annotated with boundary information (⇠ 1 : ⇠ 2 and ⇣ 1 : ⇣ 2 ) of target entity mentions (⇠ and ⇣). We focus on the classic binary relation extraction setting , where the number of associated mentions is two. The output is a relation from a predefined relation set R = (r 1 , . . . , r M , None), where "None" means that no relation holds for the entities.
Two steps are taken for predicting the correct relation given an input sentence. First, a dependency parser is used to label the syntactic structure of the input. Here our baseline system takes the standard approach, using the 1-best parser output tree D T as features. In contrast, our proposed model uses the most confident parser forest D F as features. Given D T or D F , the second step is to encode both s and D T /D F using a neural network, before making a prediction.
We make use of the same graph neural network encoder structure to represent dependency syntax information for both the baseline and our model. In particular, a graph recurrent neural network architecture (Beck et al., 2018; is used, which has been shown effective in encoding graph structures , giving competitive results with alternative graph networks such as graph convolutional neural networks Bastings et al., 2017). (GRN) to encode a 1-best dependency tree, which extracts features from the sentence and the dependency tree D T , respectively. Similar model frameworks have shown highly competitive performances in previous relation extraction studies (Peng et al., 2017;Song et al., 2018b).

Bi-LSTM layer
Given the input sentence w 1 , w 2 , . . . , w N , we represent each word with its embedding to generate a sequence of embeddings e 1 , e 2 , . . . , e N . A Bi-LSTM layer is used to encode the sentences: where the state of each word w i is generated by concatenating the states of both directions:

GRN layer
A 1-best dependency tree can be represented as a directed graph D T = hV , Ei, where V includes all words w 1 , w 2 , . . . , w N and E = {(w j , l, w i )}| w j 2V,w i 2V represents all dependency edges . Each triple (w j , l, w i ) corresponds to a dependency edge, where w j modifies w i with an arc label l. Each word w i is associated with a hidden state that is initialized with the Bi-LSTM output h i . The state representation of the entire tree consists of all word states: In order to capture non-local interactions between words, the GRN layer adopts a message passing framework that performs iterative information exchange between directly connected words. As a result, each word state is updated by absorbing larger contextual information through the message passing process, and a sequence of , T ), where T is a hyperparameter representing the number of state transitions.
Message passing The message passing framework takes two main steps within each iteration: message calculation and state update. Take w i and iteration t as the example. In the first step, separate messages m " i and m # i are calculated by summing up the messages of its children and parent in the dependency tree, respectively: where E (·,·,i) and E (i,·,·) represent all edges with a head word w i and a modifier word w i , respectively, and e lrev represents the embedding of label l rev , the reverse version of original label l (such as "amod-rev" is the reverse version of "amod"). The message from a child or a parent is obtained by simply concatenating its hidden state with the corresponding edge label embedding.
In the second step, GRN uses standard gated operations of LSTM (Hochreiter and Schmidhuber, 1997) to update hidden state h (t 1) i with the previously integrated message. In particular, a cell c i are used to control information flow from the inputs and to the output h where W " x , W # x , and b x (x 2 {1, 2, 3, 4}) are model parameters, and c (0) i is initialized as a vector of zeros.
The same process repeats for T iterations. Starting from h (0) of the Bi-LSTM layer, increasingly more informed hidden states h (t) are obtained as the iteration increases, and h (T ) is used as the final representation of each word.

Relation prediction
Given h (T ) of the GRN encoding, we calculate the representation vector of the two related entity mentions ⇠ and ⇣ (such as "almorexant" and "orexin receptor" in Figure 1) with mean pooling: where ⇠ 1 : ⇠ 2 and ⇣ 1 : ⇣ 2 represent the span of ⇠ and ⇣, respectively, and f mean is the mean-pooling function.
Finally, the representations of both mentions are concatenated to be the input of a logistic regression classifier: where W 5 and b 5 are model parameters.

Model
In this section, we first discuss how to generate high-quality dependency forests, before showing how to adapt GRN to consider the parser probability of each dependency edge.

Forest generation
Given a dependency parser, generating dependency forests with high recall and low noise is a non-trivial problem. On the one hand, keeping the whole search space gives 100% recall, but introduces maximum noise. On the other hand, using the 1-best dependency tree can result in low recall given an imperfect parser. We investigate two algorithms to generate high-quality forests by judging "quality" from different perspectives: one focusing on arcs, and the other focusing on trees.
EDGEWISE This algorithm focuses on the local relation of each individual edge and uses parser probabilities as confidence scores to assess edge qualities. Starting from the whole parser search space, it keeps all the edges with scores greater than a threshold . The time complexity is O(N 2 ), where N represents the sentence length. 1 KBESTEISNER This algorithm extends the Eisner algorithm (Eisner, 1996) with cube pruning (Huang and Chiang, 2005) for finding K highest-scored tree structures. The Eisner algorithm is a standard method for decoding 1-best trees for graph-based dependency parsing. Based on bottom-up dynamic programming, it stores the 1-best subtree for each span and takes O(N 3 ) time complexity for decoding a sentence of N words.
KBESTEISNER keeps a sorted list of K-best hypotheses for each span. Cube pruning (Huang and Chiang, 2005) is adopted to generate the Kbest list for each larger span from the K-best lists of its sub-spans. After the bottom-up decoding, we merge the final K-bests by combining identical dependency edges to make the forest. As a result, KBESTEISNER takes O(N 3 K log K) time. Discussions EDGEWISE is much simpler and faster than KBESTEISNER. Compared with the O(N 3 K log K) time complexity of KBESTEIS-NER, EDGEWISE only takes O(N 2 ) running time, and each step (storing an edge) runs faster than KBESTEISNER (making a new hypothesis by combining two from sub-spans). Besides, the forests of EDGEWISE can be denser and provide richer information than those from KBESTEIS-NER. This is because KBESTEISNER only merges K trees, where many edges are shared among them. Also, K cannot be set to a large number (such as 100), because that will cause a dramatic increase of running time.
Compared with KBESTEISNER, EDGEWISE suffers from two potential problems. First, EDGE-WISE does not guarantee to produce a 1-best tree in a generated forest, as it makes decisions by considering the individual edges. Second, it does not guarantee to generate spanning forests, which can happen when the threshold is high. On the other hand, no previous work has shown that the information from the whole tree is crucial for relation extraction. In fact, many previous studies use only the dependency path between the target entity mentions (Bunescu and Mooney, 2005;Airola et al., 2008;Chowdhury et al., 2011;Gormley et al., 2015;Mehryary et al., 2016). We study the effectiveness of both algorithms in our experiments.

GRN encoding with parser confidence
As illustrated by Figure 1(b), our dependency forests are directed graphs that can be consumed by GRN without any structural changes. For fair comparison, we use the same model as the baseline to encode sentences and forests. Thus our model uses the same number of parameters as our baseline taking 1-best trees.
Since forests contain more than one tree, it is intuitive to consider parser confidence scores for potentially better feature extraction. To this end, we slightly adjust the GRN encoding process without introducing additional parameters. In particular, we enhance the original message sum function (Equations 5 and 6) by applying the edge probabilities in calculating weighted message sums: where ✏ (instead of a triple) is used to represent an edge for simplicity, and p ✏ is the parser probability for edge ✏. The edge probabilities are not adjusted during end-task training.

Training
Relation loss Given a set of training instances, each containing a sentence s with two target mentions ⇠ and ⇣, and a dependency structure D (tree or forest), we train our models with a crossentropy loss between the gold-standard relations r and model distribution: where ✓ represents the model parameters.
Using additional NER loss For training on BioCreative VI CPR, we follow previous work Verga et al., 2018) to take NER loss as additional supervision, though the mention boundaries are known during testing.
where t n is the gold NE tag of w n with the "BIO" scheme. Both losses are conditionally independent given the deep features produced by our model, and the final loss for BioCreative VI CPR training is l = l R + l NER .

Experiments
We conduct experiments on two medical benchmarks to test the usefulness of dependency forest.

Data
BioCreative VI CPR (Krallinger et al., 2017) This task 2 focuses on the relations between chemical compounds (such as drugs) and proteins (such as genes). The full corpus contains 1020, 612 and 800 extracted PubMed 3 abstracts for training, development and testing, respectively. All abstracts are manually annotated with the boundaries of entity mentions and the relations. The data provides three types of NEs: "CHEMICAL", "GENE-Y" and "GENE-N", and the relation set R contains 5 regular relations ("CPR:3", "CPR:4", "CPR:5", "CPR:6" and "CPR:9") and the "None" relation.
For efficient generation of dependency structures, we segment each abstract into sentences, keeping only the sentences that contain at least a chemical mention and a protein mention. For any sentence containing several chemical mentions or protein mentions, we keep multiple copies of it with each copy having different target mention pairs. As a result, we only consider the relations of mentions in the same sentence, assigning all cross-sentence chemical-protein pairs as "None" relation. By doing this, we effectively sacrifice cross-sentence relations, which has a negative effect on our systems; but this is necessary for efficient generation of dependency structures since directly parsing a short paragraph is slow and erroneous. 4 In general, we obtain 16,107 training, 10,030 development and 14,269 testing instances, in which around 23% have regular relations. The highest recalls for relations on our development and test sets are 92.25 and 92.54, respectively, because of the exclusion of cross-sentence relations in preprocessing. We report F1 scores of the full test set for a fair comparison, using all gold regular relations to calculate recalls.
Phenotype-Gene relation (PGR)  This dataset concerns the relations between human phenotypes (such as diseases) with human genes, where the relation set is a binary class on whether a phenotype is related to a gene. It has 18,451 silver training instances and 220 highquality test instances, with each containing mention boundary annotations. We separate the first 15% training instances as our development set. Unlike BioCreative VI CPR, almost every relation of PGR is within a single sentence.

Models
We compare the following models: • TEXTONLY: It does not take dependency structures and directly uses the Bi-LSTM outputs (h (0) in Eq. 3) to make predictions.
• DEPTREE: Our baseline using 1-best dependency trees, as shown in Section 4.
• EDGEWISEPS and EDGEWISE: Our models using the forests generated by our EDGEWISE algorithm with or without parser scores.
• KBESTEISNERPS and KBESTEISNER: Our model using the forests generated by our KBESTEISNER algorithm with or without parser scores, respectively.

Settings
We take a state-of-the-art deep biaffine parser (Dozat and Manning, 2017) Table 1 demonstrates several characteristics of the generated forests of both the EDGEWISE and KBESTEISNER algorithms in Section 5.1, where "#Edge/#Sent" measures the forest density with the number of edges divided by the sentence length, "LAS" represents the oracle LAS score on 100 biomedical sentences with manually annotated dependency trees, and "Conn. Ratio (%)" shows the percentage of forests where both related entity mentions are connected.

Analyses of generated forests
Regarding the forest density, forests produced by EDGEWISE generally contain more edges than those from KBESTEISNER. Due to the combinatorial property of forests, EDGEWISE can give much more candidate trees (and sub-trees) for the whole sentence (and each sub-span). This coincides with the fact that the forests generated by EDGEWISE have higher oracle scores than these generated by KBESTEISNER.
For connectivity, KBESTEISNER guarantees to generate spanning forests. On the other hand, the connectivity ratio for the forests produced by EDGEWISE drops when increasing the threshold . We can have more than 94% being connected with  0.2. Later we will show that good endtask performance can still be achieved with the 94% connectivity ratio. This indicates that losing connectivity for a small potion of the data may not hurt the overall performance. improvements over DEPTREE and TEXTONLY. Generally, EDGEWISE gives more improvements than KBESTEISNER. The main reason may be that EDGEWISE generates denser forests, providing richer features.

Development results
On the other hand, KBESTEISNER shows a marginal improvement by increasing K from 5 to 10. This indicates that only merging 10-best trees may be far from sufficient. However, using a much larger K (such as 100) is not practical due to dramatically increased computation time. In particular, the running time of KBESTEISNER with K = 10 is already much longer than that of EDGEWISE. As a result, EDGEWISE better serves our goal compared to KBESTEISNER. This may sound surprising, as EDGEWISE does not consider tree-level scores. It suggests that relation extraction may not require full dependency tree features. This coincides with previous relation extraction research (Bunescu and Mooney, 2005;Airola et al., 2008), which utilizes the shortest path connecting the two candidate entities in the dependency tree.
Leveraging parser confidence scores also consistently helps both methods. It is especially effective for EDGEWISE when = 0.05. This is likely because the parser confidence scores are useful for distinguishing some erroneous dependency arcs, when noise is large (e.g. when is too small). Following the development results, we Model F1 score GRU+Attn ) † 49.5 Bran (Verga et al., 2018 (Efron and Tibshirani, 1994).
directly report the performances of EDGEWISEPS and KBESTEISNERPS, setting and K to 0.2 and 10, respectively, in our remaining experiments. Table 2 shows the main comparison results on the BioCreative CPR testset, with comparisons to the previous state-of-the-art and our baselines.

Main results on BioCreative VI CPR
GRU+Attn ) stacks a self-attention layer on top of GRU (Cho et al., 2014) and embedding layers; Bran (Verga et al., 2018) adopts a biaffine self-attention model to simultaneously extract the relations of all mention pairs. Both methods use only textual knowledge. TEXTONLY gives a performance comparable with Bran. With 1-best dependency trees, our DEPTREE baseline gives better performances than the previous state of the art. This confirms the usefulness of dependency structures and the effectiveness of GRN on encoding these structures. Using dependency forests and parser confidence scores, both KBESTEISNERPS and EDGE-WISEPS obtain significantly higher numbers than DEPTREE. Consistent with the development experiments, EDGEWISEPS has a higher testset performance than KBESTEISNERPS.

Analysis
Effectiveness on parsing accuracy We have shown in Sections 7.5 and 7.6 that a dependency parser trained using a domain-general treebank can produce high-quality dependency forests in a target domain (biomedical) for helping relation extraction. This is based on the assumption of there being a high-quality treebank in a descent scale, which may not be true for low-resource languages. We simulate this low-resource effect by training our parser in much smaller treebanks of 1K or 5K dependency trees, respectively. The LAS scores for the resulting parsers on our 100 manually annotated biomedical dependency trees are 79.3 and 84.2, respectively, while the LAS score for the parser trained with the full treebank is 86.4, as shown in Table 1. Figure 4 shows the results on the Biocreative CPR development set, where the performance of TEXTONLY is 51.6. DEPTREE fails to outperform TEXTONLY when only 1K or 5K dependency trees are available for training our parser. This is due to the low parsing recall and subsequent noise caused by the weak parsers. It confirms the previous conclusion that dependency structures are highly influential to the performance of relation extraction. Both EDGEWISEPS and KBESTEIS-NERPS are still more effective than DEPTREE. In particular, KBESTEISNERPS significantly improves TEXTONLY with 5K dependency trees, and EDGEWISEPS is helpful even with 1K dependency trees.
KBESTEISNER shows relatively smaller gaps than EDGEWISE when only a limited number of dependency trees are available. This is probably because considering whole-tree quality helps to better eliminate noise. Figure 5 illustrates two major types of errors in BioCreative CPR, which are caused by inaccurate 1-best dependency trees. As shown in Figure 5(a), the baseline system mistakenly predicts a "None" relation for that instance. This is mainly because "STAT3" is incorrectly linked to the main verb "inhibited" with a "punct" relation, but it should be linked to "AKT". In contrast, our forest contains the correct relation and with a probability of 0.18. This is possibly because "AKT and STAT3" fits the common pattern of "A and B" that conjunct two nouns. Figure 5(b) shows another type of parsing er-  rors that cause end-task mistakes. In this example, the multi-token mention "calcium modulated cyclases" is incorrectly segmented in the 1-best dependency tree, where "modulated" is used as the main verb of the whole sentence, leaving "cyclases" and "calcium" as the object and the modifier of the subject, respectively. However, this mention ought to be a noun phrase with "cyclases" being the head. Our forest helps in this case by providing a more reasonable structure (shown as the yellow dashed arcs), where both "calcium" and "modulated" modify "cyclases". This is likely because "modulated" can be interpreted as an adjective in addition to being a verb. It shows the advantage of keeping multiple candidate syntactic arcs. Table 3 shows the comparison with previous work on the PGR testset, where our models are significantly better than the existing models. This is likely because the previous models do not utilize all the information from inputs: BO-LSTM only takes the words (without arc labels) along the shortest dependency path between the target mentions; the pretrained weights of BioBERT are kept constant during training for relation extraction. With 1-best trees, DEPTREE is 2.9 points bet-  In addition to the biomedical domain, leveraging dependency forests applies to other domains as well. As shown in Table 4, we conduct a preliminary study on SemEval-2010 task 8 (Hendrickx et al., 2009), a widely used benchmark for newsdomain relation extraction. It is a public dataset, containing 10,717 instances (8000 for training and development, 2717 for testing) with 19 relations: 9 directed relations and a special "Other" class. Both C-GCN and C-AGGCN take a similar network as ours by stacking a graph neural network for encoding trees on top of a Bi-LSTM layer for encoding sentences. DEPTREE achieves similar performance as C-GCN and is slightly worse than C-AGGCN, with one potential reason being that C-AGGCN takes more parameters. Using forests, both KBESTEIS-NERPS and EDGEWISEPS outperform DEPTREE with the same number of parameters, and they show comparable and slightly better performances than C-AGGCN. Again, EDGEWISEPS is better than KBESTEISNERPS, showing that the former is a better way for generating forests.

Conclusion
We have proposed two algorithms for generating high-quality dependency forests for relation extraction, and studied a graph recurrent network for effectively distinguishing useful features from noise in parsing forests. Experiments on two biomedical relation extraction benchmarks show the superiority of forests versus tree structures, without introducing any additional model parameters. Our deep analyses indicate that the main advantage comes from alleviating out-of-domain parsing errors.