An extended dependency graph for relation extraction in biomedical texts

Kernel-based methods are widely used for relation extraction task and obtain good results by leveraging lexical and syntactic information. However, in biomedical domain these methods are limited by the size of dataset and have difﬁculty in coping with variations in text. To address this problem, we propose Extended Dependency Graph (EDG) by incorporating a few simple linguistic ideas and include information beyond syntax. We believe the use of EDG will enable machine learning methods to generalize more easily. Experiments conﬁrm that EDG provides up to 10% f-value improvement over dependency graph using mainstream kernel methods over ﬁve corpora. We conducted additional experiments to provide a more detailed analysis of the contributions of individual modules in EDG construction.


Introduction
With growing amount of biomedical information available in textual form, there has been considerable interest in applying NLP techniques and machine-learning (ML) methods to biomedical literature. Some of these projects involve extracting relations such as protein-protein interaction (Krallinger et al., 2008).
In biomedical domain, most relation extraction work is currently applied on the abstracts of articles. These abstracts by nature are dense with information and often use constructions such as appositives and relative clauses. The abundance of textual variations can thus be problematic for ML systems, especially with small training corpora.
One solution to this issue is to find a suitable level of abstraction in the text representation so that ML methods become easier to generalize. Use of syntax and parse information provides one such abstraction. Using syntactic dependency information has become prevalent in biomedical relation extraction. It has been suggested dependency links are close to the semantic relationship needed for the next stage of interpretation (Covington, 2001).
There have been significant advances in the development of advanced machine learning and kernel methods and the use of sophisticated parameter tuning in the biomedical domain. In this work, we focus on the representation of the text used in learning rather than the machine learning technique, with the hope that advances in both directions will be improve the performance of the relation extraction systems. In this paper we propose Extended Dependency Graph (EDG), which includes information about text that goes beyond syntax. We will define EDG and discuss how we construct it from a given sentence by using some simple linguistic notions.
The hypothesis we test here is that EDG allows ML techniques to generalize more easily. To determine the effect of EDG, we conducted experiments on protein-protein interaction (PPI) extraction. For this purpose, we used two kernels: a simple kernel based on edit distance (Erkan et al., 2007) and a more elaborate kernel that is one of the top performing kernels on the PPI task . We compared the performance of both kernels using dependency graph and EDG on 5 corpora. Our results suggest EDG provides up to 10% f-value improvement over dependency graph. On 3 out of 5 corpora the results are better than the overall best system in the study of (Tikk et al., 2010), as well as an ensemble method that builds on them (Miwa et al., 2009a). We also evaluate the contributions of the individual components included in EDG.

Related work
Many kernel-based relation extraction systems have employed lexical and syntactic information Zhou et al., 2007;Ning and Qi, 2011). There has been a growth in the use of more complex kernels and sophisticated parameter tuning methods to improve the results (Zhang et al., 2006;Choi and Myaeng, 2010). In PPI task, machine learning methods using rich feature vectors (Miwa et al., 2009b), edit distance kernel (Erkan et al., 2007), dependency tree kernel (Chowdhury et al., 2011), allpath graph kernel , or their combination and variations (Miwa et al., 2009a;Zhang et al., 2012) have been proposed. Our focus is on improving the representation of information in natural texts, rather than on developing new kernels. There have been several attempts to leverage syntax and shallow semantic argument structure (Miwa et al., 2010;Van Landeghem et al., 2010;Van Landeghem et al., 2012;Liu et al., 2013;Oepen et al., 2014;Peng et al., 2014;Nguyen et al., 2015). Though the focus of these works was not to utilize the information with machine learning methods, they offer insight on utility of information beyond syntax. We develop the EDG approach for relation extraction based on these ideas. Figure 1 illustrates the overall architecture with the core component highlighted: EDG construction. The input is a sentence with named entities marked. We use Charniak-Johnson parser and Stanford conversion tool to get the basic syntactic dependency graph (SDG). Our approach focuses on how to leverage simple linguistic principles and information beyond syntax to construct EDG from SDG.

Extended dependency graph (EDG)
In this paper, we use EDG to represent the structure of the sentence. Like in the case of many dependency graph representations used in relation extraction, the vertices in a EDG are labelled with information such as the text, part-of-speech, and the word lemma. If an entity mention spans multiple tokens in a sentence, we merge their corresponding vertices (called contracting vertices) into one vertex.
EDG has two types of dependencies. The syntactic dependencies that are obtained from collapsed dependencies output by applying Stanford dependencies converter on a syntactic parsing tree (De Marneffe and Manning, 2008). The other type of dependencies are the numbered arguments based on the guidelines of PropBank (Bonial et al., 2012). Because we are currently focusing on binary relation extraction, we use only arg0 and arg1 (probably better stated as not-arg0) in EDG. Figure 2 shows EDGs of three text fragments with syntactic edges appearing above the words and numbered argument edges appearing below. From a relation extraction perspective, the syntactic dependencies in Figure 2 are less relevant but their numbered arguments between two entity mentions are same.
There are two motivations for using numbered arguments. One is to "provide consistent argument labels across different syntactic realizations of the same verb" (Bonial et al., 2012) with the intention of making generalizations easier downstream. The other is to add/propagate new arg0 and arg1 using reasoning that goes beyond syntax.
Following these two motivations, we will first discuss how to capture arg0 and arg1 using different syntactic dependencies obtained from Stanford dependencies. Then we will describe relations such as is-a, member-collection, and partwhole and how to propagate arg0 and arg1 using them.

Syntax based arg0 and arg1
We follow approaches of SemRep (Rinaldi et al., 2006) and PASMED (Nguyen et al., 2015) to obtain the basic edges arg0 and arg1 from the syntactic dependencies. For example, EDG will include an arg0 from a verb to the noun if the syntactic dependency is nsubj or agent and include an arg1 if the dependency is nsubjpass or dobj.
In addition, we consider situation where verbs in gerund form are used as noun modifiers. Figure 3 shows a compound noun phrase. We know that there is a PPI between "retinoblastoma" and "protein", because we can rewrite the phrase into "retinoblastoma binds to protein, RBP1". Therefore, we add arg1 from "binding" to "protein" in Figure 3. This operation will introduce cyclicity because the gerund is included in the noun phrase headed by "protein". We posit that these edges are useful when found in combination with other construction, such as appositive. We will discuss how to propagate arg1 from the gerund "binding" to "RBP1" later. Next we consider two cases of argument elision.

Elided argument relation
Here we consider cases when the argument of a predicate is not explicit but implicit. Figure 4 shows a sentence where arg0(interaction, Presenilin 1) can be inferred. The SDG includes a prep via from the first verb "suppresses" to the nominalized verb "interaction", to indicate the PP attachment to the verb. In this case, we add an edge arg0 from the nominalized verb to the arg0-argument of the first verb. In constructing EDG, we also consider prep through as well as prep by when a gerund verb, rather than a nominalized verb, follows it.
Reduced relative clauses Relative clause is a clause that modifies a noun phrase. There are two types of relative clauses that frequently appear in biomedical text. Full relative clauses are introduced by relative pronouns, such as "which" serine/threonine kinase that is phosphorylated by Pto nsubjpass auxpass agent arg1 arg0 (a) A sample full relative clause. (b) A sample reduced relative clause. Figure 5: Sample relative clauses.
and "that". Reduced relative clauses start with a gerund or past participle and have no overt subject. The PropBank annotation guidelines (Bonial et al., 2012) posit a numbered argument link from the relative clause verb to the trace in the parse tree which also indicates the referent noun phrase. For full relative clauses, we follow the normal procedure for verbs ( Figure 5a). For reduced relative clauses, since we use the dependency structure that includes no traces, we use the edge vmod in the SDG from the head of the noun phrase to the reduced relative clause's verb ( Figure 5b). The direction of this edge indicates that the relative clause is syntactically included in the larger noun phrase. For the arg edge, we reverse the direction of vmod and create an edge from the relative clause's verb, as shown in Figure 5b. When compared to Figure 5a, the arg construction unifies the treatment for full relative clauses.
Notice that although in both cases, the arg1 is not an incident on named entities, it might still lead to the named entity through the propagation of edges as discussed in the next subsection.

Going Beyond Syntax
Here we consider the propagation of arg using information that goes beyond syntax.
Co-reference If an edge arg from a vertex v reaches a pronominal node, we add a new edge arg from v to any named entity the pronoun corefers to. To detect the coreference we use the implementation of the technique described in (Qiu et al., 2004). For the acronyms with long-form and short-form, we treat them in the same way as coreference. We add extra edge arg when there is an arg incident on the long-form. We use the acronym detector of (Schwartz and Hearst, 2003) to add acronyms missed in SDG. Interestingly, SDG uses appos for both acronym and appositive.
Appositive Reconsider the fragment "the Using the construction discussed thus far, the arg1 will reach "protein". Further, SDG uses an edge appos from "protein" to "RBP1" for appositional modifier. We integrate arg1 and appos to construct another edge arg1 from "binding" to the actual named entity "RBP1".
Is-A In addition to appositive, we consider other forms of is-a relation mentioned textually, but cannot be directly found from the syntactic dependences. For example, in Figure 6, there is no edge in SDG to explicitly capture the is-a relation. It is worth noting that the edge nsubj itself does not indicate the is-a relation, but together with two other edges cop and det, we can figure it out. Hence we add a new edge from "oncogene" to "HOX11" to reflect this relation in EDG (dotted edge). Afterwards, we propagate arg0 from "targets" to "HOX11".
Besides the pattern shown in Figure 6, we also identify "known as", "designated as", "considered as", "identified as" and "act as" as patterns that signal is-a relations. These patterns contain and extend rules in (Snow et al., 2005;Hearst, 1992).
Member-collection links a generic reference (called collection) to a group of entity mentions (called members). Like in Figure 7, typical key words that can identify member-collection relations are "including" and "such as". We consider the cases where mention group follows the keywords and the generic reference precedes these words. After the detection, we propagate arg from the collection to its members.
Part-whole links an entity part to its mention, typically denoting construction of larger entities out of smaller ones. Just like "breaking the glass of the window" can be stated as "breaking the window", in biomedical tasks an action on a larger unit can often be inferred from a mention of the action applied on its part. That is, in Figure 8, after we detect a part-whole relation, an edge arg1 incident on the part is propagated to the object that contains it.
In this paper, we focus on three types of patterns to recognize part-whole relations. The first is the preposition phrase such as "domain of e". Here "domain" indicates the part and e indicates the larger entity mention the "domain" belongs to. Other keywords indicating parts include "fragment", "portion", and "region". The second structural elements is a compound nominal like "e domain". The third group exploits keywords such as "contain", "consist", and "compose". For each part-whole relation, we propagate edges from the part to its entity mention.

Experiments
We evaluated our method on protein-protein interaction (PPI) extraction task, where the system identifies whether a given protein pair in a sentence has PPI relationship or not. We used SDG or EDG as input representation of the sentences, which includes the named protein entities.

Kernels
We tested the effect of using EDG on two kernels that have been employed for PPI extraction.
Edit distance kernel is based on the edit distance among the shortest paths between entities in the dependency graph and is based on the minimal number of operations (deletion, insertion, substitution at word level) needed to transform one path (p 1 ) into the other (p 2 ). Following (Erkan et When comparing two shortest paths, we considered the word lemma and the edge labels . We also renamed the candidate pair in the sentence as "E1" and "E2" and the remaining proteins provided in the annotation as "EX". For example, the following are the shortest paths of Figure 2a, 3, and 8. Therefore, the edit distance between (a) and (b) is 1 because the predicate verbs are different. The distance between (b) and (c) is 0. It shows the generalizability of using EDG.
All-paths graph kernel is a practical instantiation of a graph kernel framework (Gärtner et al., 2003). It counts weighted shared paths of all possible lengths in a graph . Allpaths graph kernel uses two graph representations: (1) a dependency graph where all edges on the shortest paths between the candidate pair receive a weight of 0.9 and other edges receive a weight of 0.3; and (2) a linear graph where each word node is connected by an edge to its succeeding word node with weight 0.9.
We used word (not lemma) and edge labels to compute the all-paths graph kernel. Similar to the case with the edit distance kernel, we replaced the protein names in a sentence with "E1", "E2" and "EX". We use the APG software (http://mars.cs.utu.fi/ PPICorpora/GraphKernel.html) to train and test the kernel. The software uses sparse regularized least squares method instead of SVM.

Experimental setup
We evaluated our method on five PPI corpora that have been used in the community: AIMed , BioInfer (Pyysalo et al., 2007), HPRD50 (Fundel et al., 2007), IEPA (Ding et al., 2002), and LLL (Nédellec, 2005). These corpora have different sizes (Table 1) and vary slightly in their definition of PPI . (Tikk et al., 2010) conducted a comparison of a variety of PPI extraction systems on these corpora (http://mars.cs.utu.fi/ PPICorpora). We used the same experimental setup to evaluate our methods: self-interactions were excluded from the corpora and 10-fold document-level cross-validation is used for evaluation.
For our experiments, we used the Charniak-Johnson parser (Charniak and Johnson, 2005) and the Stanford conversion tool with "Collapsed" setting to obtain SDG (De Marneffe and Manning, 2008). The edit distance kernel was trained with LIBSVM (Chang and Lin, 2011). The APG kernel was trained with APG software.
Both these kernels have several parameters, whose settings can influence the performance. In this paper, we did not perform exhaustive systematic parameter search and optimization. We believe such parameter tuning techniques might lead to further improvements.
For the edit kernel, we set γ to 4.5, which was the value used in the original application of edit kernel on these corpora (Erkan et al., 2007). We set c in SVM to 10, which was the average best value used in (Tikk et al., 2010). For the APG kernel, we used the default settings of implementation of  which uses a grid parameter search for each iteration of the 10-fold cross validation. The parameter search selects the best setting based on a random set of 1,000 samples from the training sets (9 folds). If there are less than 1,000 samples, the software used the whole training set. Note that the test sets (the remaining fold) were not used for the parameter tuning.

Results
Performance, as measured by precision, recall, and F-value, is shown in Table 2. To provide context, we also include the results published in (Tikk et al., 2010) and (Miwa et al., 2009a). The first reports the results of the APG kernel  that was found to be a leading performer on these 5 corpora in the study reported in (Tikk et al., 2010). The second set of results is those of an ensemble method that combines different systems. Although we are using the same corpora in the study of (Tikk et al., 2010), and the same implementation of the APG kernel, the results in Row 1 and Row 6 in the table are not the same. The differences are possibly due to the fact that different parsers were used and how parameters were chosen. However, we want to emphasize that all our own measurements (e.g., in Rows 3-5 or Rows 6-8) are directly comparable to each other because the same parameter settings were used for each corpus.
The first part of Table 2 shows results using the edit distance kernel with original dependency graph (Row 3), and with the complete EDG (Row 4). We also experimented with different configurations of EDG by dropping one of the extra edge types added in EDG. The results obtained by the best configuration are reported in Row 5. On three of the corpora, the best results are obtained by using the full EDG. However, better results were obtained on HPRD50, when the member-collection relations were not included and on LLL, when the is-a relations were not included. In the next subsection we will address why these relations were not included.
Overall, comparing Rows 3 and 4, we obtain Fvalue improvements using EDG over using SDG on 4 corpora (except LLL), with around 10% gains on AIMed and HPRD50 and noticeable gain in recall. For 3 of the corpora (AIMed, HPRD50 and IEPA), there is an increase in both precision and recall. For BioInfer, the gain in precision slightly exceeds the loss in recall whereas in LLL the gain in precision is slightly lower than the loss in recall. When Row 5 is used for comparison, we obtain an improvement in F-value for all 5 corpora with improvement in recision and recall in 4 corpora (BioInfer being the exception). We now see over 18% F-value improvement on HPRD50.
Despite weak performance of the edit kernel using the baseline SDG, the performance of this kernel with full EDG is close to or exceeds the results of the leading PPI systems using kernel methods (Rows 1 and 2) on 4 corpora and exceeds them on these 4 corpora when results of Row 5 is considered.
The second part of Table 2 (Rows 6-8) shows results using the APG kernel. The EDG (Best) in Row 8 is achieved on AIMed, BioInfer and LLL by dropping the is-a relation and on HPRD50 by not including the member-collection relations. We see F-value gains on 4 corpora through the use of EDG.
Comparing the results on the edit distance and APG kernels, we find that the more complex APG kernel (the best one overall in (Tikk et al., 2010) study) gets generally better results than Edit kernel using the baseline SDG. However, the use of EDG not only closes the gap between the kernels but in fact, edit kernel with EDG obtains higher F-value than APG with SDG or EDG in 4 of the 5 corpora.
To provide the comparision with non-kernel methods, we also include the results published in (Miwa et al., 2009b), which is the state-of-the-art system on the five corpora. This paper develops several systems that use a rich feature vector, combining analysis from different parsers and the values obtained from multiple kernels including the APG's score. L2-SVM and SVM-CW are among the leading SVM-based systems proposed in this paper.
Row 9 shows the results of L2-SVM on these corpora. We observe that both edit kernel and APG kernel with EDG (Best) gets improvements on two of the corpora. Row 10 shows the results of SVM modified for corpora weighting (SVM-CW). Using one of the corpora as the target corpus, SVM-CW weights the remaining corpora (called the source corpora) with "goodness" for training on the target corpus, adjusting the effect of their compatibility and incompatibility (Miwa et al., 2009b). Thus, their results are not directly comparable with our results. However we obtain improvements using edit kernel with EDG (Best) on HPRD50. Table 3 compares the effects of different techniques in EDG on five corpora using the edit distance kernel. We first evaluated SDG obtained from the Stanford conversion tool with "CCProcessed" setting (Row 2) for processing conjunctions, and next added only syntax based arg0 and arg1 (Row 3). After that, we added in succession referential links (including coreference, appositive, and is-a), member-collection, and partwhole detection in the EDG construction step by step (Row 4-6). Overall, using "CCProcessed" increases the F-values on all five corpora. EDG constructed using syntax based arg achieves additional increases on 4 out of 5 corpora (exception was IEPA). Every subsequent step generally provides more improvements on F-values. However, we observed that on HPRD50, member-collection decreased F-value. Therefore we tried to switch off this part in the EDG construction but included the rest of the relations and achieved a higher Fvalue of 79.9% on this corpus (Row 7). This corresponds to the same result we displayed in Row 5 (EDG Best) in Table 2. On the LLL corpus, as components were successively added, we noticed a drop in F-value when referential linking was added. So similarly by turning off is-a detection and including all other EDG edges enabled us to obtain the EDG best F-value of 84.6% on LLL.

Contribution of individual relation
We also identified that is-a decreased F-values on IEPA, however no further improvement could be made by switching it off. We plan to further analyze this result in the future.
Additionally, due to the gap in the performance between our system and (Miwa et al., 2009a) on BioInfer, we analyzed the error cases and noticed several cases similar to the following example. The candidate pair of named enitites are marked in bold.
• This process involves other actin-binding proteins, such as cofilin and coronin.
Using techniques as shown in Figure 3, we can create arg0 (binding, actin) and arg1 (binding, proteins) in EDG and also detect membercollection relation between "actin-binding proteins" and "cofilin". With propogation, an interaction between "actin" and "cofilin" can be predicted. However, this relation is annotated as a negative, but instead the annotation in BioInfer includes a positive relation between "actin-binding proteins" and "cofilin". Because of similar examples in BioInfer, the member-collection and is-a and propagation failed to improve the results in BioInfer.

Conclusion
In this paper, we strive to find a level of abstraction that is more suitable for tasks such as relation extraction. For this purpose, we introduced techniques to create a new dependency graph representation (EDG) that goes beyond syntactic dependencies. We evaluated the efficacy of EDG with the edit distance and APG kernels and applied them on 5 different PPI-related datasets. We obtained improvements in F-value by using EDG. We find that despite the simplicity of the edit kernel and its weak performance with the baseline graph, results comparable to state-of-the-art systems using kernel methods are obtained on different corpora with the inclusion of EDG. While the use of EDG has led to gain in recall as well as precision mostly, the recall drops with BioInfer dataset. We would like to analyze this result further in the future. One of our main motivations for developing EDG is to develop methods to learn with small datasets and whether the abstraction captured in EDG allows for easier generalization. The testing of learning with small datasets and use in context of active learning will be investigated in the future.
We plan to test the use of EDG on other relation extraction tasks in the biomedical domain. We also plan to investigate richer features and their combinations in conjunction with the use of EDG.