Graph Convolution over Pruned Dependency Trees Improves Relation Extraction

Dependency trees help relation extraction models capture long-range relations between words. However, existing dependency-based models either neglect crucial information (e.g., negation) by pruning the dependency trees too aggressively, or are computationally inefficient because it is difficult to parallelize over different tree structures. We propose an extension of graph convolutional networks that is tailored for relation extraction, which pools information over arbitrary dependency structures efficiently in parallel. To incorporate relevant information while maximally removing irrelevant content, we further apply a novel pruning strategy to the input trees by keeping words immediately around the shortest path between the two entities among which a relation might hold. The resulting model achieves state-of-the-art performance on the large-scale TACRED dataset, outperforming existing sequence and dependency-based neural models. We also show through detailed analysis that this model has complementary strengths to sequence models, and combining them further improves the state of the art.


Introduction
Relation extraction involves discerning whether a relation exists between two entities in a sentence (often termed subject and object, respectively). Successful relation extraction is the cornerstone of applications requiring relational understanding of unstructured text on a large scale, such as question answering (Yu et al., 2017), knowledge base population (Zhang et al., 2017), and biomedical knowledge discovery (Quirk and Poon, 2017).
Models making use of dependency parses of the input sentences, or dependency-based models, ⇤ Equal contribution. The order of authorship was decided by a tossed coin. A subtree of the original UD dependency tree between the subject ("he") and object ("Mike Cane") is also shown, where the shortest dependency path between the entities is highlighted in bold. Note that negation ("not") is off the dependency path.
have proven to be very effective in relation extraction, because they capture long-range syntactic relations that are obscure from the surface form alone (e.g., when long clauses or complex scoping are present). Traditional feature-based models are able to represent dependency information by featurizing dependency trees as overlapping paths along the trees (Kambhatla, 2004). However, these models face the challenge of sparse feature spaces and are brittle to lexical variations. More recent neural models address this problem with distributed representations built from their computation graphs formed along parse trees. One common approach to leverage dependency information is to perform bottom-up or top-down computation along the parse tree or the subtree below the lowest common ancestor (LCA) of the entities (Miwa and Bansal, 2016). Another popular approach, inspired by Bunescu and Mooney (2005), is to reduce the parse tree to the shortest dependency path between the entities (Xu et al., 2015a,b). However, these models suffer from several drawbacks. Neural models operating directly on parse trees are usually difficult to parallelize and thus computationally inefficient, because aligning trees for efficient batch training is usually nontrivial. Models based on the shortest dependency path between the subject and object are computationally more efficient, but this simplifying assumption has major limitations as well. Figure 1 shows a real-world example where crucial information (i.e., negation) would be excluded when the model is restricted to only considering the dependency path.
In this work, we propose a novel extension of the graph convolutional network (Kipf and Welling, 2017;Marcheggiani and Titov, 2017) that is tailored for relation extraction. Our model encodes the dependency structure over the input sentence with efficient graph convolution operations, then extracts entity-centric representations to make robust relation predictions. We also apply a novel path-centric pruning technique to remove irrelevant information from the tree while maximally keeping relevant content, which further improves the performance of several dependencybased models including ours. We test our model on the popular SemEval 2010 Task 8 dataset and the more recent, larger TAC-RED dataset. On both datasets, our model not only outperforms existing dependency-based neural models by a significant margin when combined with the new pruning technique, but also achieves a 10-100x speedup over existing tree-based models. On TACRED, our model further achieves the state-of-the-art performance, surpassing a competitive neural sequence model baseline. This model also exhibits complementary strengths to sequence models on TACRED, and combining these two model types through simple prediction interpolation further improves the state of the art.
To recap, our main contributions are: (i) we propose a neural model for relation extraction based on graph convolutional networks, which allows it to efficiently pool information over arbitrary dependency structures; (ii) we present a new pathcentric pruning technique to help dependencybased models maximally remove irrelevant information without damaging crucial content to improve their robustness; (iii) we present detailed analysis on the model and the pruning technique, and show that dependency-based models have complementary strengths with sequence models.

Models
In this section, we first describe graph convolutional networks (GCNs) over dependency tree structures, and then we introduce an architecture that uses GCNs at its core for relation extraction.

Graph Convolutional Networks over Dependency Trees
The graph convolutional network (Kipf and Welling, 2017) is an adaptation of the convolutional neural network (LeCun et al., 1998) for encoding graphs. Given a graph with n nodes, we can represent the graph structure with an n ⇥ n adjacency matrix A where A ij = 1 if there is an edge going from node i to node j. In an L-layer GCN, if we denote by h (l 1) i the input vector and h (l) i the output vector of node i at the l-th layer, a graph convolution operation can be written as (1) where W (l) is a linear transformation, b (l) a bias term, and a nonlinear function (e.g., ReLU). Intuitively, during each graph convolution, each node gathers and summarizes information from its neighboring nodes in the graph.
We adapt the graph convolution operation to model dependency trees by converting each tree into its corresponding adjacency matrix A, where A ij = 1 if there is a dependency edge between tokens i and j. However, naively applying the graph convolution operation in Equation (1) could lead to node representations with drastically different magnitudes, since the degree of a token varies a lot. This could bias our sentence representation towards favoring high-degree nodes regardless of the information carried in the node (see details in Section 2.2). Furthermore, the information in h (l 1) i is never carried over to h (l) i , since nodes never connect to themselves in a dependency tree.
We resolve these issues by normalizing the activations in the graph convolution before feeding it through the nonlinearity, and adding self-loops to each node in the graph: whereÃ = A + I with I being the n ⇥ n identity matrix, and d i = P n j=1Ã ij is the degree of token i in the resulting graph.  Figure 2: Relation extraction with a graph convolutional network. The left side shows the overall architecture, while on the right side, we only show the detailed graph convolution computation for the word "relative" for clarity. A full unlabeled dependency parse of the sentence is also provided for reference.
Stacking this operation over L layers gives us a deep GCN network, where we set h Moreover, the propagation of information between tokens occurs in parallel, and the runtime does not depend on the depth of the dependency tree.
Note that the GCN model presented above uses the same parameters for all edges in the dependency graph. We also experimented with: (1) using different transformation matrices W for topdown, bottom-up, and self-loop edges; and (2) adding dependency relation-specific parameters for edge-wise gating, similar to (Marcheggiani and Titov, 2017). We found that modeling directions does not lead to improvement, 1 and adding edgewise gating further hurts performance. We hypothesize that this is because the presented GCN model is usually already able to capture dependency edge patterns that are informative for classifying relations, and modeling edge directions and types does not offer additional discriminative power to the network before it leads to overfitting. For example, the relations entailed by "A's son, B" and "B's son, A" can be readily distinguished with "'s" attached to different entities, even when edge directionality is not considered.

Encoding Relations with GCN
We now formally define the task of relation extraction. Let X = [x 1 , ..., x n ] denote a sentence, where x i is the i th token. A subject entity and an object entity are identified and correspond to two spans in the sentence: . Given X , X s , and X o , the goal of relation extraction is to predict a relation r 2 R (a predefined relation set) that holds between the entities or "no relation" otherwise.
After applying an L-layer GCN over word vectors, we obtain hidden representations of each token that are directly influenced by its neighbors no more than L edges apart in the dependency tree. To make use of these word representations for relation extraction, we first obtain a sentence representation as follows (see also Figure 2 left): where h (l) denotes the collective hidden representations at layer l of the GCN, and f : R d⇥n ! R d is a max pooling function that maps from n output vectors to the sentence vector. We also observe that information close to entity tokens in the dependency tree is often central to relation classification. Therefore, we also obtain a subject representation h s from h (L) as follows as well as an object representation h o similarly. Inspired by recent work on relational learning between entities (Santoro et al., 2017;Lee et al., 2017), we obtain the final representation used for classification by concatenating the sentence and the entity representations, and feeding them through a feed-forward neural network (FFNN): This h final representation is then fed into a linear layer followed by a softmax operation to obtain a probability distribution over relations.

Contextualized GCN
The network architecture introduced so far learns effective representations for relation extraction, but it also leaves a few issues inadequately addressed. First, the input word vectors do not contain contextual information about word order or disambiguation. Second, the GCN highly depends on a correct parse tree to extract crucial information from the sentence (especially when pruning is performed), while existing parsing algorithms produce imperfect trees in many cases.
To resolve these issues, we further apply a Contextualized GCN (C-GCN) model, where the input word vectors are first fed into a bi-directional long short-term memory (LSTM) network to generate contextualized representations, which are then used as h (0) in the original model. This BiL-STM contextualization layer is trained jointly with the rest of the network. We show empirically in Section 5 that this augmentation substantially improves the performance over the original model.
We note that this relation extraction model is conceptually similar to graph kernel-based models (Zelenko et al., 2003), in that it aims to utilize local dependency tree patterns to inform relation classification. Our model also incorporates crucial off-path information, which greatly improves its robustness compared to shortest dependency pathbased approaches. Compared to tree-structured models (e.g., Tree-LSTM (Tai et al., 2015)), it not only is able to capture more global information through the use of pooling functions, but also achieves substantial speedup by not requiring recursive operations that are difficult to parallelize. For example, we observe that on a Titan Xp GPU, training a Tree-LSTM model over a minibatch of 50 examples takes 6.54 seconds on average, while training the original GCN model takes only 0.07 seconds, and the C-GCN model 0.08 seconds.

Incorporating Off-path Information with Path-centric Pruning
Dependency trees provide rich structures that one can exploit in relation extraction, but most of the information pertinent to relations is usually contained within the subtree rooted at the lowest common ancestor (LCA) of the two entities. Previous studies (Xu et al., 2015b;Miwa and Bansal, 2016) have shown that removing tokens outside this scope helps relation extraction by eliminating irrelevant information from the sentence. It is therefore desirable to combine our GCN models with tree pruning strategies to further improve performance. However, pruning too aggressively (e.g., keeping only the dependency path) could lead to loss of crucial information and conversely hurt robustness. For instance, the negation in Figure 1 is neglected when a model is restricted to only looking at the dependency path between the entities. Similarly, in the sentence "She was diagnosed with cancer last year, and succumbed this June", the dependency path She diagnosed!cancer is not sufficient to establish that cancer is the cause of death for the subject unless the conjunction dependency to succumbed is also present. Motivated by these observations, we propose path-centric pruning, a novel technique to incorporate information off the dependency path. This is achieved by including tokens that are up to distance K away from the dependency path in the LCA subtree. K = 0, corresponds to pruning the tree down to the path, K = 1 keeps all nodes that are directly attached to the path, and K = 1 retains the entire LCA subtree. We combine this pruning strategy with our GCN model, by directly feeding the pruned trees into the graph convolutional layers. 2 We show that pruning with K = 1 achieves the best balance between including relevant information (e.g., negation and conjunction) and keeping irrelevant content out of the resulting pruned tree as much as possible.

Related Work
At the core of fully-supervised and distantlysupervised relation extraction approaches are statistical classifiers, many of which find syntactic information beneficial. For example, Mintz et al. (2009) explored adding syntactic features to a statistical classifier and found them to be useful when sentences are long. Various kernel-based approaches also leverage syntactic information to measure similarity between training and test examples to predict the relation, finding that tree-based kernels (Zelenko et al., 2003) and dependency path-based kernels (Bunescu and Mooney, 2005) are effective for this task.
Recent studies have found neural models effective in relation extraction. Zeng et al. (2014) first applied a one-dimensional convolutional neural network (CNN) with manual features to encode relations. Vu et al. (2016) showed that combining a CNN with a recurrent neural network (RNN) through a voting scheme can further improve performance. Zhou et al. (2016) and Wang et al. (2016) proposed to use attention mechanisms over RNN and CNN architectures for this task.
Apart from neural models over word sequences, incorporating dependency trees into neural models has also been shown to improve relation extraction performance by capturing long-distance relations. Xu et al. (2015b) generalized the idea of dependency path kernels by applying a LSTM network over the shortest dependency path between entities. Liu et al. (2015) first applied a recursive network over the subtrees rooted at the words on the dependency path and then applied a CNN over the path. Miwa and Bansal (2016) applied a Tree-LSTM (Tai et al., 2015), a generalized form of LSTM over dependency trees, in a joint entity and relation extraction setting. They found it to be most effective when applied to the subtree rooted at the LCA of the two entities.
More recently,  and Zhang et al. (2017) have shown that relatively simple neural models (CNN and augmented LSTM, respectively) can achieve comparable or superior performance to dependency-based models when trained on larger datasets. In this paper, we study dependency-based models in depth and show that with a properly designed architecture, they can outperform and have complementary advantages to sequence models, even in a large-scale setting.
Finally, we note that a technique similar to pathcentric pruning has been applied to reduce the space of possible arguments in semantic role labeling (He et al., 2018). The authors showed pruning words too far away from the path between the predicate and the root to be beneficial, but reported the best pruning distance to be 10, which almost always retains the entire tree. Our method differs in that it is applied to the shortest dependency path between entities, and we show that in our technique the best pruning distance is 1 for several dependency-based relation extraction models.

Baseline Models
We compare our models with several competitive dependency-based and neural sequence models.
Dependency-based models. In our main experiments we compare with three types of dependency-based models. (1) A logistic regression (LR) classifier which combines dependencybased features with other lexical features. (2) Shortest Dependency Path LSTM (SDP-LSTM) (Xu et al., 2015b), which applies a neural sequence model on the shortest path between the subject and object entities in the dependency tree. (3) Tree-LSTM (Tai et al., 2015), which is a recursive model that generalizes the LSTM to arbitrary tree structures. We investigate the child-sum variant of Tree-LSTM, and apply it to the dependency tree (or part of it). In practice, we find that modifying this model by concatenating dependency label embeddings to the input of forget gates improves its performance on relation extraction, and therefore use this variant in our experiments. Earlier, our group compared (1) and (2) with sequence models (Zhang et al., 2017), and we report these results; for (3) we report results with our own implementation.
Neural sequence model. Our group presented a competitive sequence model that employs a position-aware attention mechanism over LSTM outputs (PA-LSTM), and showed that it outperforms several CNN and dependency-based models by a substantial margin (Zhang et al., 2017). We compare with this strong baseline, and use its open implementation in further analysis. 3

Experimental Setup
We conduct experiments on two relation extraction datasets: (1) TACRED: Introduced in (Zhang et al., 2017), TACRED contains over 106k mention pairs drawn from the yearly TAC KBP 4 challenge. It represents 41 relation types and a special no relation class when the mention pair does not have a relation between them within these categories. Mentions in TACRED are typed, with subjects categorized into person and organization, and objects into 16 fine-grained types (e.g., date and location). We report micro-averaged F 1 scores on this dataset as is conventional. (2) SemEval It contains 19 relation classes over untyped mention pairs: 9 directed relations and a special Other class. On SemEval, we follow the convention and report the official macro-averaged F 1 scores. For fair comparisons on the TACRED dataset, we follow the evaluation protocol used in (Zhang et al., 2017) by selecting the model with the median dev F 1 from 5 independent runs and reporting its test F 1 . We also use the same "entity mask" strategy where we replace each subject (and object similarly) entity with a special SUBJ-<NER> token. For all models, we also adopt the "multichannel" strategy by concatenating the input word embeddings with POS and NER embeddings.
Traditionally, evaluation on SemEval is conducted without entity mentions masked. However, as we will discuss in Section 6.4, this method encourages models to overfit to these mentions and fails to test their actual ability to generalize. We therefore report results with two evaluation protocols: (1) with-mention, where mentions are kept for comparison with previous work; and (2) maskmention, where they are masked to test the generalization of our model in a more realistic setting.
Due to space limitations, we report model training details in the supplementary material.

Results on the TACRED Dataset
We present our main results on the TACRED test set in Table 1  outperforms all dependency-based models by at least 1.6 F 1 . By using contextualized word representations, the C-GCN model further outperforms the strong PA-LSTM model by 1.3 F 1 , and achieves a new state of the art. In addition, we find our model improves upon other dependencybased models in both precision and recall. Comparing the C-GCN model with the GCN model, we find that the gain mainly comes from improved recall. We hypothesize that this is because the C-GCN is more robust to parse errors by capturing local word patterns (see also Section 6.2). As we will show in Section 6.2, we find that our GCN models have complementary strengths when compared to the PA-LSTM. To leverage this result, we experiment with a simple interpolation strategy to combine these models. Given the output probabilities P G (r|x) from a GCN model and P S (r|x) from the sequence model for any relation r, we calculate the interpolated probability as where ↵ 2 [0, 1] is chosen on the dev set and set to 0.6. This simple interpolation between a GCN and a PA-LSTM achieves an F 1 score of 67.1, outperforming each model alone by at least 2.0 F 1 . An interpolation between a C-GCN and a PA-LSTM further improves the result to 68.2.

Results on the SemEval Dataset
To study the generalizability of our proposed model, we also trained and evaluated our best C-GCN model on the SemEval test set (Table 2). We find that under the conventional with-entity evaluation, our C-GCN model outperforms all existing dependency-based neural models on this sep- arate dataset. Notably, by properly incorporating off-path information, our model outperforms the previous shortest dependency path-based model (SDP-LSTM). Under the mask-entity evaluation, our C-GCN model also outperforms PA-LSTM by a substantial margin, suggesting its generalizability even when entities are not seen.

Effect of Path-centric Pruning
To show the effectiveness of path-centric pruning, we compare the two GCN models and the Tree-LSTM when the pruning distance K is varied. We experimented with K 2 {0, 1, 2, 1} on the TACRED dev set, and also include results when the full tree is used. As shown in Figure 3, the performance of all three models peaks when K = 1, outperforming their respective dependency path-based counterpart (K = 0). This confirms our hypothesis in Section 3 that incorporating off-path information is crucial to relation extraction. Miwa and Bansal (2016) reported that a Tree-LSTM achieves similar performance when the dependency path and the LCA subtree are used respectively. Our experiments confirm this, and further show that the result can be improved by path-centric pruning with K = 1.
We find that all three models are less effective when the entire dependency tree is present, indicating that including extra information hurts performance. Finally, we note that contextualizing the GCN makes it less sensitive to changes in the tree structures provided, presumably because the   model can use word sequence information in the LSTM layer to recover any off-path information that it needs for correct relation extraction.

Ablation Study
To study the contribution of each component in the C-GCN model, we ran an ablation study on the TACRED dev set (Table 3). We find that: (1) The entity representations and feedforward layers contribute 1.0 F 1 .
(2) When we remove the dependency structure (i.e., settingÃ to I), the score drops by 3.2 F 1 .
(3) F 1 drops by 10.3 when we remove the feedforward layers, the LSTM component and the dependency structure altogether. (4) Removing the pruning (i.e., using full trees as input) further hurts the result by another 9.7 F 1 .

Complementary Strengths of GCNs and PA-LSTMs
To understand what the GCN models are capturing and how they differ from a sequence model such as the PA-LSTM, we compared their performance    Table 4: The three dependency edges that contribute the most to the classification of different relations in the TACRED dev set. For clarity, we removed edges which 1) connect to common punctuation (i.e., commas, periods, and quotation marks), 2) connect to common prepositions (i.e., of, to, by), and 3) connect between tokens within the same entity. We use PER, ORG for entity types of PERSON, ORGANIZATION. We use S-and O-to denote subject and object entities, respectively. We also include edges for more relations in the supplementary material.
over examples in the TACRED dev set. Specifically, for each model, we trained it for 5 independent runs with different seeds, and for each example we evaluated the model's accuracy over these 5 runs. For instance, if a model correctly classifies an example for 3 out of 5 times, it achieves an accuracy of 60% on this example. We observe that on 847 (3.7%) dev examples, our C-GCN model achieves an accuracy at least 60% higher than that of the PA-LSTM, while on 629 (2.8%) examples the PA-LSTM achieves 60% higher. This complementary performance explains the gain we see in Table 1 when the two models are combined. We further show that this difference is due to each model's competitive advantage (Figure 4): dependency-based models are better at handling sentences with entities farther apart, while sequence models can better leverage local word patterns regardless of parsing quality (see also Figure 6). We include further analysis in the supplementary material.

Understanding Model Behavior
To gain more insights into the C-GCN model's behavior, we visualized the partial dependency tree it is processing and how much each token's final representation contributed to h sent ( Figure 5). We find that the model often focuses on the dependency path, but sometimes also incorporates offpath information to help reinforce its prediction. The model also learns to ignore determiners (e.g., "the") as they rarely affect relation prediction.
To further understand what dependency edges contribute most to the classification of different relations, we scored each dependency edge by summing up the number of dimensions each of its connected nodes contributed to h sent . We present the top scoring edges in Table 4. As can be seen in the table, most of these edges are associated with indicative nouns or verbs of each relation. 5

Entity Bias in the SemEval Dataset
In our study, we observed a high correlation between the entity mentions in a sentence and its relation label in the SemEval dataset. We experimented with PA-LSTM models to analyze this ALBA -the Bolivarian Alternative for the Americas -was founded by Venezuelan President Hugo Chavez and Cuban leader Fidel Castro in 2004 and also includes Bolivia , Nicaragua and the Caribbean island of Dominica .
Bashardost was born in 1965 in the southern Ghanzi province and his family migrated to Iran and then to Pakistan after successive coup and factional fighting in Afghanistan . Figure 6: Dev set examples where either the C-GCN (upper) or the PA-LSTM (lower) predicted correctly in five independent runs. For each example, the predicted and pruned dependency tree corresponding to K = 1 in path-centric pruning is shown, and the shortest dependency path is thickened. We omit edges to punctuation for clarity. The first example shows that the C-GCN is effective at leveraging long-range dependencies while reducing noise with the help of pruning (while the PA-LSTM predicts no relation twice, org:alternate names twice, and org:parents once in this case). The second example shows that the PA-LSTM is better at leveraging the proximity of the word "migrated" regardless of attachment errors in the parse (while the C-GCN is misled to predict per:country of birth three times, and no relation twice).
phenomenon. 6 We started by simplifying every sentence in the SemEval training and dev sets to "subject and object", where subject and object are the actual entities in the sentence. Surprisingly, a trained PA-LSTM model on this data is able to achieve 65.1 F 1 on the dev set if GloVe is used to initialize word vectors, and 47.9 dev F 1 even without GloVe initialization. To further evaluate the model in a more realistic setting, we trained one model with the original SemEval training set (unmasked) and one with mentions masked in the training set, following what we have done for TACRED (masked). While the unmasked model achieves a 83.6 F 1 on the original SemEval dev set, F 1 drops drastically to 62.4 if we replace dev set entity mentions with a special <UNK> token to simulate the presence of unseen entities. In contrast, the masked model is unaffected by unseen entity mentions and achieves a stable dev F 1 of 74.7. This suggests that models trained without entities masked generalize poorly to new examples with unseen entities. Our findings call for more careful evaluation that takes dataset biases into account in future relation extraction studies.

Conclusion
We showed the success of a neural architecture based on a graph convolutional network for relation extraction. We also proposed path-centric pruning to improve the robustness of dependencybased models by removing irrelevant content without ignoring crucial information. We showed through detailed analysis that our model has complementary strengths to sequence models, and that the proposed pruning technique can be effectively applied to other dependency-based models.