Joint Type Inference on Entities and Relations via Graph Convolutional Networks

We develop a new paradigm for the task of joint entity relation extraction. It first identifies entity spans, then performs a joint inference on entity types and relation types. To tackle the joint type inference task, we propose a novel graph convolutional network (GCN) running on an entity-relation bipartite graph. By introducing a binary relation classification task, we are able to utilize the structure of entity-relation bipartite graph in a more efficient and interpretable way. Experiments on ACE05 show that our model outperforms existing joint models in entity performance and is competitive with the state-of-the-art in relation performance.


Introduction
Extracting entities and relations from plain texts is an important and challenging task in natural language processing.Given a sentence, the task aims to detect text spans with specific types (entities) and semantic relations among those text spans (relations).For example, in the Figure 1, "Toefting" is a person entity (PER), "teammates" is a person entity (PER), and the two entities have a Person-Social relation (PER-SOC).
To tackle the task of entity relation extraction, various methods have been proposed, which can be divided into two categories: pipeline models and joint models.Pipeline models extract entities and relations in two stages: entities are first extracted by an entity model, and then these extracted entities are used as the inputs of a relation model.Pipeline models often ignore interactions between the two models and they suffer from error propagation.Joint models integrate information between entities and relations into a single model with the joint training, and have achieved better results than the pipeline models.In this paper, we focus on joint models.
More and more joint methods have been applied to this task.Among them, Miwa and Bansal (2016); Katiyar and Cardie (2017) identify the entity with a sequence labelling model, and identify the relation type with a multi-class classifier.These joint methods do joint learning through sharing parameters and they have no explicit interaction in type inference.In addition, some complex joint decoding algorithms (e.g., simultaneously decoding entities and relations in beam search) have been carefully investigated, including Li and Ji (2014); Zhang et al. (2017); Zheng et al. (2017); Wang et al. (2018).They jointly handle span detection and type inference to achieve more interactions.
By inspecting the performance of existing models (Sun et al., 2018) on ACE05, we find that, for many entities, their spans are correctly identified, but their entity types are wrong.In particular, the F1 of extracting typed entities is about 83%, while the F1 of extracting entity spans is about 90%.Thus, if we have a better type inference model, we may get a better joint extraction performance.At the same time, we observe that a joint inference on entity and relation types could be potentially better than predicting them independently.For example, in Figure 1, the PER-SOC relation suggests that the type of "Toefting" might be PER, and vice versa.Moreover the PER ("Toefting") and the relation PER-SOC could benefit from other relations such as PHYS.
In this paper, we define joint entity relation extraction into two sub-tasks: entity span detection and entity relation type deduction.For entity span detection, we treat it as a sequence labeling problem.For joint type inference, we propose a novel and concise joint model based on graph convolutional networks (GCNs) (Kipf and Welling, 2017).The two sub-models are trained jointly.Specifically, given all detected entity spans in a sentence, we define an entity-relation bipartite graph.For each entity span, we assign an entity node.For each entity-entity pair, we assign a relation node.Edges connect relation nodes and their entity nodes (last part of Figure 1).With efficient graph convolution operations, we can learn representations for entity nodes and relation nodes by recursively aggregating information from their neighborhood over the bipartite graph.It helps us to concisely capture information among entities and relations.For example, in Figure 1, to predict the PER ("Toefting"), our joint model can pool the information of PER-SOC, PHYS, PER ("teammates") and GPE (captital).
To further utilize the structure of the graph, we also propose assigning different weights on graph edges.In particular, we introduce a binary relation classification task, which is to determine whether the two entities form a valid relation.Different from previous GCN-based models (Shang et al., 2018;Zhang et al., 2018), the adjacency matrix of graph is based on the output of binary relation classification, which makes the proposed adjacency matrix more explanatory.To summarize, the main contributions of this work are 1 • We present a novel and concise joint model to handle the joint type inference problem based on graph convolutional network (GCN).
• We introduce a binary relation classification task to explore the structure of entity-relation 1 Our implementation is available at https:// github.com/changzhisun/AntNRE. bipartite graph in a more efficient and interpretable way.
• We show that the proposed joint model on ACE05 achieves best entity performance, and is competitive with the state-of-the-art in relation performance.
2 Background of GCN In this section, we briefly describe graph convolutional networks (GCNs).Given a graph with n nodes, the goal of GCNs is to learn structureaware node representations on the graph which takes as inputs: • an n×d input node embedding matrix H, where n is the number of nodes and d is the dimension of input node embedding; • an n × n matrix representation of the graph structure such as the adjacency matrix A (or some function thereof)2 .
In an L-layer GCNs, every layer can be written as a non-linear function with H (0) = H, where Â = D − 1 2 AD − 1 2 is the normalized symmetric adjacency matrix and W (l) is a parameter matrix for the l-th GCN layer.D is the diagonal node degree matrix, where D ii = j A ij .σ is a non-linear activation function like ReLU.Finally, we can obtain a node-level output Z = H (L) , which is an n × d feature matrix.

Approach
We define the joint entity relation extraction task.Given a sentence s = w 1 , . . .w |s| (w i is a word), the task is to extract a set of entity spans E with specific types and a set of relations R.An entity span e ∈ E is a sequence of words labeling with an entity type y (e.g., person (PER), organization (ORG)).A relation r is a quintet (e 1 , y 1 , e 2 , y 2 , l), where e 1 and e 2 are two entity spans with specific types y 1 and y 2 .l is a relation type describing the semantic relation between two entities.(e.g., organization affiliation relation (ORG-AFF)).Let T e , T r be the set of possible entity types and relation types respectively.
In this work, we decompose the joint entity relation extraction task into two parts, namely, entity span detection and entity relation type deduction.We first treat entity span detection as a sequence labelling task (Section 3.1), and then construct an entity-relation bipartite graph (Section 3.2) to perform joint type inference on entity nodes and relation nodes (Section 3.3).All submodels share parameters and are trained jointly.Different from existing joint learning algorithms (Sun et al., 2018;Zhang et al., 2017;Katiyar and Cardie, 2017;Miwa and Bansal, 2016), we propose a concise joint model to perform joint type inference on entities and relations based on GCNs.It considers interactions among multiple entity types and relation types simultaneously in a sentence.

Entity Span Detection
To extract entity spans from a sentence (Figure 2), we adopt the BILOU sequence tagging scheme: B, I, L and O denote the begin, inside, last and outside of a target span, U denotes a single word span.For example, for a person (PER) entity "Patrick McDowell", we assign B to "Patrick" and L to "McDowell".
Given an input sentence s, we use a bidirectional long short term memory (biLSTM) network (Hochreiter and Schmidhuber, 1997) with parameter θ seq to incorporate information from both forward and backward directions of s.
where h i is the concatenation of a forward and a backward LSTM's hidden states at position i, and x i is the word representation of w i which contains pre-trained embeddings and character-based word representations generated by running a CNN on the character sequences of w i .Then, we employ a softmax output layer to predict w i 's tag ti , where W span is the parameter.Given an input sentence s and its gold tag sequence t = t 1 , . . ., t |s| , the training objective is to minimize3 The biLSTM model for entity span detection.

Entity-Relation Bipartite Graph
Given a set of detected entity spans Ê (obtained from the entity span tag sequence t), we consider all entity span pairs in Ê as candidate relations4 .Then we build a heterogeneous undirected bipartite graph G s which contains entity nodes and relation nodes in a sentence s.In the graph G s , interactions on multiple entity types and relation types can be explicitly modeled.The number of nodes n in the graph G s is the number of entity spans | Ê| plus the number of all candidate rela-

2
. We have an initial input node embedding matrix H.For a relation r 12 and its two entities e 1 , e 2 , we use H r 12 to denote relation embedding of r 12 , and use H e 1 ,H e 2 to denote entity embedding of e 1 , e 2 respectively.
Next, we build edges between entity nodes and relation nodes.For graph edges, we connect every relation node to its two entity nodes instead of directly connecting any entity (relation) nodes.Thus we focus on the bipartite graph.The reasons are two folds.a) We do not think that all the remaining entities in the sentence are helpful.Relation nodes are bridges between entity nodes and vice versa.b) GCN is not suitable for fully-connected graphs because GCN reduce to rather trivial operations on fully-connected graphs.It means that, for an entity node e, the only way to observe other entities is through relations which e takes part in.For example, given a relation node r 12 and its two entity nodes e 1 , e 2 , we add two edges.One is the edge between e 1 and r 12 , and another is the edge between e 2 and r 12 .We refer to it as static graph.
In order to further utilize the structure of the graph (some kind of prior knowledge) instead of using a static graph, we also investigate the dynamic graph for pruning redundant edges.A key intuition is that if two entities hold a relation, we could add two edges between the relation node and two entity nodes.Conversely, if two entities have no relation, we keep two entity nodes and the relation node separately.To this end, we introduce the binary relation classification task.It aims to predict whether a certain relation exists between an entity span pair (ignoring specific relation types).We build a binary relation model which predicts a label in {0, 1} to indicate the existence of a candidate relation based on relation node embedding.Given a relation node r ij in a sentence s, to get the posterior of the binary relation label b, we apply softmax layer on the relation node embedding where W bin is the parameter.The training objective is to minimize where true binary annotations b are transformed from the original typed relation labels.Formally, the adjacency matrix A is defined as 5, we set the value of A between entity nodes e i , e j and relation node r ij to 1.0, • the diagonal elements of A are set to 1.0, • while others are set to 0.0.
To compare with hard binary value A, we also try the soft value A in experiments.It means that we set the value of A between entity nodes e i , e j and relation node r ij to the probability P ( b = 1|r ij , s) except for the diagonal elements (they are set to 1.0).
Here, we introduce how to compute two types of contextualized node embedding in the graph G s : entity node embedding and relation node embedding.
Entity Node Embedding Given an entity span e ∈ Ê, for each word w i ∈ e, we first collect w i 's biLSTM hidden vector h i from entity span model.Then, we use a CNN (a single convolution layer with a max-pooling layer) with a multi-layer perceptron (MLP) on vectors {h i |w i ∈ e} to obtain the resulting d-dimensional entity span node embedding H e (H is a matrix mentioned before in Section 2), as shown in the left part of Figure 4.
Relation Node Embedding Given a candidate relation r 12 , we extract two types of features, namely, features regarding words in e 1 , e 2 and features regarding contexts of the entity span pair (e 1 , e 2 ).For features on words in e 1 , e 2 , we simply use entity node embedding H e 1 and H e 2 .For context features of the entity span pair (e 1 , e 2 ), we build three feature vectors by looking at words between e 1 and e 2 , words on the left of the pair and words on the right of the pair.Similarly, we build three features by running another CNN with  an MLP.Finally, the five feature vectors are concatenated to a single vector.To get d-dimensional relation node embedding H r 12 , we apply an MLP on the single vector, as shown in the right part of Figure 4.

Joint Type Inference
After building the entity-relation bipartite graph, we feed the graph into a multi-layer GCNs to obtain the node-level output Z.For each row in Z (entity or relation node representation), it can gather and summarize information from other nodes in the graph G s although there is no direct entity-entity or relation-relation edges in the graph.Then the final node representation F of graph G s is concatenated by the input node embedding H and the node-level output Z (H, Z and F are matrices).
Given an entity node e i and a relation node r ij , to predict the corresponding node types, we pass the resulted node representation into two fully connected layer with a softmax function, respectively, where W ent , W rel are parameters.And the training objective is to minimize Ê log P (ŷ = y|e i , s), (5) where the true label y, l can be read from annotations, as shown in Figure 3.

Training
To train the joint model, we optimize the combined objective function L = L span + L bin + L ent + L rel , where the training is accomplished by the shared parameters.We employ the scheduled sampling strategy (Bengio et al., 2015) in the entity model similar to (Miwa and Bansal, 2016).We optimize our model using Adadelta (Zeiler, 2012) with gradient clipping.The network is regularized with dropout.Within a fixed number of epochs, we select the model according to the best relation performance on development sets5 .

Experiments
We conduct experiments on ACE05 dataset, which is a standard corpus for the entity relation extraction task.It includes 7 entity types and 6 relation types between entities.We use the same data split of ACE05 documents as previous work (351 training, 80 development and 80 testing) (Miwa and Bansal, 2016).We evaluate the performances using precision (P), recall (R) and F1 scores following (Miwa and Bansal, 2016;Sun et al., 2018).Specifically, an output entity (e, y) is correct if its type y and the region of its head e are correct, and an output relation r is correct if its (e 1 , y 1 , e 2 , y 2 , l) are correct ( i.e., exact match).
In this paper, the default setting "GCN" is the 1-layer GCN-based joint model with the dynamic hard adjacency matrix, which achieves the best relation performance on ACE05 dataset.

End-to-End Results on ACE05
First, we compare proposed models with previous work in Table 1.In general, our "GCN" achieves the best entity performance 84.2 percent comparing with existing joint models.For relation performance, our "GCN" significantly outperforms all joint models except for (Sun et al., 2018) which uses more complex joint decoder.Comparing with our basic neural network "NN", our "GCN" has large improvement both on entities and relations.Those observations demonstrate the effectiveness of our "GCN" for capturing information on multiple entity types and relation types from a sentence.2018) are joint decoding algorithms.Miwa and Bansal (2016) and Katiyar and Cardie (2017) are joint training systems without joint decoding."NN" is our neural network model without GCN."GCN" is dynamic hard GCN-based neural network.We omit pipeline methods which underperform joint models (see (Li and Ji, 2014) for details).
Compared to the state-of-the-art method which adopts minimum risk training (Sun et al., 2018), our "GCN" has better entity performance and comparable relation performance.Different from existing joint decoding systems, we do not use complex joint decoding algorithms such as beam search (Li and Ji, 2014), global normalization (Zhang et al., 2017) and minimum risk training (Sun et al., 2018).Our models only rely on sharing parameters similar to (Miwa and Bansal, 2016;Katiyar and Cardie, 2017).It is worth noting that the precision of our "GCN" is high compared to all the other methods.We attribute the phenomenon to the strong ability to model feature representations of entity nodes and relation nodes.
Next, we evaluate our model with different settings.As mentioned in Section 3.2, we have three types of graph: "GCN (static)", "GCN (dynamic + hard)" and "GCN (dynamic + soft)".The last three rows of Table 3 show their performances.We have three observations regarding the Table 3. 1.Compared with "Sun (NN)" model which is the base neural network without minimum risk training (Sun et al., 2018), our "NN" performs better 0.5 point on entities.One reason might be the entity type model and the relation type model share more parameters (entity CNN+MLP parameters), while "Sun (NN)" only shares biLSTM hidden states.However, our "NN" performs within 0.6 point on relations.One possible reason might be that we do not use the features of output entity type for relation type classification.
2. After introducing graph convolutional networks, all three GCN-based models improve per- formances of entity and relation.Specifically, The "GCN (static)" has been slightly improved on relations.The "GCN (dynamic + soft)" achieves 0.7 percent improvement on relations and has the same entity performance.The "GCN (dynamic + hard)" improves the entity performance (0.4 percent)6 and achieves large improvement (1.9 percent) in relation performance.It is competitive with state-of-the-art model (Sun et al., 2018).These observations show that the proposed joint model is effective for the joint type inference on entities and relations, and also show the rationality of the proposed dynamic graph, as expected.
3. The performances of the entity span and the binary relation are close to all proposed models.One possible reason is that there are more coarsegained task.Effective features can be easily extracted for all models.It is worth noting that the performance in binary relation is not very good.
Our dynamic graph relies on binary relation detection task.How to improve the performance of binary relation is still a hard question.We leave it as future work.
Thirdly, we present the influences of the number of GCN layers (Table 2).We take the "GCN (dynamic + hard)" as a example.In general, the performances on four tasks are insensitive to the number of GCN layers7 .In particular, the performances on entity span, entity and relation fluctuate at 1.0 points, and the binary relation fluctuate at 1.4 points.Interestingly, we find the one layer GCN achieves best relation performance though the performances of other three tasks are not best.One possible reason is that the all models are closely related to each other.However, how they  affect each other in this joint settings is still an open question.Forthly, we examine the relation performance with respect to different the number of relations for each sentence (Figure 5).In general, our GCNbased models almost outperform "NN" when the number of relations is larger than 2. It proves that the proposed GCN-based models are more suitable for handle multiple relations in a sentence.We think our method will perform better on the complex multiple relations dataset which is very common in reality.
Finally, We compare the "NN" model with the "GCN" model on some concrete examples, as shown in Table 5.
For S2, the "NN" does not detect PART-WHOLE relation while the "GCN" correctly find it.These two observations show that our "GCN" is good at dealing with the situation when the multiple relations share common entities, as expected.For S3, our "GCN" identifies a PHYS relation between "[units] PER " and "[captial] GPE ", while the "NN" does not find this relation even the entities are correct.However, both models do not identify the relation ART between "[units] PER " and "[weapons] WEA ".We think advanced improvement methods which use more powerful graph neural network might be helpful in this situation.

Golden Entity Results on ACE05
In order to compare with relation classification methods, we evaluate our models with golden entities on ACE05 corpus in Table 4.We use the same data split to compare with their model (Miwa and Bansal, 2016;Christopoulou et al., 2018).We do not tune hyperparameters extensively.For example, we use the same setting in both end-to-end and golden entity rather than tune parameters on each of them.The baseline systems are (Miwa and Bansal, 2016) and (Christopoulou et al., 2018).
In general, our "NN" is competitive, comparing to the dependency tree-based state-of-the-art model (Miwa and Bansal, 2016).It shows that our CNN-based neural networks are able to extract more powerful features to help relation extraction task.After adding GCN, our GCN-based models achieve the better performance.This indicates that the proposed models can achieve large improvement without any external syntactic tools8 .

S3
a red line may have been drawn around the ART-2:♥ once u.s. and allied troops cross it .
Table 5: Examples from the ACE05 dataset with label annotations from "NN" model and "GCN" model for comparison.The ♥ is the gold standard, and the ♣, ♠ are the output of the "NN" ,"GCN" model respectively.

Related Work
There have been extensive studies for entity relation extraction task.Early work employs a pipeline of methods that extracts entities first, and then determines their relations (Zelenko et al., 2003;Miwa et al., 2009;Chan and Roth, 2011;Lin et al., 2016).As pipeline approaches suffer from error propagation, researchers have proposed methods for joint entity relation extraction.
Parameter sharing is a basic strategy for joint extraction.For example, Miwa and Bansal (2016) propose a neural method comprised of a sentencelevel RNN for extracting entities, and a dependency tree-based RNN to predict relations.Their relation model takes hidden states of the entity model as features (i.e., the shared parameters).Similarly, Katiyar and Cardie (2017) use a simplified relation model based on the entity RNN using the attention mechanism.These joint methods do joint learning through sharing parameters and they have no explicit interaction in type inference.
To further explore interactions between the entity decoder and the relation decoder, many of them focus on some joint decoding algorithms.ILP-based joint decoder (Yang and Cardie, 2013), CRF-based joint decoder (Katiyar and Cardie, 2016), joint sequence labelling tag set (Zheng et al., 2017), beam search (Li and Ji, 2014), global normalization (Zhang et al., 2017), and transition system (Wang et al., 2018) are investigated.Different from models there, we propose a novel and concise joint model to handle joint type inference based on graph convolutional networks, which can capture information between multiple entity types and relation types explicitly 9 . 9In addition, transfer learning (Sun and Wu, 2019), multi-Recently, researches of graph neural networks (GNNs) have been receiving more and more attention because of the great expressive power of graphs (Cai et al., 2018;Battaglia et al., 2018;Zhou et al., 2018).Graph Convolutional Network (GCN) is one of the typical variants of GNN (Bruna et al., 2013;Defferrard et al., 2016;Kipf and Welling, 2017).It has been successfully applied to many NLP tasks such as text classification (Yao et al., 2018), semantic role labeling (Marcheggiani and Titov, 2017), relation extraction (Zhang et al., 2018) machine translation (Bastings et al., 2017) and knowledge base completion (Shang et al., 2018).We note that most previous applications of GCN focus on a single job, while the joint entity relation extraction consists of multiple sub-tasks.Investigating GCN in joint learning scenarios is the main topic of this work.A closely related work is (Christopoulou et al., 2018), which focuses on relation extraction with golden entities.Our work can be viewed as an end-to-end extension of their work.

Conclusion
We propose a novel and concise joint model based on GCN to perform joint type inference for entity relation extraction task.Compared with existing joint methods, it provides a new way to capture the interactions on multiple entity types and relation types explicitly in a sentence.Experiments on ACE05 dataset show the effectiveness of the proposed method.
task learning (Sanh et al., 2018) for this task were also studied.In order to make a fair comparison, we do not include these models in experiments.

ToeftingFigure 1 :
Figure 1: An example from ACE05.The first part contains annotations and the second part is the entityrelation graph of the sentence used in GCN.

Figure 5 :
Figure 5: F1 scores with respect to the number of relations for each sentence.The numbers in parentheses are counts of sentences in the ACE05 test set.
Figure 3: Our network structure for the joint entity and relation extraction based on GCN.The node embedding extractor computes H e and H r .

Table 2 :
Results on the ACE05 development set with respect to the number of GCN layers.

Table 4 :
Results on the ACE05 dataset with golden entity.