Fine-Grained Entity Typing via Hierarchical Multi Graph Convolutional Networks

This paper addresses the problem of inferring the fine-grained type of an entity from a knowledge base. We convert this problem into the task of graph-based semi-supervised classification, and propose Hierarchical Multi Graph Convolutional Network (HMGCN), a novel Deep Learning architecture to tackle this problem. We construct three kinds of connectivity matrices to capture different kinds of semantic correlations between entities. A recursive regularization is proposed to model the subClassOf relations between types in given type hierarchy. Extensive experiments with two large-scale public datasets show that our proposed method significantly outperforms four state-of-the-art methods.


Introduction
Nowadays, Knowledge Base (KB for short) attracts increasing research interests in various areas. One of the fundamental components in KB is the information of entity type, which clusters a group of entities with same properties, and is the glue that holds our mental world together (Murphy, 2004). Traditional entity typing focuses on a small set of types, such as Person, Location and Organization (Ratinov and Roth, 2009;Nadeau and Sekine, 2007), while Fine-Grained Entity Typing assigns more specific types to an entity, which normally forms a type-path in the type hierarchy in KB (Ren et al., 2016). For example, Messi is classified as having the path of following types: F ootballP layer ⊂ Athlete ⊂ P erson. Types in KB are usually organized as a hierarchical structure, namely type hierarchy. Unfortunately, most KBs are incomplete and lack of type information. For example, in DBpedia, the average number of types is only 2.9 (5,044,223 entities with 14,760,728 types), while 36.53% entities * Corresponding author do not have type information. As Figure 1 shows, for each unlabeled entity, we will fully utilize its textual description, category and property to predict missing types. In this paper, we aim at assigning Fine-grained Entity Typing for entities in KB (such as Wikipedia and DBpedia). Figure 1: Entity (i.e., Yao Ming) with textual description (with anchor text in red), property (e.g., bornIn Shanghai) and category (e.g., Houston Rockets players). Yao Ming is associated with a typepath: BasketballP layer ⊂ Athlete ⊂ P erson ⊂ Agent ⊂ T hing Many researches have been carried out in this field. State-of-the-art methods normally learn distributed representation for each entity, and apply a multi-label classification model to make type inference. For example, (Neelakantan and Chang, 2015) and (Xu et al., 2016) exploit various kinds of information to construct the feature representation of an entity, such as entity textual description, property and category. After that, a predict function is learned to infer whether entity e is an instance of type t.  focus on the names of entities and the context of entity mentions (in text), and design two scoring models for pairs of entities and types. All these works ignore the internal relations among entities, and assign types for each entity in isolation. (Jin et al., 2018) viewed the internal relations among entities as the structural information, and constructed an Entity Graph, further proposed a network embedding framework to learn the correlation among entities. It will be profitable to enrich entity features to entity graph structures, as recent studies have suggested to bring both node feature and graph structure information in a convolutional manner (Defferrard et al., 2016;Atwood and Towsley, 2016). Among these works, graph convolutional network (GCN) is the most widelyused model which can directly operate on graphs with arbitrary sizes and shapes (Kipf and Welling, 2016). The inputs of a GCN are feature vectors of nodes and graph structure. The most significant aspect of GCN is information diffusion, with which node's feature vector can be enriched by all the feature vectors of its neighbors.
Here, we convert entities in KB into three semantic graphs, each encoding a specific kind of correlation among entities, as follows: A co a graph of the topical co-occurrence relations among entities, A cat a category-based graph encoding the category-proximity between entities, and A prop a property-based graph encoding the property-proximity between entities. We propose Hierarchical Multi Graph Convolutional Network (HMGCN), a novel deep learning architecture consisting of three Graph Convolutional Networks (GCNs): GCN co and GCN cat and GCN prop . Each connectivity matrix is fed to its corresponding GCN model, and the three GCN models are learned by shared parameters and consistency regularization. We adopt a simple and effective recursive regularization to deal with subClassOf relations between types. The main contributions of this paper are as follows: • Graph convolutional network is applied for fine-grained entity typing task, which effectively integrates entity feature and structural information.
• Multi connectivity matrices are constructed to encode different kinds of semantic relatedness between entities. A recursive regularization is proposed to model subClassOf relations between types.
• Extensive experiments show that our proposed method significantly outperforms four state-of-the-art methods, with 1.7% and 1.3% improvement in Mi-F1 and Ma-F1 (on FIGER dataset), respectively.
The rest of this paper is organized as follows. Section 2 formally defines the problem of Fine-Grained Entity Typing in knowledge bases. Section 3 describes the proposed approach in detail. Section 4 reports a number of experiments with evaluations. Section 5 outlines some related works. Section 6 concludes our work.

Problem Formulation
In this section, we formally define the problem of Fine-Grained Entity Typing in KB.
where E, T and R are sets of entities, types and isA relations, respectively. |E| = N and |T | = K. R = R i ∪ R s consists of instanceOf and subClassOf triples.
Note that not each entity in a knowledge base KB is assigned with types. We split the entity set E into two subsets: E l and E u , E = E l ∪ E u , E l ∩ E u = ∅. E l denotes the entity set whose member is assigned with type information. E u denotes the entity set whose member has no type information. |E l | = N l , |E u | = N u . Types in T form a tree or a directed acyclic graph (DAG) via subClassOf relations, we refer it as type hierarchy H = (T , R s ).
For each entity, we distinguish three kinds of features, namely, Text Description, Category, and Property, which are beneficial to Fine-Grained Entity Typing task.
• Text Description is a succinct summary of an entity, providing valuable clues to predict entity types. As Figure 1 shows, there are a large number of anchor texts in textual description.
• Category refers, in Wikipedia, actually to tags/topics that group entities on similar subjects. For instance, Yao Ming in Wikipedia has category Olympic basketball players of China, which is the topic of the entity instead of its type. Category can be a useful information to infer missing type information.
• Property of entities is important cue in type inference. For instance, from Yao Ming's property playing position, one may infer that Yao Ming is an Athlete.
Besides entity features above, there are many kinds of semantic correlations between entities, which organize entities into graph structures. We refer it as entity graph. Each kind of semantic correlation corresponds to a kind of connectivity matrix. There are many ways to construct entity graph via different connectivity manner, which will be discussed in Section 3.
Definition 2 Entity Graph G = (A, X, Y). A ∈ R N ×N is the connectivity matrix representing a kind of link relation between entities. X ∈ R N ×M is the feature matrix representing inherent features for all entities, where M is the number of features. Each row in X refers to an entity's feature vector X i . X can either be binary or take any real value. Y ∈ R N l ×K is the type matrix collecting type information for entities in E l . Each row in Y refers to the instanceOf relations between an entity and all types. Y ij = 1 if e i has been assigned with type t j , otherwise 0.
The task of Fine-Grained Entity Typing can be formally described as follows: Given a partiallylabeled Entity Graph G = (A, X, Y), we aim at learning a type predictor from Y, which comprehensively takes graph structure and entity features into account. Then we utilize the learned type predictor to predict the missing instanceOf relations in E u , i.e., whether (e i , instanceOf, t j ) is true. In this way, we convert the task of Fine-Grained Entity Typing into a task of graph-based semi-supervised classification.

Hierarchical Multi Graph Convolutional Network (HMGCN)
State-of-the-art methods convert entities in KB into entity graph, then apply graph-based algorithm. The difficulties lie in two facts: • Firstly, it is hard to effectively integrate entity features and structural information. Although network embedding methods, in terms of attributed network embedding, can take both graph structure and features into consideration (Huang et al., 2017), the node features are usually only served as auxiliary information in structure learning (Yang et al., 2016;Liao et al., 2017).
• Secondly, it is not easy to effectively integrate different connectivity matrix which capture different kinds of structural information Zhuang and Ma, 2018).
We propose a Hierarchical Multi GCN model (HMGCN) to encode heterogeneous structural information in an ensemble manner, as illustrated in Figure 2. We construct three undirected entity graphs to capture different kinds of semantic correlations between entities, i.e., co-occurrence graph A co , category-based graph A cat , and property-based graph A prop . A co encodes the topical relevance between entities, and is derived from textual anchor texts. A cat is constructed through similarity computation based on category information, based on the assumption that entities with similar categories tend to have the same type. In the same way, we construct property-based graph A prop . Each entity graph is fed to corresponding GCN model, namely, GCN co , GCN cat and GCN prop . These models use shared parameters and unsupervised consistency regularization, so that they can jointly consider opinions from three semantic perspectives. To achieve this, a simple and effective recursive regularization is adopted to deal with the subClas-sOf relations between types.

GCN-Based Entity Type Classification
GCN model consists of multiple stacked GCN layers. Given the input feature matrix X ∈ R N ×M and adjacency matrix A ∈ R N ×N , the output of the i-th hidden layer of the network Z (i) is defined as: (1) is the output of the (i − 1)-th layer, and Z (0) = X. W (i) are the trainable parameters of the network, and σ(·) denotes an activation function (e.g., ReLU, Sigmoid). A detailed explanation of GCN model is presented in the original work (Kipf and Welling, 2016).
Since Fine-grained Typing is a multi-label classification problem. The final layer further connects to a set of K Sigmoid functions, which correspond to the K types in the type hierarchy. Given the set of labeled entities E l , our model optimizes Figure 2: Framework of HMGCN. We convert entities in KB into multi kinds of semantic graphs. The cooccurrence graph A co is constructed from anchor text in textual description. The category-based graph A cat is derived from category proximity. The property-based graph A prop is derived from property proximity. Each graph is fed to corresponding GCN model with same feature matrix X. The three GCN models are learned by sharing parameters and consistency regularization. A hierarchical regularization is used to deal with the subClassOf relations between types. the cross-entropy between the true type distribution and the predicted distribution.
Where Y i,j indicates whether entity e i belongs to type t j .Ẑ i,j refers to the probability of GCN prediction.
The role ofD − 1 is to exactly conduct a 1-hop diffusion process in each layer. Namely, a node's feature vector is enriched in the way of linearly adding all the feature vectors of its neighbors. With more hidden layers, an entity's feature can diffuse to further entities (2-hop, 3-hop and so on). Research shows that there's no significant increase when the number of layers is higher than 3 (Kipf and Welling, 2016). Here, we assume adjacency matrix A and feature matrix X are given. In the next part of this section, we show how to compute A and X.

Connectivity Matrix Designing
In Knowledge Graphs, there are large numbers of descriptive anchor texts that are helpful for type inference. For an entity e, CXT (e) is an entity set which contains all entities occur in its textual description. For example in Figure 1, Houston Rockets is in CXT (Y aoM ing). If entity e i ∈ CXT (e j ), we can say that e i occurs in e j 'context or e j cites e i in its textual description. These are co-occurrence information among entities, and can be explicitly extracted and represented by the (3) A co can capture co-occurrence relations, though not precise enough. For example, Houston Rockets and NBA occur in Yao Ming's context, but they do not belong to same type.
Besides A co , we calculate the correlation between entities via category information. If entity e i and e j share common categories, e i and e j will be likely belong to a same type. For example in Figure 2, both Houston Rockets and Shanghai Sharks have category Basketball team clubs, they have same type BasketballTeam. We construct A cat , and use Jaccard similarity coefficient to calculate each element A cat [i, j]: in which, Cat(e) is the category set of entity e. In the same way, we can calculate the correlation between entities via property information, and construct A prop , whose element is calculated as follows.
in which P rop(e) is the property set of entity e. Each connectivity matrix is fed to its corresponding GCN model, i.e., GCN co , GCN cat and GCN prop , respectively. As mentioned in Section 2, each entity is associated with text description, category and property. We used category and property information to construct connectivity matrix, therefore, words appeared in the text description of an entity can be approximately viewed as features of this entity. We apply fastText method, which treats the average of word/n-grams embeddings as entity feature (Joulin et al., 2016), and construct the feature matrix X. As shown in Figure 2, both GCN co , GCN cat and GCN prop take X as input feature matrix.

Parameters Sharing and Consistency Regularization
The success of our model largely relies on the strategy that three GCN models share common parameters (i.e., neural network weights W (i) in Eqs. 1), as shown in Figure 2. By doing so, our model (which is characterized by the parameters W (i) ) can simultaneously consider the knowledge encoded in A co , A cat and A prop . The prediction of GCN co and GCN cat are denoted asẐ co andẐ cat ∈ R N ×K , respectively. To jointly consider the opinions from GCN co and GCN cat , we apply an unsupervised regularizer for the ensemble. We minimize the mean squared difference be-tweenẐ co andẐ cat for all N entities.
The prediction of GCN prop is denoted asẐ prop . In the same way, we minimize the mean squared differences betweenẐ co andẐ prop over all N entities.
Our model can jointly consider the opinions of GCN co , GCN cat and GCN prop . Although sharing the same parameters W (i) , three GCN models utilize different connectivity matrices as input, which capture different semantic correlations between entities. This difference may cause the predictions of three GCN models to differ. Our model is expected to give a consistent prediction via the proposed unsupervised consistency regularizations, i.e., minimizing Eqs. 6 and 7. As a result, the learned parameter matrix W (i) considers the judgments from both GCN co , GCN cat and GCN prop .

Recursive Hierarchical Regularization
Leaf nodes (in type hierarchy) may have insufficient training examples. In that case, decisions can be regularized by its parent, if a type hierarchy is available. We introduce dependency relations among types to improve the classification performance. Similar to (Peng et al., 2018;Gopal and Yang, 2013), we use a recursive regularization over the final layer. Two types shall have similar embeddings, if they are close in a graph. In particular, types that are related by subClassOf relation shall have similar embeddings. For example, in Figure 1, there is an edge between Athlete and BasketballPlayer, so the parameters of the two types could be similar to each other. Assuming there are n hidden layers, the last layer parameter matrix W (n) can be regarded as the type embedding matrix, each row in W (n) refers to a type representation. Let C(t) refer to the sub-type set of t (t's children in the hierarchy). We use the following recursive regularization term to regularize the parameter of each type:

Model Training
We calculate the supervised loss over all labeled entities for GCN co (Eqs. 2). Although Z co i is regarded as the final type prediction result, our model can consider the opinions from three GCN models via parameters sharing and consistency regularization. The total loss is the sum of supervised loss, two consistency regularizations and recursive regularization. (9) in which λ(t) is a dynamic weight function, and λ is chosen among a fixed set based on performance on the dev set. At the beginning of the training process, λ(t) is small, the loss function is mainly dominated by L s . HMGCN is inclined to agree with GCN co when making decisions, which encodes co-occurrence relation between entities. λ(t) increases as time goes on, our model will obtain posterior distribution over the types using GCN co , and λ(t) will force our model to consider simultaneously the knowledge encoded in GCN cat and GCN prop . GCN cat and GCN prop play more important role when making decision. We adopt Adam (Kingma and Ba, 2014) to minimize the above loss functions. For each e i ∈ E u , we use Z co i as the type prediction result.

Datasets and Metrics
Datasets: There are two public large-scale datasets available for Fine-Grained Entity Typing in KB, FIGER 1 and DBpedia (Zhang et al., 2015). FIGER is a widely used dataset which is proposed by Yaghoobzadeh et al.. The DBpedia dataset is constructed by picking 14 non-overlapping types for text classification originally, we expand these 14 types and form a hierarchy. For each dataset, we extract entity feature from DBpedia 2 . For FIGER and DBpedia, we split entities into train (50%), dev (20%) and test (30%) sets. Since DBpedia dataset contains 630,000 entities, which is too big for computation, we divide it equally into three parts (i.e., DB-1, -2 and -3). Table 1 shows some statistics about the two datasets. Our source code is available 3 for reference. More detailed information can be found there. Metrics: To evaluate the performance of our proposed method, we use Accuracy (Strict-F1), Micro-averaged F1 (Mi-F1) and Macro-averaged F1 (Ma-F1), which have been used in many finegrained typing systems (Ling and Weld, 2012;Ren et al., 2016;.

Methods for Comparison
Baseline: we compare HMGCN with four state-of-the-art of methods and three variants of HMGCN: • CUTE: (Xu et al., 2016) utilizes three kinds of entity features: category, property and property-value pair, and employs a hierarchical multi-label classification method.
• MuLR:  applies embeddings of words, entities and types in entity typing task, and uses multilevel representations of entities via character, word, and entity embedding technologies.
• FIGMENT:  introduces global model and context model to provide complementary information for entity typing.
• APE: (Jin et al., 2018) applies an attributed and predictive entity embedding method to learn entity representations. In a way, it takes graph structure and entity features into account.
• HMGCN's variants: HMGCN no cat only use GCN co and GCN prop to make type inference. HMGCN no prop only use GCN co and GCN cat to make type inference.
HMGCN no co only use GCN cat and GCN prop to make type inference. HMGCN f lat ignores the correlation between types, i.e., removes hierarchical recursive regularization.
Parameter Settings: HMGCN is implemented based on the original GCN model (Kipf and Welling, 2016). In our implementation, both GCN co , GCN cat and GCN prop have two hidden layers. Namely, there are two separate parameter vectors, W (1) and W (2) , that need training. Table 2 presents detailed information about the implementation of our method for each dataset, including (1) number of hidden units; (2) dropout rate; and (3) learning rate η. λ is chosen among {0.00625, 0.0125, 0.025, 0.05, 0.1, 0.2, 0.4, 0.8}, we found λ = 0.2 achieves best performance. We set λ(t) in several ways, which is discussed in Section 4.4. We train all models for a maximum of 200 epochs (training iterations) using Adam (Kingma and Ba, 2014) and early stopping with a window size of 10 to minimize loss functions.  Table 3 and 4 shows the overall performance on two datasets and we have the following conclusions:

Overall Comparison Results
Comparison with Entity Typing methods. Our model significantly outperforms the state-of-theart entity typing methods over all datasets. HMGCN outperforms the best baseline with 1.7% and 1.3% improvement in Mi-F1 and Ma-F1 on FIGER dataset, 3.2% and 3.6% improvement in Mi-F1 and Ma-F1 on DB-1 dataset, 3.9% and 3.5% improvement in Mi-F1 and Ma-F1 on DB-2 dataset, 3.7% and 3.8% improvement in Mi-F1 and Ma-F1 on DB-3 dataset, respectively. Compared with other methods, the performance of CUTE is relatively poor, as it ignores the rich structure information both in entity graph and type hierarchy. FIGMENT and MuLR achieve better performance, in part because powerful embedding technologies are applied and the decision is made in global-view and context-view. APE considers graph structure and entity features in attributed network embedding manner, while HMGCN applies this in a convolutional manner, which can embed the graph knowledge more sufficiently (Kipf and Welling, 2016;Zhuang and Ma, 2018).
Comparison with Variants. HMGCN consistently outperforms all four variants, because in HMGCN three GCN models provide complementary information. The key of HMGCN is parameter-sharing (W (i) in GCN co , GCN cat and GCN prop ) and unsupervised regularizer for ensemble. The three GCNs encode different kinds of semantic correlations between entities, so that HMGCN could have a trade-off in decisionmaking. As a result, the trained parameters W (i) have considered opinions from all three GCNs. HMGCN no co and HMGCN no prop demonstrate better performance than HMGCN no cat , which suggests that category information should be more import than co-occurrence and property information. HMGCN f lat ignores the correlations between types (i.e., subClassOf relations), while HMGCN take the structure of type hierarchy as prior knowledge. Infrequent types benefit from the recursive regularization in HMGCN. Because when an infrequent type has less training example, the decision can be regularized by its parent (super-type).

Result Analysis
Effects of Labeled Data. Intuitively, our model can learn better embeddings and achieve better performance, if we have more labeled entities. We investigate the effect of labeled data proportion.
The results show that three metrics strikingly increase with more labeled data added, and gradually become stable when the proportion is over 0.5, as illustrated in Figure 3(a). This shows that our model can achieve a satisfactory result, even with not enough labeled data, and this advantage is benefited from the information diffusion of the GCN model, i.e., similar entities should share information between each other. Results of Frequent/Infrequent Types. We evaluate the performance on frequent types (f requency > 3, 000; 15 types) and infrequent types (f requency < 200; 36 types). The classification measure is the type macro average F1 (F1 of entities assigned to a type, then averaged over types) (Yosef et al., 2012). Note that it is different from Ma-F1 reported in Table 3. Generally, the performance on infrequent types is worse than frequent ones. Our model consistently outperforms the other methods on infrequent types, which demonstrates its ability on dealing with rare types. The results for infrequent and frequent types in FIGER dataset are illustrated in Figure 3(b). Effect of Regularization Weight λ(t). Our model uses a weight function λ(t) to balance the trade-off between the supervised loss and unsupervised consistency regularizer. We devised several different weight functions (Zhuang and Ma, 2018), as shown in Figure 3(c). Here, t is defined as the number of epochs. f i (1 ≤ i ≤ 4) increases with different increment speed, as the number of epochs increases; while f 5 decreases as the number of epochs increases. f i (1 ≤ i ≤ 4) strikingly outperforms f 5 , as shown in Figure 3(d). We can see that the knowledge embedded in GCN cat and GCN prop is beneficial to entity typing task. At the beginning of training process, supervised loss function plays the leading role. After several epochs, λ(t) forces our model to simultaneously consider the knowledge encoded in GCN cat and GCN prop .

Related Work
Fine-grained entity typing in KB is an important sub-task of knowledge base completion. Most ex-   isting methods do type inference by utilizing entity's inherent feature (textual description, property and category) or mention in text (anchor text). (Neelakantan and Chang, 2015) use text description as entity feature representation, then design a global objective function to predict missing entity types in a KB. (Xu et al., 2016) use property and category information, along with a multi-label hierarchical classifier, to assign DBpedia types to Chinese entities. (Yaghoobzadeh and Schütze, 2015)first propose FIGMENT to address this problem. They only used contextual information to assign types for entities in KB. After that, they present FIGMENT-Multi to learn multi-level representations of entities on three complementary levels (character, word and entity) . FIGMENT-Multi predicts whether an entity is a member of a type based on the learned embeddings. Finally, they propose an embedding based method which combines a global model with a context model. A global model that scores based on aggregated context information and a context model that aggregates the scores of individual contexts . Recent research convert entities in KB to an entity graph and apply graph-based algorithm on it. (Jin et al., 2018) use links between entities to construct an entity graph, jointly utilize entity feature and graph structure to make type inference. They apply an attributed and predictive network embedding method to encode entity feature and graph structure. After that, a multi-label classifier is employed to carry out type classification.

Conclusion
We convert the task of Fine-Grained Entity Typing in KB to graph-based semi-supervised classification task, and propose a hierarchical multi graph convolutional network (HMGCN) that fully utilizes entity features, entity graph structure, and type hierarchy structure to address this task. Three GCN models are devised to embed multi kinds of semantic correlations between entities. A recursive regularization is proposed to make the model understand, to a certain extent, the structure of type hierarchy. Experiments on two real datasets demonstrate the effectiveness of the proposed model.