Fine-Grained Entity Typing in Hyperbolic Space

How can we represent hierarchical information present in large type inventories for entity typing? We study the suitability of hyperbolic embeddings to capture hierarchical relations between mentions in context and their target types in a shared vector space. We evaluate on two datasets and propose two different techniques to extract hierarchical information from the type inventory: from an expert-generated ontology and by automatically mining the dataset. The hyperbolic model shows improvements in some but not all cases over its Euclidean counterpart. Our analysis suggests that the adequacy of this geometry depends on the granularity of the type inventory and the representation of its distribution.


Introduction
Entity typing classifies textual mentions of entities according to their semantic class.The task has progressed from finding company names (Rau, 1991), to recognizing coarse classes (person, location, organization, and other, Tjong Kim Sang and De Meulder, 2003), to fine-grained inventories of about one hundred types, with finer-grained types proving beneficial in applications such as relation extraction (Yaghoobzadeh et al., 2017) and question answering (Yavuz et al., 2016).The trend towards larger inventories has culminated in ultra-fine and open entity typing with thousands of classes (Choi et al., 2018;Zhou et al., 2018).
However, large type inventories pose a challenge for the common approach of casting entity typing as a multi-label classification task (Yogatama et al., 2015;Shimaoka et al., 2016)  difficult as the number of types increases.A natural solution for dealing with a large number of types is to organize them in hierarchy ranging from general, coarse types such as "person" near the top, to more specific, fine types such as "politician" in the middle, to even more specific, ultrafine entity types such as "diplomat" at the bottom (see Figure 1).By virtue of such a hierarchy, a model learning about diplomats will be able to transfer this knowledge to related entities such as politicians.
Prior work integrated hierarchical entity type information by formulating a hierarchy-aware loss (Ren et al., 2016;Murty et al., 2018;Xu and Barbosa, 2018) or by representing words and types in a joint Euclidean embedding space (Shimaoka et al., 2017;Abhishek et al., 2017).Noting that it is impossible to embed arbitrary hierarchies in Euclidean space, Nickel and Kiela (2017) propose hyperbolic space as an alternative and show that hyperbolic embeddings accurately encode hierarchical information.Intuitively (and as explained in more detail in Section 2), this is because distances in hyperbolic space grow exponentially as one moves away from the origin, just like the number of elements in a hierarchy grows exponentially with its depth.
While the intrinsic advantages of hyperbolic embeddings are well-established, their usefulness in downstream tasks is, so far, less clear.We be-  In this work, we address both of these issues.Using ultra-fine grained entity typing (Choi et al., 2018) as a test bed, we first show how to incorporate hyperbolic embeddings into a neural model (Section 3).Then, we examine the impact of the hierarchy, comparing hyperbolic embeddings of an expert-generated ontology to those of a large, automatically-generated one (Section 4).As our experiments on two different datasets show (Section 5), hyperbolic embeddings improve entity typing in some but not all cases, suggesting that their usefulness depends both on the type inventory and its hierarchy.In summary, we make the following contributions: 1. We develop a fine-grained entity typing model that embeds both entity types and entity mentions in hyperbolic space.
2. We compare two different entity type hierarchies, one created by experts (WordNet) and one generated automatically, and find that their adequacy depends on the dataset.
3. We study the impact of replacing the Euclidean geometry with its hyperbolic counterpart in an entity typing model, finding that the improvements of the hyperbolic model are noticeable on ultra-fine types.
2 Background: Poincaré Embeddings If we consider the origin O and two points, x and y, moving towards the outside of the disk, i.e. x , y → 1, the distance That is, the path between x and y is converges to a path through the origin.This behaviour can be seen as the continuous analogue to a (discrete) tree-like hierarchical structure, where the shortest path between two sibling nodes goes through their common ancestor.
As an alternative intuition, note that the hyperbolic distance between points grows exponentially as points move away from the center.This mirrors the exponential growth of the number of nodes in trees with increasing depths, thus making hyperbolic space a natural fit for representing trees and hence hierarchies (Krioukov et al., 2010;Nickel and Kiela, 2017).
By embedding hierarchies in the Poincaré ball so that items near the top of the hierarchy are placed near the origin and lower items near infinity (intuitively, embedding the "vertical" structure), and so that items sharing a parent in the hierarchy are close to each other (embedding the "horizontal" structure), we obtain Poincaré embeddings (Nickel and Kiela, 2017).More formally, this means that embedding norm represents depth in the hierarchy, and distance between embeddings the similarity of the respective items.
Figure 2 shows the results of embedding the WordNet noun hierarchy in two-dimensional Euclidean space (left) and the Poincaré disk (right).In the hyperbolic model, the types tend to be located near the boundary of the disk.In this region the space grows exponentially, which allows related types to be placed near one another and far from unrelated ones.The actual distance in this model is not the one visualized in the figure but the one given by Equation 1.  3 Entity Typing in Hyperbolic Space

Task Definition
The task we consider is, given a context sentence c containing an entity mention m, predict the correct type labels t m that describe m from a type inventory T , which includes more than 10,000 types (Choi et al., 2018).The mention m can be a named entity, a nominal, or a pronoun.The ground-truth type set t m may contain multiple types, making the task a multi-label classification problem.

Objective
We aim to analyze the effects of hyperbolic and Euclidean spaces when modeling hierarchical information present in the type inventory, for the task of fine-grained entity typing.Since hyperbolic geometry is naturally equipped to model hierarchical structures, we hypothesize that this enhanced representation will result in superior performance.With the goal of examining the relation between the metric space and the hierarchy, we propose a regression model.We learn a function that maps feature representations of a mention and its context onto a vector space such that the instances are embedded closer to their target types.The ground-truth type set contains a varying number of types per instance.In our regression setup, however, we aim to predict a fixed amount of labels for all the instances.This imposes strong upper bounds to the performance of our proposed model.Nonetheless, as the strict accuracy of stateof-the-art methods for the Ultra-Fine dataset is below 40% (Choi et al., 2018;Xiong et al., 2019), the evaluation we perform is still informative in qualitative terms, and enables us to gain better intu-itions with regard to embedding hierarchical structures in different metric spaces.

Method
Given the encoded feature representations of a mention m and its context c, noted as e(m, c) ∈ R n our goal is to learn a mapping function f : R n → S n , where S n is the target vector space.We intend to approximate embeddings of the type labels t m , previously projected into the space.Subsequently, we perform a search of the nearest type embeddings of the embedded representation in order to assign the categorical label corresponding to the mention within that context.Figure 3 presents an overview of the model.
The label distribution on the dataset is diverse and fine-grained.Each instances is annotated with three levels of granularity, namely coarse, fine and ultra-fine, and on the development and test set there are, on average, five labels per item.This poses a challenging problem for learning and predicting with only one projection.As a solution, we propose three different projection functions, f coarse , f f ine , and f ultra , each one of them finetuned to predict labels of a specific granularity.
We hypothesize that the complexity of the projection increases as the granularity becomes finer, given that the target label space per granularity increases.Inspired by Sanh et al. (2019), we arrange the three projections in a hierarchical manner that reflects these difficulties.The coarse projection task is set at the bottom layer of the model and more complex (finer) interactions at higher layers.With the projected embedding of each layer, we aim to introduce an inductive bias in the next projection that will help to guide it into the correct region of the space.Nevertheless, we use shortcut connections so that top layers can have access to the encoder layer representation.

Mention and Context Representations
To encode the context c containing the mention m, we apply the encoder schema of Choi et al. (2018) based on Shimaoka et al. (2016).We replace the location embedding of the original encoder with a word position embedding p i to reflect relative distances between the i-th word and the entity mention.This modification induces a bias on the attention layer to focus less on the mention and more on the context.Finally we apply a standard Bi-LSTM and a self-attentive encoder (McCann et al., 2017) on top to get the context representation C ∈ R dc .
For the mention representation we derive features from a character-level CNN, concatenate them with the Glove word embeddings (Pennington et al., 2014) of the mention, and combine them with a similar self-attentive encoder.The mention representation is denoted as M ∈ R dm .The final representation is achieved by the concatenation of mention and context [M ; C] ∈ R dm+dc .

Projecting into the Ball
To learn a projection function that embeds our feature representation in the target space, we apply a variation of the re-parameterization technique introduced in Dhingra et al. (2018).The reparameterization involves computing a direction vector r and a norm magnitude λ from e(m, c) as follows: where ϕ dir : R n → R n , ϕ norm : R n → R can be arbitrary functions, whose parameters will be optimized during training, and σ is the sigmoid function that ensures the resulting norm λ ∈ (0, 1).The re-parameterized embedding is defined as v = λr, which lies in S n .
By making use of this simple technique, the embeddings are guaranteed to lie in the Poincaré ball.This avoids the need to correct the gradient or the utilization of Riemannian-SGD (Bonnabel, 2011).Instead, it allows the use of any optimization method in deep learning, such as Adam (Kingma and Ba, 2014).
We parameterize the direction function ϕ dir : R dm+dc → R n as a multi-layer perceptron (MLP) with a single hidden layer, using rectified linear units (ReLU) as nonlinearity, and dropout.We do not apply the ReLU function after the output layer in order to allow negative values as components of the direction vector.For the norm magnitude function ϕ norm : R dm+dc → R we use a single linear layer.

Optimization of the Model
We aim to find projection functions f i that embed the instance representations closer to the respective target types, in a given vector space S n .As target space S n we use the Poincaré Ball B n and compare it with the Euclidean unit ball R n .Both B n and R n are metric spaces, therefore they are equipped with a distance function, namely the hyperbolic distance d H defined in Equation 1, and the Euclidean distance d E respectively, which we intend to minimize.Moreover, since the Poincaré Model is a conformal model of the hyperbolic space, i.e. the angles between Euclidean and hyperbolic vectors are equal, the cosine distance d cos can be used, as well.
We propose to minimize a combination of the distance defined by each metric space and the cosine distance to approximate the embeddings.Although formally this is not a distance metric since it does not satisfy the Cauchy-Schwarz inequality, it provides a very strong signal to approximate the target embeddings accounting for the main concepts modeled in the representation: relatedness, captured via the distance and orientation in the space, and generality, via the norm of the embeddings.
To mitigate the instability in the derivative of the hyperbolic distance2 we follow the approach proposed in Sala et al. (2018) and minimize the square of the distance, which does have a continuous derivative in B n .Thus, in the Poincaré Model we minimize the distance for two points u, v ∈ B n defined as: Whereas in the Euclidean space, for x, y ∈ R n we minimize: The hyperparameters α and β are added to compensate the bounded image of the cosine distance function in [0, 1].

Hierarchical Type Inventories
In this section, we investigate two methods for deriving a hierarchical structure for a given type inventory.First, we introduce the datasets on which we perform our study since we exploit some of their characteristics to construct a hierarchy.

Data
We focus our analysis on the the Ultra-Fine entity typing dataset introduced in To gain a better understanding of the proposed model under different geometries, we also experiment on the OntoNotes dataset (Gillick et al., 2014) as it is a standard benchmark for entity typing.

Deriving the Hierarchies
The two methods we analyze to derive a hierarchical structure from the type inventory are the following.Knowledge base alignment: Hierarchical information can be provided explicitly, by aligning the type labels to a knowledge base schema.In this case the types follow the tree-like structure of the ontology curated by experts.On the Ultra-Fine dataset, the type vocabulary T (i.e.noun phrases) is extracted from WordNet (Miller, 1992).Nouns in WordNet are organized into a deep hierarchy, defined by hypernym or "IS A" relationships.By aligning the type labels to the hypernym structure existing in WordNet, we obtain a type hierarchy.In this case, all paths lead to the root type entity.In the OntoNotes dataset the annotations follow a pre-established, much smaller, hierarchical taxonomy based on "IS A" relations, as well.Type co-occurrences: Although in practical scenarios hierarchical information may not always be available, the distribution of types has an implicit hierarchy that can be inferred automatically.If we model the ground-truth labels as nodes of a graph, its adjacency matrix can be drawn and weighted by considering the co-occurrences on each instance.That is, if t 1 and t 2 are annotated as true types for a training instance, we add an edge between both types.To weigh the edge we explore two variants: the frequency of observed instances where this co-relation holds, and the pointwise mutual information (pmi), as a measure of the association between the two types4 .By mining type cooccurrences present in the dataset as an affinity score, the hierarchy can be inferred.This method alleviates the need for a type inventory explicitly aligned to an ontology or pre-defined label correlations.
To embed the target type representations into the different metric spaces we make use of the library Hype5 (Nickel and Kiela, 2018).This library allows us to embed graphs into low-dimensional continuous spaces with different metrics, such as hyperbolic or Euclidean, ensuring that related objects are closer to each other in the space.The learned embeddings capture notions of both similarity, through the relative distance among each other, and hierarchy, through the distance to the origin, i.e. the norm.The projection of the hierarchy derived from WordNet is depicted in Figure 2.

Experiments
We perform experiments on the Ultra-Fine (Choi et al., 2018) and OntoNotes (Gillick et al., 2014) datasets to evaluate which kind of hierarchical information is better suited for entity typing, and under which geometry the hierarchy can be better exploited.

Setup
For evaluation we run experiments on the Ultra-Fine dataset with our model projecting onto the hyperbolic space, and compare to the same setting in Euclidean space.The type embeddings are  We compare our model to the multi-task model of Choi et al. (2018) trained on the open-source version of their dataset (MULTITASK).The final type predictions consist of the closest neighbor from the coarse and fine projections, and the three closest neighbors from the ultra-fine projection.We report Loose Macro-averaged and Loose Micro-averaged F1 metrics computed from the precision/recall scores over the same three granularities established by Choi et al. (2018).For all models we optimize Macro-averaged F1 on coarse types on the validation set, and evaluate on the test set.All experiments project onto a target space of 10 dimensions.The complete set of hyperparameters is detailed in the Appendix.

Comparison of the Hierarchies
Results on the test set are reported in Table 2. From comparing the different strategies to derive the hierarchies, we can see that FREQ and PMI substantially outperform MULTITASK on the ultrafine granularity (17.5% and 29.8% relative improvement in Macro F1 and Micro F1, respectively, with the hyperbolic model).Both hierarchies show a substantially better performance over the WORDNET hierarchy on this granularity as well (MaF1 16.0 and MiF1 15.4 for PMI vs 7.0 and 6.7 for WORDNET on the Hyperbolic model), indicating that these structures, created solely from the dataset statistics, better reflect the type distribution in the annotations.On FREQ and PMI, types that frequently co-occur on the training set are located closer to each other, improving the prediction based on nearest neighbor.
All the hierarchies show very low performance on fine when compared to the MULTITASK model.This exhibits a weakness of our regression setup.On the test set there are 1,998 instances but only 1,318 fine labels as ground truth (see Table 1).By forcing a prediction on the fine level for all instances, precision decreases notably.More details in Section 6.3.
The combined hierarchy WORDNET + FREQ achieves marginal improvements on coarse and fine granularities, while it degrades the performance on ultra-fine when compared to FREQ.
By imposing a hierarchical structure over the type vocabulary we can infer types that are located higher up in the hierarchy from the predictions of the lower ones.To analyze this, we add the closest coarse label to the ultra-fine prediction of each instance.Results are reported in Table 2b.The improvements are noticeable on the Macro score (up to 3.9 F1 points difference on FREQ) whereas Micro decreases.Since we are adding types to the prediction, this technique improves recall and penalizes precision.Macro is computed on the entity level, while Micro provides an overall score, showing that per instance the prediction tends to be better.The improvements can be observed on FREQ and PMI given that their predictions over  ultra-fine types are better.

Comparison of the Spaces
When comparing performances with respect to the metric spaces, the hyperbolic models for PMI and FREQ outperform all other models on ultra-fine granularity.Compared to its Euclidean counterpart, PMI brings considerable improvements (16.0 vs 12.2 and 15.4 vs 11.5 for Macro and Micro F1 respectively).This can be explained by the exponential growth of this space towards the boundary of the ball, combined with a representation that reflects the type co-occurrences in the dataset.
Figure 4 shows a histogram of the distribution of ground-truth types as closest neighbors to the prediction.
On both Euclidean and hyperbolic models, the type embeddings for coarse and fine labels are located closer to the origin of the space.In this region, the spaces show a much more similar behavior in terms of the distance calculation, and this similarity is reflected on the results as well.
The low performance of the hyperbolic model of WORDNET on coarse can be explained by the fact that entity is the root node of the hierarchy, therefore it is located closer to the center of the space.Elements placed in the vicinity of the origin have a norm closer to zero, thus their distance to other types tends to be shorter (does not grow exponentially).This often misleads the model into assign entity as the coarse.See Table 3c for an example.
This issue is alleviated on WORDNET + FREQ.Nevertheless, it appears again when using the ultra-fine prediction to infer the coarse label.The drop in performance can be seen in Table 2b: Macro F1 decreases by 8.0 and Micro F1 by 12.2.

Error analysis
We perform an error analysis on samples from the development set and predictions from two of our proposed hyperbolic models.We show three examples in Table 3. Overall we can see that predictions are reasonable, suggesting synonyms or related words.
In the proposed regression setup, we predict a fixed amount of labels per instance.This schema has drawbacks as shown in example a), where all predicted types by the FREQ model are correct though we can not predict more, and b), where we predict more related types that are not part of the annotations.
In examples b) and c) we see how the FREQ model predicts the coarse type correctly whereas the model that uses the WordNet hierarchy predicts group and entity since these labels are considered more general (organization IS A group) thus located closer to the origin of the space.
To analyse precision and recall more accurately, we compare our model to the one of Shimaoka et al. (2016) (ATTNER) and the multi-task model of Choi et al. (2018) (MULTI).We show the results for macro-averaged metrics in Table 4.Our model is able to achieve higher recall but lower precision.Nonetheless we are able to outperform ATTNER with a regression model even though they apply a classifier to the task.

Analysis Case: OntoNotes
To better understand the effects of the hierarchy and the metric spaces we also perform an evaluation on OntoNotes (Gillick et al., 2014).We compare the original hierarchy of the dataset (ONTO), and one derived from the type co-occurrence frequency extracted from the data augmented by Choi et al. (2018) with this type inventory.The results for the three granularities are presented in Table 5.
The FREQ model on the hyperbolic geometry achieves the best performance for the ultrafine granularity, in accordance with the results on the Ultra-Fine dataset.In this case the improvements of the frequency-based hierarchy are not so remarkable when compared to the ONTO model given that the type inventory is much smaller, and the annotations follow a hierarchy where there is only one possible path for every label to its coarse type.
The low results on the ultra-fine granularity are due to the reduced multiplicity of the annotated types (See Table 8).Most instances have only one or two types, setting very restrictive upper bounds for this setup.

Related Work
Type inventories for the task of fine-grained entity typing (Ling and Weld, 2012;Gillick et al., 2014;Yosef et al., 2012) have grown in size and complexity (Del Corro et al., 2015;Murty et al., 2017;Choi et al., 2018).Systems have tried to incorporate hierarchical information on the type distribution in different manners.Shimaoka et al. (2017) encode the hierarchy through a sparse matrix.Xu and Barbosa (2018) model the relations through a hierarchy-aware loss function.Ma et al. (2016) and Abhishek et al. (2017) learn embeddings for labels and feature representations into a joint space in order to facilitate information sharing among them.Our work resembles Xiong et al. (2019) since they derive hierarchical information in an unrestricted fashion, through type co-occurrence statistics from the dataset.These models operate under Euclidean assumptions.Instead, we impose a hyperbolic geometry to enrich the hierarchical information.
Hyperbolic spaces have been applied mostly on complex and social networks modeling (Krioukov et al., 2010;Verbeek and Suri, 2016).In the field of Natural Language Processing, they have been employed to learn embeddings for Question Answering (Tay et al., 2018), in Neural Machine Translation (Gulcehre et al., 2019), and to model language (Leimeister and Wilson, 2018;Tifrea et al., 2019).We build upon the work of Nickel and Kiela (2017) on modeling hierarchical link structure of symbolic data and adapt it with the parameterization method proposed by Dhingra et al. (2018) to cope with feature representations of text.

Conclusions
Incorporation of hierarchical information from large type inventories into neural models has become critical to improve performance.In this work we analyze expert-generated and data-driven hierarchies, and the geometrical properties provided by the choice of the vector space, in order to model this information.Experiments on two different datasets show consistent improvements of hyperbolic embedding over Euclidean baselines on very fine-grained labels when the hierarchy reflects the annotated type distribution.

Figure 1 :
Figure 1: Examples of annotations and hierarchical type inventory with co-occurrence frequencies.

Figure 2 :
Figure 2: Type inventory of the Ultra-Fine dataset aligned to the WordNet noun hierarchy and projected on two dimensions in different spaces.

"
A list of novels by Agatha Christie published in the year..."

Figure 3 :
Figure 3: Overview of the proposed model to predict types of a mention within its context.

3
Choi et al. (2018) uses the licensed Gigaword to build part of the dataset resulting in about 25.2M training samples.

Figure 4 :
Figure 4: Histogram of ground-truth type neighbor positions for ultra-fine predictions in Hyperbolic and Euclidean spaces on the test set.
, since exploiting inter-type correlations becomes more (Chamberlain et al., 2017he unit circle represents infinity, i.e., as a point approaches infinity in hyperbolic space, its norm approaches one in the Poincaré disk model.In the general n-dimensional case, the disk model becomes the Poincaré ball(Chamberlain et al., 2017) B n = {x ∈ R n | x < 1}, where • denotes the Euclidean norm.In the Poincaré model the distance between two points u, v ∈ B n is given by: Hyperbolic geometry studies non-Euclidean spaces with constant negative curvature.Twodimensional hyperbolic space can be modelled as the open unit disk, the so-called

Table 1 :
Type instances in the dataset grouped by split and granularity.
Choi et al. (2018).Its design goals were to increase the diversity and coverage entity type annotations.It contains 10,331 target types defined as free-form noun phrases and divided in three levels of granularity: coarse, fine and ultra-fine.The data consist of 6,000 crowdsourced examples and approximately 6M training samples in the open-source version 3 , automatically extracted with distant supervision, by entity linking and nominal head word extraction.Our evaluation is done on the original crowdsourced dev/test splits.

Table 2 :
Results on the test set for different hierarchies and spaces.The best results of our models are marked in bold.On (b) we report the comparison of adding the closest coarse label to the ultra-fine with respect to the coarse results on (a).
a) Example Rin, Kohaku and Sesshomaru Rin befriends Kohaku, the demonslayer Sango's younger brother, while Kohaku acts as her guard when Naraku is using her for bait to lure Sesshomaru into battle.Annotation event, conflict, war, fight, battle, struggle, dispute, group action Prediction FREQ: event, conflict, war, fight, battle; WORDNET: event, conflict, difference, engagement, assault b) Example The UN mission in Afghanistan dispatched its own investigation, expressing concern about reports of civilian casualties and calling for them to be properly cared for.

Table 3 :
Qualitative analysis of instances taken from the development set.The predictions are generated with the hyperbolic models of FREQ and WORDNET.Correct predictions are marked in blue color.

Table 5 :
Macro and micro F1 results on OntoNotes.