A Fully Hyperbolic Neural Model for Hierarchical Multi-class Classification

Label inventories for fine-grained entity typing have grown in size and complexity. Nonetheless, they exhibit a hierarchical structure. Hyperbolic spaces offer a mathematically appealing approach for learning hierarchical representations of symbolic data. However, it is not clear how to integrate hyperbolic components into downstream tasks. This is the first work that proposes a fully hyperbolic model for multi-class multi-label classification, which performs all operations in hyperbolic space. We evaluate the proposed model on two challenging datasets and compare to different baselines that operate under Euclidean assumptions. Our hyperbolic model infers the latent hierarchy from the class distribution, captures implicit hyponymic relations in the inventory, and shows performance on par with state-of-the-art methods on fine-grained classification with remarkable reduction of the parameter size. A thorough analysis sheds light on the impact of each component in the final prediction and showcases its ease of integration with Euclidean layers.


Introduction
Entity typing classifies textual mentions of entities, according to their semantic class, within a set of labels (or classes) organized in an inventory. The task has progressed from recognizing a few coarse classes (Sang and De Meulder, 2003), to extremely large inventories, with hundreds (Gillick et al., 2014) or thousands of labels (Choi et al., 2018). Therefore, exploiting inter-label correlations has become critical to improve performance.
Large inventories tend to exhibit a hierarchical structure, either by an explicit tree-like arrangement of the labels (coarse labels at the top, finegrained at the bottom), or implicitly through the Figure 1: Tree embedded in hyperbolic space. Items at the top of the hierarchy are placed near the origin of the space, and lower items near the boundary. Moreover, the hyperbolic distance (Eq. 1) between sibling nodes resembles the one through the common ancestor, analogous to the distance in the tree. That is label distribution in the dataset (coarse labels appear more frequently than fine-grained ones). Prior work has integrated only explicit hierarchical information by formulating a hierarchy-aware loss (Murty et al., 2018;Xu and Barbosa, 2018) or by representing instances and labels in a joint Euclidean embedding space (Shimaoka et al., 2017;Abhishek et al., 2017). However, the resulting space is hard to interpret, and these methods fail to capture implicit relations in the label inventory. Hyperbolic space is naturally equipped for embedding symbolic data with hierarchical structures (Nickel and Kiela, 2017). Intuitively, that is because the amount of space grows exponentially as points move away from the origin. This mirrors the exponential growth of the number of nodes in trees with increasing distance from the root (Cho et al., 2019) (see Figure 1).
In this work, we propose a fully hyperbolic neural model for fine-grained entity typing. Noticing a perfect match between hierarchical label inventories in the linguistic task and the benefits of hyperbolic spaces, we endow a classification model with a suitable geometry to capture this fundamental property of the data distribution. By virtue of the hyperbolic representations, the proposed approach automatically infers the latent hierarchy arising from the class distribution and achieves a meaningful and interpretable organization of the label space. This arrangement captures implicit hyponymic relations (is-a) in the inventory and enables the model to excel at fine-grained classification. To the best of our knowledge, this work is the first to apply hyperbolic geometry from beginning to end to perform multi-label classification on real NLP datasets.
Recent work has proposed hyperbolic neural components, such as word embeddings (Tifrea et al., 2019), recurrent neural networks (Ganea et al., 2018) and attention layers (Gulcehre et al., 2019). However, researchers have incorporated these isolated components into neural models, whereas the rest of the layers and algorithms operate under Euclidean assumptions. This impedes models from fully exploiting the properties of hyperbolic geometry. Furthermore, there are different analytic models of hyperbolic space, and not all previous work operates in the same one, which hinders their combination, and hampers their adoption for downstream tasks (e.g. Tifrea et al. (2019) learn embeddings in the Poincaré model, Gulcehre et al. (2019) aggregate points in the Klein model, or Nickel and Kiela (2018) perform optimization in the Lorentz model). We address these issues. Our model encodes textual inputs, applies a novel attention mechanism, and performs multi-class multilabel classification, executing all operations in the Poincaré model of hyperbolic space ( §4). We evaluate the model on two datasets, namely Ultra-Fine (Choi et al., 2018) and OntoNotes (Gillick et al., 2014), and compare to Euclidean baselines as well as to state-of-the-art methods for the task (Xiong et al., 2019;Onoe and Durrett, 2019). The hyperbolic system has competitive performance when compared to an ELMo model (Peters et al., 2018) and a BERT model (Devlin et al., 2019) on very fine-grained types, with remarkable reduction of the parameter size ( §6). Instead of relying on large pre-trained models, we impose a suitable inductive bias by choosing an adequate metric space to embed the data, which does not introduce extra burden on the parameter footprint.
By means of the exponential and logarithmic maps (explained in §2) we are able to mix hyperbolic and Euclidean components into one model, aiming to exploit their strengths at different levels of the representation. We perform a thorough ablation that allows us to understand the impact of each hyperbolic component in the final performance of the system ( §6.1.1 and §6.1.2), and showcases its ease of integration with Euclidean layers.

Hyperbolic Neural Networks
In this section we briefly recall the necessary background on hyperbolic neural components. The terminology and formulas used throughout this work follow the formalism of Möbius gyrovector spaces (Ungar, 2008a,b), and the definitions of hyperbolic neural components of Ganea et al. (2018). For more information about Riemannian geometry and Möbius operations see Appendix A and B. In the following, ·, · and · are the inner product and norm inherited from the Euclidean space.
Hyperbolic space: It is a non-Euclidean space with constant negative curvature. We adopt the Poincaré ball model of hyperbolic space (Cannon et al., 1997). In the general n-dimensional case, it becomes D n = {x ∈ R n | x < 1} 2 . The Poincaré model is a Riemannian manifold equipped with the Riemannian metric g D x = λ 2 x g E , where λ x := 2 1− x 2 is called the conformal factor and g E = I n is the Euclidean metric tensor. The distance between two points x, y ∈ D n is given by: Möbius addition: It is the hyperbolic analogous to vector addition in Euclidean space. Given two 2 Ganea et al. (2018) define the ball as D n = {x ∈ R n | c x 2 < 1} with a parameter c in relation to the radius of the Poincaré ball r = 1/ √ c. In this work we assume c = 1 therefore we omit such parameter.  Figure 3: Overview of the proposed model. The mention encoder extracts word and char-level entity representations. The context encoder is based on a bidirectional-GRU with attention. The outputs of both encoders are concatenated and passed to a classifier based on a multinomial logistic regression. points x, y ∈ D n , it is defined as: Möbius matrix-vector multiplication: Given a linear map M : R n → R m , which we identify with its matrix representation, and a point x ∈ D n , M x = 0, it is defined as: Pointwise non-linearity: If we model it as ϕ : R n → R n , then its Möbius version ϕ ⊗ can be applied using the same formulation of the matrixvector multiplication. A visualization of the aforementioned operations can be seen in Figure 2. By combining these operations we obtain a onelayer feed-forward neural network (FFNN) in hyperbolic space, described as y = ϕ ⊗ (M ⊗ x ⊕ b) with M ∈ R m×n and b ∈ D m as trainable parameters. Note that the parameter b lies in the hyperbolic space, thus its updates during training need to be corrected for this geometry. Exponential and logarithmic maps: For each point x ∈ D n , let T x D n denote the associated tangent space, which is always a subset of Euclidean space (Liu et al., 2019). We make use of the exponential map exp x : T x D n → D n and the logarithmic map log x : D n → T x D n to map points in the hyperbolic space to the Euclidean space, and viceversa. At the origin of the space, they are given for v ∈ T 0 D n \{0} and y ∈ D n \{0}: To map a point y ∈ D n onto the Euclidean space we apply log 0 (y). Conversely, to map a point v ∈ R n onto the hyperbolic space, we assume R n = T 0 D n and apply exp 0 (v). This allows to mix hyperbolic and Euclidean neural layers as shown in §6.1.2.

Fine-grained Entity Typing
Given a context sentence s containing an entity mention m, the goal of entity typing is to predict the correct type labels t m that describe m from a type inventory T . The ground-truth type set t m may contain multiple types, making the task a multiclass multi-label classification problem.
For fine-grained entity typing the type inventory T tends to contain hundreds to thousands of labels. Encoding hierarchical information from large type inventories has been proven critical to improve performance (López et al., 2019). Thus we hypothesize that our proposed hyperbolic model will benefit from this representation.

Hyperbolic Classification Model
In this section we propose a general hyperbolic neural model for classification with sequential data as input. The building blocks are defined in a generic manner such that they can be applied to different tasks, or integrated with regular Euclidean layers. Our proposed architecture resembles recent neural models applied to entity typing (Choi et al., 2018). For the encoders we employ the neural networks introduced in Ganea et al. (2018), we propose a novel attention mechanism operating entirely in the Poincaré model, and we extend the hyperbolic classifier to multi-class multi-label setups. An overview of the model can be seen in Figure 3.

Mention Encoder
To represent the mention, we combine word and char-level features, similar to Lee et al. (2017). Given a sequence of k tokens in a mention span, we represent them using pre-trained word embeddings w i ∈ D n which we assume to lie in hyperbolic space. We apply a hyperbolic FFNN, described as: are parameters of the model. We combine the resulting m 1 , ..., m k into a single mention representation m ∈ D d M by computing a weighted sum of the token representations in hyperbolic space with the attention mechanism explained in §4.4. Moreover, we extract features from the sequence of characters in the mention span with a recurrent neural network (RNN) (Lample et al., 2016). We represent each character with a char-embedding c i ∈ D d C that we train in the Poincaré ball. An RNN operating in hyperbolic space is defined by: and ϕ is a pointwise non-linearity function. Finally, we obtain a single representation c ∈ D d C by taking the midpoint of the states h i using Equation 9.

Context Encoder
To encode the context we apply a hyperbolic version of gated recurrent units (GRU) (Cho et al., 2014) proposed in Ganea et al. (2018) 3 . Given a sequence of l tokens, we represent them with a pre-trained word embedding w i ∈ D n , and apply a forward and backward GRU, producing contextualized representations for each token. We concatenate the resulting states into a single embedding Ultimately, we combine s 1 , ..., s l into a single context representation s ∈ D 2d S with the distance-based attention mechanism.

Concatenation
If we model the concatenation of two vectors in the Poincaré ball as appending one to the other, this does not guarantee that the result remains inside the ball. Thus, we apply a generalized version of the concatenation operation. For x ∈ D k , y ∈ D l , then concat : D k × D l → D n is defined as: where M 1 ∈ R n×k , M 2 ∈ R n×l , b ∈ D n are parameters of the model. In Euclidean architectures, the concatenation of vectors is usually followed by a linear layer, which takes the form of Equation 7 when written explicitly.

Distance-based Attention
Previous approaches to hyperbolic attention (Gulcehre et al., 2019;Chami et al., 2019) require mappings of points to different spaces, which hinders their adoption into neural models. We propose a novel attention mechanism in the Poincaré model of hyperbolic space. We cast attention as a weighted sum of vectors in this geometry, without requiring any extra mapping of the inputs. In this manner, we make consistent use of the same analytical model of hyperbolic space across all components, which eases their integration.
To obtain the attention weights, we exploit the hyperbolic distance between points (Gulcehre et al., 2019). Given a sequence of states x i ∈ D n , we combine them with a trainable position embedding p i ∈ D n such that r i = x i ⊕ p i . We use addition as the standard method to enrich the states with positional information (Vaswani et al., 2017;Devlin et al., 2019). We apply two different linear transformations on r i to obtain vectors q i and k i , both lying in the Poincaré ball. We compute the distance between these two points and finally obtain the weight by applying a softmax over the sequence in the following manner: where W Q , W K ∈ R n×n , b Q , b K ∈ D n and β ∈ R are parameters of the model. Attention weights will be higher for elements with q i and k i vectors placed close to each other. The positional embeddings are trained along with the model as a hyperbolic parameter. For the context encoder, they reflect relative distances between the i-th word and the entity mention. For the mention encoder, they represent the absolute position of the word inside the mention span.
To aggregate the points as a weighted summation in hyperbolic space we propose to apply the Möbius midpoint, which obeys many of the properties that we expect from a weighted average in Euclidean space (Ungar (2010), Theorem 4.6): where x i are the states in the sequence, α i the weights corresponding to each state, and γ(x i ) the Lorentz factors. By applying the Möbius midpoint we develop an attention mechanism that operates entirely in the Poincaré model of hyperbolic space. Detailed formulas and experimental observations can be found in Appendix D.

Classification in the Poincaré Ball
The input of the classifier is the concatenation of mention and context features. To perform multiclass classification in the Poincaré ball, we adapt the generalized multinomial logistic regression (MLR) from Ganea et al. (2018). Given K classes and k ∈ {1, ..., K}, p k ∈ D m , a k ∈ T p k D m \{0}, the formula for the hyperbolic MLR is: Where x ∈ D m , and p k and a k are trainable parameters. It is based on formulating logits as distances to margin hyperplanes. The hyperplanes in hyperbolic space are defined by the union of all geodesics passing through p k and orthogonal to a k .
Although this formulation was made for onelabel classification, the underlying notion also holds for multi-label setups. In that case, we need to be able to select several classes by considering the distances (logits) to all hyperplanes. To achieve that we employ the sigmoid function as f , instead of a softmax, and predict the given class if p(y = k|x) > 0.5. More details in Appendix E. Figure 4 shows examples of the hyperbolic definition of multiple hyperplanes, which follow the curvature of the space.

Optimization
With the proposed classification model, we aim to minimize variants of the binary cross-entropy loss function as the training objective.
The model has trainable parameters in both Euclidean and hyperbolic space. We apply the Geoopt implementation of Riemannian Adam (Kochurov et al., 2020) as a Riemannian adaptive optimization method (Bécigneul and Ganea, 2019) to carry out a gradient-based update of the parameters in their respective geometry.

Experiments
We evaluate the proposed hyperbolic model on two different datasets for fine-grained entity typing, and compare to Euclidean baselines as well as state-ofthe-art models.

Data
For analysis and evaluation of the model, we focus on the Ultra-Fine entity typing dataset (Choi et al., 2018). It contains 10,331 target types defined as free-form noun phrases and divided in three levels of granularity: coarse, fine and ultra-fine. Besides this segregation, the dataset does not provide any further explicit information about the relations among the types. The data consist of 6,000 crowdsourced examples and 6M training samples in the open-source version, automatically extracted with distant supervision. Our evaluation is done on the original crowdsourced dev/test splits.
To gain a better understanding of the proposed model, we also experiment on the OntoNotes dataset (Gillick et al., 2014) as it is a standard benchmark for entity typing.

Setup
The MLR classifier operates in a hyperbolic space of m dimensions with m = d M + d C + 2d S . By setting different values, we experiment with three models: BASE (m = 100), LARGE (m = 250) and XLARGE (m = 500).
As word embeddings we employ Poincaré GloVe embeddings (Tifrea et al., 2019), which are pretrained in the Poincaré model. Hence, the input to the encoders is already in hyperbolic space and all operations can be performed in this geometry. These embeddings are not updated during training. Low values of dropout are used since the model was very sensitive to this parameter given the behaviour of the hyperbolic distance.
On the Ultra-Fine dataset, for each epoch, we train over the entire training set, and we run extra iterations over the crowdsourced split before evaluating. In this way, the model benefits from the large amount of noisy, automatically-generated data, and is fine-tuned with high-quality crowdsourced samples. As previous work (Xiong et al.,  2019; Onoe and Durrett, 2019), we optimize the multi-task objective proposed by Choi et al. (2018). For evaluation we report Macro-averaged and Micro-averaged F 1 metrics computed from the precision/recall scores over the same three granularities established by Choi et al. (2018). For all models we optimize Total Macro-averaged F 1 on the validation set, and evaluate on the test set. Following Ganea et al. (2018), we report the average of three runs given the highly non-convex spectrum of hyperbolic neural networks. Hyperparameters are detailed in Appendix F along with other practical aspects to ensure numerical stability.

Baselines
Euclidean baseline: We replace all operations of the hyperbolic model by their Euclidean counterpart. To map the Poincaré GloVe embeddings to the Euclidean space we apply log 0 . We do not apply any kind of normalization or correction over the weights to circumscribe them into the unit ball. On the contrary, we grant them freedom over the entire Euclidean space to establish a fair comparison. Multi-task: Model proposed by Choi et al. (2018), along with the Ultra-Fine dataset.

Results and Discussion
Following previous work (Choi et al., 2018;Onoe and Durrett, 2019), we report results on the devel-opment set in Table 1. All hyperbolic models outperform MULTITASK and LABELGCN baselines on Total Macro F 1 . DENOISED and BERT systems, based on large pre-trained models, show the best Total performance. Nonetheless, HY XLARGE has a competitive performance, and surpasses both systems on ultra-fine F 1 . In the hyperbolic model, fine-grained types are placed near the boundary of the ball, where the amount of space grows exponentially. Furthermore, the underlying structure of the type inventory is hierarchical, thus the hyperbolic definition of the hyperplanes is well-suited to improve the classification in this case (see comparison with Euclidean classifiers on Figure 4). These properties combined enable the hyperbolic model to excel at classifying hierarchical labels, with outstanding improvements on very fine-grained types.
The reduction of the parameter size is also remarkable: 70% and 91% versus DENOISED and BERT respectively. This emphasizes the importance of choosing a suitable metric space that fits the data distribution (hierarchical in this case) as a powerful and efficient inductive bias. Through adequate tools and formulations, we are able to exploit this bias without introducing an overload    on the parameter cost.
Correspondence of results between HY BASE and LABELGCN suggest that both models capture similar information. LABELGCN requires label co-occurrence statistics represented as a weighted graph, from where a hierarchy can be easily derived (Krioukov et al., 2010). The similarity of results indicates that the hyperbolic model is able to implicitly encode the latent hierarchical information in the label co-occurrences without additional inputs or the burden of the graph layer.
To shed light on this aspect, we inspect the points p k learned by HY BASE to define the hyperplanes of Equation 10. Table 2 shows the types corresponding to the closest points to the label person and its subtypes, measured by hyperbolic distance. The types are highly correlated given that they often co-occur in similar contexts. Moreover, the model captures hyponymic relations (is-a) present in the label co-occurrences. An analogous behaviour is observed for other types (see tables in Appendix G). The inductive bias given by the hyperbolic geometry allows the model to capture the hierarchy, deriving a meaningful and interpretable representation of the label space: coarse labels near the origin, fine-grained labels near the boundary, and hyponymic relations are preserved. It is also noteworthy that the model learns these relations automatically without requiring the explicit data encoded in the graph.

Comparison of the Spaces
A comparison of the metric spaces for different models on the test set is shown in Table 3. It can be seen that the hyperbolic model outperforms its Euclidean variants in all settings. It is notable that this trend holds even in high-dimensional spaces (500 dimensions for XLARGE). Since the label inventory exhibits a clearly hierarchical structure, it perfectly suits the hyperbolic classification method.  The hyperbolic model brings considerable gains as the granularity becomes finer: 5.1% and 16.2% relative improvement in fine and ultra-fine Macro F 1 respectively for the BASE model over its Euclidean counterpart. We also observe that as the size of the model increases, the Euclidean baseline becomes more competitive for ultra-fine. This is due to the Euclidean model gaining enough capacity to accommodate the separation hyperplanes with higher dimensions, thus reducing the gap.
It is noticeable that the BASE model outperforms the larger ones on coarse and fine granularities. That is due to the larger models overfitting given the low dropout applied. Moreover, Euclidean and hyperbolic models exhibit a similar performance on the coarse granularity when compared to each other. A possible explanation is that the separation planes for these labels are located closer to the origin of the space. In this region, the spaces behave alike in terms of the distance calculation, and this similarity is reflected in the results as well.

Word Embeddings Ablation
The input for both the Euclidean and hyperbolic models are Poincaré GloVe embeddings, which are originally trained in hyperbolic space (Tifrea et al., 2019). This might favor the hyperbolic model, despite the application of the log 0 map in the Euclidean case. Thus, we replace the hyperbolic embeddings by the regular GloVe embeddings (Pennington et al., 2014), and use exp 0 on the hyperbolic model to project them into the ball.   holds, and that the improvement does not come from the embeddings. Also, in this way we showcase how the hyperbolic model can be easily integrated with regular word embeddings.

Component Ablation
With the aim of analyzing the contribution of the different hyperbolic components, we perform an ablation study on the BASE model. We divide the system in encoder, attention (both in the mention and context encoders), concatenation, and MLR, and replace them, one at a time, by their Euclidean counterparts. Note that when Euclidean and hyperbolic components are mixed, we convert the internal representations from one manifold to the other with the exp 0 and log 0 maps. As we see in Table 5, MLR is the component that contributes the most to the ultra-fine classification. The hierarchical structure of the type inventory combined with the hyperbolic definition of the hyperplanes are the reason of this (see Figure 4).
Hyperbolic attention and concatenation are relevant for coarse and fine-grained classification (here is where the biggest drop appears when they are removed), but do not play a major role in the ultrafine granularity.
Finally, the encoders do not benefit from the hyperbolic representation. As the reason for this we consider that the model is not able to capture tree-like relations among the input tokens such that they can be exploited for the task.
This ablation suggests that the main benefits of hyperbolic layers arise when they are incorporated at deeper levels of representation in the model, and not over low-level features or raw text.
Computing time: Möbius operations are more expensive than their Euclidean counterparts. Due to this, in our experiments we found the hyperbolic encoder to be twice slower, and the MLR 1.5 times slower than their Euclidean versions.

OntoNotes Dataset
To further understand the capabilities of the proposed model we also perform an evaluation on the OntoNotes dataset (Gillick et al., 2014). In this case, we apply the standard binary cross-entropy loss, since fine-grained labels are scarce in this dataset. Following previous work (Xiong et al., 2019), we train over the dataset augmented by Choi et al. (2018). Results for the three granularities for BASE and LARGE models are presented in Table 6. The hyperbolic models outperform the Euclidean baselines in both cases, and the difference is noticeable for fine and ultra-fine (42.0 vs 38.2 and 24.0 vs 18.9 on Macro F 1 for the LARGE model), in accordance with the results on Ultra-Fine. We report a comparison with neural systems in Table 7. The hyperbolic model, without requiring the explicit hierarchy provided in this dataset, achieves a competitive performance. Nonetheless, the advantages of the hyperbolic model are mitigated by the low multiplicity of fine-grained labels, and the lower hierarchy.

Related Work
Type inventories for the task of fine-grained entity typing (Ling and Weld, 2012;Yosef et al., 2012) have grown in size and complexity (Del Corro et al., 2015;Choi et al., 2018). Researchers have tried to incorporate hierarchical information on the type distribution in different manners (Shimaoka et al., 2016;Ren et al., 2016a). Shimaoka et al. (2017) encode the hierarchy through a sparse matrix. Xu

Model
Acc. Ma-F 1 Mi-F 1 Shimaoka et al. (2017) 51.7 70.9 64.9 AFET (Ren et al., 2016a) (Tay et al., 2018), in Machine Translation (Gulcehre et al., 2019), and modeling language (Dhingra et al., 2018;Tifrea et al., 2019). We build upon the hyperbolic neural layers introduced in Ganea et al. (2018), and develop the missing components to perform, not binary, but multi-class multi-label text classification. We test the proposed model not with a synthetic dataset, but on a concrete downstream tasks, such as entity typing. Our work resembles López et al. (2019) and Chen et al. (2019), though they separately learn embeddings for type labels and text representations in hyperbolic space, whereas we do it in an integrated fashion.

Conclusions
Incorporating hierarchical information from the label inventory into neural models has become critical to improve performance. Hyperbolic spaces are an exciting approach since they are naturally equipped to model hierarchical structures. However, previous work integrated isolated components into neural systems. In this work we propose a fully hyperbolic model and showcase its effectiveness on challenging datasets. Our hyperbolic model automatically infers the latent hierarchy from the class distribution, captures implicit hyponymic relations in the inventory and achieves a performance comparable to state-of-the-art systems on very finegrained labels with a remarkable reduction of the parameter size. This emphasizes the importance of choosing a metric space suitable to the data distribution as an effective inductive bias to capture fundamental properties, such as hierarchical structure.
Moreover, we illustrate ways to integrate different components with Euclidean layers, showing their strengths and drawbacks. An interesting future direction is to employ hyperbolic representations in combination with contextualized word embeddings. We release our implementation with the aim to ease the adoption of hyperbolic components into neural models, yielding lightweight and efficient systems.

A Basics of Riemannian Geometry
Manifold: a n-dimensional manifold M is a space that can locally be approximated by R n . It generalizes the notion of a 2D surface to higher dimensions. More concretely, for each point x on M, we can find a homeomorphism (continuous bijection with continuous inverse) between a neighbourhood of x and R n .
Tangent space: the tangent space T x M at a point x on M is a n-dimensional hyperplane in R n+1 that best approximates M around x. It is the first order linear approximation.
Riemannian metric: A Riemannian metric g = (g x ) x∈M on M is a collection of inner-products g x : T x M × T x M → R varying smoothly with x on tangent spaces. Riemannian metrics can be used to measure distances on manifolds Riemannian manifold: is a pair (M, g), where M is a smooth manifold and g = (g x ) x∈M is a Riemannian metric.
Geodesics: γ : [0, 1] → M are the generalizations of straight lines to Riemannian manifolds, i.e., constant speed curves that are locally distance minimizing. In the Poincaré disk model, geodesics are circles that are orthogonal to the boundary of the disc as well as diameters.
Parallel transport: defined as P x→y : T x M → T y M, is a linear isometry between tangent spaces that corresponds to moving tangent vectors along geodesics. It is a generalization of translation to non-Euclidean geometry, and it defines a canonical way to connect tangent spaces.

B Möbius Operations
Möbius scalar multiplication: for x ∈ D n \{0} the Möbius scalar multiplication by r ∈ R is defined as: and r ⊗ 0 := 0. By making use of the exp and log maps, this expression is reduced to: r ⊗ x = exp 0 (r log 0 (x)), ∀r ∈ R, x ∈ D n (12) Exponential and logarithmic maps: The mapping between the tangent space and hyperbolic space is done by the exponential map exp x : T x D n → D n and the logarithmic map log x : D n → T x D n . They are given for v ∈ T x D n \{0} and y ∈ D n \{0}, y = x: These expressions become more appealing when x = 0, that is, at the origin of the space. It can be seen that the matrix-vector multiplication formula is derived from M ⊗ y = exp 0 (M log 0 (y)). The point y ∈ D n is mapped to the tangent space T 0 D n , the linear mapping M is applied in the Euclidean subspace, and finally the result is mapped back into the ball. A similar approach holds for the Möbius scalar multiplication and the application of pointwise non-linearity functions to elements in the Poincaré ball (see Ganea et al. (2018), Section 2.4). Parallel transport with exp and log maps: By applying the exp and log maps the parallel transport in the Poincaré ball for a vector v ∈ T 0 D n to another tangent space T x D n , is given by: This result is used to define and optimize the a k = (λ 0 /λ p k )a k in the Hyperbolic MLR (Appendix E)

C Hyperbolic Gated Recurrent Unit
To encode the context we apply a hyperbolic version of gated recurrent units (GRU) (Cho et al., 2014) proposed in Ganea et al. (2018): where W ∈ R d S ×d S , U ∈ R d S ×n , x t ∈ D n and b ∈ D d S (superscripts are omitted). r t is the reset gate, z t is the update gate, diag(x) denotes a diagonal matrix with each element of the vector x on its diagonal, and σ is the sigmoid function.
In the case of Gulcehre et al. (2019), the application of the Einstein midpoint (Ungar, 2010, Theorem 4.4) requires the mapping of the points onto the Klein model. By applying the Möbius midpoint, we avoid this mapping, and achieve an attention mechanism that operates only in one model of hyperbolic space.

D.2 Experimental Observations
To obtain the weights for the attention mechanism, initially Equation 8 was given by: We experimented with replacing f for sigmoid and softmax functions. We found better performance with the latter one. Moreover, empirical observation lead us to remove the c value, since it converged to zero in all experiments. We believe that the biases b Q and b K from Equation 8 compensate for this c.

D.3 Queries and Keys
To further analyze the attention mechanism we plot the query q i and key k i points of Equation 8 for both models in Figure 5. It must be recalled that the shorter the distance between points, the higher the attention weight that the word gets assigned. Furthermore, we observed that the attention gets prominently centered on the mention in both models, assigning very low weights on the rest of the words in the context. In the Euclidean space we can clearly distinguish the two clusters which make the distance-based attention to give very low weights on most words of the context. The small red cluster on the top right of the image belongs to points corresponding to words in the mention span. These words get projected very close to the key vector, in order to (a) Euclidean Space.
(b) Hyperbolic Space. minimize the distance and increase the attention weight. On the hyperbolic model, the queries get clustered at the bottom of the plot, whereas the keys are the points adjusting the distance to define the weight on each word.

E.1 Hyperbolic MLR
The original formula from Ganea et al. (2018) for MLR in the Poincaré ball, given K classes and k ∈ {1, ..., K}, p k ∈ D n , a k ∈ T p k D n \{0}, the formula for the hyperbolic MLR is: Where x ∈ D n , p k and a k are trainable parameters, and c is a parameter in relation to the radius of the Poincaré ball r = 1/ √ c which in this work we assume to be c = 1, hence it is omitted of the formulations. Since a k ∈ T p k D n and therefore depends on p k , it is unclear how to perform optimization. The solution proposed by Ganea et al. (2018) is to re-express it as: where a k ∈ T 0 D n = R n . In this way we can optimize a k as a Euclidean parameter. Finally, when we use a k instead of a k , the formula for the MLR is:

E.2 Euclidean MLR
The Euclidean formulation of the MLR is given by: This equation arise from taking the limit of c → 0 in Equation 18. In that case,

F Experimental Details
For the context-GRU we use tanh as non-linearity to establish a fair comparison against the classical GRU (Cho et al., 2014). On the char-RNN we use the identity (no non-linearity). The MLR is fed with the final representation achieved by the concatenation of mention and context features: In the XLARGE model, we use the Euclidean encoder in all experiments given time constraints.
Hyperparameters: Both hyperbolic and Euclidean models were trained with the hyperparameters detailed in Table 8.
Dropout: We apply low values of dropout given that the model was very sensitive to the this parameter. We consider this a logical behaviour since the distances in hyperbolic space grow exponentially with the norm of the points, making the model very responsive to this parameter.
Numerical Errors: they appear when the norm of the hyperbolic vectors is very close to 1 or 0.
To avoid them we follow the recommendations reported on Ganea et al. (2018). The result of hyperbolic operations is always projected in the ball of radius 1 − , where = 10 −5 . When vectors are very close to 0, they are perturbed with an ε = 10 −15 before they are used in any of the above operations. Finally, arguments of the tanh function are clipped between ±15, while arguments of tanh −1 are clipped in the interval [−1 + 10 −15 , 1 − 10 −15 ]. Finally, and by recommendations of the Geoopt developers (Kochurov et al., 2020), we operate on floating point of 64 bits.
Initialization: we initialize character and positional embeddings randomly from the uniform distribution U (−0.0001, 0.0001). In the case of the hyperbolic model, we map them into the ball with the exp 0 map. We initialize all layers in the model using Glorot uniform initialization.
Exponential and logarithmic map: In the case of the Glove embedding ablation (Section 6.1.1), we used the 100d version, trained over Wikipedia and Gigaword 4 . By directly applying the logarithmic map, the embeddings were projected close to the border of the ball, making the model very unstable. To overcome this, we use a parameter c described in Ganea et al. (2018) to adjust the radius of the ball, which helps to project the embeddings closer to the origin of the space.
Hardware: All experiments for the hyperbolic and Euclidean models were performed using 2 NVIDIA P40 GPUs, with the batch sizes specified in Table 8

G Closest Types
We report the points p k learned by the model to define the hyperplanes of Equation 10. Table 9 shows the types corresponding to the closest points, measured by their hyperbolic distance d D (see Eq 1), to the coarse labels. We observe that the types are highly correlated given that they often co-occur in the same context.

H More Experimental Observations
Text vectors norms: By "text vector" we refer the concatenated vector of the context, mention and char-level mention representations before the MLR layer. We report the average norm of this vectors per training epoch, for the 20D Euclidean and hyperbolic model on Figure 6. The norm of the vectors of the hyperbolic model are measured according to the hyperbolic distance d D (see Equation 1). That is, we take the hyperbolic distance from the origin to the point, thus the values are above one. The norm of the Euclidean model is measured according to the Euclidean norm. We observe that both models learn to reduce the norm of the vectors, and it is noticeable that the convergence value for the Euclidean model is higher than for the hyperbolic model.  Table 9: Closest p k points in the Poincaré Ball to coarse entity types, with their hyperbolic distance. In many cases, a hierarchical relation holds with the closest type. For example: firm is-a institution is-a organization.