Low-Dimensional Hyperbolic Knowledge Graph Embeddings

Knowledge graph (KG) embeddings learn low- dimensional representations of entities and relations to predict missing facts. KGs often exhibit hierarchical and logical patterns which must be preserved in the embedding space. For hierarchical data, hyperbolic embedding methods have shown promise for high-fidelity and parsimonious representations. However, existing hyperbolic embedding methods do not account for the rich logical patterns in KGs. In this work, we introduce a class of hyperbolic KG embedding models that simultaneously capture hierarchical and logical patterns. Our approach combines hyperbolic reflections and rotations with attention to model complex relational patterns. Experimental results on standard KG benchmarks show that our method improves over previous Euclidean- and hyperbolic-based efforts by up to 6.1% in mean reciprocal rank (MRR) in low dimensions. Furthermore, we observe that different geometric transformations capture different types of relations while attention- based transformations generalize to multiple relations. In high dimensions, our approach yields new state-of-the-art MRRs of 49.6% on WN18RR and 57.7% on YAGO3-10.


Introduction
Knowledge graphs (KGs), consisting of (head entity, relationship, tail entity) triples, are popular data structures for representing factual knowledge to be queried and used in downstream applications such as word sense disambiguation, question answering, and information extraction. Real-world KGs such as Yago (Suchanek et al., 2007) or Wordnet (Miller, 1995) are usually incomplete, so a common approach to predicting missing links in KGs is via embedding into vector spaces. Embedding methods learn representations of entities and relationships that preserve the information found in the graph, and have achieved promising results for many tasks.
Relations found in KGs have differing properties: for example, (Michelle Obama, married to, Barack Obama) is symmetric, whereas hypernym relations like (cat, specific type of, feline), are not ( Figure  1). These distinctions present a challenge to embedding methods: preserving each type of behavior requires producing a different geometric pattern in the embedding space. One popular approach is to use extremely high-dimensional embeddings, which offer more flexibility for such patterns. However, given the large number of entities found in KGs, doing so yields very high memory costs.
For hierarchical data, hyperbolic geometry offers an exciting approach to learn low-dimensional embeddings while preserving latent hierarchies. Hyperbolic space can embed trees with arbitrarily low distortion in just two dimensions. Recent research has proposed embedding hierarchical graphs into these spaces instead of conventional Euclidean space (Nickel and Kiela, 2017;Sala et al., 2018). However, these works focus on embedding simpler graphs (e.g., weighted trees) and cannot express the diverse and complex relationships in KGs.
We propose a new hyperbolic embedding ap-proach that captures such patterns to achieve the best of both worlds. Our proposed approach produces the parsimonious representations offered by hyperbolic space, especially suitable for hierarchical relations, and is effective even with lowdimensional embeddings. It also uses rich transformations to encode logical patterns in KGs, previously only defined in Euclidean space. To accomplish this, we (1) train hyperbolic embeddings with relation-specific curvatures to preserve multiple hierarchies in KGs; (2) parameterize hyperbolic isometries (distance-preserving operations) and leverage their geometric properties to capture relations' logical patterns, such as symmetry or anti-symmetry; (3) and use a notion of hyperbolic attention to combine geometric operators and capture multiple logical patterns.
We evaluate the performance of our approach, ATTH, on the KG link prediction task using the standard WN18RR (Dettmers et al., 2018;Bordes et al., 2013), FB15k-237 (Toutanova and Chen, 2015) and YAGO3-10 (Mahdisoltani et al., 2013) benchmarks. (1) In low (32) dimensions, we improve over Euclidean-based models by up to 6.1% in the mean reciprocical rank (MRR) metric. In particular, we find that hierarchical relationships, such as WordNet's hypernym and member meronym, significantly benefit from hyperbolic space; we observe a 16% to 24% relative improvement versus Euclidean baselines. (2) We find that geometric properties of hyperbolic isometries directly map to logical properties of relationships. We study symmetric and anti-symmetric patterns and find that reflections capture symmetric relations while rotations capture anti-symmetry. (3) We show that attention based-transformations have the ability to generalize to multiple logical patterns. For instance, we observe that ATTH recovers reflections for symmetric relations and rotations for the antisymmetric ones.
In high (500) dimensions, we find that both hyperbolic and Euclidean embeddings achieve similar performance, and our approach achieves new stateof-the-art results (SotA), obtaining 49.6% MRR on WN18RR and 57.7% YAGO3-10. Our experiments show that trainable curvature is critical to generalize hyperbolic embedding methods to highdimensions. Finally, we visualize embeddings learned in hyperbolic spaces and show that hyperbolic geometry effectively preserves hierarchies in KGs.

Related Work
Previous methods for KG embeddings also rely on geometric properties. Improvements have been obtained by exploiting either more sophisticated spaces (e.g., going from Euclidean to complex or hyperbolic space) or more sophisticated operations (e.g., from translations to isometries, or to learning graph neural networks). In contrast, our approach takes a step forward in both directions.
Euclidean embeddings In the past decade, there has been a rich literature on Euclidean embeddings for KG representation learning. These include translation approaches (Bordes et al., 2013;Ji et al., 2015;Wang et al., 2014;Lin et al., 2015) or tensor factorization methods such as RESCAL (Nickel et al., 2011) or DistMult . While these methods are fairly simple and have few parameters, they fail to encode important logical properties (e.g., translations can't encode symmetry).
Complex embeddings Recently, there has been interest in learning embeddings in complex space, as in the ComplEx (Trouillon et al., 2016) and Ro-tatE (Sun et al., 2019) models. RotatE learns rotations in complex space, which are very effective in capturing logical properties such as symmetry, anti-symmetry, composition or inversion. The recent QuatE model (Zhang et al., 2019) learns KG embeddings using quaternions. However, a downside is that these embeddings require very highdimensional spaces, leading to high memory costs.
Deep neural networks Another family of methods uses neural networks to produce KG embeddings. For instance, R-GCN (Schlichtkrull et al., 2018) extends graph neural networks to the multirelational setting by adding a relation-specific aggregation step. ConvE and ConvKB (Dettmers et al., 2018;Nguyen et al., 2018) leverage the expressiveness of convolutional neural networks to learn entity embeddings and relation embeddings. More recently, the KBGAT (Nathani et al., 2019) and A2N (Bansal et al., 2019) models use graph attention networks for knowledge graph embeddings. A downside of these methods is that they are computationally expensive as they usually require pre-trained KG embeddings as input for the neural network.
Hyperbolic embeddings To the best of our knowledge, MuRP (Balažević et al., 2019) is the only method that learns KG embeddings in hyperbolic space in order to target hierarchical data. MuRP minimizes hyperbolic distances between a re-scaled version of the head entity embedding and a translation of the tail entity embedding. It achieves promising results using hyperbolic embeddings with fewer dimensions than its Euclidean analogues. However, MuRP is a translation model and fails to encode some logical properties of relationships. Furthermore, embeddings are learned in a hyperbolic space with fixed curvature, potentially leading to insufficient precision, and training relies on cumbersome Riemannian optimization. Instead, our proposed method leverages expressive hyperbolic isometries to simultaneously capture logical patterns and hierarchies. Furthermore, embeddings are learned using tangent space (i.e., Euclidean) optimization methods and trainable hyperbolic curvatures per relationship, avoiding precision errors that might arise when using a fixed curvature, and providing flexibility to encode multiple hierarchies.

Problem Formulation and Background
We describe the KG embedding problem setting and give some necessary background on hyperbolic geometry.

Knowledge graph embeddings
In the KG embedding problem, we are given a set of triples (h, r, t) ∈ E ⊆ V × R × V, where V and R are entity and relationship sets, respectively. The goal is to map entities v ∈ V to embeddings e v ∈ U d V and relationships r ∈ R to embeddings r r ∈ U d R , for some choice of space U (traditionally R), such that the KG structure is preserved.
Concretely, the data is split into E T rain and E T est triples. Embeddings are learned by optimizing a scoring function s : V × R × V → R, which measures triples' likelihoods. s(·, ·, ·) is trained using triples in E T rain and the learned embeddings are then used to predict scores for triples in E T est . The goal is to learn embeddings such that the scores of triples in E T est are high compared to triples that are not present in E.

Hyperbolic geometry
We briefly review key notions from hyperbolic geometry; a more in-depth treatment is available in standard texts (Robbin and Salamon). Hyperbolic geometry is a non-Euclidean geometry with constant negative curvature. In this work, we use the d- x is a d-dimensional vector space containing all possible directions of paths in B d,c leaving from x.
The tangent space T c x maps to B d,c via the exponential map (Figure 2), and conversely, the logarithmic map maps B d,c to T c x . In particular, we have closed-form expressions for these maps at the origin: Vector addition is not well-defined in the hyperbolic space (adding two points in the Poincaré ball might result in a point outside the ball). Instead, Möbius addition ⊕ c (Ganea et al., 2018) provides an analogue to Euclidean addition for hyperbolic space. We give its closed-form expression in Appendix A.1. Finally, the hyperbolic distance on B d,c has the explicit formula:

Methodology
The goal of this work is to learn parsimonious hyperbolic embeddings that can encode complex logical patterns such as symmetry, anti-symmetry, or inversion while preserving latent hierarchies. Our model, ATTH, (1) learns KG embeddings in hyperbolic space in order to preserve hierarchies (Section 4.1), (2) uses a class of hyperbolic isometries parameterized by compositions of Givens transformations to encode logical patterns (Section 4.2), (3) combines these isometries with hyperbolic attention (Section 4.3). We describe the full model in Section 4.4.

Hierarchies in hyperbolic space
As described, hyperbolic embeddings enable us to represent hierarchies even when we limit ourselves to low-dimensional spaces. In fact, twodimensional hyperbolic space can represent any tree with arbitrarily small error (Sala et al., 2018). It is important to set the curvature of the hyperbolic space correctly. This parameter provides flexibility to the model, as it determines whether to embed relations into a more curved hyperbolic space (more "tree-like"), or into a flatter, more "Euclidean-like" geometry. For each relation, we learn a relation-specific absolute curvature c r , enabling us to represent a variety of hierarchies. As we show in Section 5.5, fixing, rather than learning curvatures can lead to significant performance degradation.

Hyperbolic isometries
Relationships often satisfy particular properties, such as symmetry: e.g., if (Michelle Obama, married to, Barack Obama) holds, then (Barack Obama, married to, Michelle Obama) does as well. These rules are not universal. For instance, (Barack Obama, born in, Hawaii) is not symmetric.
Creating and curating a set of deterministic rules is infeasible for large-scale KGs; instead, embedding methods represent relations as parameterized geometric operations that directly map to logical properties. We use two such operations in hyperbolic space: rotations, which effectively capture compositions or anti-symmetric patterns, and reflections, which naturally encode symmetric patterns.
Rotations Rotations have been successfully used to encode compositions in complex space with the RotatE model (Sun et al., 2019); we lift these to hyperbolic space. Compared to translations or tensor factorization approaches which can only infer some logical patterns, rotations can simultaneously model and infer inversion, composition, symmetric or anti-symmetric patterns.
Reflections These isometries reflect along a fixed subspace. While some rotations can represent symmetric relations (more specifically π−rotations), any reflection can naturally represent symmetric relations, since their second power is the identity. They provide a way to fill-in missing entries in symmetric triples, by applying the same operation to both the tail and the head entity. For instance, by modelling sibling of with a reflection, we can In hyperbolic space, the distance between start and end points after applying rotations or reflections is much larger than the Euclidean distance; it approaches the sum of the distances between the points and the origin, giving more "room" to separate embeddings. This is similar to trees, where the shortest path between two points goes through their nearest common ancestor.
directly infer (Bob, sibling of, Alice) from (Alice, sibling of, Bob) and vice versa.
Parameterization Unlike RotatE which models rotations via unitary complex numbers, we learn relationship-specific isometries using Givens transformations, 2 × 2 matrices commonly used in numerical linear algebra. Let Using an even number of dimensions d, our model parameterizes rotations and reflections with block-diagonal matrices of the form: where G ± (θ) := cos(θ) ∓sin(θ) sin(θ) ±cos(θ) . (6) Rotations and reflections of this form are hyperbolic isometries (distance-preserving). We can therefore directly apply them to hyperbolic embeddings while preserving the underlying geometry. Additionally, these transformations are computationally efficient and can be computed in linear time in the dimension. We illustrate two-dimensional isometries in both Euclidean and hyperbolic spaces in Figure 3.

Hyperbolic attention
Of our two classes of hyperbolic isometries, one or the other may better represent a particular relation.
To handle this, we use an attention mechanism to learn the right isometry. Thus we can represent symmetric, anti-symmetric or mixed-behaviour relations (i.e. neither symmetric nor anti-symmetric) as a combination of rotations and reflections. Let x H and y H be hyperbolic points (e.g., reflection and rotation embeddings), and a be an attention vector. Our approach maps hyperbolic representations to tangent space representations, x E = log c 0 (x H ) and y E = log c 0 (y H ), and computes attention scores: We then compute a weighted average using the recently proposed tangent space average (Chami et al., 2019;:

The ATTH model
We have all of the building blocks for ATTH, and can now describe the model architecture. Let (e H v ) v∈V and (r H r ) r∈R denote entity and relationship hyperbolic embeddings respectively. For a triple (h, r, t) ∈ V × R × V, ATTH applies relation-specific rotations (Equation 4) and reflections (Equation 5) to the head embedding: (8) ATTH then combines the two representations using hyperbolic attention (Equation 7) and applies a hyperbolic translation: Intuitively, rotations and reflections encode logical patterns while translations capture tree-like structures by moving between levels of the hierarchy. Finally, query embeddings are compared to target tail embeddings via the hyperbolic distance (Equation 3). The resulting scoring function is: where (b v ) v∈V are entity biases which act as margins in the scoring function (Tifrea et al., 2019;Balažević et al., 2019).
Note that the total number of parameters in ATTH is O(|V|d), similar to traditional models that do not use attention or geometric operations. The extra cost is proportional to the number of relations, which is usually much smaller than the number of entities.  Table 1: Datasets statistics. The lower the metric ξ G is, the more tree-like the knowledge graph is.

Experiments
In low dimensions, we hypothesize (1) that hyperbolic embedding methods obtain better representations and allow for improved downstream performance for hierarchical data (Section 5.2). (2) We expect the performance of relation-specific geometric operations to vary based on the relation's logical patterns (Section 5.3).
(3) In cases where the relations are neither purely symmetric nor antisymmetric, we anticipate that hyperbolic attention outperforms the models which are based on solely reflections or rotations (Section 5.4). Finally, in high dimensions, we expect hyperbolic models with trainable curvature to learn the best geometry, and perform similarly to their Euclidean analogues (Section 5.5).

Experimental setup
Datasets We evaluate our approach on the link prediction task using three standard competition benchmarks, namely WN18RR (Bordes et al., 2013;Dettmers et al., 2018), FB15k-237 (Bordes et al., 2013;Toutanova and Chen, 2015) and YAGO3-10 ( Mahdisoltani et al., 2013). WN18RR is a subset of WordNet containing 11 lexical relationships between 40,943 word senses, and has a natural hierarchical structure, e.g., (car, hypernym of, sedan). FB15k-237 is a subset of Freebase, a collaborative KB of general world knowledge. FB15k-237 has 14,541 entities and 237 relationships, some of which are non-hierarchical, such as born-in or nationality, while others have natural hierarchies, such as part-of (for organizations). YAGO3-10 is a subset of YAGO3, containing 123,182 entities and 37 relations, where most relations provide descriptions of people. Some relationships have a hierarchical structure such as playsFor or actedIn, while others induce logical patterns, like isMarriedTo.
For each KG, we follow the standard data augmentation protocol by adding inverse relations (Lacroix et al., 2018)   which is a distance-based measure of how close a given graph is to being a tree. We summarize the datasets' statistics in Table 1.
Baselines We compare our method to SotA models, including MurP (Balazevic et al., 2019), MurE (which is the Euclidean analogue or MurP), RotatE (Sun et al., 2019), ComplEx-N3 (Lacroix et al., 2018) and TuckER (Balazevic et al., 2019). Baseline numbers in high dimensions (Table 5) are taken from the original papers, while baseline numbers in the low-dimensional setting (Table 2) are computed using open-source implementations of each model. In particular, we run hyper-parameter searches over the same parameters as the ones in the original papers to compute baseline numbers in the lowdimensional setting.
Ablations To analyze the benefits of hyperbolic geometry, we evaluate the performance of ATTE, which is equivalent to ATTH with curvatures set to zero. Additionally, to better understand the role of attention, we report scores for variants of ATTE/H using only rotations (ROTE/H) or reflections (REFE/H).
Evaluation metrics At test time, we use the scoring function in Equation 10 to rank the correct tail or head entity against all possible entities, and use in use inverse relations for head prediction (Lacroix et al., 2018). Similar to previous work, we compute two ranking-based metrics: (1) mean reciprocal rank (MRR), which measures the mean of inverse ranks assigned to correct entities, and (2) hits at K (H@K, K ∈ {1, 3, 10}), which measures the proportion of correct triples among the top K predicted triples. We follow the standard evaluation protocol in the filtered setting (Bordes et al., 2013): all true triples in the KG are filtered out during evaluation, since predicting a low rank for these triples should not be penalized.
Training procedure and implementation We train ATTH by minimizing the full cross-entropy loss with uniform negative sampling, where negative examples for a triple (h, r, t) are sampled uniformly from all possible triples obtained by perturbing the tail entity: log(1+exp(y t s(h, r, t ))), (11) Since optimization in hyperbolic space is practically challenging, we instead define all parameters in the tangent space at the origin, optimize embeddings using standard Euclidean techniques, and use the exponential map to recover the hyperbolic parameters (Chami et al., 2019). We provide more details on tangent space optimization in Appendix A.4. We conducted a grid search to select the learning rate, optimizer, negative sample size, and batch size, using the validation set to select the best hy-  perparameters. Our best model hyperparameters are detailed in Appendix A.3. We conducted all our experiments on NVIDIA Tesla P100 GPUs and make our implementation publicly available * .

Results in low dimensions
We first evaluate our approach in the lowdimensional setting for d = 32, which is approximately one order of magnitude smaller than SotA Euclidean methods. To understand the role of dimensionality, we also conduct experiments on WN18RR against SotA methods under varied low-dimensional settings (Figure 4). We include error bars for our method with average MRR and standard deviation computed over 10 runs. Our approach consistently outperforms all baselines, suggesting that hyperbolic embeddings still attain high-accuracy across a broad range of dimensions.
Additionally, we measure performance per relation on WN18RR in Table 3 to understand the benefits of hyperbolic geometric on hierarchical relations. We report the Krackhardt hierarchy score * Code available at https://github.com/ tensorflow/neural-structured-learning/ tree/master/research/kg_hyp_emb  (Khs G ) (Balažević et al., 2019) and estimated curvature per relation (see Appendix A.2 for more details). We consider a relation to be hierarchical when its corresponding graph is close to tree-like (low curvature, high Khs G ). We observe that hyperbolic embeddings offer much better performance on hierarchical relations such as hypernym or has part, while Euclidean and hyperbolic embeddings have similar performance on non-hierarchical relations such as verb group. We also plot the learned curvature per relation versus the embedding dimension in Figure 5b. We note that the learned curvature in low dimensions directly correlates with the estimated graph curvature ξ G in Table 3, suggesting that the model with learned curvatures learns more "curved" embedding spaces for tree-like relations. Finally, we observe that MurP achieves lower performance than MurE on YAGO3-10, while ATTH improves over ATTE by 2.3% in MRR. This suggests that trainable curvature is critical to learn embeddings with the right amount of curvature, while fixed curvature might degrade performance. We elaborate further on this point in Section 5.5.

Hyperbolic rotations and reflections
In our experiments, we find that rotations work well on WN18RR, which contains multiple hierarchical and anti-symmetric relations, while reflections work better for YAGO3-10 (Table 5). To better understand the mechanisms behind these observations, we analyze two specific patterns: relation symmetry and anti-symmetry. We report performance per-relation on a subset of YAGO3-10 relations in Table 4. We categorize relations into symmetric, anti-symmetric, or neither symmetric nor anti-symmetric categories using data statistics. More concretely, we consider a relation to satisfy a logical pattern when the logical condition is satisfied by most of the triplets (e.g., a relation r is symmetric if for most KG triples (h, r, t), (t, r, h) is also in the KG). We observe that reflections encode  symmetric relations particularly well, while rotations are well suited for anti-symmetric relations. This confirms our intuition-and the motivation for our approach-that particular geometric properties capture different kinds of logical properties.

Attention-based transformations
One advantage of using relation-specific transformations is that each relation can learn the right geometric operators based on the logical properties it has to satisfy. In particular, we observe that in both low-and high-dimensional settings, attentionbased models can recover the performance of the best transformation on all datasets (Tables 2 and 5). Additionally, per-relationship results on YAGO3-10 in Table 4 suggest that ATTH indeed recovers the best geometric operation.
Furthermore, for relations that are neither symmetric nor anti-symmetric, we find that ATTH can outperform rotations and reflections, suggesting that combining multiple operators with attention can learn more expressive operators to model mixed logical patterns. In other words, attentionbased transformations alleviate the need to conduct experiments with multiple geometric transformations by simply allowing the model to choose which one is best for a given relation.

Results in high dimensions
In high dimensions (Table 5), we compare against a variety of other models and achieve new SotA results on WN18RR and YAGO3-10, and thirdbest results on FB15k-237. As we expected, when the embedding dimension is large, Euclidean and hyperbolic embedding methods perform similarly across all datasets. We explain this behavior by noting that when the dimension is sufficiently large, both Euclidean and hyperbolic spaces have enough capacity to represent complex hierarchies in KGs. This is further supported by Figure 5b, which shows the learned absolute curvature versus the dimension. We observe that curvatures are close to zero in high dimensions, confirming our expectation that ROTH with trainable curvatures learns a roughly Euclidean geometry in this setting.
In contrast, fixed curvature degrades performance in high dimensions (Figure 5a), confirming the importance of trainable curvatures and its impact on precision and capacity (previously studied by (Sala et al., 2018)). Additionally, we show the embeddings' norms distribution in the Appendix (Figure 7). Fixed curvature results in embeddings being clustered near the boundary of the ball while trainable curvatures adjusts the embedding space to better distribute points throughout the ball. Precision issues that might arise with fixed curvature could also explain MurP's low performance in high dimensions. Trainable curvatures allow ROTH to perform as well or better than previous methods in both low and high dimensions.

Visualizations
In Figure 6, we visualize the embeddings learned by ROTE versus ROTH for a sub-tree of the organism entity in WN18RR. To better visualize the hierarchy, we apply k inverse rotations for all nodes at level k in the tree.
By contrast to ROTE, ROTH preserves the tree structure in the embedding space. Furthermore, we note that ROTE cannot simultaneously preserve the tree structure and make non-neighboring nodes far from each other. For instance, virus should be far from male, but preserving the tree structure (by going one level down in the tree) while making   (Dettmers et al., 2018). Best score in bold and best published underlined. ATTE and ATTH have similar performance in the high-dimensional setting, performing competitively with or better than state-of-the-art methods on WN18RR, FB15k-237 and YAGO3-10.  these two nodes far from each other is difficult in Euclidean space. In hyperbolic space, however, we observe that going one level down in the tree is achieved by translating embeddings towards the left. This pattern essentially illustrates the translation component in ROTH, allowing the model to simultaneously preserve hierarchies while making non-neighbouring nodes far from each other.

Conclusion
We introduce ATTH, a hyperbolic KG embedding model that leverages the expressiveness of hyperbolic space and attention-based geometric transformations to learn improved KG representations in low-dimensions. ATTH learns embeddings with trainable hyperbolic curvatures, allowing it to learn the right geometry for each relationship and generalize across multiple embedding dimensions. ATTH achieves new SotA on WN18RR and YAGO3-10, real-world KGs which exhibit hierar-chical structures. Future directions for this work include exploring other tasks that might benefit from hyperbolic geometry, such as hypernym detection. The proposed attention-based transformations can also be extended to other geometric operations.

A Appendix
Below, we provide additional details. We start by providing the formula for the hyperbolic analogue of addition that we use, along with additional hyperbolic geometry background. Next, we provide more information about the metrics that are used to determine how hierarchical a dataset is. Afterwards, we give additional experimental details, including the table of hyperparameters and further details on tangent space optimization. Lastly, we include an additional comparison against the Dihedral model (Xu and Li, 2019).
In contrast to Euclidean addition, it is neither commutative nor associative. However, it provides an analogue through the lens of parallel transport: given two points x, y and a vector v in T c x , there is a unique vector in T c y which creates the same angle as v with the direction of the geodesic (shortest path) connecting x to y. This map is the parallel transport P c x→y (·); Euclidean parallel transport is the standard Euclidean addition. Analogously, the Möbius addition satisfies (Ganea et al., 2018): x ⊕ c y = exp c x (P c 0→x (log c 0 (y))).

A.2 Hierarchy estimates
We use two metrics to estimate how hierarchical a relation is: the curvature estimate ξ G and the Krackhardt hierarchy score Khs G . While the curvature estimate captures global hierarchical behaviours (how much the graph is tree-like when zoomingout), the Krackhardt score captures a more local behaviour (how many small loops the graph has). See Figure 8 for examples.
Curvature estimate To estimate the curvature of a relation r, we restrict to the undirected graph G r spanned by the edges labeled as r. Following which is given by: ξ Gr (a, b, c) = 1 2d Gr (a, m) d Gr (a, m) 2 where m is the midpoint of the shortest path connecting b to c. This estimate is positive for triangles in circles, negative for triangles in trees, and zero for triangles in lines. Moreover, for a triangle in a Riemannian manifold M , ξ M (a, b, c) estimates the sectional curvature of the plane on which the triangle lies (see (Gu et al., 2019) for more details). Let m r be the total number of connected components in G r . We sample 1000 w i,r triangles from each connected component c i,r of G r where r , and N i,r is the number of nodes in the component c i,r . ξ Gr is the mean of the estimated curvatures of the sampled triangles. For the full graph, we take the weighted average of the relation curvatures ξ Gr with respect to the weights Krackhardt hierarchy score For the directed graph G r spanned by the relation r, we let R be the adjacency matrix (R i,j = 1 if there is an edge from node i to node j and 0 otherwise). Then: See (Krackhardt, 1994) for more details. We note that for fully observed symmetric relations (each edge is in a two-edge loop), Khs Gr = 0 while for anti-symmetric relations (no small loops), Khs Gr = 1.

A.3 Experimental details
For all our Euclidean and hyperbolic models, we conduct a hyperparameter search for the learning rate, optimizer (Adam (Kingma and Ba, 2015) or Adagrad (Duchi et al., 2011)), negative sample size and batch size. We train each model for 500 epochs and use early stopping after 100 epochs if the validation MRR stops increasing. We report the best hyperparameters for each dataset in Table 7.

A.4 Tangent space optimization
Optimization in hyperbolic space normally requires Riemannian Stochastic Gradient Descent (RSGD) (Bonnabel, 2013), as was used in MuRP. RSGD is challenging in practice. Instead, we use tangent space optimization (Chami et al., 2019). We define all the ATTH parameters in the tangent space at the origin (our parameter space), optimize embeddings using standard Euclidean techniques, and use the exponential map to recover the hyperbolic parameters. Note that tangent space optimization is an exact procedure, which does not incur losses in representational power. This is the case in hyperbolic space specifically because of a completeness property: there is always a global bijection between the tangent space and the manifold.
Concretely, ATTH optimizes the entity and relationship embeddings (e E v ) v∈V and (r E r ) r∈R , which are mapped to the Poincaré ball with: e H v = exp cr 0 (e E v ) and r H r = exp cr 0 (r E r ), The trainable model parameters are then