DyERNIE: Dynamic Evolution of Riemannian Manifold Embeddings for Temporal Knowledge Graph Completion

There has recently been increasing interest in learning representations of temporal knowledge graphs (KGs), which record the dynamic relationships between entities over time. Temporal KGs often exhibit multiple simultaneous non-Euclidean structures, such as hierarchical and cyclic structures. However, existing embedding approaches for temporal KGs typically learn entity representations and their dynamic evolution in the Euclidean space, which might not capture such intrinsic structures very well. To this end, we propose Dy- ERNIE, a non-Euclidean embedding approach that learns evolving entity representations in a product of Riemannian manifolds, where the composed spaces are estimated from the sectional curvatures of underlying data. Product manifolds enable our approach to better reflect a wide variety of geometric structures on temporal KGs. Besides, to capture the evolutionary dynamics of temporal KGs, we let the entity representations evolve according to a velocity vector defined in the tangent space at each timestamp. We analyze in detail the contribution of geometric spaces to representation learning of temporal KGs and evaluate our model on temporal knowledge graph completion tasks. Extensive experiments on three real-world datasets demonstrate significantly improved performance, indicating that the dynamics of multi-relational graph data can be more properly modeled by the evolution of embeddings on Riemannian manifolds.


Introduction
Learning from relational data has long been considered as a key challenge in artificial intelligence. In recent years, several sizable knowledge graphs (KGs), e.g. Freebase (Bollacker et al., 2008) and Wikidata (Vrandečić and Krötzsch, 2014), have Corresponding author. been developed that provide widespread availability of such data and enabled improvements to a plethora of downstream applications such as recommender systems (Hildebrandt et al., 2019) and question answering (Zhang et al., 2018). KGs are multi-relational, directed graphs with labeled edges, where each edge corresponds to a fact and can be represented as a triple, such as (John, lives in, Vancouver). Common knowledge graphs are static and store facts at their current state. In reality, however, multi-relational data are often time-dependent. For example, the political relationship between two countries might intensify because of trade fights. Thus, temporal knowledge graphs were introduced, such as ICEWS (Boschee et al., 2015) and GDELT (Leetaru and Schrodt, 2013), that capture temporal aspects of facts in addition to their multi-relational nature. In these datasets, temporal facts are represented as a quadruple by extending the static triplet with a timestamp describing when these facts occurred, i.e. (Barack Obama, inaugurated, as president of the US, 2009). Since real-world temporal KGs are usually incomplete, the task of link prediction on temporal KGs has gained growing interest. The task is to infer missing facts at specific timestamps based on the existing ones by answering queries such as (US, president, ?, 2015).
Many facts in temporal knowledge graphs induce geometric structures over time. For instance, increasing trade exchanges and economic cooperation between two major economies might promote the trade exports and economic growths of a series of countries in the downstream supply chain, which exhibits a tree-like structure over time. Moreover, an establishment of diplomatic relations between two countries might lead to regular official visits between these two countries, which produces a cyclic structure over time. Embedding methods in Euclidean space have limitations and suffer from large distortion when representing large-scale hier-archical data. Recently, hyperbolic geometry has been exploited in several works (Nickel and Kiela, 2017;Ganea et al., 2018) as an effective method for learning representations of hierarchical data, where the exponential growth of distance on the boundary of the hyperbolic space naturally allows representing hierarchical structures in a compact form. While most graph-structured data has a wide variety of inherent geometric structures, e.g. partially tree-like and partially cyclical, the above studies model the latent structures in a single geometry with a constant curvature, limiting the flexibility of the model to match the hypothetical intrinsic manifold. Thus, using a product of different constant curvature spaces (Gu et al., 2018) might be helpful to match the underlying geometries of temporal knowledge graphs and provide high-quality representations.
Existing non-Euclidean approaches for knowledge graph embeddings (Balazevic et al., 2019;Kolyvakis et al., 2019) lack the ability to capture temporal dynamics available in underlying data represented by temporal KGs. The difficulty with representing the evolution of temporal KGs in non-Euclidean spaces lies in finding a way to integrate temporal information to the geometric representations of entities. In this work, we propose the dynamic evolution of Riemannian manifold embeddings (DyERNIE), a theoretically founded approach to embed multi-relational data with dynamic relationships on a product of Riemannian manifolds with different curvatures. To capture both the stationary and dynamic characteristics of temporal KGs, we characterize the time-dependent representation of an entity as movements on manifolds. For each entity, we define an initial embedding (at t 0 ) on each manifold and a velocity vector residing in the tangent space of the initial embedding to generate a temporal representation at each timestamp. In particular, the initial embeddings represent the stationary structural dependencies across facts, while the velocity vectors capture the time-varying properties of entities.
Our contributions are the following: (i) We introduce Riemannian manifolds as embedding spaces to capture geometric features of temporal KGs. (ii) We characterize the dynamics of temporal KGs as movements of entity embeddings on Riemannian manifolds guided by velocity vectors defined in the tangent space. (iii) We show how the product space can be approximately identified from sectional cur-vatures of temporal KGs and how to choose the dimensionality of component spaces as well as their curvatures accordingly. (iv) Our approach significantly outperforms current benchmarks on a link prediction task on temporal KGs in low-and highdimensional settings. (v) We analyze our model's properties, i.e. the influence of embedding dimensionality and the correlation between node degrees and the norm of velocity vectors.

Riemannian Manifold
An n-dimensional Riemannian manifold M n is a real and smooth manifold with locally Euclidean structure. For each point x P M n , the metric tensor gpxq defines a positive-definite inner product gpxq " x¨,¨y x : where T x M n is the tangent space of M n at x. From the tangent space T x M n , there exists a mapping function exp x pvq : T x M n Ñ M n that maps a tangent vector v at x to the manifold, also known as the exponential map. The inverse of an exponential map is referred to as the logarithm map log x p¨q.

Constant Curvature Spaces
The sectional curvature Kpτ x q is a fine-grained notion defined over a two-dimensional subspace τ x in the tangent space at the point x (Berger, 2012). If all the sectional curvatures in a manifold M n are equal, the manifold then defined as a space with a constant curvature K. Three different types of constant curvature spaces can be defined depending on the sign of the curvature: a positively curved space, a flat space, and a negatively curved space. There are different models for each constant curvature space. To unify different models, in this work, we choose the stereographically projected hypersphere S n K for positive curvatures (K ą 0), while for negative curvatures (K ă 0) we choose the Poincaré ball P n K , which is the stereographic projection of the hyperboloid model: Both of the above spaces S K and P K are equipped with the Riemannian metric: g S K x " g P K x " pλ K x q 2 g E , which is conformal to the Euclidean metric g E with the conformal factor λ K x " 2{p1K ||x|| 2 2 q (Ganea et al., 2018). As explained in (Skopek et al., 2019), S K and P K have a suitable property, namely the distance and the metric tensors of these spaces converge to their Euclidean counterpart as the curvature goes to 0, which makes both spaces suitable for learning sign-agnostic curvatures.

Gyrovector Spaces
An important analogy to vector spaces (vector addition and scalar multiplication) in non-Euclidean geometry is the notion of gyrovector spaces (Ungar, 2008). Both the projected hypersphere and the Poincaré ball share the following definition of Möbius addition: x ' K y " p1´2K xx, yy 2´K ||y|| 2 2 qx`p1`K||x|| 2 2 qy 1´2K xx, yy 2`K 2 ||x|| 2 2 ||y|| 2 2 where we denote the Euclidean norm and inner product by ||¨|| and x¨,¨y 2 , respectively. Skopek et al. (2019) show that the distance between two points in S K or P K is equivalent to their variants in gyrovector spaces, which is defined as where tan K " tan if K ą 0 and tan K " tanh if K ă 0. The same gyrovector spaces can be used to define the exponential and logarithmic maps in the Poincaré ball and the projected hypersphere. We list these mapping functions in Table 8 in the appendix. As Ganea et al. (2018) use the exponential and logarithmic maps to obtain the Möbius matrix-vector multiplication: M b K x " exp K 0 pM log K 0 pxqq, we reuse them in hyperbolic space. This operation is defined similarly in projected hyperspherical space.

Product Manifold
We further generalize the embedding space of latent representations from a single manifold to a product of Riemannian manifolds with constant curvatures. Consider a sequence of Riemannian manifolds with constant curvatures, the product manifold is defined as the Cartesian product of k component manifolds M n " Ś k i"1 M n i K i , where n i is the dimensionality of the i´th component, and K i indicates its curvature, with choices M n i K i P tP n i K i , E n i , S n i K i u. We call tpn i , k i qu k i"1 the signature of a product manifold. Note that the notation E n i is redundant in Euclidean spaces since the Cartesian product of Euclidean spaces with different dimensions can be combined into a single space, i.e. E n " Ś k i"1 E n i . However, this equality does not hold in the projected hypersphere and the Poincaré ball. For each point x P M n on a product manifold, we decompose its coordinates into the corresponding coordinates in component manifolds x " px p1q , ..., x pkq q, where x piq P M n i K i . The distance function decomposes based on its definition d 2 M n px, yq " we decompose the metric tensor, exponential and logarithmic maps on a product manifold into the component manifolds. In particular, we split the embedding vectors into parts x piq , apply the desired operation on that part f n i K i px piq q, and concatenate the resulting parts back (Skopek et al., 2019).

Temporal Knowledge Graph Completion
Temporal knowledge graphs (KGs) are multirelational, directed graphs with labeled timestamped edges between entities. Let E, P, and T represent a finite set of entities, predicates, and timestamps, respectively. Each fact can be denoted by a quadruple q " pe s , p, e o , tq, where p P P represents a timestamped and labeled edge between a subject entity e s P E and an object entity e o P E at a timestamp t P T . Let F represents the set of all quadruples that are facts, i.e. real events in the world, the temporal knowledge graph completion (tKGC) is the problem of inferring F based on a set of observed facts O, which is a subset of F. To evaluate the proposed algorithms, the task of tKGC is to predict either a missing subject entity p?, p, e o , tq given the other three components or a missing object entity pe s , p, ?, tq. Taking the object prediction as an example, we consider all entities in the set E, and learn a score function φ : EˆPˆEˆT Ñ R. Since the score function assigns a score to each quadruple, the proper object can be inferred by ranking the scores of all quadruples tpe s , p, e o i , tq, e o i P Eu that are accompanied with candidate entities. Static KG Embedding Embedding approaches for static KGs can generally be categorized into bilinear models and translational models. The bilinear approaches are equipped with a bilinear score function that represents predicates as linear transformations acting on entity embeddings (Nickel et al., 2011;Trouillon et al., 2016;Yang et al., 2014;Ma et al., 2018a). Translational approaches measure the plausibility of a triple as the distance between the translated subject and object entity embeddings, including TransE (Bordes et al., 2013) and its variations (Sun et al., 2019;Kazemi and Poole, 2018). Additionally, several models are based on deep learning approaches (Nguyen et al., 2017;Dettmers et al., 2018;Schlichtkrull et al., 2018;Nathani et al., 2019;Hildebrandt et al., 2020) that apply (graph) convolutional layers on top of embeddings and design a score function as the last layer of the neural network.
Temporal KG Embedding Recently, there have been some attempts of incorporating time information in temporal KGs to improve the performance of link prediction. Ma et al. (2018b) developed extensions of static knowledge graph models by adding a timestamp embedding to the score functions. Also, Leblay and Chekol (2018) proposed TTransE by incorporating time representations into the score function of TransE in different ways. HyTE (Dasgupta et al., 2018) embeds time information in the entity-relation space by arranging a temporal hyperplane to each timestamp. Inspired by the canonical decomposition of tensors, Lacroix et al. (2020) proposed an extension of ComplEx (Trouillon et al., 2016) for temporal KG completion by decomposing a 4-way tensor. The number of parameters of these models scales with the number of timestamps, leading to overfitting when the number of timestamps is extremely large. Additionally, a considerable amount of models (Trivedi et al., 2017;Jin et al., 2019;Han et al., 2020) have been developed for forecasting on temporal knowledge graphs, which predict future links only based on past events.

Graph Embedding Approaches in non-Euclidean Geometries
There has been a growing interest in embedding graph data in non-Euclidean spaces. Nickel and Kiela (2017)

Temporal Knowledge Graph Completion in Riemannian Manifold
Entities in a temporal KG might form different geometric structures under different relations, and these structures could evolve with time. To capture heterogeneous and time-dependent structures, we propose the DyERNIE model to embed entities of temporal knowledge graphs on a product of Riemannian manifolds and model time-dependent behavior of entities with dynamic entity representations.

Entity Representation
In temporal knowledge graphs, entities might have some features that change over time and some features that remain fixed. Thus, we represent the embedding of an entity e j P E at instance t with a combination of low-dimensional vectors component manifold, K i and n i denote the curvature and the dimension of this manifold, respectively. Each component embedding e piq j ptq is derived from an initial embedding and a velocity vector to encode both the stationary properties of the entities and their time-varying behavior, namely represents an entity-specific velocity vector that is defined in the tangent space at origin 0 and captures evolutionary dynamics of the entity e j in its vector space representations over time. As shown in Figure 1 (a), we project the initial embedding to the tangent space T 0 M n i K i using the logarithmic map log K i 0 and then use a velocity vector to obtain the embedding of the next timestamp. Finally, we project it back to the manifold with the exponential map exp K i 0 . Note that in the case of Euclidean space, the exponential map and the logarithmic map are equal to the identity function. By learning both the initial embedding and velocity vector,

Score Function
Bilinear models have been proved to be an effective approach for KG completion (Nickel et al., 2011;Lacroix et al., 2018), where the score function is a bilinear product between subject entity, predicate, and object entity embeddings. However, there is no clear correspondence of the Euclidean innerproduct in non-Euclidean spaces. We follow the method suggested in Poincaré Glove (Tifrea et al., 2018) to reformulate the inner product as a function of distance, i.e. xx, yy " 1 2 pdpx, yq 2`| |x|| 2| |y|| 2 q and replace squared norms with biases b x and b y . In addition, to capture different hierarchical structures under different relations simultaneously, Balazevic et al. (2019) applied relation-specific transformations to entities, i.e. a stretch by a diagonal predicate matrix P P R nˆn to subject entities and a translation by a vector offset p P P n to object entities.
Inspired by these two ideas, we define the score function of DyERNIE as where e piq s ptq and e piq o ptq P M n i K i are embeddings of the subject and object entities e s and e o in the i-th component manifold, respectively. p piq P M n i K i is a translation vector of predicate p, and P piq P R n iˆni represents a diagonal predicate matrix defined in the tangent space at the origin. Since multi-relational data often has different structures under different predicate, we use predicatespecific transformations P and p to determine the predicate-adjusted embeddings of entities in different predicate-dependent structures, e.g. multiple hierarchies. The distance between the predicateadjusted embeddings of e s and e o measures the relatedness between them in terms of a predicate p.

Learning
The genuine quadruples in a temporal KG G are split into train, validation, and test sets. We add reciprocal relations for every quadruple, which is a standard data augmentation technique commonly used in literature (Balazevic et al., 2019;Goel et al., 2019), i.e. we add pe o , p´1, e s , tq for every pe s , p, e o , tq. Besides, for each fact pe s , p, e o , tq in the training set, we generate n negative samples by corrupting either the object pe s , p, e 1 o , tq or the subject pe o , p´1, e 1 s , tq with a randomly selected entity from E. We use the binary cross-entropy as the loss function, which is defined as where N is the number of training samples, y m represents the binary label indicating whether a quadruple q m is genuine or not, p m denotes the predicted probability σpφpq m qq, and σp¨q represents the sigmoid function. Model parameters are learned using Riemannian stochastic gradient descent (RSGD) (Bonnabel, 2013), where the Riemannian gradient ∇ M n L is obtained by multiplying the Euclidean gradient ∇ E with the inverse of the Riemannian metric tensor.

Signature Estimation
To better capture a broad range of structures in temporal KGs, we need to choose an appropriate signature of a product manifold M n , including the number of component spaces, their dimensions, and curvatures. Although we can simultaneously learn embeddings and the curvature of each component during training using gradient-based optimization, we have empirically found that treating curvature as a trainable parameter interferes with the training of other model parameters. Thus, we treat the curvature of each component and the dimension as hyperparameters selected a priori. In particular, we use the parallelogram law' deviation (Gu et al., 2018) to estimate both the graph curvature of a given temporal KG and the number of components. Details about this algorithm can be found in Appendix A. Figure 2 shows the curvature histograms on the ICEWS14 and ICEWS05-15 datasets introduced in Section 5.1. It can be noticed that curvatures are mostly non-Euclidean, offering a good motivation to learn embeddings on a product manifold. Taking the ICEWS05-15 dataset as an example, we see that most curvatures are negative. In this case, we can initialize the product manifold consisting of three hyperbolic components with different dimensions. Then we conduct a Bayesian optimization around the initial value of the dimension and the curvature of each component to fine-tune them. Finally, we select the best-performing signature according to performance on the validation set as the final choice. Baselines Our baselines include both static and temporal KG embedding models. From the static KG embedding models, we use TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), and Com-plEx (Trouillon et al., 2016) where we compress temporal knowledge graphs into a static, cumulative graph by ignoring the time information. From the temporal KG embedding models, we compare the performance of our model with several stateof-the-art methods, including TTransE (Leblay and Chekol, 2018), TDistMult/TComplEx (Ma et al., 2018b), andHyTE (Dasgupta et al., 2018).
Evaluation protocol For each quadruple q " pe s , p, e o , tq in the test set G test , we create two queries: pe s , p, ?, tq and pe o , p´1, ?, tq. For each query, the model ranks all possible entities E according to their scores. Following the commonly filtered setting in the literature (Bordes et al., 2013), we remove all entity candidates that correspond to true triples 1 from the candidate list apart from the current test entity. Let ψ es and ψ eo represent the rank for e s and e o of the two queries respectively, we evaluate our models using standard metrics across the link prediction literature: mean reciprocal rank (MRR): 1 2¨|Gtest| ř qPGtest p 1 ψe s`1 ψe o q and Hits@kpk P t1, 3, 10uq: the percentage of times that the true entity candidate appears in the top k of ranked candidates.
Implementations We implemented our model and all baselines in PyTorch (Paszke et al., 2019). For fairness of comparison, we use Table 2 in supplementary materials to compute the embedding dimension for each (baseline, dataset) pair that matches the number of parameters of our model with an embedding dimension of 100. Taking HyTE as an example, its embedding dimension is 193 and 151 on the ICEWS14 and GDELT dataset, respectively. Also, we use the datasets augmented with reciprocal relations to train all baseline models. We tune hyperparameters of our models using the quasi-random search followed by Bayesian optimization (Ruffinelli et al., 2020) and report the best configuration in Appendix E. We implement TTransE, TComplEx, and TDistMult based on the implementation of TransE, Distmult, and ComplEx respectively. We use the binary cross-entropy loss and RSGD to train these baselines and optimize hyperparameters by early stopping according to MRR on the validation set. Additionally, we use the implementation of HyTE 2 . We provide the detailed settings of hyperparameters of each baseline model in Appendix B.  Figure 2, where most sectional curvatures on the ICEWS14 and ICEWS05-15 datasets are negative.

Comparative Study
Ablation study We show an ablation study of the distance function and the entity representations  Figure 3: Scatter plot of distances between entity embeddings and the manifold's origin v.s. node degrees on ICEWS05-15. Each point denotes an entity e j . The xcoordinate gives its degree accumulated over all timestamps, and the y-coordinate represents d M pe j , 0q.
in Table 2 and 3, respectively. For the distance function, we use p and P to get predicate-adjusted subject and object embeddings and compute the distance between them. We found that any change to distance function causes performance degradation. Especially, removing the translation vector p most strongly decrease the performance. For the entity representation function, we measure the importance of a linear trend component and a non-linear periodic component. We attempt adding trigonometric functions into entity representations since a combination of trigonometric functions can capture more complicated non-linear dynamics (Rahimi and Recht, 2008). However, experimental results in Table 3 show that using only a linear transformation works the best, which indicates that finding the correct manifold of embedding space is more important than designing complicated non-linear evolution functions of entity embeddings. Additionally, we found the performance degrades significantly if removing the dynamic part of the entity embeddings. For example, on the ICEWS0515 dataset, the Hits@1 metric in the static case is only about half of that in the dynamic case, clearly showing the gain from the dynamism. Details of this ablation study are provided in Appendix G. Intrinsic hierarchical structures of temporal KGs To illustrate geometric, especially the hierarchical, structures of temporal KGs, we focus on the Poincaré ball model with a dimension of 20 and plot the geodesic distance d M p¨, 0q of learned entity embeddings to the origin of the Poincaré ball versus the degree of each entity in Figure 3. Note that the distance is averaged over all timestamps since entity embeddings are time-dependent. We observe that entities with high degrees, which means they got involved in lots of facts, are generally located close to the origin. This makes sense because these entities often lie in the top hierarchical levels. And thus, they should stand close to the root. Under the same settings, we plot the velocity norm of each entity versus the entity degree in Figure 4. Similarly, we see that entities with high degrees have a small velocity norm to stay near the origin of the manifold. Figure 5 shows two-dimensional hyperbolic entity embeddings of the ICEWS05-15 dataset on two timestamps, 2005-01-01 and 2015-12-31. Specifically, we highlight a former US president (in orange) and a former prime minister of Russia (in purple). We found that the interaction between these two entities decreased between 2005 and 2015, as shown in Figure 9 in the appendix. Accordingly, we observe that the embeddings of these two entities were moving away from each other. More examples of learned embeddings are relegated to Appendix F.

Conclusion
In this paper, we propose an embedding approach for temporal knowledge graphs on a product of Riemannian manifolds with heterogeneous curvatures. To capture the temporal evolution of temporal KGs, we use velocity vectors defined in tangent spaces to learn time-dependent entity representations. We show that our model significantly outperforms its Euclidean counterpart and other state-of-the-art approaches on three benchmark datasets of temporal KGs, which demonstrates the significance of geometrical spaces for the temporal knowledge graph completion task.

Appendices A Graph Curvature Estimation Algorithm
We use Algorithm 1 to estimate the sectional curvatures of a dataset developed by Bachmann et al. (2019).

B Implementation Details of Baselines
Note that the embedding dimension for each (baseline, dataset) pair matches the number of parameters of our models with an embedding dimension of 100. We use Table 4 and 12 to compute the rank for each (baselines, dataset) pair. Besides, for fairness of results, we use the datasets augmented with reciprocal relations to train all baseline models.

Static knowledge graph embedding models
We use TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), and ComplEx (Trouillon et al., 2016) as static baselines, where we compress temporal knowledge graphs into a static, cumulative graph by ignoring the time information. We use the cross-entropy loss and Adam optimizer with a batch size of 128 to train the static baselines. Besides, we use uniform sampling to initialize the embeddings of entities and predicates. Other hyperparameters of the above baselines are shown in Table 5. Temporal knowledge graph embedding models We compare our model's performance with several state-of-the-art temporal knowledge graph embedding methods, including TTransE (Leblay and Chekol, 2018), TDistMult/TComplEx (Ma et al., 2018b), andHyTE (Dasgupta et al., 2018). We use the ADAM optimizer (Kingma and Ba, 2014) and the cross-entropy loss to train the temporal KG models. We set learning rate = 0.001, negative samples pro fact = 500, number of epochs = 500 , batch size = 256, and validate them every 50 epochs to select the model giving the best validation MRR. For the GDELT dataset, we use a similar setting but with negative samples pro fact = 50 due to the large size of the dataset. The embedding dimensions of the above dynamic baselines on each dataset are shown in Table 6.

C Datasets
Dataset statistics are described in Table 12. Since the timestamps in the ICEWS dataset are dates rather than numbers, we sort them chronologically and encode them into consecutive numbers.

D Evaluation metrics
Let ψ es and ψ eo represent the rank for e s and e o of the two queries, respectively. We evaluate our models using standard metrics across the link prediction literature: mean reciprocal rank (MRR): 1 2¨|Gtest| ř qPGtest p 1 ψe s`1 ψe o q and Hits@kpk P t1, 3, 10uq: the percentage of times that the true entity candidate appears in the top k of ranked candidates.

E Implementation Details of DyERNIE
Signature search On the ICEWS subsets, we try all manifold combinations with the number of components of t1, 2, 3u. Due to the large size of data samples on the GDELT dataset, we only try manifold combinations with the number of components of t1, 2u. Specifically, the candidates are tP n , S n , E n u for single manifolds, tP n iŜ n i , P n iˆP n i , S n iˆS n i , P n iˆE n i , S n iˆE n i u for a product of two component manifolds, and tP n iˆP n iˆP n i , P n iˆS n iˆE n i , S n iˆS n iŜ n i , P n i ,ˆP n iˆS n i , P n iˆS n iˆS n i , P n iˆP n iÊ n i , S n iˆS n iˆE n i u for a product of three component manifold. For each combination, we use the Ax-framework 3 to optimize the assignment of dimensions to each component manifold and the curvatures. The assignment of the best-performing models are shown in Table 9, 10, and 11. We report the best results on each dataset in Table 1 in the main body.
Hyperparameter configurations for bestperforming models We select the loss function from binary cross-entropy (BCE), margin ranking loss, and cross-entropy (CE). BCE and CE give a similar performance and outperform the margin ranking loss. However, when using the BCE loss, we could use a large learning rate (lr ą 10) to speed up the training procedure. In contrast, models with the CE loss incline overfitting by large learning rates. Given the BCE loss, we found the learning rate of 50 works the best for all model configurations. Furthermore, increasing negative samples can improve the performance to some extent, while this impact is weakening gradually as the number of negative samples become larger. However, the number of negative samples largely affect the runtime of the training procedure. We empirically found that the negative sample number of 50 is a good compromise between the model performance and the training speed. Besides, there is no statistically significant difference in the model performance when using different optimizers, such as Riemannian Adam (RADAM) and Riemannian stochastic gradient descent (RSGD). Thus, for the model's simplicity, we decide to use RSGD.
Average runtime for each approach & Number of parameters in each model Table 13 shows the number of parameters and the average runtime for each model.

F Visualization
We plot the geodesic distance d M pe j , 0q of learned entity embeddings with a dimension of 20 to the 3 https://ax.dev manifold's origin versus the degree of each entity in Figure 6, where d M pe j , 0q is averaged over all timestamps since e j is time-dependent. Also, the degree of each entity is accumulated over all timestamps. Each point in the upper plot represents an entity where the x-coordinate gives their degree, and the y-coordinate gives their average distance to the origin. The plot clearly shows the tendency that entities with high degrees are more likely to lie close to the origin. The bottom plot shows the same content but with a sampling of 20% points. The gray bar around each point shows the variance of the distance between the entity embedding and the origin over time. Figure 7 shows two-dimensional hyperbolic entity embeddings of the ICEWS05-15 dataset on four timestamps. We highlight some entities to show the relative movements between them. The number of interactions between the selected entities are depicted in Figure 8 and 9, which evolves over time. Specifically, we highlight Nigerian citizens, the Nigerian government, head of the Nigerian government, other authorities in Nigeria, and Nigerian minister in the first row of subplots. We can see that two entities were getting close in the Poincare disc if the number of interactions between them increases.

G Additional Ablation Study
To assess the contribution of the dynamic part of entity embeddings, we remove the dynamic part and run the model variant on static knowledge graphs. Specifically, we compress ICEWS05-15 into a static, cumulative graph by ignoring the time information. As shown in Table 7, the performance degrades significantly if the entity embeddings only have the static part. For example, on the ICEWS0515 dataset, the Hits@1 metric of DyERNIE-Sgl in the static case is less than half of that in the dynamic case, clearly showing the gain from the dynamism.  Figure 6: Each point in the upper plot represents an entity whose x-coordinate gives their degree accumulated over all timestamps and y-coordinate gives their distance to the origin averaged over all timestamps. The plot clearly shows the tendency that entities with high degrees are more likely to lie close to the origin. The bottom plot shows the same content but with a sampling of 20% points. The gray bar around each point shows the variance of the distance over all timestamps.   Figure 8: Interaction between Nigerian entities. Subtitles show the names of the given entity pair. Red lines give the geodesic distance between two entities. Blue dots represent the number of interactions between two entities (relative degree) at each timestamp, and blue lines are regression of the relative degree between two entities over time. Figure 9: Interaction between Barack Obama and Dmitry Anatolyevich Medvedev. Red lines give the geodesic distance between two entities. Blue dots represent the number of interactions between two entities (relative degree) at each timestamp, and blue lines are regression of the relative degree between two entities over time.
Algorithm 1: Curvature Estimation Input :Number of iterations n iter , number of timestamps n time , Graph Slices tG i u n time i"1 of a temporal knowledge graph, Neighbor dictionary N . Output :tK i u n time i"1 for i " 1 to n time do for m P G i do for j " 1 to n iter do b, c " UpN pmqq and a " UpG i ztmuq ψ j pm, b, c, aq " 1 2d G i pa,mq`2 d 2 G i pa, mq`d 2 G i pb, cq{4´d 2 G i pa, bq{2`d 2 G i pa, cq{2ȇ nd ψ i pmq " ř n iter j"1 ψ j pm, b, c, aq end K i " ř mPG i ψ i pmq end  Table 9: Hyperparameter configurations for best-performing models on the ICEWS14 dataset.