TeRo: A Time-aware Knowledge Graph Embedding via Temporal Rotation

In the last few years, there has been a surge of interest in learning representations of entitiesand relations in knowledge graph (KG). However, the recent availability of temporal knowledgegraphs (TKGs) that contain time information for each fact created the need for reasoning overtime in such TKGs. In this regard, we present a new approach of TKG embedding, TeRo, which defines the temporal evolution of entity embedding as a rotation from the initial time to the currenttime in the complex vector space. Specially, for facts involving time intervals, each relation isrepresented as a pair of dual complex embeddings to handle the beginning and the end of therelation, respectively. We show our proposed model overcomes the limitations of the existing KG embedding models and TKG embedding models and has the ability of learning and inferringvarious relation patterns over time. Experimental results on four different TKGs show that TeRo significantly outperforms existing state-of-the-art models for link prediction. In addition, we analyze the effect of time granularity on link prediction over TKGs, which as far as we know hasnot been investigated in previous literature.


Introduction
In recent years, a number of sizable Knowledge Graphs (KGs) have been constructed, including DBpedia (Auer et al., 2007), YAGO (Suchanek et al., 2007), Nell (Carlson et al., 2010) and Freebase (Bollacker et al., 2008). In these KGs, a fact is represented as a triple (s, r ,o), where s (subject) and o (object) are entities (nodes), and r (relation) is the relation (edge) between them.
Several KG embedding (KGE) models are developed to perform learning and inference over these KGs (Bordes et al., 2013;Yang et al., 2014;Trouillon et al., 2016;Sun et al., 2019;. The most common learning task for these models is link prediction, which is to complete a fact with the missing entity. For instance, one can use a KGE model to perform an object query like (Barack Obama, visits, ?). In this case, there are several valid answers to this question, regardless of the time factor. Obviously, the inclusion of time information can make this query more specific, e.g., (Barack Obama, visits, ?, 2014-07-08).
Some temporal KGs (TKGs) including ICEWS (Lautenschlager et al., 2015), GDELT (Leetaru and Schrodt, 2013), YAGO3 (Mahdisoltani et al., 2013) and Wikidata (Erxleben et al., 2014) store billions of time-aware facts as quadruples (s, r ,o, t) where t is the time annotation, e.g., (Barack Obama, visits, Ukraine, 2014-07-08). The availability of these TKGs that exhibits complex temporal dynamics in addition to its multi-relational nature has created the need for approaches that can characterize and reason over them. Traditional KGE models disregard time information, leading to an ineffectiveness of performing link prediction on TKGs involving temporary relations (e.g., visits, live in, etc.). This work is licensed under a Creative Commons Attribution 4.0 International License. License details: http:// creativecommons.org/licenses/by/4.0/.
In this paper, we propose a novel approach for TKGEs, TeRo, which defines the temporal evolution of an entity embedding as a rotation from the initial time to the current time in the complex vector space. We show the limitation of the existing TKGE models and the advantage of our proposed model on learning various relation patterns over time.
Specially, for facts involving time intervals, each relation is represented as a pair of dual complex embeddings which are used to handle the beginning and the end of the relation, respectively. In this way, TeRo can adapt well to datasets where time annotations are represented in the various forms: time points, beginning or end time, time intervals.
Most of previous TKGE-related works as far as we know use specific time granularities for various TKGs. For example, the time granularity of the ICEWS datasets is fixed as 24 hours in (Dasgupta et al., 2018;. In this work, we adopt various time-division approaches for different TKG datasets and investigate the effect of the length of the time steps on the performance of our model.
To verify our approach, we compare the performance of our proposed models on link prediction and time prediction tasks over four different TKGs with the state-of-the-art KGE models and the existing TKGE models. The experimental results demonstrate that our proposed model outperforms other baseline models significantly by inferring various relation patterns and encoding time information.

Related Work
KGE models can be roughly classified into distance-based models and semantic matching models.
Distance-based models measure the plausibility of a fact as the distance between the two entities, usually after a translation or rotation carried out by the relation. A typical example of distance-based models is TransE (Bordes et al., 2013). TransE exhibits deficiencies when learning 1-n relations. Thus, various extensions of TransE (Wang et al., 2014;Ji et al., 2015;Lin et al., 2015;, were proposed to tackle this problem. They use different mapping methods to project entities from entity space to relation space. Specially, RotatE (Sun et al., 2019) defines each relation as a rotation from the subject to the object. Nevertheless, these distance-based distance models are still unable to capture reflexive relations which can hold, i.e. for a particular relation r each entity is related to itself via r. In distance-based models, the values or the phases of vectors for all reflexive relations are enforced to be 0, which does not allow to fully express the semantic characteristics of these relations.
Semantic matching models measure plausibility of facts by matching latent semantics of entities and relations embodied in their embedding representations. A few examples of such models include RESCAL (Nickel et al., 2011), DistMult (Yang et al., 2014), ComplEx (Trouillon et al., 2016), QuatE  and GeomE (Xu et al., 2020). RESCAL and DistMult cannot capture asymmetric relations since the score of the triple (s, r, o) is always equal to the score of its symmetric triple (o, r, s). ComplEx, QuatE and GeomE have been proven to be able to capture various relation patterns for static KGs, but cannot model temporary relations in TKGs due to their ignorance of time information.
Recent research illustrated that the performances of KGE models can be further improved by incorporating time information in TKGs. Some TKGE models are extended from TransE, e.g., TTransE (Leblay and Chekol, 2018), TA-TransE (García-Durán et al., 2018), HyTE (Dasgupta et al., 2018 and ATiSE . Another part of TKGE models are temporal extensions of DistMult, e.g., Know-Evolve (Trivedi et al., 2017), TDistMult (Ma et al., 2018) and TA-DistMult (García-Durán et al., 2018). Similar to TransE and DistMult, these TKGE models have issues with capturing reflexive relations or asymmetric relations. Specially, DE-SimplE (Goel et al., 2020) incorporates time information into diachronic entity embeddings and has capability of modeling various relation patterns. However, this approach only focuses on event-based TKG datasets, and cannot model facts involving time intervals

A Novel TKGE Approach based on Temporal Rotation
Although various KGE models have been developed to learn multi-relational interactions between entities, all of them have problems with inferring temporary relations which are only valid for a certain time point or a last for a certain time period. To illustrate this by example, assume we are given a quadruple (Barack Obama, visits, France, 2014-02-12) as a training sample, where the relation visits is a temporary relation. If we query (Barack Obama, visits, ?, 2014-07-08), a trained static KGE model probably returns the incorrect answer France due to the validness of the triple Barack Obama, visits, France, while the correct answer is Ukraine considering the given time constraint. On the other hand, most of the existing TKGE models, which were extended from TransE (Bordes et al., 2013) and DistMult (Yang et al., 2014), incorporate time information in the embedding space, but have limitations on learning transitive relations or asymmetric relations as discussed in Section 2.
To overcome the limitations of these existing KGE and TKGE models on learning and inferring over TKGs, we propose a new TKGE model, TeRo, which defines the temporal evolution of an entity embedding as a rotation in the complex vector space. Let E denote the set of entities, R denote the set of relations, and T denote the set of time steps. Then a TKG is a collection of factual quadruples (s, r, o, t), where s, o ∈ E are the subject and object entities, r ∈ R is the relation, t denotes the actual time when the fact occurs. For any time t, we have a time step τ ∈ T representing this actual time. We map s, r, o to their complex embeddings, i.e., s, r, o ∈ C k ; then we define the functional mapping induced by each time step τ as an element-wise rotation from the time-independent entity embeddings s and o to the time-specific entity embeddings s t and o t . The mapping function is defined as follows: where • denotes the Hermitian dot product between complex vectors. Here, we constrain the modulus of each element of τ ∈ C k , i.e., τ j ∈ C, to be |τ j | = 1. By doing this, τ j is of the form e iθ τ,j , which corresponds to a counter-clockwise rotation by θ τ,j radians around the origin of the complex plane, and only affects the phases of the entity embeddings in the complex vector space. This idea is motivated by Euler's identity e iθ = cosθ + isinθ, which indicates that a unitary complex number can be regarded as a rotation in the complex plane. We regard the relation embedding r as translation from the time-specific subject embedding s t to the conjugate of the time-specific object embedding o t for a single quadruple (s, r, o, t)∈ Q + , where r ∈ R b ∪ R e and Q + denotes the set of all positive quadruples. The score function is defined as: For a fact (s, r, o, t) occurring in a certain time interval, i.e., t = [t b , t e ] where t b , t e denote the beginning time and the end time of the fact, we separate this fact into two quadruples, namely, (s, r b , o, t b ) and (s, r e , o, t e ). Here, we extend the relation set R in a TKG which involves time intervals to a pair of dual relation sets, R b and R e . A relation r b ∈ R b is used to handle the beginning of relation r, meanwhile a relation r e ∈ R e is used to handle the end of relation r. By doing this, we score a fact (s, r, o, [t r , t e ]) as the mean value of scores of two quadruples, (s, r b , o, t b ) and (s, r e , o, t e ) which represent the beginning and the end of this fact respectively.
Specially, for a fact missing either the beginning time or the end time, e.g., (s, r, o, [t b , -]) or (s, r, o, [-, t e ]), the score of this fact is equal to the score of the quadruple involving the known time, i.e., In this paper, we use the same loss function as the negative sampling loss proposed in (Sun et al., 2019) for optimizing our model. This loss function has been proved to be very effective on optimizing some distance-based KGE models, e.g., TransE, RotatE (Sun et al., 2019) and ATiSE .
where ξ ∈ Q + is a positive training quadruple, ξ i is the ith negative sample corresponding to ξ generated by randomly corrupting the subject or the objects of ξ such as (s , p, o, t) and (s, p, o , t), σ(·) denotes the sigmoid function, γ is a fixed margin and η is the ratio of negatives over positive training samples.

Learning Various Relation Patterns
Static KGE models and some existing TKGE models which are the temporal extensions of TransE or DistMult have limitations on capturing some key relation patterns which are defined as follows.
Definition 3. A relation r is a reflexive relation if ∃s, t r(s, s, t) holds True.
As mentioned in Section 2, static KGE models can not model temporary relations, e.g., 'visits', since . Temporal extensions of DistMult (denoted as T-DistMult) including TDistMult, TA-DistMult and Know-Evolve can not model asymmetric relations, e.g., 'par-  By defining each time step as a rotation in the complex vector spaces, TeRo can capture all of the above three relation patterns. Given an observed fact (s, r, o, t 1 ) where s t 1 + r = o t 1 : • as shown in Figure 1(b), if r is a temporary relation, we can have s t 2 + r = o t 2 for TeRo to make r(s, o, t 1 ) ∧ ¬r(s, o, t 2 ) hold true. • as shown in Figure 1(c), if r is an asymmteric relation, we can have o t 1 + r = s t 1 for TeRo to make r(s, o, t 1 ) ∧ ¬r(o, s, t 1 ) hold true. • as shown in Figure 1(d), if r is a reflexive relation, we have Im(r) = 2Im(s t 1 ) for TeRo. Thus, TeRo can represent multiple reflexive relations as different embeddings due to the conjugate operations of object embeddings.

Complexity
In Table 1, we summarize the scoring functions and the space complexites of several state-of-the-art TKGE approaches and our model as well as TransE. n e , n r , n τ and n token are numbers of entities, relations, time steps and temporal tokens used in (García-Durán et al., 2018); d is the dimensionality of embeddings. P t (·) denotes the temporal projection for embeddings (Dasgupta et al., 2018). LSTM(·) denotes an LSTM neural network; [r; t seq ] denotes the concatenation of the relation embedding and the sequence of temporal tokens (García-Durán et al., 2018). − → and ← − denote the temporal part and untemporal part of a time-specific diachronic entity embedding (Goel et al., 2020); r −1 denotes the inverse relation embedding of r, i.e., (s, r, o, t) ↔ (o, r −1 , s, t). D KL (·) denotes the KL divergence between two Gaussian distributions; P s,t , P o,t , P r,t denote the Gaussian embeddings of s, r and o at time t .
As shown in Table 1, the space complexity of TeRo and TransE will be close if n τ < n e . In practice, we can achieve this condition by tuning the time granularity.

Model
Scoring Function Space Complexity TransE   [2003,2005]. We list the statistics of the four datasets we use in Table 2

Time Granularity
In some recent work (Dasgupta et al., 2018;, the time span of a TKG dataset was splitted into a number of time steps. For ICEWS14 and ICEWS05-15, the time granularity was fixed as 1 day. For YAGO11k and Wikidata12k, month and day information was dropped, and less frequent year mentioned were clubbed into same time steps but years with high frequency formed individual time steps in order to alleviate the effect of the long-tail property of time data. In other words, the lengths of different time steps were different for the balance of numbers of triples in different time steps. However, it has not been investigated whether the lengths of time steps affect the performances of TKGE models. In this work, we test our model with different time units, denoted as u, in a range of {1, 2, 3, 7, 14, 30, 90 and 365} days for ICEWS datasets. Dasgupta et al. (2018) and  applied a minimum threshold of 300 triples per interval during construction for YAGO11k and Wikidata12k. We follow their time-division approaches for these two datasets and test different minimum thresholds, denoted as thre, in a range of {1, 3, 10, 30, 100, 300, 1000, 3000, 10000, 30000}. The change of time granularity will reconstruct the set of time steps T . For ICEWS14, when the time unit u is 1 day, we have totally 365 time steps and the date 2014-01-02 is represented by the second time step, i.e., τ 1 . If the time unit is changed as 2 days, the total number of time steps will be 183 and the date 2014-01-02 will be denoted as τ 0 . For YAGO11k, when the mini threshold thre = 1, we have 396 time steps since there are totally 396 different years existing as timestamps in YAGO11k. Years like -453, 100 and 2008 are all taken as independent time steps. When thre for YAGO11k rises to 300, the number of time steps drops to 127 and years between -431 and 100 are clubbed into a same time step.

Evaluation Metrics
We evaluate our model by testing the performances of our model on link prediction task over TKGs under the time-wise filtered setting defined in Goel et al., 2020). This task is to complete a time-wise fact with a missing entity. For a test quadruple (s, p, o, t), we first generate candidate quadruples C = {(s, p, o , t) : o ∈ E} ∪ {(s , p, o, t) : s ∈ E} by replacing s or o with all possible entities. Different from the time-unwise filtered setting (Bordes et al., 2013) which filters the triples appearing either in the training, validation or test set from the candidate list , we only filter the quadruples ξ ∈ Q + existing in the dataset. This ensures that the facts which do not appear at time t are still considered as candidates for evaluating the given test quadruple. We obtain the final rank of the test quadruple among filtered candidate quadruples C = {ξ : ξ ∈ C, ξ / ∈ Q + } by sorting their scores. Two commonly used evaluation metrics are used here, i.e., Mean Reciprocal Rank and Hits@k. The Mean Reciprocal Rank (MRR) is the means of the reciprocal values of all computed ranks. And the fraction of test quadruples ranking in the top k is called Hits@k.

Baselines
We compare our approach with several state-of-the-art KGE approaches and existing TKGE approaches, including TransE (Bordes et al., 2013), DistMult (Yang et al., 2014), ComplEx-N3 (Lacroix et al., 2018), RotatE (Sun et al., 2019), QuatE , TTransE (Leblay and Chekol, 2018), TA-TransE, TA-DistMult (García-Durán et al., 2018), DE-SimplE (Goel et al., 2020) and ATiSE . The results of most baselines are taken from some recent work (Goel et al., 2020; which used the same evaluation protocol as ours. DE-SimplE which mainly focuses on event-based datasets, cannot model time intervals or time annotations missing moth and day information which are common in YAGO and Wikidata. Thus its result on YAGO11k and Wikidata12k are unobtainable. Since the original source code of TA-TransE and TA-DistMult (García-Durán et al., 2018) is not released, we reimplement these models according to the implementation details reported in the original paper, in order to obtain their results on YAGO11k and Wikidata12k.

Experimental Setup
We implement our proposed model in PyTorch. The code is available at https://github.com/ soledad921/ATISE.
We select the optimal hyperparameters by early validation stopping according to MRR on the validation set. We restrict the iterations to 5000. Following the setup used in , the batch size b = 512 is kept for all datasets, the embedding dimensionality k is tuned in {100, 200, 300, 400, 500}, the ratio of negative over positive training samples η is tuned in {1, 3, 5, 10} and the margin γ is tuned in {1, 2, 3, 5, 10, 20, · · · , 120}. Regarding optimizer, we choose Adagrad for TeRo and tune the learning rate r in a range of {1, 0.3, 0.1, 0.03, 0.01}. Specially, the time granularity parameters u and thre are also regraded as hyperparameters for TeRo as mentioned in Section 4.2.

Ablation Study
In this work, we analyze the effect of the change of the time granularity on the performance of our model. As mentioned in Section 4.2, we adopt two different time-division approaches for event-based datasets, i.e., ICEWS datasets, and time-wise KGs involving time intervals, i.e., YAGO11k as well as Wikidata12k. For ICEWS14 and ICEWS05-15, we use time steps with fixed length since the the distribution of numbers of facts in ICEWS datasets over time are relatively uniform as shown in Figure 2   In ICEWS14, time distribution is relatively uniform and thus representing time with a small time granularity can provide more abundant time information. As shown in Figure 3, TeRo with small time granularities, e.g., 1 day, 2 days and 3 days, had better performance on ICEWS14 compared to TeRo with big time granularities regarding MRR and Hits@3. Likewise, the optimal time unit for TeRo on ICEWS05-15 was proven by our experiments to be 2 days. For Wikidata12k, using a very small time granularity was non-optimal due to the long-tail property of time data. On the other hand, using an overly big time granularity resulted in the invalid incorporation of time information. Figure 3 demonstrates the low performances of TeRo with big time granularities. More concretely, when time unit u was 1 year, all of time annotations in ICEWS14 were represented by a uniform time embedding, which meant this time embedding was temporally unmeaningful. Table 5 demonstrates a few examples of link prediction results on ICEWS14 of TeRo models with time units u of two days and one year.
As shown in Table 5, in many cases, TeRo with u = 1 predicated correctly, meanwhile TeRo with u = 365 gave the wrong predictions. We notice that these predictions of TeRo with u = 365 in Table 5 would be valid if we disregarded the time constraint. For instance, (Colombia, Host a visit, John F. Kelly) happened on 2014-03-27, (UN Security Council, Criticize or denounce, Armed Band (South Sudan)) was  true on 2014-08-07. As mentioned in Section 3, Host a visit and Criticize or denounce are temporary relations. The above results prove that using a reasonable time granularity is helpful for TeRo to effectively incorporate time information. And the inclusion of time information enables TeRo to capture temporary relations and improve its performance on link prediction over TKGs.

Efficiency Study
TeRo has the same space complextiy as TTransE (Leblay and Chekol, 2018)   It is also noteworthy that representing each relation as a pair of dual complex embeddings is helpful to save training time on TKGs involving time intervals. Given a fact (s, r, o, [t b , t e ]), some TKGE models, e.g., HyTE and ATiSE, discretize this fact into several quadruples involving continuous time points, i.e., [(s, r, o, t b ), (s, r, o, t b + 1), · · · , (s, r, o, t e )]. When thre = 300, each fact lasts for averagely around 15 and 8 time steps in YAGO11k and Wikidata12k. In other words, such method that discretizes facts involving time intervals expands the sizes of both datasets by 15 and 8 times. In our model, we propose a more efficient method to handle time intervals by using two different quadruples, (s, r b , o, t b ) and (s, r e , o, t b ) to represent the beginning and the end of each fact. In this way, we only expand the sizes of datasets as less than twice as their original sizes.
For relations r in YAGO11k, we analyze the similarities between the embeddings r b and r e . As shown in Figure 4, for short-term relations, e.g., deadIn, the real parts of r b and r e , as well as their imaginary parts, have high similarities since r b and r e always happen at the same time and have the same semantics. By contrast, for long-term relations, e.g., isMarriedTo, the real parts of r b and r e show their semantic similarities and the imaginary parts capture their temporal dissimilarities.

Conclusion
In this work, we introduce TeRo, a new TKGE model which represents entities or relations as single or dual complex embeddings and temporal changes as rotations of entity embeddings in the complex vector space. Our model is advantageous with its capability in modelling several key relation patterns and handling time annotations in various forms. Experimental results show that TeRo remarkably outperforms the existing state-of-the-art KGE models and TKGE models on link prediction over four well-established TKG datasets. Specially, we adopt two different time-division approaches for various datasets and investigate the effect of the time granularity on the performance of our model.