RTFE: A Recursive Temporal Fact Embedding Framework for Temporal Knowledge Graph Completion

Static knowledge graph (SKG) embedding (SKGE) has been studied intensively in the past years. Recently, temporal knowledge graph (TKG) embedding (TKGE) has emerged. In this paper, we propose a Recursive Temporal Fact Embedding (RTFE) framework to transplant SKGE models to TKGs and to enhance the performance of existing TKGE models for TKG completion. Different from previous work which ignores the continuity of states of TKG in time evolution, we treat the sequence of graphs as a Markov chain, which transitions from the previous state to the next state. RTFE takes the SKGE to initialize the embeddings of TKG. Then it recursively tracks the state transition of TKG by passing updated parameters/features between timestamps. Specifically, at each timestamp, we approximate the state transition as the gradient update process. Since RTFE learns each timestamp recursively, it can naturally transit to future timestamps. Experiments on five TKG datasets show the effectiveness of RTFE.


Introduction
Temporal knowledge graph (TKG) is an extension of static knowledge graphs (SKGs) which introduce the time dimension. In SKGs, facts are considered to be time-invariant (Sil and Cucerzan, 2014). In reality, facts are not always true. For example, the triple (Obama, President, United States) was true only from 2009 to 2016 and (Obama, married, Mitchell) since 1992. However, SKGs do not reflect the change in facts over time. An example of TKG is shown in Figure 1. Besides, facts on social networks, e-commerce platforms and trading platforms also change over time. Therefore, TKGs have the potential to improve the performance of question answering, search, recommendation and prediction based on KGs (Huang et al., 2020;Garg et al., 2020). * * Corresponding author: Haihong E TKG can be expressed as a set of quadruples (subject, relation, object, timestamp). Different from SKGs which ignore the time attribute of facts, the facts of TKGs are distributed in timestamps, which can reflect the dynamic change of entities and relationships over time. Due to the limited coverage of KGs, TKGs are also incomplete. By completing TKG, missing and potential knowledge under specific timestamps can be found.
In recent years, a lot of work (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015;Kazemi and Poole, 2018;Schlichtkrull et al., 2018;Sun et al., 2019;Zhang et al., 2020) has focused on KG completion by methods of graph embedding. These efforts have yielded good results, but most of them focused on SKGs and required training in a large number of triples.
However, the TKG under a certain timestamp is a sparse multi-relation graph (Esteban et al., 2016), so it is necessary to absorb information from other timestamps. What's more, SKGE methods lacked the modeling of time attribute of relations, and were proposed based on the assumption that all facts occur at the same time. So they cannot reflect the temporal dependencies of facts. To handle these two problems, our RTFE passes parameters and features between timestamps in a recursive manner, which not only alleviates the sparsity problem of TKG, but takes advantage of the continuity and relevance characteristics of the fact as well.
Existing TKG completion methods (Dasgupta et al., 2018;Goel et al., 2019;Lacroix et al., 2020) follow the training pattern of SKGE, which shuffles facts (s, r ,o ,t) of different time randomly and learns all facts in a chaotic temporal order by minibatch gradient descent algorithm. In other words, they just take time as a parameter but ignore the correlations in time evolution.
However, the early state may affect the later one and later facts tend to be dependent on early ones. In particular, the state at t i directly influences that … … … Figure 1: A toy example of TKG where solid edges represent observed edges and red edges represent new facts that occurred at that timestamp. Besides, dotted edges represent missing or potential facts. at t i+1 . E.g., (Obama, Campaign, President, 2008) directly influences (Obama, Inaugurated as, President, 2009). It has been verified that the chronological order of events can be used to improve the performance of link prediction (Jiang et al., 2016a;Jiang et al., 2016b). Based on this, we further find that the early training state can improve later one if we train facts in their chronological order. In order to capture changes in TKG's state transition, we think TKG as a sequence of dynamic graphs, not as a whole graph labeled with time information.
Besides, since new facts of future timestamps can be added to TKG, TKG is expanding dynamically. And the graphs of new timestamps may still be incomplete. However, existing TKG completion methods provide no solution to complete unseen future graphs. In their training pattern, facts of all timestamps are trained jointly to complete the graphs that have appeared. Models may need to be retrained on facts of all timestamps when a new timestamp appears. In contrast, our RTFE embeds and completes TKG's each timestamp in a recursive way. By using the information of previous timestamps, RTFE can be naturally extended to future timestamps during the state transition of parameters/features. RTFE only needs to be trained on new emerging facts, which is light and immediate.
SKGE has been studied for many years while TKGE is still at birth. Problems encountered in SKGE (e.g., diverse relation patterns) can also occur in TKGE. Thus the advantages of SKG researches can be used to accelerate the development of TKGs if we bridge the gap between them. Our RTFE provides a way to migrate SKGE methods to TKGs while preserving their excellent effects.
Further, existing TKG completion methods de-signed specifically for the characteristics of TKGs can also be enhanced using the training pattern of RTFE. To sum up, we have made the following contributions: 1. We propose a training pattern to bridge the gap between SKGE and TKGE. Therefore, state-of-the-art SKGE models can be used to accelerate the development of TKGE.
2. Existing TKGE models can be further enhanced with our framework RTFE, after finishing their own regular training.
3. To the best of our knowledge, we are the first to deal with the TKG evolution problem (i.e., new future timestamps are added to TKGs) in the TKG completion task.
4. The experimental results on 5 TKG datasets show that RTFE preserve SKGE models' excellent performance. And the predictive performance of state-of-the-art TKGE models are further enhanced using RTFE.

Problem definition
A temporal knowledge graph (TKG) can be represented as a sequence of graphs, i.e. G = {G t 1 , . . . , G tn } where G t i is a set of quadruples that occured at timestamp t i , i.e. G t i = {(s, r, o, t i )} where V is the set of G's entities and s, o ∈ V ; R is the set of G's relations and r ∈ R. We focus on the following task: given a training TKG G train = {G t 1 , . . . , G tn } , to infer the missing quadruples (s, r, o, t) in test set G test = {G t 1 , . . . , G tn } (i.e., assign high scores to true quadruples and low scores to false ones). As shown in Figure 1, missing facts with high probability are dotted.

Recursive Temporal Fact Embedding (RTFE) Framework
The state of TKGs change with the change of entities and relations over time. SKGE models fail to capture correlations during state transition. And existing TKGE models for TKG completion capture it implicitly. It can be observed that the TKG after the current change changes on the closest former state, which is similar to a first order Markov chain (Given the state at the current moment, the state at the next moment is independent of the state at the past moment). Inspired by Markov analysis, we use the time granularity of TKGs to discretely divide states. Then the basic model of RTFE can be expressed as: where S t i represents the state of G t i and P t i represents probability transition matrix to transform S t i to that of t i+1 . A typical KG embedding learner uses its parameters θ and features X to represent the semantic information of KG. Thus we approximate state vectors as: The idea of RTFE is to dynamically adjust θ and X as the TKG changes while passing the information of each timestamp graph. We simply assume the features and parameters satisfy Markov Property: where X t i and θ t i denote features and parameters at time t i .
RTFE does not specify a model, but rather a training method for TKG completion. Existing SKGE methods and TKGE methods that follow the SKGE training pattern such as DE-SimplE (Goel et al., 2019) and TComplEx (Lacroix et al., 2020) can potentially be utilized as the embedding component. The RTFE framework is illustrated in Figure  2. In section 3, we specify that RTFE how to use SKGE models for TKG completion. In section 4, we generalize RTFE to existing TKGE models to enhance their performance.

Preliminary training for static features
Instead of training from scratch, RTFE uses SKGE as input to the first timestamp. In order to obtain the input features, the TKG is transformed into SKG G static , which is obtained by merging the facts of each timestamp: Suppose the SKG embedding learner be θ, which takes the knowledge graph G (facts of G) and the feature X (which can be predefined or randomly initialized) as inputs. Send G static and X to θ, and then get the updated feature X after training, which will be the input to the first timestamp.

Learning each timestamp recursively
In TKGs, parameters θ and features X should change with time (i.e., with the change of TKG). We find that, due to the continuity of facts, most of the facts are the same in the adjacent timestamps, while only a small number of facts changed. For discrete events, they influence the states of the surrounding entities, leading to the possibility that these entities may produce new facts. Therefore, model parameters and features fitting a certain timestamp provide a good starting point for the learning of the next timestamp.
Different from most neural network-based SKGE models (Schlichtkrull et al., 2018;Wu et al., 2019) which only update θ during training, leaving the input features X unchanged, we let X be updated as well, to capture the temporal dynamics of entities and relations.
Therefore, in our framework RTFE, model parameters θ and input features X are both updated in the way similar to equation (1) during state transition: where θ t i and X t i denote the state vectors of θ and X at time t i respectively; P t i represent the probability transition matrix. To transform state vectors at t i to that at t i+1 , we approximate the state transition P t i as the gradient update process of learning G t i (i.e., updating according to the gradient of the loss function for several epochs): where α is the learning rate; l is the loss function defined by the specified embedding learner; ∇ θ is the gradient of l with respect to θ.
It must be pointed out that the state transition matrix in Markov analysis is fixed, so the above analysis method is generally applicable to shortterm prediction. But the state vectors are different in different states and the gradient between the states is also different. Since the state vector is fixed in a specific state, a model can be established for each discrete state by the time interval of TKG. Then the gradient can be updated between states according to the difference of each state vector, to continue our framework.
RTFE recursively trains each timestamp according to equation (6) and (7) and uses θ t i and X t i to test G t i . Since RTFE is trained and tested by timestamp, only the latest parameters and features need to be stored, which shows good scalability for large TKGs. The framework RTFE is illustrated in Figure 2 and the overall training and testing algorithm is shown in Algorithm 1.
For Graph neural network-based methods like RGCN (Schlichtkrull et al., 2018), we make its input feature do gradient update as well, so that the input features encode the information of each timestamp, so as to enhance the information transfer between timestamps. In addition, a residual connection is added to the network between the network inputs and outputs of each timestamp.
For RDGCN (Wu et al., 2019) that was designed for entity alignment, in order to measure the plausibility of a triple (s, r, o) for SKG completion, we design a distance function consisting of type distance and semantic distance: where X E ∈ R |V |×d and X R ∈ R |V |×2d denotes output entity and relation representations.

Enhancing TKGE models
Since existing TKGE models such as DE-SimplE (Goel et al., 2019) and TComplEx (Lacroix et al., 2020) for TKG completion follow the training pattern of SKGE models (i.e., think of TKG as a whole graph, not as a sequence of graphs), we can use them as the embedding leaner of RTFE. Specifically, we think their own training process as the preliminary training of RTFE. After TKGE models finish their own training process, we use the obtained features and parameters as the input to the learning of the first timestamp. Then RTFE trains the TKGE model recursively by equation (6) and equation (7).

Extensibility for future timestamps
Since RTFE embeds each timestamp recursively, transforming form current state to the next state, it provides a way to complete upcoming future timestamps. Specifically, given a sequence of observed graphs of a TKG: G obs = {G t 1 , . . . , G tn } and a sequence of upcoming future graphs: G f ut = {G t n+1 , . . . , G t n+j }. We pre-train RTFE on G obs , then embeds timestamp recursively to obtain the latest features X tn and parameters θ tn . To complete G t n+1 , we use equation (6) to equation (7) to obtain X t n+1 and θ t n+1 similarly. Then graphs of G f ut can also be completed in this recursive way, without retraining of G obs .  (Goel et al., 2019). The details of the five datasets are illustrated in appendix. Evaluation settings and metrics: For entity prediction, we used mean reciprocal rank (M RR) and Hits@1, Hits@3, Hits@10 as metrics. Hits@n is defined as: For relation prediction, we used mean rank (M R) = f act∈test_set rank(f act)≤n #test_set and Hits@1 as metrics since the number of relations is small. The rank of a test triple is obtained by replacing its head/tail/relation with remaining negative samples, and then evaluating the score rank of the original triple in all the replacement samples. Mean rank (M R) is the average rank of all test triples. And Mean reciprocal rank (M RR) is the average of the reciprocal ranks.
For our RTFE framework, a timestamp-bytimestamp train-test mode was adopted. The total test result was a weighted average of all timestamp test results. For example, the final M RR was calculated as: Baselines: We compared our framework RTFE to state-of-the-art TKGE models including t-TransE  (Zhang et al., 2020) as the embedding learner of RTFE to perform TKG completion as well. And finally we use TKGE models as the embedding learner of RTFE to show the gain of performance.

Entity prediction
Entity prediction is given a quadruple (s, r, o, t), to perform head entity prediction (i.e., to predict the plausibility of (?, r ,o, t) ) , and performs tail entity prediction (i.e., to predict the plausibility of (s, r, ?, t)). The plausibility of (s, r, o, t) is ranked among all corrupted quadruples, while all true quadruples are excluded according to TransE's filtering protocol.
The experimental results are shown in Table 1 and  Table 1: Entity prediction on continuous fact datasets: YAGO11k and Wikidata12k. Since the relations of these 2 datasets have typical "one-to-many" nature, the performance of tail prediction is better than that of head prediction.   Table 2 where results marked (*) are taken from reported results of Hyte and ATiSE. Table 1 shows that on continuous fact datasets, both translation-based and graph neural networkbased methods can be transplanted to RTFE, and the results are better than Hyte, indicating the generality and superiority of RTFE. For both TransEbased approaches, RTFE-TransE outperforms Hyte on all metrics (e.g., with the improvement of 15.0% in tail M RR on Wikidata12k) because RTFE takes advantage of the continuity of facts directly.
RotatE and HAKE are the most advanced translation-based approaches and RTFE-RotatE or RTFE-HAKE outperforms other methods, which demonstrates that our framework can preserve the excellent results of these methods over SKG. Besides, the performance of state-of-the-art TKGE model TComplEx is enhanced by RTFE, which shows the gain from our recursive training pattern. Table 2 shows that on discrete event datasets, RTFE also significantly improves the performance of TKGE models. Besides, on a dense dataset (i.e., with a small number of entities and a large number of facts) like GDELT, RTFE can take advantage of SKGE models such as HAKE.

Relation prediction
Relation prediction is given a quadruple (s, r, o, t), to evaluate the plausibility of (s, ?, o, t). The experimental results are shown in Table 3. RTFE-RDGCN type outperforms Hyte on YAGO11k that has only 10 relations (e.g., with the improvement of 12.5% in Hits@1), which implies that type information plays an important role in this task. Since the number of relations between these two datasets is relatively small (10 and 24), the performance improvement is not obvious after adding semantic information (e.g., with the improvement of 1.3% in Hits@1).
Hyte performed well on Wikdata12k. This may be attributed to its SKGE training pattern, which helped to capture applicable relation types between two entities from all the facts. In contrast, the timestamp of RTFE is trained by time, so only the facts of the current timestamp and information of last timestamp are directly utilized. To provide RTFE-RDGCN with more training data about relations, we added additional 30% negative samples obtained by replacing relations of quadruples into the negative sample set: {(s, r , o)|(s, r, o) ∈  (4)   G t , (s, r , o) / ∈ G}. We call this variant RTFE-RDGCN rel , which improves the performance of relation prediction on Wikidata12k compared with RTFE-RDGCN (with the improvement of 9.1% in Hits@1).

Extensibility validation
In order to verify the influence of pre-trained static features on RTFE's entire TKG completion, we divide timestamps into four time intervals and perform pre-training of RTFE on them respectively. Then the pre-trained static features of these time intervals are used as inputs to RTFE to test the performance of entity prediction at all timestamps.
The experimental results are presented in Figure  3. Although a complete SKG is not provided for pre-training, RTFE still remains a similar performance, which verifies the framework's extensibility for future timestamps. So RTFE can be extended to future timestamps to some extent, without the retraining of former timestamps, which shows good lightness and immediacy.

Ablation study
In this subsection, we explore the effects of preliminary training and recursive training. w/o pretrain refers to RTFE without preliminary training for static futures (i.e., only recursive training). As shown in Table 4  In recent years, some work (Jiang et al., 2016a;Esteban et al., 2016;Tresp et al., 2017;Trivedi et al., 2017;García-Durán et al., 2018;Jain et al., 2020;Ma et al., 2019;Xu et al., 2019;Jin et al., 2019;Wang and Li, 2019;Tang et al., 2020;Goel et al., 2019;Xu et al., 2020;Jain et al., 2020;Lacroix et al., 2020;) began to use the time information to improve the KG completion or directly complete the TKG. Based on the fact or event they dealt with, we state representative TKGE methods as follows.
(1) Event completion: DE (Goel et al., 2019) made the entity embedding into a function DEEMB that takes the time point as a variable. While DE transplanted SKG embedding methods to TKGs, it didn't involve recent GNN-based SKG embedding methods. TComplEx (Lacroix et al., 2020) presented an extension of Complex (Trouillon et al., 2016) by adding timestamp embedding into decomposition of tensors of order 4. ATiSE (Xu et al., 2019) incorporated time information into entity/relation representations by using Additive Time Series decomposition.
(2) Event prediction: (Esteban et al., 2016) trained an event prediction model by using background information provided by KG and recent events. RE-NET (Jin et al., 2019) models the event sequence as a temporal joint probability distribution. The method is trained on historical data and then, by sampling from the probability distribution, predicts the events of the future timestamp graph. GHNN (Han et al., 2020) used Hawkes process to capture the dynamic of evolving graph sequences. Glean (Deng et al., 2020) incorporated both relational and world contexts to capture historical information.
(3) Continuous fact completion: (Jiang et al., 2016a;Jiang et al., 2016b) used the order of relations and temporal consistency constraints to improve completion but did not make the embedding space directly contain time information. (García-Durán et al., 2018) used RNN to learn the representation of temporal relations, but did not consider that the embedding of entities should also change over time. Hyte (Dasgupta et al., 2018) represented timestamps as hyperplanes, and projected the entities and relations onto these hyperplanes. Then, the facts of all timestamps are learned jointly using a translation-based score function.

Conclusion
We propose a framework RTFE for TKG completion. We have transplanted SKGE models to TKGs and enhance the performance of existing TKGE models. Experiments show that on five TKG datasets RTFE outperformed baselines and is extensible for future timestamps to some extent.
In the future, we will further deal with discrete events. Since events with adjacent timestamps are correlated, we plan to modify RTFE so that it can learn correlations (especially causality) of events. By modeling spatio-temporal dependency of TKG, events in future timestamps can be forecasted. Besides, we plan to deal with the task of predicting time validity of facts (Leblay and Chekol, 2018).