TeMP: Temporal Message Passing for Temporal Knowledge Graph Completion

Inferring missing facts in temporal knowledge graphs (TKGs) is a fundamental and challenging task. Previous works have approached this problem by augmenting methods for static knowledge graphs to leverage time-dependent representations. However, these methods do not explicitly leverage multi-hop structural information and temporal facts from recent time steps to enhance their predictions. Additionally, prior work does not explicitly address the temporal sparsity and variability of entity distributions in TKGs. We propose the Temporal Message Passing (TeMP) framework to address these challenges by combining graph neural networks, temporal dynamics models, data imputation and frequency-based gating techniques. Experiments on standard TKG tasks show that our approach provides substantial gains compared to the previous state of the art, achieving a 10.7% average relative improvement in Hits@10 across three standard benchmarks. Our analysis also reveals important sources of variability both within and across TKG datasets, and we introduce several simple but strong baselines that outperform the prior state of the art in certain settings.


Introduction
The ability to infer missing facts in temporal knowledge graphs is essential for applications such as event prediction (Leblay and Chekol, 2018;De Winter et al., 2018), question answering (Jia et al., 2018), social network analysis (Zhou et al., 2018;Trivedi et al., 2019) and recommendation systems (Kumar et al., 2018).
Whereas static knowledge graphs (KGs) represent facts as triples (e.g., (Obama, visit, China)), temporal knowledge graphs (TKGs) additionally associate each triple with a timestamp (e.g., (Obama, visit, China, 2014)). Figure 1 shows a subgraph of such TKG. Usually, TKGs are assumed to consist of discrete timestamps (Jiang et al., 2016), meaning that they can be represented as a sequence of static KG snapshots, and the task of inferring missing facts across these snapshots is referred to as temporal knowledge graph completetion (TKGC).
Recent works on TKGC have largely focused on developing time-dependent scoring functions, which score the likelihood of missing facts and build closely upon popular representation learning methods for static KGs (Dasgupta et al., 2018;Jiang et al., 2016;Goel et al., 2019;Xu et al., 2019;Lacroix et al., 2020). However, while powerful, these existing methods do not properly account for multi-hop structural information in TKGs, and they lack the ability to explicitly leverage temporal facts in nearby KG snapshots to answer queries. Knowing facts like (Obama, make agreement with, China, 2013) or (Obama, visit, China, 2012) is useful for answering the query (Obama, visit, ?, 2014).
Moreover-and perhaps more importantlythere are also serious challenges regarding temporal variability and temporal sparsity, which previous works fail to address. In real-world TKGs, models have access to variable amounts of reference temporal information in near KG snapshots when answering different queries (Figure 2 and Figure 6 in the Appendix). For example, in a political event dataset, there are likely to be more quadruples with subject-relation pair (Obama, visit) than (Trump, visit) from 2008 to 2013. 2 Hence the model could access more reference information to answer where Obama visited in 2014.
The temporal sparsity problem reveals that only a small fraction of entities are active 3 at each time step ( Figure 7 in the Appendix). Previous methods usually assign the same embedding for inactive entities at different time steps, which is not fully representative of the time-sensitive features. Present work. To address these issues, we introduce the Temporal Message Passing (TeMP) framework, which combines neural message passing and temporal dynamic models. We then propose frequency-based gating and data imputation techniques to counter the temporal sparsity and variability issues. We achieve state-of-the-art performance on standard TKGC benchmarks. In particular, on the standard ICEWS14, ICEWS05-15, and GDELT datasets, TeMP is able to provide an 7.3% average relative improvement in Hits@10 compared to the next-best model. Fine-grained error analysis on these three datasets demonstrates the unique contributions made by each of the different components of TeMP. Our analysis also highlights important sources of variability, in particular variations in temporal sparsity both within and across TKG datasets, and how effects of different components are affected by such variability.

Related Work
Static KG representation learning Much research exists on representation learning methods for static KGs, in which entities and relations are represented as low-dimensional embeddings (Nickel et al., 2011;Yang et al., 2014;Trouillon et al., 2016;Nickel et al., 2016). Generally, these methods involve a decoding method, which scores candidate facts based on entity and relation embeddings, and the models are optimized so that valid triples receive higher scores than random negative examples. While these methods typically rely on shallow encoders to generate the embeddingsi.e., single embedding-lookup layers (Hamilton et al., 2017)-message passing (or graph neural network; GNN) approaches have also been proposed (Schlichtkrull et al., 2018;Vashishth et al., 2019;Busbridge et al., 2019) to leverage multi-hop information around entities.
Temporal KG representation learning Recent works endeavor to extend static KGC models to the temporal domain. Typically, such approaches employ embedding methods with a shallow encoder and design time-sensitive quadruple decoding functions (Dasgupta et al., 2018;Jiang et al., 2016;Goel et al., 2019;Xu et al., 2019;Lacroix et al., 2020). While time-specific information is considered by these methods, entity-level temporal patterns such as event periodicity are not explicitly captured.
Another line of work on temporal (knowledge) graph reasoning uses message passing networks to capture intra-graph neighborhood information, which is sometimes combined with temporal recurrence or attention mechanisms (Manessi et al., 2020;Kumar et al., 2018;Pareja et al., 2019;Chen et al., 2018;Jin et al., 2019;Sankar et al., 2020;Hajiramezanali et al., 2019). Orthogonal to our work, Trivedi et al. (2017Trivedi et al. ( , 2019; Han et al. (2020) explore using temporal point processes. However, their focus is on continuous TKGC. The prior works that most resemble our framework are Recurrent Event Networks (RE-NET) (Jin et al., 2019) and DySAT (Sankar et al., 2020). RE-NET uses multi-level RNNs to model entity interactions, while DySAT uses self-attention to learn latent node representations on dynamic graphs. However, both these works were proposed for the task of graph extrapolation (i.e., inferring the next timestep in a sequence), so they are not directly compatible with the TKGC setting.

Proposed Approach
We first define our key notation and provide an overview of our TeMP framework, before describing the individual components in detail in the following sections. Notation and task definition. Our goal is to predict missing facts in a temporal knowledge graph (TKG) Here, E and R stand for the union of sets of entities and relations across all time steps and are known in advance. D (t) denotes the set of all observed triples (s, r, o) at time t, with subjects s ∈ E , objects o ∈ E and relations r ∈ R. Let D (t) denote the set of true triples at time t such that D (t) ⊆ D (t) , ∀t, the temporal knowledge graph completion (TKGC) problem is defined as ranking the subject and object entities given object queries (s, r, ? , t) and subject queries ( Overview of TeMP. Following common practice, we structure our TeMP framework around the notion of an encoder and decoder. The encoder maps each entity e i ∈ E to time-dependent low-dimensional embedding z i,t at each time-step t, while the decoder uses these entities' embeddings to score the likelihood of a temporal fact. Figure 3 depicts the architecture of our model. A key insight in TeMP is that we use an encoder that combines a structural entity representation and temporal representations. The structural encoder (SE) based on a multi-relational message passing network produces entity representation x i,t = SE(e i , D (t) ) while the temporal encoder (TE) integrates the output of SE at previous time steps to induce z i,t = TE(x i,t−τ , ..., x i,t ). Here τ stands for the number of temporal input KG snapshots to the model.
In addition, in Section 3.3, we propose a series of augmentations to TeMP that are designed to address the temporal sparsity and variability issues of real-world TKGs. Finally, in Section 3.4, we discuss how TeMP can leverage existing decoders from the static KG setting in order to train a model.  Figure 3: Architecture of TeMP Framework. TeMP combines structural graph encoder and temporal encoder to induce entity representations. Given query (s, r, ? t) at time t, TeMP takes graphs from time step t−τ to t as input to compute structural embedding x s,t and temporal embedding z s,t for the centering entity s. The final representationz s,t is obtained by further applying frequency-based gating, as illustrated in the upper rectangle. The red dotted arrow at the bottom indicates the imputation process for an inactive entity at time step t.

Structural Encoder
The first key component of TeMP is the structural encoder, which generates entity embeddings based on the graph G (t) within each time-step. We build our structural encoder by adapting existing techniques for message passing on static knowledge graphs (Schlichtkrull et al., 2018).
Here, u i denotes a one-hot embedding indicating entity e i , W 0 is an entity embedding matrix, and W (l) r and W (l) s are transformation matrices specific to each layer of the model. These matrices are shared across all discrete time stamps. We use N r i to denote the set of neighboring entities of e i connected by relation r, whose size acts as a normalizing constant for averaging the neighborhood information. After running L layers of this message-passing approach on a snapshot G (t) , we use x i,t = h While we focus on RGCN as the structural encoder, our framework is not tied to any specific multi-relational message passing network. One can swap RGCN with any multi-relational graph encoder, e.g. CompGCN (Vashishth et al., 2019) and EdgeGAT (Busbridge et al., 2019).

Temporal Encoder
The second key component of TeMP is the temporal encoder, which seeks to integrate information across time in the entity representations. We investigate two approaches to compute entity representation z i,t leveraging temporal information: a recurrent architecture (inspired by Jin et al. (2019)) and a self-attention approach (inspired by Sankar et al. (2020)). Temporal recurrence model (TeMP-GRU). We propose to couple a traditional recurrence mechanism with weight decay, in order to account the diminishing effect of historical facts. Let t − denote the last time step at which entity e i was active before t, the down-weighted entity representation z i,t − is defined as follows: where γ z denotes the decay rate with λ z and b z as learnable parameters. This design is inspired by Che et al. (2018) and ensures that γ z is monotonically decreasing with respect to the temporal difference and ranges from 0 to 1. We ensure that z i,t − is only nonzero if t − ∈ {t − τ, .., t − 1}, otherwise it will be assigned a zero vector. Finally, we use a gated recurrent unit (GRU) to obtain the entity embedding z i,t based onẑ i,t − and the static representation x i,t : where GRU denotes the standard cell defined by Cho et al. (2014). Temporal self attention model (TeMP-SA). Another way to incorporate historical information is to selectively attend to the sequence of active temporal entity representations. We use the following equations-inspired by the transformer architecture (Vaswani et al., 2017)-to perform attentive pooling over the entity embeddings x i,t at each time step t ∈ {t − τ, .., t}, in order to generate time-dependent embeddings z i,t : where W q , W k , W v ∈ R d×d denote linear projection matrices, as in a transformer layer (Vaswani et al., 2017), β ∈ R |E|×τ denotes the attention weight matrix obtained by multiplicative attention function and {λ z , b z } denotes the learnable parameters of the down-weighting function. The M ∈ R |E|×τ matrix is a mask defined as As M ij → −∞, the attention weights β ij → 0, which ensures that only active temporal entity representations are assigned non-zero weights. Finally, note that the full self-attention model can be generalized to use multiple attention heads, as in Vaswani et al. (2017). Incorporating future information. Note that in the TKGC setting, we assume that the model has access to all the time steps during training. In particular, we assume there is missing data within each time step but that all the (incomplete) snapshots information D (t) are available during training. Thus, in both the attention and recurrence-based approaches, it is worthwhile to integrate temporal information from both the past and future. We do so by employing a bi-directional GRU in the recurrent approach, and by attending over both past and future time steps in the attention-based approach.

Tackling Temporal Heterogeneities
Although TeMP jointly models structural and temporal information, the encoder alone is insufficient to deal with the temporal heterogeneity in realworld TKGs, namely sparsity and variability of entity occurrences. We explore data imputation and frequency-based gating techniques to address these temporal heterogeneities. Because the degrees of temporal heterogeneities vary drastically across datasets (Appendix A.5), our proposed techniques are optional model variations that may im-prove model performance depending on the dataset characteristics. Imputation of inactive entities. Recall that structural encoder only encodes neighboring entities within the same KG snapshot. For entity e i that is inactive at time step t, the static representation x i,t is hence not informed by any structural neighbors, resulting in stale representations shared across multiple time steps. We propose an imputation (IM) approach that integrates stale representations with temporal representations for inactive entities, i.e., Without loss of generality, we define the imputation for a uni-directional model and refer the bidirectional case to Appendix A.2. We defined IM to be the weighted sum function, with the similar exponential decay mechanism used in Equation (1): The imputed representation is defined as follows: This model-agnostic approach is applicable by replacing x i,t in the temporal models withx i,t . Frequency-based gating. In addition to imputation, we also implement an approach to perform frequency-based gating (FG). The encoded representation of an entity is modulated depending on how many recent temporal facts it participates in.
In particular, we propose to learn a gating term in order to fuse the embeddings x i,t from output of the structural encoder (Section 3.1) with the temporal embeddings z i,t (Section 3.2) in a frequencydependent way. We differentiate the weights by the query types (subject or object query) and entity position (whether e i is subject or object in the queried fact) in order to contextualize the entities into their role within a quadruple.
In what follows, we use the term pattern to denote a non-empty subset of the quadruple (s, r, o, t) (not containing time t). The temporal frequency of a pattern is defined as the number of facts with such pattern in the defined time window. Consider the quadruple (Obama, visit, China, 2014), the temporal frequency of the pattern (Obama, visit) is the number of quadruples (Obama, visit, * , t ) with t in the time window (e.g., from 2000 to 2014).
We define the following temporal pattern frequencies (TPFs) associated with the quadruple (s, r, o, t): (1) subject frequency f t s , (2) object frequency f t o , (3) relation frequency f t r , (4) subjectrelation frequency f t s,r , (5) relation-object frequency f t r,o . Without loss of generality, we define our gating mechanism from the perspective of object queries (s, r, ? , t), where the goal is to predict the missing object in a quadruple. The definition for subject queries is analogous and detailed in Appendix A.3.
When answering the object query (s, r, ? , t) the model has only the access to frequencies . Thus, we use the frequency vector F s to define a gating term over the embeddings in the query:z where α os = MLP os (F s ), α oo = MLP oo (F s ) are weights in the range [0, 1] learned via a two-layer dense neural network. Here the calculation for object embeddingz o,t covers all entities.

Decoder and Training
Let φ(.) denote the score for a tuple and let DEC denote any proper decoding function for static KGs, e.g., the TransE decoder (Bordes et al., 2013). The score for the quadruple is defined as follows: Here,z s,t andz o,t are the subject and object embeddings (as defined in Sections 3.1-3.3) while z r is a learned embedding of the relation r. To train a model using this score function, the model parameters are learned using gradient-based optimization in mini-batches. For each triple η = (s, r, o) ∈ D (t) , we sample a negative set of entities D − η = {o |(s, r, o ) ∈ D (t) } and define the cross-entropy loss as follows: .
Note that without loss of generality, we defined the above loss over object queries (as in Section 3.3), with an analogous loss and negative sampling used for subject queries defined in Appendix A.3.

Experiments
We evaluate the performances of TeMP models on three standard TKGC benchmark datasets and analyze the strengths and shortcomings when answering queries with different characteristics. Code to

Evaluation Metrics
For each quadruple (s, r, o, t) in the test set, we evaluate two queries (s, r, ? , t) and (? , r, o, t).
For the first query we calculate scores for (s , r, o, t), ∀s ∈ E using Equation (13). Similar procedure applies to the second query. We then calculate the metrics based on the rank of (s, r, o, t) in each query. Evaluation is performed under filtered settings defined by Bordes et al. (2013). We report the Hits@1,@3, @10 scores and MRR (mean reciprocal rank). Please see Appendix A.6 for detailed definitions.

Baseline Methods
We compare TeMP against a broad spectrum of existing approaches, including a novel rule-based baseline, static embedding methods, and existing state-of-the-art approaches for TKGC. TED model. We propose a rule-based baseline by directly copying facts from quadruples in the recent past and future, denoted as temporal exponential decay (TED) model. The basic idea in this approach is that we predict missing facts by simply copying facts from nearby time steps. The probability of copying each fact is dependent on (1) number of elements overlapping with the queried quadruple and (2) temporal distance to the current time step. For a detailed description of this baseline, please refer to Appendix A.4. Static KGC methods.
We include TransE

Implementation and Hyperparameters
All the models except TED are implemented in Py-Torch, making use of the PyTorch lightning module and the Deep Graph Library (Wang et al., 2019). We set the negative sampling ratio to 500, i.e. 500 negative samples per positive triple. Because we corrupt subjects and objects separately, there are in total 1000 negative samples collected to estimate the probability of a factual triple. For full details on all the model hyperparameters for TeMP and the baselines, refer to Appendix A.7.

Comparative Study
We compare the baseline models with two instantiations of the TeMP framework: TeMP-GRU, TeMP-SA, corresponding to the GRU and self-attention variants discussed in Section 3.2. Incorporating imputation or frequency-based gating is treated optional and we explore different model variants in Section 4.5.2. Results on each dataset are given by the model variant that achieves the best validation set performance. The core experimental results are summarized in Table 1. TeMP achieves a new state of the art. We find that TeMP-SA and TeMP-GRU achieve state-ofthe-art results on all three datasets in terms of Hits@10. Compared to the most recent work (Lacroix et al., 2020)-which achieves the best performance to-date on the ICEWS datasets-our results are 8.0% and 10.7% higher on the Hits@10 evaluation, though they are slightly worse on Hits@1. Additionally, our model achieves a 3.7% improvement on GDELT compared with DE, the prior state-of-the-art on that dataset. The results of the AtiSEE and TNTComplEx methods on the GDELT dataset are not available. Strong baseline performance. Interestingly, we find that two of our proposed baseline models also achieve surprisingly strong performance, even outperforming the prior state of the art in some settings. For example, our rule-based TED baseline achieves relatively strong performance on all three datasets, in particular on GDELT, where it is better than all existing neural models by all measures. This highlights the power of simply copying temporal facts with the same patterns as the queried quadruples. Similarly, our static RGCN baseline (SRGCN) also achieves very strong performance, with the nextbest Hits@10 results behind the TeMP framework. We hypothesize that the message-passing procedure in SRGCN allows the model to leverage multihop structural information that is specific to each time-step, enabling strong performance.

Exploration of Model Variations
We study the effect of the imputation and frequency-based gating approaches proposed in Section 3.3 by running model variants on three datasets. We highlight the performance comparison as well as the implication of dataset characteristics on the performance variations.
Our results are reported on the corresponding validation sets of these benchmarks. The results regarding the incorporation of imputation (IM) and frequency-based gating (FG) are shown in Table 2. We use a to indicate a certain component being used in the experiment, and blank for the absence of the corresponding component. 4 ICEWS14. On the ICEWS14 dataset, we find that combining both TeMP-GRU and TeMP-SA models with both imputation and gating achieves the best results on validation set (3.3% improvement). Additionally, each individual component helps improve the overall model performance by about 1%. ICEWS05-15. On ICEWS05-15, models with gating improved the performance by more than 1% compared to those without gating. However, the additional incorporation of imputation does not result in improvement in the results. GDELT. As for GDELT dataset, we find neither imputation nor gating is significant for model performance. However, it is evident from dataset characteristics that GDELT does not exhibit the same temporal variability and sparsity as the ICEWS datasets. Discussion in Appendix A.5 shows that all entities are active at every time step in GDELT

Fine-grained Error Analysis
To assess how models perform on TKGC queries with different temporal pattern frequencies (TPFs; see Section 3.3), we group queried quadruples based on different TPFs and calculate the Hits@10 metrics in each group. We plot the temporal subject-relation frequency f t s,r (defined in Section 3.3) versus the model performances on subject and object queries to study the replication and reference effects of temporal facts, respectively. Here, we use the term replication effect to denote the situation where the model can make predictions by copying the exact correct answer to a query from temporal facts. For example, copying China from (Biden, visit, China, 2013) to answer the query (Obama, visit, ?, 2014). We use the term reference effect to denote the effect of having facts that are related (but do not not contain answer entity) to the query fact in the temporal context. For example, selecting China from a set countries where Obama visited in the year 2013.
We compare the performances of static models (DE and SRGCN) and temporal models (TeMP-GRU models) on different TPFs. TeMP-GRU-Vanilla represents the vanilla version of the model and TeMP-GRU-Gating refers to TeMP-GRU model combined with gating technique. Detailed analysis regarding TKGC performance versus other TPFs are discussed in Appendix A.8.

Replication effect analysis
Here, we examine how the subject-relation TPF correlates with model performance on subject queries. Figure 4 illustrates that temporal models exhibit positive correlation between subject-relation TPF and subject query performance, while static models show relatively negative correlation between the two quantities. This suggests that the replication effect is stronger in TeMP, indicating that the TeMP model is better at utilizing temporal information for TKGC queries. Additionally, gating helps improve over the vanilla version by a slight margin on all subject-relation frequency values. On the other hand, SRGCN achieves better performance on low-TPF queries than temporal models. However, coupling the TeMP model with gating helps close the gap, sometimes surpassing SRGCN on such queries. Reference effect analysis. Here, we examine how the occurrence of related facts (not containing the answer) in the temporal context impacts performance. We find that the temporal models exhibit non-linear correlations between object query performance and subject-relation TPF ( Figure 5). In particular, on the ICEWS datasets the performance increases as the log-frequencies grows from −∞ to 2 and drops at higher frequency values. We hypothesize that it is harder for temporal model to select the answer from a very large set of object candidates, e.g,. choosing China from more than 100 countries that Obama visited from 2008 to 2013. In terms of model comparisons, we find that gating helps TeMP-GRU to surpass its vanilla version and SRGCN on most TPF values. The margin of improvement is especially significant on queries of high TPF in ICEWS05-15.
The null effect of frequency-based gating on GDELT can be attributed to the same reason as discussed in Section 4.5.2.

Conclusion
In this work, we present a novel framework named TeMP for temporal knowledge graph completion (TKGC). TeMP computes entity representation by jointly modelling multi-hop structural information and temporal facts from nearby time-steps.
Additionally, we introduce novel frequencybased gating and data imputation techniques to address the temporal variability and sparsity problems in TKGC. We show that our model is able to achieve superior performance (10.7% relative improvement) over the state-of-the-arts on three benchmark datasets. Our work is potentially beneficial to other tasks such as temporal information extraction and temporal question answering, by providing beliefs about the likelihood of facts at particular points in time.  Future work involves exploring the generalization of TeMP to continuous TKGC and better imputation techniques to induce representations for infrequent and inactive entities.

A Appendix
A.1 Architecture Details Temporal Edge Dropout. The replication effect illustrated in Figure 4 and 8 shows that TeMP is increasingly better capable at copying from temporal facts when TPFs also increase. We refer to this as "overfitting" to the temporal facts. In order to alleviate such problem, we propose temporal edge dropout: randomly dropping facts occurred in the defined time window used to induce the entity representation.
Rong et al. (2019) propose dropping a proportion in the local graph context to combat over-fitting and over-smoothing. We extend this technique to TKG by either (1) randomly dropping a certain percentage of quadruples in each temporal snapshot and (2) drop quadruples with different probabilities based on certain quadruple characteristics. Details of the second method is omitted since we find the two methods working equally well. We use 0.2 as temporal edge dropout rate in all experiments. Positional Embedding. We capture the timesensitive information in the TKG by combining the entity representation with positional embedding. The positional embedding is denoted as {p 1 , p 2 , ..., p T }, which embeds absolute positional information of each time step. The set of representations for entity e i at all time steps is {p 1 + z i,1 , p 2 + z i,2 , ..., p T + z i,T }, which are used as input entity representation to the decoding function.

A.2 Extended Imputation Formulation
For bidirectional temporal recurrent model, we defined the imputed representation analogous to Equation (9) and 10. We use t + to denote the very next time step at which entity e i is active after t. The decay rate for imputing from future representations as follows: To calculate the imputed representation of the e i at time t, we divide both exponential decay rates by two and renormalize: Intrinsic imputation for TeMP-SA. We use Equation (4) -(7) to derive entity representations for both active and inactive entities and view it as an intrinsic way of imputation. Hence imputation is tagged with all TeMP-SA results in Table 2.

A.3 Analogous Definition of Frequency Based Gating and Training Loss
We define the process for deriving entity representation for subject queries (? , r, o, t) analogous to Equation (11) and (12). The model is only allowed the access to frequencies , we use it to define a similar gating over static and temporal entity representations: With the negative subject entity set being D − η,s = {s |(s , r, o) ∈ D (t) }, the training loss for subject queries is defined as follows: .
The final training loss is the sum of losses for two types of queries: L = L sub + L obj .

A.4 Detailed TED Formulation and Analysis
TED Model Definition. We hypothesize that certain quadruples with more frequent occurrence in more recent time steps are informative for the current-step KGC. For each query, we construct a set of reference entities from training data. Similar to the down-weighting mechanism of temporal encoder (Section 3.2), we score each entity based on exponential decaying mechanism with respect to the temporal distance to the current time step. We then rank the entities in the reference set according to such scores.
For each queried quadruple (s, r, o, t), we collect reference entity sets consisting of tuples {(e, t ), t = t} where e is the subject or object entity and t is the corresponding time of occurrence. The tuples are extracted from the temporal facts sharing at least one element with (s, r, o, t). We divide them into subject and object reference sets two types of queries. The subject reference set consists of: (1) subjects with shared relation-object pair, i.e., {(s , t )|∃t = t, (s , r, o) ∈ D (t ) train }, (2) subjects with shared object, i.e., (3) subjects with shared relation, i.e., Symmetrically, object reference set consists of: (1) objects with shared subject-relation pair, i.e., (2) objects with shared subject, i.e., (3) objects with shared relation, i.e., We don't collect triples in the current time step t as we assume D Note that (1) is a subset of (2) and (3), also (2) and (3) contain overlapping tuples. We define the priority to be (1) > (2) > (3), such that if some tuple is present in (1), then it will be removed from both (2) and (3). This is based on the assumption that objects with the same subject-relation pair as the current triple are the most ideal candidates. For example, because of the characteristics of police, the fact (police, arrest, citizen) occurred multiple times across in the dataset. Objects with same shared subject and different relation comes second, e.g. (Obama, visit, China, 2013), (Obama, visit, Russia, 2014) are important information for predicting (Obama, make announcement to, ?, 2015).
Let S be some set of tuple defined above. The score for e is the sum over all tuples containing e, t ,(e,t )∈S TED Results and Analysis. Table 3 shows the sensitivity analysis for parameter σ on validation set. We notice that the performances are low when σ is either extremely large or small, while peaks when σ = 0.1 on ICEWS datasets and σ = 1 on GDELT dataset. This suggests an existing trade-off between recency and frequency heuristics. TED model results also expose the bias of recurring events in political event datasets, particularly in GDELT. However, TED should be considered by future work as an important baseline to gauge the relative model performance. Additionally, the results suggests the potential for pointer-style TKGC -deciding between coping an entity from historical facts and selecting an entity in the current snapshot to answer a query.

A.5 Dataset Statistics and Characteristics
The dataset statistics are summarized in Table 4. The numbers of entities are 7,128, 10,488 and 500 respectively in three datasets, indicating that temporal sparsity issue is severe on ICEWS datasets but trivial on GDELT dataset. The temporal variability of three datasets is demonstrated in Figure 7. The average number of associated temporal facts for each entity is much lower in ICEWS datasets compared to GDELT. The difference can be attributed to the fact that GDELT dataset is constructed by extracting facts among the most frequent 500 entities in the entire dataset. This intrinsically eliminates the sparsity and variability bias in the original datasets.

A.6 Definitions for Evaluation Metrics
We use MRR, Hits@1, Hits@3 and Hits@10 to evaluate the model performance. MRR is defined as: ( 1 rank(o|s, r, t) The Hit@1, Hit@3, Hit@10 are the percentages of test facts for which the k highest ranked predictions contain the correct prediction, k = 1, 3, 10. That is, where k = 1, 3, 10, I is the indicator function.

A.7 Detailed Implementation and Hyperparameters
We use the Adam optimizer and set the learning rate to 0.001. The batch size is set to 8 for ICEWS14 and ICEWS05-15, i.e. each batch contains facts in 8 snapshots. We additionally sample 3,000 quadruples in each snapshot to avoid out-of-memory issue.
Embedding size and hidden sizes for both recurrent and self-attentive models are both set to 128. We use 8 attention heads in TeMP-SA to model the multi-faced evolution of TKG. As required by reproducibility checklist, the complete hyperparameter setting and run-time information for TeMP-GRU model on all benchmark datasets are summarized in Table 5. Suggested by ablation study in (Jin et al., 2019) we set the number of relational convolution layers to 2 to encode two-hop neighbors. We apply temporal edge dropout technique to TKG, in each training epoch we randomly drop 50% of the quadruples in current KG and 20% triples in each temporal reference KG to combat over-fitting and over-smoothing. We experimented with TransE, DistMult and Com-plEx on validation set and found that ComplEx (Trouillon et al., 2016) yields the best performance overall. Hence ComplEx is used as decoding function to score head or tail entities given queries. During inference on D The parameter τ stands for the number of KG snapshots available for answering query. This is applied to temporal models as a budget. Singledirection models take temporal entity embedding from the past τ graphs while bidirectional models focus on τ 2 historical and future snapshots. We use early stopping with patience 10 with respect to the average MRR on the validation set. All ablation studies are conducted on the validation set. For the best model variants, we use the model checkpoint that achieves the best MRR score on validation set to perform final evaluation on test set.

A.8 Detailed Analysis of Performances versus TPFs
We studied the correlation between subject-relation TPF and query answering performances in Section 4.5.3. Here, we first define a complete set of TPFs that covers all possible subsets of a quadruple. In Section 3.3 we defined (1) subject frequency f t s , (2) object frequency f t o , (3) relation frequency f t r , (4) subject-relation frequency f t s,r , (5) relation-object frequency f t r,o related to quadruple (s, r, o, t). We additionally define (6) subject-object frequency f t s,o and (7) triple frequency f t s,r,o . We use the following combinations of TPFs and query types to study replication and reference effects respectively.     For replication effect, we compare subject query results against (1), then compare object query results against (2) and (5). Values of (6) and (7) are compared with the results of both subject and object queries. For reference effect, we compare object query results against (1), and subject query results against (2) and (5). Results are summarized in Figure 8 and Figure 9 respectively.
The general observation is similar to the discussion in Section 4.5.3. In the replication analysis , TeMP-GRU models show significantly more positive trends than the static models (SRGCN and DE). However, we witness drops in performances when TPFs become large in the reference effect analysis. Performance of TeMP-GRU-Vanilla model improves with the help of gating on ICEWS datasets on TPFs. The benefit is less obvious on GDELT dataset due to the observation that GDELT is less affected by temporal sparsity and variability problem (Appendix A.5).
We conclude that TeMP models are significant more advantageous in utilizing temporal facts for TKGC task. In addition, frequency-based gating improves the overall performance with respect to all different TFPs.