Recurrent Event Network: Autoregressive Structure Inference over Temporal Knowledge Graphs

Knowledge graph reasoning is a critical task in natural language processing. The task becomes more challenging on temporal knowledge graphs, where each fact is associated with a timestamp. Most existing methods focus on reasoning at past timestamps and they are not able to predict facts happening in the future. This paper proposes Recurrent Event Network (RE-N ET ), a novel autoregressive architecture for predicting future interactions. The occurrence of a fact (event) is modeled as a probability distribution conditioned on temporal sequences of past knowledge graphs. Speciﬁ-cally, our RE-N ET employs a recurrent event encoder to encode past facts, and uses a neighborhood aggregator to model the connection of facts at the same timestamp. Future facts can then be inferred in a sequential manner based on the two modules. We evaluate our proposed method via link prediction at future times on ﬁve public datasets. Through extensive experiments, we demonstrate the strength of RE-N ET , especially on multi-step inference over future timestamps, and achieve state-of-the-art performance on all ﬁve datasets 1 .


Introduction
Knowledge graphs (KGs), which store real-world facts, are vital in various natural language processing applications (Bordes et al., 2013;Schlichtkrull et al., 2018;Kazemi et al., 2019). Due to the high cost of annotating facts, most knowledge graphs are far from complete, and thus predicting missing facts (a.k.a., knowledge graph reasoning) becomes an important task. Most existing efforts study reasoning on standard knowledge graphs, where each fact is represented as a triple of subject entity, object entity and the relation between them. However, in practice, each fact may not be true forever, Each edge or interaction between entities is associated with temporal information and a set of interactions build a multi-relational graph at each time. Our task is to predict interactions and build graphs at future times. and hence it is useful to associate each fact with a timestamp as a constraint, yielding a temporal knowledge graph (TKG). Fig. 1 shows example subgraphs of a temporal knowledge graph. Despite the ubiquitousness of TKGs, methods for reasoning over such kind of data are relatively unexplored.
Given a temporal knowledge graph with timestamps varying from t 0 to t T , TKG reasoning primarily has two settings -interpolation and extrapolation. In the interpolation setting, new facts are predicted for time t such that t 0 ≤ t ≤ t T (García-Durán et al., 2018;Leblay and Chekol, 2018;Dasgupta et al., 2018). In contrast, extrapolation reasoning, as a less studied setting, focuses on predicting new facts (e.g., unseen events) over timestamps t that are greater than t T (i.e., t > t T ). The extrapolation setting is of particular interests in TKG reasoning as it helps populate the knowledge graph over future timestamps and facilitates forecasting emerging events (Muthiah et al., 2015;Phillips et al., 2017;Korkmaz et al., 2015).
Recent attempts to solve the extrapolation TKG reasoning problem are Know-Evolve (Trivedi et al., 2017) and its extension DyRep (Trivedi et al., 2019), which predict future events assuming ground truths of the preceding events are given at inference time. As a result, these methods are unable to predict events sequentially over future timestamps without ground truths of the preceding events-i.e., a practical requirement when deploying such reasoning systems for event forecasting (Morstatter et al., 2019). Moreover, these approaches do not model concurrent events occurring within the same time window (e.g., a day, or 12 hours), despite their prevalence in real-world event data (Boschee et al., 2015;Leetaru and Schrodt, 2013). Thus, it is desirable to have a principled method that can extrapolate graph structures over future timestamps by modeling the concurrent events within a time window as a local graph.
To this end, we propose an autoregressive architecture, called Recurrent Event Network (RE-NET), for modeling temporal knowledge graphs. Key ideas of RE-NET are based on: (1) predicting future events over multiple time stamps can be formulated as a sequential and multi-step inference problem; (2) temporally adjacent events may carry related semantics and informative patterns, which can further help predict future events (i.e., temporal information); and (3) multiple events may co-occur within the same time window and exhibit structural dependencies between entities (i.e., local graph structural information).
Given these observations, RE-NET defines the joint probability distribution of all events in a TKG in an autoregressive fashion. The probability distribution of the concurrent events at the current time step is conditioned on all the preceding events (see Fig. 2 for an illustration). Specifically, a recurrent event encoder summarizes information of the past event sequences, and a neighborhood aggregator aggregates the information of concurrent events within the same time window. With the summarized information, our decoder defines the joint probability of a current event. Inference for predicting future events can be achieved by sampling graphs over time in a sequential manner.
We evaluate our proposed method on five public TKG datasets via a temporal (extrapolation) link prediction task, by testing the performance of multi-step inference over time. Experimental results demonstrate that RE-NET outperforms stateof-the-art models of both static and temporal knowledge graph reasoning, showing its better capability to model temporal, multi-relational graph data. We  Figure 2: Illustration of the Recurrent Event Network architecture. The aggregator encodes the global graph structure and the local neighborhood, capturing global and local information respectively. The recurrent event encoder updates its state with the sequence of encoded representations of graph structures. The MLP decoder defines the probability of a current graph.
further show that RE-NET can perform effective multi-step inference to predict unseen entity relationships in a distant future.

Problem Formulation
We first describe notations for building our model and problem definition, and then we define the joint distribution of temporal events. Notations and Problem Definition. We consider a temporal knowledge graph as a multi-relational, directed graph with time-stamped edges between nodes (entities). An event is defined as a timestamped edge, i.e., (subject entity, relation, object entity, time) and is denoted by a quadruple (s, r, o, t) or (s t , r t , o t ). We denote a set of events at time t as G t . In our setup, the timestamps are discrete integers and used for the relative order of graphs or events. A TKG is built upon a sequence of event quadruples ordered ascending based on their timestamps, i.e., where each time-stamped edge has a direction pointing from the subject entity to the object entity. 2 The goal of learning generative models of events is to learn a distribution p(G) over TKGs, based on a set of observed event sets {G 1 , ..., G t }. Approach Overview. The key idea of our approach is to learn temporal dependency from the sequence of graphs and local structural dependency from the neighborhood (Fig. 2). Formally, we represent TKGs as sequences, and then build an autoregressive generative model on the sequences. To this end, RE-NET defines the joint probability of concurrent events (or a graph), which is conditioned on all the previous events. Specifically, RE-NET consists of a Recurrent Neural Network (RNN) as a recurrent event encoding module and a neighborhood aggregation module to capture the information of graph structures. We first start with the definition of joint distribution of temporal events. Modeling Joint Distribution of Temporal Events. We define the joint distribution of all the events G = {G 1 , ..., G T } in an autoregressive manner. Basically, we decompose the joint distribution into a sequence of conditional distributions, p(G t |G t−m:t−1 )), where we assume the probability of the events at a time step, G t , depends on the events at the previous m steps, G t−m:t−1 . For each conditional distribution p(G t |G t−m:t−1 ), we further assume that the events in G t are mutually independent given the previous events G t−m:t−1 . In this way, the joint distribution can be rewritten as follows: From these probabilities, we generate triplets as follows. Given all the past events G t−m:t−1 , we first sample a subject entity s t through p(s t |G t−m:t−1 ). Then we generate a relation r t with p(r t |s t , G t−m:t−1 ), and finally the object entity o t is generated by p(o t |s t , r t , G t−m:t−1 ). 3 Next, we introduce how these probabilities are defined and parameterized in our method.

Recurrent Event Network
In this section, we introduce our proposed method, Recurrent Event Network (RE-NET). RE-NET consists of a Recurrent Neural Network (RNN) as a recurrent event encoder (Sec. 3.1) for temporal dependency and a neighborhood aggregator (Sec. 3.2) for graph structural dependency. We also discuss parameter learning of RE-NET and define multistep inference for distant future by sampling intermediate graphs in a sequential manner (Sec. 3.3).

Recurrent Event Encoder
To parameterize the probability for each event, RE-NET introduces a set of global representations as 3 We can also first sample an object entity in this process. Details are omitted for brevity. well as local representations. The global representation H t summarizes the global information from the entire graph until time stamp t, which reflects the global preference on the upcoming events. In contrast, the local representations focus more on each subject entity s or each pair of subject entity and relation (s, r), which capture the knowledge specifically related to those entities and relations. We denote the above local representations as h t (s) and h t (s, r), respectively. The global and local representations capture different aspects of knowledge from the knowledge graph, which are naturally complementary, allowing us to model the generative process of the graph in a more effective way.
Based on the above representations, RE-NET parameterizes p(o t |s, r, G t−m:t−1 ) in the following way: where e s , e r ∈ R d are learnable embedding vectors specified for subject entity s and relation r. h t−1 (s, r) ∈ R d is the local representation for (s, r) obtained at time stamp (t−1). Intuitively, e s and e r can be understood as static embedding vectors for subject entity s and relation r, whereas h t−1 (s, r) is dynamically updated at each time stamp. By concatenating both the static and dynamic representations, RE-NET can effectively capture the semantic of (s, r) up to time stamp (t − 1). Based on that, we further compute the probability of different object entities o t by passing the encoding into our multi-layer perceptron (MLP) decoder. We define the MLP decoder as a linear softmax classifier parameterized by {w ot }. Similarly, we define probabilities for relations and subjects as follows: where h t−1 (s) focuses on the local information about s in the past, and H t−1 ∈ R d is a vector representation to encode global graph structures G t−1:t−m . To predict what relations a subject entity will interact with p(r t |s, G t−m:t−1 ), we treat the static representation e s as well as the dynamic representation h t−1 (s) as features, and feed them into a multi-layer perceptron (MLP) decoder parameterized by w rt . Besides, to predict the distribution of subject entities at time stamp t (i.e., p(s t |G t−m:t−1 )), we treat the global representation H t−1 as a feature, as it summarizes the global information from all the past graphs until time stamp t − 1, which reflects the global preference on the upcoming events at time stamp t. The global representation H t is expected to preserve the global information about all the graphs up to time stamp t. The local representations h t (s, r) and h t (s) emphasize more on the local events related to each entity and relation. Thus we define them as follows: ht(s, r) = RNN 2 (g(N (s) t ), Ht, ht−1(s, r)), where g is an aggregate function which will be discussed in Section 3.2 and N (s) t stands for all the events related to s at the current time step t. We leverage a recurrent model RNN (Cho et al., 2014) to update them. The global representation takes the global graph structure g(G t ) as an input. g(G t ) is an aggregation over all the events G t at time t. We define g(Gt) = max({g(N (s) t )}s), which is an element-wise max-pooling operation over all captures the local graph structure for subject entity s. The local representations are different from the global representations in two ways. First, the local representations focus more on each entity and relation, and hence we aggregate information from events N (s) t that are related to the entity. Second, to allow RE-NET to better characterize the relationships between different entities, we treat the global representation H t as an extra feature in the definition, which acts as a bridge to connect different entities.
In the next section, we introduce how we design g in RE-NET.

Neighborhood Aggregators
In this section, we first introduce two simple aggregation functions: a mean pooling aggregator and an attentive pooling aggregator. These two simple aggregators only collect neighboring entities under the same relation r. Then we introduce a more powerful aggregation function: a multi-relational aggregator. We depict comparison on aggregators in Fig. 3. equally, and thus ignores the different importance of each neighbor entity. Attentive Pooling Aggregator. We define an attentive aggregator based on the additive attention introduced in (Bahdanau et al., 2015) to distinguish the important entities for (s, r). The aggregate function is defined as g(N (s,r) where αo = softmax(v tanh(W (es; er; eo))). v ∈ R d and W ∈ R d×3d are trainable weight matrices. By adding the attention function of the subject and the relation, the weight can determine how relevant each object entity is to the subject and the relation.

Multi-Relational Graph (RGCN) Aggregator.
We introduce a multi-relational graph aggregator from (Schlichtkrull et al., 2018). This is a general aggregator that can incorporate information from multi-relational and multi-hop neighbors. Formally, the aggregator is defined as follows: where initial hidden representations for each node (h o ) are set to trainable embedding vectors (e o ) for each node and c s is a normalizing factor. Details are described in Section B of appendix.

Parameter Learning and Inference
In this section, we discuss how RE-NET is trained and infers events over multiple time stamps. Parameter Learning via Event Predictions. An object entity prediction given (s, r) can be viewed as a multi-class classification task, where each class corresponds to each object entity. Similarly, relation prediction given s and subject entity prediction Algorithm 1: Learning algorithm of RE-NET Input: Observed graph sequence: {G1, ..., Gt}, Number of events to sample at each step: M . Output: An estimation of the conditional distribution p(Gt+∆t|G:t).
can be considered as a multi-class classification task. Here we omit the notations for preceding events for brevity. Thus, the loss function is as follows: where G is set of events, and λ 1 and λ 2 are importance parameters that control the importance of each loss term. λ 1 and λ 2 can be chosen depending on the task. If a task aims to predict o given (s, r), then we can give small values to λ 1 and λ 2 . Multi-step Inference over Time. RE-NET seeks to predict the forthcoming events based on the previous observations. Suppose that the current time is t and we aim to predict events at time t + ∆t where ∆t > 0. Then the problem of multi-step inference can be formalized as inferring the conditional probability p(Gt+∆t|G:t). The problem is nontrivial as we need to integrate over all G t+1:t+∆t−1 . To achieve efficient inference, we draw a sample of G t+1:t+∆t−1 , and estimate the conditional probability as follows: Intuitively, one starts with computing p(Gt+1|G:t), and drawing a sampleĜ t+1 from the conditional distribution. With this sample, one can further compute p(Gt+2|Ĝt+1, G:t). By iteratively computing the conditional distribution for G t and drawing a sample from it, one can eventually estimate p(Gt+∆t|G:t) as p(Gt+∆t|Ĝt+1:t+∆t−1, G:t). Although we can improve the estimation by drawing multiple graph samples at each step, RE-NET already performs very well with a single sample, and thus we only draw one sample graph at each step for better efficiency. Based on the estimation of the conditional distribution, we can further predict events that are likely to form in the future. We summarize the detailed inference algorithm in Algorithm 1; we first sample M number of s (line 3) and pick top-k triples (line 4). Then we build a graph at time t (line 5) to generate a graph. The time complexity of the algorithm is described in Section C of appendix.

Experiments
Evaluating the quality of generated graphs is nontrivial, especially for knowledge graphs (Theis et al., 2015). In our experiments, we evaluate the proposed method on a extrapolation link prediction task on TKGs. The task of predicting future links aims to predict unseen relationships with object entities given (s, r, ?, t) (or subject entities given (?, r, o, t)) at future time t, based on the past observed events in the TKG. Essentially, the task is a ranking problem over all the events (s, r, ?, t) (or (?, r, o, t)). RE-NET can approach this problem by computing the probability of each event in a distant future with the inference algorithm in Algorithm 1, and further rank all the events according to their probabilities. Note that we are only given a training set as ground truth at inference and we do not use any ground truth in the test set for the next time step predictions when performing multi-step inference. This is the main difference from previous work; they use previous ground truth in the test set.
We evaluate our proposed method on three benchmark tasks: (1) predicting future events on three event-based datasets; (2) predicting future facts on two knowledge graphs which include facts with time spans, and (3) studying ablation of our proposed method. Section 4.1 summarizes the datasets. In all these experiments, we perform predictions on time stamps that are not observed during training.

Experimental Setup
We compare the performance of our model against various traditional models for knowledge graphs, as well as some recent temporal reasoning models on five public datasets.

Datasets.
We use five TKG datasets in our experiments: 1) three event- Evaluation Setting and Metrics. For each dataset except ICEWS14 4 , we split it into three subsets, i.e., train(80%)/valid(10%)/test(10%), by time stamps. Thus, (time stamps of train) < (time stamps of valid) < (time stamps of test). We report a filtered version of Mean Reciprocal Ranks (MRR) and Hits@3/10. Similar to the definition of filtered setting in (Bordes et al., 2013), during evaluation, we remove all the valid triplets that appear in the train, valid, or test sets from the list of corrupted triplets.
Baselines. We compare our approach to baselines for static graphs and temporal graphs as follows: (1) Static Methods. By ignoring the edge time stamps, we construct a static, cumulative graph for all the training events, and apply multi-relational 4 We used the splits as provided in (Trivedi et al., 2017). (2) Temporal Reasoning Methods. We also compare state-of-the-art temporal reasoning methods for knowledge graphs, including  et al., 2020). These methods were proposed to predict interactions at future time on homogeneous graphs. Thus, we modified the methods to apply them on multi-relational graph.
(3) Variants of RE-NET. To evaluate the importance of different components of RE-NET, we varied our model in different ways: RE-NET w/o multi-step which does not update history during inference, RE-NET without the aggregator (RE-NET w/o agg.), RE-NET with a mean aggregator (RE-NET w. mean agg.), and RE-NET with an attentive aggregator (RE-NET w. attn agg.). RE-NET w/o agg. takes a zero vector instead of an aggregator. RE-NET w. GT denotes RE-NET with ground truth history.
Please refer to Section D of appendix for detailed experimental settings.

Performance Comparison on TKGs.
We compare our proposed method with other baselines. The test results are obtained by averaged metrics (5 runs) over the entire test sets on datasets.
Results on Event-based TKGs. Table 1 summarizes results on all datasets. Our proposed RE-NET outperforms all other baselines on ICEWS18 and GDELT. Static methods underperform compared to our method since they do not consider temporal factors. Also, RE-NET outperforms all other tem-poral methods including TA-DistMult, HyTE, and dynamic methods on homogeneous graphs. Know-Evovle+MLP significantly improves Know-Evolve, which shows effectiveness of our MLP decoder. However, there is still a large gap from our model, which also indicates effectiveness of our recurrent event encoder. R-GCRN+MLP has a similar structure to ours in that it has a recurrent encoder and an RGCN aggregator but it lacks multi-step inference, global information, and the sophisticated modeling for the recurrent encoder. Thus, it underperforms compared to our method. More importantly, none of the prior temporal methods are capable of multistep inference, while RE-NET can sequentially infer multi-step events (Details in Section 4.3).
Results on Public KGs. Previous results have demonstrated the effectiveness of RE-NET on event-based KGs. In Table 1 we compare RE-NET with other baselines on the Public KGs WIKI and YAGO. Our proposed RE-NET outperforms all other baselines on these datasets. In these datasets, baselines show better results than in the eventbased TKGs. This is due to the characteristics of the datasets; they have facts that are valid within a time span. However, our proposed method consistently outperforms the static and temporal methods, which implies that RE-NET effectively infers new events using a powerful event encoder and an aggregator, and provides accurate prediction results.
Performance of Prediction over Time. Next, we further study performance of RE-NET over time. Figs. 4 shows the performance comparisons over different time stamps on the ICEWS18, GDELT, WIKI, and YAGO datasets with filtered Hits@3 metrics. RE-NET consistently outperforms baseline methods for all different time stamps. Performance of each method fluctuate since testing entities are different at each time step. We notice that with increasing time steps, the difference between RE-NET and ConvE gets smaller as shown in Fig. 4. This is expected since further future events are harder to predict. To estimate the joint probability distribution of events in a distant future, RE-NET needs to generate a long graph sequence. The quality of generated graphs deteriorates when RE-NET generates a long graph sequence.

Ablation Study
In this section, we study the effect of variations in RE-NET on the ICEWS18 dataset. We present the results in Tables 1, 2  Different Aggregators. In Table 2, we observe that RE-NET w/o agg. hurts model quality, suggesting that introducing aggregators makes the model capable of dealing with concurrent events and improves performance. Table 1 and Fig. 5a show the performance of RE-NET with different aggregators. Among them, RGCN aggregator outperforms other aggregators. This aggregator has the advantage of exploring multi-relational neighbors. Also, RE-NET with an attentive aggregator shows better performance than RE-NET with a mean aggregator, which implies that giving different attention weights to each neighbor helps predictions.
Multi-step Inference. In Table 2, we observe that RE-NET outperforms RE-NET w/o multi-step. The latter one does not update history during inference; keeps its last history in the training set. So it is not affected by time stamps. Without the multi-step inference, the performance of RE-NET is decreased as is shown. Also we expect that RE-NET w. GT shows significant improvement when RE-NET uses ground truth of triples at the previous time step which are not allowed in our setup.
Empirical Probabilities. Here, we study the role of p(s t |G t−m:t−1 ) and p(r t |s, G t−m:t−1 ). We denote them as p(s) and p(r) for brevity. p(s t , r t |G t−m:t−1 ) (or simply p(s, r)) is equivalent to p(s)p(r). In Fig 5b, emp. p(s) (or p e (s)) denotes a model with empirical p(s), defined as p e (s) = (# of s-related triples) / (total # of triples). emp. p(s, r) (or p e (s, r)) denotes a model with p e (s) and p e (r),defined as p e (r) = (# of r-related triples) / (total # of triples). Thus, p e (s, r) = p e (s)p e (r).
Note that RE-NET use a trained p(s) and p(r). The results show that the trained p(s) and p(r) help RE-NET for multi-step predictions. p e (s) underperforms RE-NET, and p e (s, r) = p e (s)p e (r) shows the worst performance, suggesting that training each part of the probability in equation (1) improves performance.

Related Work
Temporal KG Reasoning. There have been some recent attempts on incorporating temporal information in modeling dynamic knowledge graphs, broadly categorized into two settings -extrapolation ( , while other methods predict by reconstructing an adjacency matrix by using an autoencoder (Goyal et al., 2018(Goyal et al., , 2019. These methods seek to predict on single-relational graphs, and are designed to predict future edges in one future step (i.e., for t + 1). However, our work focuses on multi-relational knowledge graphs and aims for multi-step prediction.
Deep Autoregressive Models. Deep autoregressive models define joint probability distributions as a product of conditionals. DeepGMG (Li et al., 2018) and GraphRNN (You et al., 2018) are deep generative models of graphs and focus on generating static homogeneous graphs where there is only a single type of edge. In contrast to these studies, our work focuses on generating heterogeneous graphs, in which multiple types of edges exist, and thus our problem is more challenging. To the best of our knowledge, this is the first paper to formu-late the structure inference (prediction) problem for temporal, multi-relational (knowledge) graphs in an autoregressive fashion.

Conclusion
To tackle the extrapolation problem, we proposed Recurrent Event Network (RE-NET) to model temporal, multi-relational, and concurrent interactions between entities. RE-NET defines the joint probability of all events, and thus is capable of inferring graphs in a sequential manner. The experiment revealed that RE-NET outperforms all the static and temporal methods and our extensive analysis shows its strength. Interesting future work includes developing a fast and efficient version of RE-NET, and modeling lasting events and performing inference on the long-lasting graph structures.
We use Gated Recurrent Units (Cho et al., 2014) as RNN: where : is concatenation, σ(·) is an activation function, and • is a Hadamard operator. The input is a concatenation of four vectors: subject embedding, object embedding, aggregation of neighborhood representations, and global information vector (e s , e r , g(N t (s)), H t ). h t (s) and H t are similarly defined. For h t (s), a concatenation of subject embedding, aggregation of neighborhood representations, and global information vector (e s , g(N t (s)), H t ) is input. For H t , aggregation of the whole graph representations g(G t ) is input.

B Details of RGCN Aggregator
The RGCN aggregator is defined as follows: where initial hidden representations for each node (h o ) are set to trainable embedding vectors (e o ) for each node and c s is a normalizing factor. Detailed Basically, each relation can derive a local graph structure between entities, which further yield a message on each entity by aggregating information from neighbors, i.e., o∈N (s,r) o . The overall message on each entity is further computed by aggregating all the relation-specific messages, i.e., r∈R o∈N (s,r) o . Finally, the aggregator g(N (s) t ) is defined by combining both the overall message and information from past steps, i.e., To distinguish weights between different relations, we adopt independent weight matrices {W (l) r } for each relation r. Furthermore, the aggregator collects representations of multi-hop neighbors by introducing multiple layers of the neural network with each layer indexed by l. The number of layers determines the depth to which the node reaches to aggregate information from its local neighborhood.
The major issue of this aggregator is that the number of parameters grows rapidly with the number of relations. In practice, this can easily lead to overfitting on rare relations and models of very large size. Thus, we adopt the block-diagonal decomposition (Schlichtkrull et al., 2018), where each relation-specific weight matrix is decomposed into a block-diagonal by decomposing into low-dimensional matrices. W (l) r in equation (10) is defined as a block diagonal matrix, diag(A B) and B is the number of basis matrices. The block decomposition reduces the number of parameters and helps prevent overfitting.

C Computational Complexity Analysis.
We analyze the time complexity of the graph generation process in Algorithm 1. Computing p(s t | G t−m:t−1 ) (equation (4)

D Detailed Experimental Settings
Datasets. We use five datasets: 1) three eventbased temporal knowledge graphs and 2) two knowledge graphs where temporally associated facts have meta-facts as (s, r, o, [t s , t e ]) where t s is the starting time point and t e is the ending time point. The first group of graphs includes Integrated Crisis Early Warning System (ICEWS18 (Boschee et al., 2015) and ICEWS14 (Trivedi et al., 2017)), and Global Database of Events, Language, and Tone (GDELT) (Leetaru and Schrodt, 2013). The second group of graphs includes WIKI (Leblay and Chekol, 2018) and YAGO (Mahdisoltani et al., 2014). Dataset statistics are described on Table 3, where N train , N valid , and N test are the numbers of train set, valid set, and test set, respectively. N ent and N rel are the numbers of entities and relations. The time gap represents time granularity between adjacent events. ICEWS18 is collected from 1/1/2018 to 10/31/2018, ICEWS14 is from 1/1/2014 to 12/31/2014, and GDELT is from 1/1/2018 to 1/31/2018. The ICEWS14 is from (Trivedi et al., 2017). We didn't use their version of the GDELT dataset since they didn't release the dataset.
WIKI and YAGO datasets have temporally associated facts (s, r, o, [t s , t e ]). We preprocess the datasets such that each fact is converted to {(s, r, o, t s ), (s, r, o, t s +1 t ), ..., (s, r, o, t e )} where 1 t is a unit time to ensure each fact has a sequence of events. Noisy events of early years are removed (before 1786 for WIKI and 1830 for YAGO).
The difference between the first group and the second group is that facts happen multiple times (even periodically) on the first group (event-based knowledge graphs) while facts last long time but are not likely to occur multiple times in the second group.
Model details of RE-NET. We use Gated Recurrent Units (Cho et al., 2014) as our recurrent event encoder, where the length of history is set as m = 10 which means saving past 10 event sequences. If the events related to s are sparse, we check the previous time steps until we get m previous time steps related to the entity s. We pretrain the parameters related to equations 4 and 5 due to the large size of training graphs. We use a multi-relational aggregator to compute H t . The aggregator provides hidden representations for each node and we max-pool over all hidden representations to get H t . We apply teacher forcing for model training over historical data, i.e., we use the ground truth rather than the model's own prediction as the input of the next time step during training. At inference time, RE-NET performs multi-step prediction across the time stamps in dev and test sets. In each time step, we sample 1000 (= M ) number of subjects and save top-1000 (= k) triples to use them as a generated graph . We set the size of entity/relation embeddings to be 200 and embedding of unobserved embeddings are randomly initialized. We use two-layer RGCN in the RGCN aggregator with block dimension 2 × 2. The model is trained by the Adam optimizer (Kingma and Ba, 2014). We set λ 1 to 0.1, the learning rate to 0.001 and the weight decay rate to 0.00001. All experiments were done on GeForce GTX 1080 Ti.
Experimental Settings for Baseline Methods. In this section, we provide detailed settings for baselines. We use implementations of DistMult 6 . We implemented TA-DistMult based on the implementation of Distmult. For TA-DistMult, We use temporal tokens with the vocabulary of year, month and day on the ICEWS dataset and the vocabulary of year, month, day, hour and minute on the GDELT dataset. We use use a binary cross-entropy loss for DistMult and TA-DistMult. We validate the embedding size among 100 and 200. We set the batch size to 1024, margin to 1.0, negative sampling ratio to 1, and use the Adam optimizer.
We use the implementation of HyTE 7 . We use every timestamp as a hyperplane. The embedding size is set to 128, the negative sampling ratio to 5, and margin to 1.0. We use time agnostic negative sampling (TANS) for entity prediction, and the Adam optimizer.
We use the codes for ConvE 8 and use implementation by Deep Graph Library 9 . Embedding sizes are 200 for both methods. We use 1 to all negative sampling for ConvE and use 10 negative sampling ratio for RGCN, and use the Adam optimizer for both methods. We use the codes for  , 2020). These methods were proposed to predict interactions at a future time on homogeneous graphs, while our proposed method is for predicting interactions on multi-relational graphs (or knowledge graphs). Furthermore, those methods predict links at one future time stamp, whereas our method seeks to predict interactions at multiple future time stamps. We modified some methods to apply them on multi-relational graphs as follows. We adopt R- GCN (Schlichtkrull et al., 2018) for EvolveGCN-O and call it EvolveRGCN. We convert knowledge graphs into homogeneous graphs for dyn-graph2vecAE. The idea of this method is to reconstruct an adjacency matrix using an auto-encoder and regard it as a future adjacency matrix. If we keep relations, relation-specific adjacency matrices will be extremely sparse; the method learns to reconstruct near-zero adjacency matrices. tN-odeEmbed is a temporal method on homogeneous 10 https://github.com/rstriv/Know-Evolve 11 https://github.com/DeepGraphLearning/KnowledgeGraphEmbedding graphs. To use this on multi-relational graphs, we first train entity embeddings with DistMult and set these as initial embeddings for entities in tNodeEmbed. Also we give entity embeddings as input to LSTM of tNodeEmbed. We concatenate output of LSTM and relation embeddings to predict objects. We did not modified other methods since it is not trivial to extend the methods. Table 4 shows the performance comparison on ICEWS18, GDELT, ICEWS14 with raw settings.

E.1 Results with Raw Metrics
Our proposed RE-NET outperforms all other baselines.

E.2 Sensitivity Analysis
In this section, we study the parameter sensitivity of RE-NET including the length of history for the event encoder, cutoff position k for events to generate a graph, the number of layers of the RGCN   aggregator, and effect of the global representation from a global graph structure. We report the performance change of RE-NET on the ICEWS18 dataset by varying the hyper-parameters (Figs. 7 and 7c).
Length of Past History in Recurrent Event Encoder. The recurrent event encoder takes the sequence of past interactions up to m graph sequences or previous histories. Fig. 7a shows the performance with various lengths of past histories. When RE-NET uses longer histories, MRR is getting higher. However, the MRR is not likely to go higher when the length of history is 5 and over.
Cut-off Position k at Inference. To generate a graph at each time, we cut off top-k triples on ranking results. In Fig. 7b, when k is 0, RE-NET does not generate graphs for estimating p(G t+∆t |G :t ), i.e., RE-NET performs single-step predictions, and it shows the lowest result. When k is larger, the performance is getting higher and it is saturated after 500. We notice that the conditional distribution p(G t+∆t |G :t ) can be approximated by p(G t+∆t | G t+1:t+∆t−1 , G :t ) by using a larger cutoff position.
Layers of RGCN Aggregator. The number of layers in the aggregator means the depth to which the node reaches. Fig. 7c shows the performance according to different numbers of layers of RGCN. 2layered RGCN improves the performance considerably compared to 1-layered RGCN since 2-layered RGCN aggregates more information. However, RE-NET with 3-layered RGCN underperforms RE-NET with 2-layered RGCN. We conjecture that the bigger parameter space leads to overfitting. Global Information. We further observe that representations from global graph structures help the predictions. Fig. 7d shows effectiveness of a representation of global graph structures. The improvement is marginal, but we consider that global representations at different time steps give distinct information beyond local graph structures.

F Case Study
In this section, we study RE-NET's predictions. Its predictions depend on interaction histories. We categorize histories into three cases: (1) consistent interactions with an object, (2) a specific temporal pattern, and (3) irrelevant history (Fig. 8). RE-NET can learn (1) and (2) cases, so it achieves high performances. For the first case, RE-NET can predict the answer because it consistently interacts with an object. However, static methods are prone to predicting different entities which are observed under relation "Accuse" in training set. The second case shows specific temporal patterns on relations: ( Arrest, o ) → ( Use force, o ). Without knowing this pattern, one method might predict "Businessman" instead of "Men". RE-NET is able to learn these temporal patterns so it can predict the second case.
Lastly, the third case shows irrelevant history to the answer and the history is not helpful to predictions. RE-NET fails to predict the third case.

G Implementation Issues of Know-Evolve
We found a problematic formulation in the Know-Evolve model and codes. The intensity function (equation 3 in (Trivedi et al., 2017)) is defined as λ s,r r (t|t) = f (g s,r r (t))(t −t), where g(·) is a score function, t is current time, andt is the most recent time point when either subject or object entity was involved in an event. This intensity function is used in inference to rank entity candidates. However, they don't consider concurrent event at the same time stamps, and thust will become t after one event. For example, we have events e 1 = (s, r, o 1 , t 1 ), e 2 = (s, r, o 2 , t 1 ). After e 1 ,t will become t (subject s's most recent time point), and thus the value of intensity function for e 2 will be 0. This is problematic in inference since if t =t, then the intensity function will always  Figure 8: Case study of RE-NET's predictions. RE-NET's predictions depend on interaction histories. Interaction histories are categorized into three cases: (1) consistent interactions with an object, (2) a specific temporal pattern, and (3) irrelevant history. RE-NET achieves good performances on the first two cases, and poor performances on the third case.
be 0 regardless of entity candidates. In inference, all object candidates are ranked by the intensity function. But all intensity scores for all candidates will be 0 since t =t, which means all candidates have the same 0 score. In their code, they give the highest ranks (first rank) for all entities including the ground truth object in this case. Thus, we fixed their code for a fair comparison; we give an average rank to entities who have the same scores.