Temporal Knowledge Graph Completion using a Linear Temporal Regularizer and Multivector Embeddings

Representation learning approaches for knowledge graphs have been mostly designed for static data. However, many knowledge graphs involve evolving data, e.g., the fact (The President of the United States is Barack Obama) is valid only from 2009 to 2017. This introduces important challenges for knowledge representation learning since the knowledge graphs change over time. In this paper, we present a novel time-aware knowledge graph embebdding approach, TeLM, which performs 4th-order tensor factorization of a Temporal knowledge graph using a Linear temporal regularizer and Multivector embeddings. Moreover, we investigate the effect of the temporal dataset’s time granularity on temporal knowledge graph completion. Experimental results demonstrate that our proposed models trained with the linear temporal regularizer achieve the state-of-the-art performances on link prediction over four well-established temporal knowledge graph completion benchmarks.


Introduction
Numerous large-scale knowledge graphs (KGs) including DBpedia (Auer et al., 2007), Free-Base (Bollacker et al., 2008) and WordNet (Miller, 1995) have been established in recent years. Such KGs abstract knowledge from the real world into a complex network graph consisting of billions of triples. Each triple is denoted as (s, r, o), where s is the subject entity, o is the object entity, and r is the relation between the entities.
Knowledge graph completion is one of the main challenges in the KG field since most KGs are incomplete. To tackle this problem, knowledge graph embedding (KGE) approaches embed entities and relations into a low-dimensional embedding space and measure the plausibility of triples by inputting embeddings of the entities and their relation to a score function (Wang et al., 2017). For instance, ComplEx (Trouillon et al., 2016) has been proven to be a highly effective KGE model, where entities and relations are represented as complex embeddings, and the score of a triple (s, r, o) is computed with the asymmetric Hermitian dot product.
In this paper, we present a novel temporal KG embedding approach TeLM. We move beyond the complex-valued representations and introduce more expressive multivector embeddings from 2grade geometric algebras to model entities, relations, and timestamps for TKGE. At a high level, our approach performs 4th-order tensor factorization of a temporal KG, using the asymmetric geometric product. The geometric product provides a greater extent of expressiveness compared to the complex Hermitian operator.
Specially, each relation is represented as a pair of dual multivector embeddings used to handle the beginning and the end of the relation. In this way, TeLM can adapt well to datasets where time annotations are represented in the various forms: time points, begin or end time, time intervals.
Moreover, we develop a new linear temporal regularization function for time representation learning which introduces a bias component in the temporal smoothing function and empirically study the effect of the time granularity for a TKG dataset on the performance of our models.
Experimental results on four well-established TKG datasets show that our approach outperforms the state-of-the-art TKGE models, and the linear temporal regularization function improves the performance of our model compared to three common temporal regularization functions.

Related Work
Tensor decomposition-based KGE approaches have led to good results in static KG completion. Such approaches (Yang et al., 2014;Trouillon et al., 2016;Kazemi and Poole, 2018;Zhang et al., 2019;Xu et al., 2020b) model a static KG as a lowdimensional 3rd-order tensor and consider knowledge graph completion as a tensor decomposition problem. A typical tensor decomposition model ComplEx (Trouillon et al., 2016) has been proven to be fully expressive with complex embeddings. Apart from tensor decomposition approaches, distance-based KGE models are also commonly used for KG completion. However, distance-based KGE models like TransE (Bordes et al., 2013) and its variants (Wang et al., 2014;Lin et al., 2015;Nayyeri et al., , 2020 have been proven to have limitations in modeling various relation patterns which does not lead to the state-of-the-art results on the current benchmarks. The above KGE approaches achieve satisfactory results on link prediction over static KGs. Recent research on TKG completion shows that the inclusion of time information can improve the performances of KGE models on TKGs. TTransE (Leblay and Chekol, 2018), HyTE (Dasgupta et al., 2018), ATiSE and TeRo (Xu et al., , 2020a propose scoring functions which incorporate time representations into a distancebased score function in different ways. Further-more, RTGE (Xu et al., 2020c) introduces the concept of temporal smoothness to optimize and learn the hyperplanes of adjacent time intervals jointly on the basis of HyTE. García-Durán et al. (2018) utilize recurrent neural networks to learn timeaware representations of relations and use standard scoring functions from the existing KG embedding model, e.g. TransE and DistMult. DE-SimplE (Goel et al., 2020) uses diachronic entity embeddings to represent entities at different time steps and exploit the same score function as Sim-plE to score the plausibility of a quadruple. TIME-PLEX (Jain et al., 2020) and TComplEx (Lacroix et al., 2020) extend the time-agnostic ComplEx model in different ways. Among them, TCom-plEx performs a 4th-order tensor decomposition of a TKG using the quadranomial Hermitian product which involves the embedding of timestamp T . Similarly to RTGE, TComplEx also uses the temporal smoothness to improve its performance. Thanks to the strong expressiveness provide by the complex embeddings and the 4th-order tensor decomposition, TComplEx achieves state-of-the-art results on TKG completion.

Geometric Algebras
In this section, we provides a brief introduction to the 2-grade Geometric Algebra G 2 . The contents are sufficient to understand the rest of the work.
Thus, the complex numbers C ∈ C can be embedded into a subalgebra of G 2 which are formed with scalars and bivectors. In other words, a 2-grade multivector M = a 0 + a 12 e 1 e 2 consisting of a scalar plus a bivector is isomorphic to a complex number C = a 0 + a 12 i.
The norm of a multivector M ∈ G 2 is equal to the root of the square sum of real values of its all elements. Taking a 2-grade multivector as an example, its norm is defined as: ||M || = a 2 0 + a 2 1 + a 2 2 + a 2 12 . Geometric algebra also introduces a new product geometric product denoted as × n where n is the grade of multivectors, as well as three multivector involutions, space inversion, reversion and Clifford conjugation.
The geometric product of two 2-grade multivectors comprises of multiplications between scalars, vectors and bivectors. The product of two 2-grade multivectors M a = a 0 + a 1 e 1 + a 2 e 2 + a 12 e 1 e 2 and M b = b 0 +b 1 e 1 +b 2 e 2 +b 12 e 1 e 2 from G 2 is equal to The Clifford conjugation of a 2-grade multivector M is a subsequent composition of space inversion M * and reversion M † as M = M † * , where space inversion M * is obtained by changing e i to −e i and reversion is obtained by reversing the order of all products i.e. changing e 1 e 2 to e 2 e 1 . Thus, the Clifford conjugation of an 2-grade multivector M a = a 0 + a 1 e 1 + a 2 e 2 + a 12 e 1 e 2 is computed as M a = a 0 −a 1 e 1 −a 2 e 2 −a 12 e 1 e 2 . Note that the product of a multivector M a and its conjugation M 2 is a scalar, i.e., given a 2-grade multivector M a = a 0 + a 1 e 1 + a 2 e 2 + a 12 e 1 e 2 , we have producing a real number. where t = t b = t e . We extend the relation set R of a TKG to a pair of dual relation sets, R b and R e . A relation r b ∈ R b is used to handle the begin of relation r, meanwhile a relation r e ∈ R e is used to handle the end of relation r. By doing this, we score a fact (s, r, o, [t b , t e ]) as the mean value of scores of two quadruples, (s, r b , o, t b ) and (s, r e , o, t e ) which represent the begin and the end of this fact respectively, i.e., f (s, . For a fact missing the begin time or the end time, e.g., (s, r, o, , the score of this fact is equal to the score of the quadruple involving the We construct a set of time steps T for a TKG. For any time t appearing in the TKG, we can find a time step τ ∈ T to represent t. The time set T changes with time granularity of the TKG. Our approach TeLM embeds a TKG in a multiple-dimensional 2-grade multivector space G 2×k where k is the dimension of embeddings, and score a quadruple with an element-wise geometric product. TeLM embeds each entity, relation and time step as a k-dimensional 2-grade multivector embedding M where each component is a multivector, i.e., M = [M 1 , . . . , M k ], i = 1, . . . , k, M i ∈ G 2 . We can define the score function of TeLM as where τ is the time step corresponding to time t, M rτ = M r ⊗ 2 M τ , M s , M r , M o and M τ denote the k-dimensional multivector embeddings of s, r, o and τ respectively. ⊗ 2 denotes the elementwise geometric product between 2-grade multivector embeddings, e.g., . Sc(·) denotes the realvalued vector of the scalar component of a multivector embedding, 1 denotes a k × 1 vector having all k elements equal to one, M denotes the element-wise conjugation of multivectors i.e. M = [M 1 , . . . , M k ]. and a, b := k a k b k is the dot product.
In our approach, the total number of parameters increases linearly with embedding dimension k, i.e., the space complexity of a TeLM model is O(k). Since the score is computed with an asymmetric quadranomial geometric product between k-dimensional multivector embeddings, the time complexity is also equal to O(k), which are the same as some common KGE models, e.g., TransE and DistMult.

Loss Function
Using full multiclass log-softmax loss function and N3 regularization has been proven to be helpful in boosting the performances of tensor decompositionbased (T)KGE models (Lacroix et al., 2018;Xu et al., 2020b;Lacroix et al., 2020;Jain et al., 2020). In this work, we follow such setting for TeLM and utilize the reciprocal learning for simplifying the training process.
For each relation r, we create an inverse relation r −1 and create a quadruple (o, r −1 , s, t) for each training quadruple (s, r, o, t). At the evaluation phase, queries of the form (?, r, o, t) are answered as (o, r −1 , ?, t). By doing this, the multiclass logloss of a training quadruple ω = (s, r, o, t) can be defined as follows, where λ ω denotes the N3 regularization weight.

Temporal Regularization
A common approach to leverage the temporal aspect of temporal graphs is to use time as a regularizer to impose a smoothness constraint on time embeddings. RTGE (Xu et al., 2020c) and TCom-plEx (Lacroix et al., 2020) introduce the temporal smoothness between hyperplanes and embeddings of adjacent time steps, respectively, based on the assumption that the neighboring time steps should have close representations. The smoothing temporal regularizer is defined as, where n τ is the number of time steps and p = 3 in this work since we use N3 regularization. Apart from the basic temporal smoothness, various temporal regularization methods are used for learning temporal embeddings. Singer et al. (2019) add a rotation projection to align the neighboring temporal embeddings. The loss of such projective temporal regularization can be defined as, where M w is the rotation embedding. Yu et al. (2016) propose an autoregressive temporal regularizer based on the assumption that the change of temporal embeddings fits an AR model. This autoregressive temporal regularizer is defined as, where m = 3 is the order of the AR model used in our work, and M j denote the weight of the embeddings of previous time steps which are learned during the training process.
In this work, we develop a novel linear temporal regularizer by adding a bias component between the neighboring temporal embeddings, which can be defined as, where M b denotes the bias embedding which are randomly initialized and then learned from the training process. This linear regularizer promotes that the difference between embeddings of two adjacent time steps is smaller than the difference between embeddings of two distant time steps, i.e., ||M τ i+m −M τ i || > ||M τ i+1 −M τ i || when m 1. This formulation can be helpful for effectively clustering and ordering time embeddings M τ i .
The total loss L b of a training batch Ω b is the sum of the quadruple loss and the temporal regularization term, i.e., where λ T denotes the coefficient of the temporal regularizer. In this work, we use the linear temporal regularizer for TeLMand compare its performance with other three temporal regularizers.  The statics of datasets are listed in Table 1. All datasets can be downloaded from https: //github.com/soledad921/ATISE.

Time granularity
In the previous work (García-Durán et al., 2018;Goel et al., 2020;Lacroix et al., 2020), the time granularity of ICEWS14 and ICEWS05-15 was set as 1 day. For YAGO11k and Wikidata12k, Dasgupta et.al (2018) and Xu et.al (2019) dropped the month and day information. They took care of the unbalance that might occur in terms of number of facts in a particular interval by clubbing neighboring years which are less frequently mentioned into the same time step and applying a minimum threshold of 300 facts per interval during construction. To illustrate, in Wikidata12k, there were time steps like ,  with a large span as the facts occurring on those years were relatively less in KG. The years like 2013 being highly frequent were self-contained. This setting was used to alleviate the effect of the long-tail property of time data in YAGO11k and Wikidata12k. As shown in Figure 1, the time distribution of facts in ICEWS14 is relatively uniform, while the frequency distribution of time data in YAGO11k has a long tail. In this work, we study the effect of time granularity on TKG completion. For ICEWS datasets, we test our model with different time units, denoted as u, in a range of {1, 2, 3, 7, 14, 30, 90 and 365} days. Dasgupta et al. (2018) and  applied a minimum threshold of 300 triples per interval during construction for YAGO11k and Wikidata12k. We follow their time-division approaches for these two datasets and test different minimum thresholds, denoted as tr, amongst {1, 10, 100, 1000, 10000} for grouping years into different time steps. The change of time granularity will reconstruct the set of time steps T . To illustrate, the total number of time steps in ICEWS14 is 365 with u = 1. When the time unit u changes from 1 to 2, the set of time steps T will be reconstructed and include 188 different time steps. In YAGO11k, there are totally 388 different time steps when tr = 1. Years like -453, 100 and 2008 are taken as independent time steps. When tr for YAGO11k rises to 100, the number of time steps drops to 118 and years between -431 and 100 are clubbed into a same time step.

Evaluation Metrics
We evaluate our models on link prediction over the above-mentioned TKG benchmarks.  a time-aware link prediction query (s, r, ?, T ), we first generate the candidate list C = {(s, r, o , T ) : o ∈ E}. Following the time-wise filtered setting used in most previous TKGE-related work, e.g., TComplEx (Lacroix et al., 2020), we then remove the candidate quadruples appearing in the train Ω train , valid Ω train and test set Ω train from the candidate list. The filtered candidate list is denoted as C = {ω : ω ∈ C, ω / ∈ Ω train ∪ Ω valid ∪ Ω test }. We get the rank of test quadruple (s, r, o, T ) among the candidate quadruples C by sorting their scores. We use Mean Reciprocal Rank (MRR) and Hits@N as evaluation metrics. The Mean Reciprocal Rank (MRR) is the average of the reciprocal values of all computed ranks. The percentage of testing quadruples which are ranked lower than N is considered as Hits@N.

Baselines
We compare our models with the state-ofthe-art KGE model, ComplEx-N3 (Lacroix et al., 2018) and several existing TKGE approaches including TTransE (Leblay and Chekol, 2018) (Goel et al., 2020), TIME-PLEX(base) (Jain et al., 2020) and TCom-plEx (Lacroix et al., 2020). We do not use the complete TIME-PLEX model and the TNTCom-plEx model as baselines since the former incorporates additional temporal constraints for some specific relations and the latter is designed for modelling a KG where some facts involve time information and others do not. Among the existing TKGE approaches, TComplEx achieves state-of-the-art results on TKG completion.

Experimental Setup
We implement our proposed model TeLM in Py-Torch. We use the Adagrad optimizer with a learning rate of 0.1 to train both models. The batch size b is fixed as 1000. The regularization weights λ ω and λ T are tuned in a range of {0, 0.001, 0.0025, 0.005, 0.0075, 0.01,. . . , 0.1}. To avoid too much memory consumption, we follow the setting in (Lacroix et al., 2020) to make the maximum embedding no more than 2000. The above experimental setup is also used for evaluating TComplEx on YAGO11k and Wikidata12k. Notably, the time granularity parameters u and tr are also regraded as hyperparameters for TeLM as mentioned in the previous section. The optimal hyperparameters for TeLM are as follows: λ ω = 0.0075, λ T = 0.01, u = 1 on ICEWS14; λ ω = 0.0025, λ T = 0.1, u = 1 on ICEWS05-15; λ ω = 0.025, λ T = 0.001, tr = 100 on YAGO11k; λ ω = 0.025, λ T = 0.0025, tr = 1 on Wikidata12k. The optimal embedding dimension is k = 2000 in all cases. The training processes of a TeLM model with k = 2000 on ICEWS14, YAGO11K and Wikidata12k all cost less than half an hour with a GeForce RTX 2080 GPU. On ICEWS05-15, It takes about 2 hours to train a 2000-dimensional TeLM model.
6 Results and Analysis 6.1 Link Prediction   Table 3 since there is no literature reporting the results of TA-TransE on YAGO11k and Wikidata12k and the performances of TA-TransE are worse than most baseline models on other TKG datasets. The results of DE-SimplE on YAGO11k and Wikidata12k can not be obtained since DE-SimplE mainly focuses on eventbased datasets and cannot model time intervals or time annotations missing moth and day information which are common in YAGO and Wikidata. On YAGO11k, TeLM outperforms all baseline models other than TA-TransE and DE-SimplE regarding MRR, Hits@1 and Hits@10, though performs slightly worse than TeRo on Hits@3. Additionally, TeLM also achieves the state-of-the-art results except the the Hits@1 of TComplEx is 0.1 point higher than TeLM.

Effect of Linear Temporal Regularizer
We compare the performances of the TeLM model trained with various temporal regularizers mentioned before, e.g., the smoothing temporal regularizer, the projective temporal regularizer, the 3order autoregressive temporal regularizer, and our proposed linear temporal regularizer. As shown in Figure 2, the TeLM model trained with the linear temporal regularizer outperforms the TeLM model trained with other temporal regularizer on ICEWS14. Compared to the smoothing temporal regularizer, the linear temporal regularizer improves MRR by 0.2 point and Hits@1 by 0.3 point. And the linear temporal regularizer is also less sensitive to the temporal regularization weight λ T amongst the range of {0.001, ..., 0.1} since its bias component is learned during the training process and thus can be partly adaptive to different λ T .
In Figure 3, In we show 2-d PCA projections of the 2000-dimensional time embeddings of TeLM models trained with/without a linear temporal regularizer. Adjacent time embeddings of TeLM trained without the temporal regularization naturally come together. However, the time embeddings representing time points in different months are not well divided. By contrast, time embeddings of TeLM trained with the linear temporal regularizer are forming good clusters in chronological order. Overall, the linear temporal regularizer provides good geometric meanings of time embeddings by effectively retaining the time sequence information in temporal KGs and thus improves the performances of TeLM.

Effect of Time Granularity and Embedding Dimension
In this work, we analyze the effect of the change of the time granularity on the performance of our model. As mentioned in the previous section, we adopt two different time-division approaches for event-based datasets, i.e., ICEWS datasets, and time-wise KGs involving time intervals, i.e., YAGO11k as well as Wikidata12k. As shown in Figure 4(a), on ICEWS14 where time distribution of facts is relatively uniform, the performance of TeLM decreases with the time unit u increasing, since representing time with a small time granularity can provide more abundant time information.
On the other hand, Figure 4(b) illustrates that using the smallest time granularity is non-optimal for YAGO11k due to the long-tail property of time data. An appropriate minimum threshold used for generating time steps, e.g., tr = 100, can improve the link prediction results of TeLM by alleviating the effect of the long-tail property of time data and decrease the memory usage with fewer time steps. Meanwhile, using overly coarse-grained time units always leads to low performances since the time information is not fully expressed in these cases.
Figure 4(c) and (d) show that the performances on ICEWS14 and YAGO11k of TGe-omE2 improve with the increasing of the embedding dimension in a range of k = {20, 50, 100, 200, 500, 1000, 2000}. TeLM with k = 500 has fewer adjustable parameters than TComplEx with k = 1740 used in (Lacroix et al., 2020) but performs closely (0.612 vs 0.61 on MRR). It will still be interesting to explore the performances of TeLM models with higherdimensional embeddings, e.g., Ebisu et al. (2018) use 10000-dimensional embeddings for TorusE, although it would bring more memory pressure.

Conclusion
We propose a new time-aware approach for TKG completion, TeLM, which performs 4th-order tensor factorization of a temporal knowledge graph using multivector embeddings for knowledge graph representation and a linear temporal regularizer for learning time embeddings. Compared to realvalued and complex-valued embeddings, multivector embeddings provides better generalization capacity and richer expressiveness with higher degree of freedom for TKGE. Moreover, the linear temporal regularizer provides better geometric meanings for time embeddings and improves the performances of TeLM compared to the temporal smoothness. Additionally, two time division methods are used for different types of TKG datasets to study the effect of the time granularity on TKG completion. Our proposed models trained with the linear temporal regularizer achieve the state-of-the-art results on time-wise link prediction over four wellknown datasets involving various forms of time information, e.g., time points, begin or end time, and time intervals. Experimental results also show that choosing a reasonable time division method with an appropriate time granularity is helpful for TKG completion.