Learning Sequence Encoders for Temporal Knowledge Graph Completion

Research on link prediction in knowledge graphs has mainly focused on static multi-relational data. In this work we consider temporal knowledge graphs where relations between entities may only hold for a time interval or a specific point in time. In line with previous work on static knowledge graphs, we propose to address this problem by learning latent entity and relation type representations. To incorporate temporal information, we utilize recurrent neural networks to learn time-aware representations of relation types which can be used in conjunction with existing latent factorization methods. The proposed approach is shown to be robust to common challenges in real-world KGs: the sparsity and heterogeneity of temporal expressions. Experiments show the benefits of our approach on four temporal KGs. The data sets are available under a permissive BSD-3 license.


Introduction
Knowledge graphs (KGs) are used to organize, manage, and retrieve structured information. The incompleteness of most real-world KGs has stimulated research on predicting missing relations between entities. A KG is of the form G = (E, R), where E is a set of entities and, R is a set of relation types or predicates. One can represent G as a set of triples of the form (subject, predicate, object), denoted as (s, p, o). The link prediction problem seeks the most probable completion of a triple (s, p, ?) or (?, p, o) (Nickel et al., 2016). We focus on temporal KGs where some triples are augmented with time information and the link prediction problem asks for the most probable completion given time information. More formally, a temporal KG G = (E, R, T ) is a KG where facts can also have the form (subject, predicate, object, timestamp) or (subject, predicate, object, time predicate, timestamp), in addition to (s, p, o) triples. For instance, facts such as (Barack Obama, born, US, 1961) or (Barack Obama, president, US, occursSince, 2009-01) express temporal information about the facts associated with Barack Obama. While the former expresses that a relation type occurred at a specific point in time, the latter expresses an (open) time interval using the time predicate "occursSince." The latter example also illustrates a common challenge posed by the heterogeneity of time expressions due to variations in language and serialization standards.
Most approaches to link prediction are characterized by a scoring function that operates on the entity and relation type embeddings of a triple (Bordes et al., 2013;Yang et al., 2014;Guu et al., 2015). Learning representations that carry temporal information is challenging due to the sparsity and irregularities of time expressions. It is possible, however, to turn time expressions into sequences of tokens expressing said temporal information. Moreover, character-level architectures for language modeling (Zhang et al., 2015;Kim et al., 2016) operate on characters as atomic units to derive word embeddings. Inspired by these models, we propose a method to incorporate time information into standard embedding approaches for link prediction. We learn time-aware representations by training a recursive neural network with sequences of tokens representing the time predicate and the digits of the timestamp, if they exist. The last hidden state of the recurrent network is combined with standard scoring functions from the KG completion literature.

Related Work
Reasoning with temporal information in knowledge bases has a long history and has resulted in numerous temporal logics (van Benthem, 1995). Several recent approaches extend statistical relational learning frameworks with temporal reasoning capabilities Chekol and Stuckenschmidt, 2018;Dylla et al., 2013).
There is also prior work on incorporating temporal information in knowledge graph completion methods. Jiang et al. (2016) capture the temporal ordering that exists between some relation types as well as additional common-sense constraints to generate more accurate link predictions. Esteban et al. (2016) introduce a prediction model for link prediction that assumes that changes to a KG are introduced by incoming events. These events are modeled as a separate event graph and used to predict the existence of links in the future. Trivedi et al. (2017) model the occurrence of a fact as a point process whose intensity function is influenced by the score assigned to the fact by an embedding function. Leblay and Chekol (2018) develop scoring functions that incorporate time representations into a TransE-type scoring function. Prior work has also incorporated numerical but non-temporal entity information for knowledge base completion (Garcia-Duran and Niepert, 2017).
Contrary to all previous approaches, we encode sequences of temporal tokens with an RNN. This facilitates the encoding of relation types with temporal tokens such as "since," "until," and the digits of timestamps. Moreover, the RNN encoding provides an inductive bias for parameter sharing among similar timestamps (e.g., those occurring in the same century). Finally, our method can be combined with all existing scoring functions.

Time-Aware Representations
Embedding approaches for KG completion learn a scoring function f that operates on the embeddings of the subject e s , the object e o , and the pred- (1) • DISTMULT (Yang et al., 2014): where e s , e o ∈ R d are the embeddings of the subject and object entities, e p ∈ R d is the embedding of the relation type predicate, and • is the elementwise product. These scoring functions do not take temporal information into account. Given a temporal KG where some triples are augmented with temporal information, we can decompose a given (possibly incomplete) timestamp into a sequence consisting of some of the following temporal tokens year 0 · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 month 01 · 02 · 03 · 04 · 05 · 06 · 07 · 08 · 09 · 10 · 11 · 12 day 0 · 1 · 2 · 3 · 4 · 5 · 6 · 7 · 8 · 9 Hence, temporal tokens have a vocabulary size of 32. Moreover, for each triple we can extract a sequence of predicate tokens that always consists of the relation type token and, if available, a temporal modifier token such as "since" or "until." We refer to the concatenation of the predicate token sequence and (if available) the sequence of temporal tokens as the predicate sequence p seq . Now, a temporal KG can be represented as a collection of triples of the form (s, p seq , o), wherein the predicate sequence may include temporal information. Table 1 lists some examples of such facts from a temporal KG and their corresponding predicate sequence. We use the suffix y, m and d to indicate whether the digit corresponds to year, month or day information. It is these sequences of tokens that are used as input to a recurrent neural network.

LSTMs for Time-Encoding Sequences
A long short-term memory (LSTM) is a neural network architecture particularly suited for modeling sequential data. The equations defining an where i, f , o and g are the input, forget, output and input modulation gates, respectively. c and h are the cell and hidden state, respectively. All vectors are in R h . x n ∈ R d is the representation of the n-th element of a sequence. In this paper we set h = d. σ g , σ c and σ h are activation functions.
Each token of the input sequence p seq is first mapped to its corresponding d-dimensional embedding via a linear layer and the resulting sequence of embeddings used as input to the LSTM. Each predicate sequence of length N is represented by the last hidden state of the LSTM, that is, e pseq = h N . The predicate sequence representation, which carries time information, can now be used in conjunction with subject and object embeddings in standard scoring functions. For instance, temporal-aware versions of TRANSE and DISTMULT, which we refer to as TA-TRANSE and TA-DISTMULT, have the following scoring function for triples (s, p seq , o): All parameters of the scoring functions are learned jointly with the parameters of the LSTMs using stochastic gradient descent.
The advantages of character level models to encode time information for link prediction are: (1) the usage of digits and modifiers such as "since" as atomic tokens facilitates the transfer of information across similar timestamps, leading to higher efficiency (e.g. small vocabulary size); (2) at test time, one can obtain a representation for a timestamp even though it is not part of the training set; (3) the model can use triples with and without temporal information as training data. Figure 1 illustrates the generic working of our approach.

Experiments
We conducted experiments on four different KG completion data sets where a subset of the facts are augmented with time information.

Datasets
Integrated Crisis Early Warning System (ICEWS) is a repository that contains political events with a specific timestamp. These political events relate entities (e.g. countries, presidents...) to a number of other entities via logical predicates (e.g. 'Make a visit' or 'Express intent to meet or negotiate'). Additional information can be found at http://www.icews.com/. The repository is organized in dumps that contain the events that occurred each year from 1995 to 2015. We created two temporal KGs out of this repository, i) a short-range version that contains all events in 2014, and ii) a long-range version that contains all events occurring between 2005-2015. We refer to these two data sets as ICEWS 2014 and ICEWS 2005-15, respectively. Due to the large number of entities we selected a subset of the most frequently occurring entities in the graph and all facts where both the subject and object are part of this subset of entities. We split the facts into training, validation and test in a proportion of 80%/10%/10%, respectively. The protocol for the creation of these data sets is identical to the onw followed in previous work (Bordes et al., 2013). To create YAGO15K, we used FREEBASE15K (Bordes et al., 2013) (FB15K) as a blueprint. We aligned entities from FB15K to YAGO (Hoffart et al., 2013) with SAMEAS relations contained in a YAGO dump 2 , and kept all facts involving those entities. Finally, we augment this collection of facts with time information from the "yagoDateFacts" 3 dump. Contrary to the    ICEWS data sets, YAGO15K does contain temporal modifiers; namely, 'occursSince' and 'occur-sUntil'. Contrary to previous work (Leblay and Chekol, 2018), all facts maintain time information in the same level of granularity as one can find in the original dumps these data sets come from. We also experimented with the temporal facts from the WIKIDATA data set 4 extracted in (Leblay and Chekol, 2018). Only information regarding the year is available for these facts, since the authors discarded information of finer granularity. All facts are framed in a time interval (i.e. they contain the temporal modifiers 'occursSince' and 'occursUntil'). Facts annotated with a single point-in-time are associated with that time-point as start and end time. Due to the large number of entities of this data set, which hinders the computation of standard KG completion metrics, we selected a subset of the most frequent entities and 4 http://staff.aist.go.jp/julien.leblay/datasets kept all facts where both the subject and object are part of this subset of entities. This set of filtered facts was split into training, validation and test in the same proportion as before. Table 2 lists some statistics of the temporal KGs. All four data sets, with their corresponding training, validation, and test splits are available at https://github.com/nle-ml/mmkb.

General Set-up
We evaluate various methods by their ability to answer completion queries where i) all the arguments of a fact are known except the subject entity, and ii) all the arguments of a fact are known except the object entity. For the former we replace the subject by each of the KBs entities E in turn, sort the triples based on the scores returned by the different methods, and computed the rank of the correct entity. We repeated the same process for the second completion task and average the results.
[playsFor, since, temporal_tokens(date)] This is standard procedure in the KG completion literature. We also report the filtered setting as described in (Bordes et al., 2013). The mean of all computed ranks is the Mean Rank (lower is better) and the fraction of correct entities ranked in the top n is called hits@n (higher is better). We also compute the Mean Reciprocal Rank (higher is better) which is less susceptible to outliers.
Recent work (Leblay and Chekol, 2018) evaluates different approaches for performing link prediction in temporal KGs. The approach that learns independent representations for each timestamp and use these representations as translation vectors, similarly to (Bordes et al., 2013), leads to the best results. This approach is called VECTOR-BASED TTRANSE, though for the shake of simplicity in the paper we refer to it as TTRANSE. We compare our approaches TA-TRANSE and TA-DISTMULT against TTRANSE, and the standard embedding methods TRANSE and DISTMULT. For all approaches, we used ADAM (Kingma and Ba, 2014) for parameter learning in a mini-batch setting with a learning rate of 0.001, the categorical cross-entropy (Kadlec et al., 2017) as loss function and the number of epochs was set to 500. We validated every 20 epochs and stopped learning whenever the MRR values on the validation set decreased. The batch size was set to 512 and the number of negative samples to 500 for all experiments. The embedding size is d=100. We apply dropout (Srivastava et al., 2014) for all embeddings. We validated the dropout from the values {0, 0.4} for all experiments. For TA-TRANSE and TA-DISTMULT, the activation gate σ g is the sigmoid function; σ c and σ h were chosen to be linear activation functions.   Figure 3 shows a comparison of the training loss of TRANSE and TA-TRANSE for YAGO15K. Under the same set-up, TA-TRANSE's ability to learn from time information leads to a training loss lower than that of TRANSE. Figure 2 shows a t-SNE (Maaten and Hinton, 2008) visualization of the embeddings learned for the predicate sequence p seq = [playsFor, occursSince, date], where date corresponds to the date token sequence. This illustrates that the learned relation type embeddings carry temporal information.

Conclusions
We propose a digit-level LSTM to learn representations for time-augmented KG facts that can be used in conjunction with existing scoring functions for link prediction. Experiments in four temporal knowledge graphs show the effectiveness of the approach.