TuckER: Tensor Factorization for Knowledge Graph Completion

Knowledge graphs are structured representations of real world facts. However, they typically contain only a small subset of all possible facts. Link prediction is a task of inferring missing facts based on existing ones. We propose TuckER, a relatively straightforward but powerful linear model based on Tucker decomposition of the binary tensor representation of knowledge graph triples. TuckER outperforms previous state-of-the-art models across standard link prediction datasets, acting as a strong baseline for more elaborate models. We show that TuckER is a fully expressive model, derive sufficient bounds on its embedding dimensionalities and demonstrate that several previously introduced linear models can be viewed as special cases of TuckER.


Introduction
Vast amounts of information available in the world can be represented succinctly as entities and relations between them. Knowledge graphs are large, graph-structured databases which store facts in triple form (e s , r, e o ), with e s and e o representing subject and object entities and r a relation. However, far from all available information is currently stored in existing knowledge graphs and manually adding new information is costly, which creates the need for algorithms that are able to automatically infer missing facts.
Knowledge graphs can be represented as a third-order binary tensor, where each element corresponds to a triple, 1 indicating a true fact and 0 indicating the unknown (either a false or a missing fact). The task of link prediction is to predict whether two entities are related, based on known facts already present in a knowledge graph, i.e. to infer which of the 0 entries in the tensor are indeed false, and which are missing but actually true.
A large number of approaches to link prediction so far have been linear, based on various methods of factorizing the third-order binary tensor (Nickel et al., 2011;Yang et al., 2015;Trouillon et al., 2016;Kazemi and Poole, 2018). Recently, state-of-the-art results have been achieved using non-linear convolutional models (Dettmers et al., 2018;Balažević et al., 2019). Despite achieving very good performance, the fundamental problem with deep, non-linear models is that they are nontransparent and poorly understood, as opposed to more mathematically principled and widely studied tensor decomposition models.
In this paper, we introduce TuckER (E stands for entities, R for relations), a straightforward linear model for link prediction on knowledge graphs, based on Tucker decomposition (Tucker, 1966) of the binary tensor of triples, acting as a strong baseline for more elaborate models. Tucker decomposition, used widely in machine learning (Schein et al., 2016;Ben-Younes et al., 2017;Yang and Hospedales, 2017), factorizes a tensor into a core tensor multiplied by a matrix along each mode. It can be thought of as a form of higherorder SVD in the special case where matrices are orthogonal and the core tensor is "all-orthogonal" (Kroonenberg and De Leeuw, 1980). In our case, rows of the matrices contain entity and relation embeddings, while entries of the core tensor determine the level of interaction between them. Subject and object entity embedding matrices are assumed equivalent, i.e. we make no distinction between the embeddings of an entity depending on whether it appears as a subject or as an object in a particular triple. Due to the low rank of the core tensor, TuckER benefits from multi-task learning by parameter sharing across relations. A link prediction model should have enough expressive power to represent all relation types (e.g. symmetric, asymmetric, transitive). We thus show that TuckER is fully expressive, i.e. given any ground truth over the triples, there exists an assignment of values to the entity and relation embeddings that accurately separates the true triples from false ones. We also derive a dimensionality bound which guarantees full expressiveness.
In summary, key contributions of this paper are: • proposing TuckER, a new linear model for link prediction on knowledge graphs, that is simple, expressive and achieves state-of-theart results across all standard datasets; • proving that TuckER is fully expressive and deriving a bound on the embedding dimensionality for full expressiveness; and • showing how TuckER subsumes several previously proposed tensor factorization approaches to link prediction.

Related Work
Several linear models for link prediction have previously been proposed: RESCAL (Nickel et al., 2011) optimizes a scoring function containing a bilinear product between subject and object entity vectors and a full rank relation matrix. Although a very expressive and powerful model, RESCAL is prone to overfitting due to its large number of parameters, which increases quadratically in the embedding dimension with the number of relations in a knowledge graph.
DistMult (Yang et al., 2015) is a special case of RESCAL with a diagonal matrix per relation, which reduces overfitting. However, the linear transformation performed on entity embedding vectors in DistMult is limited to a stretch. The binary tensor learned by DistMult is symmetric in the subject and object entity mode and thus Dist-Mult cannot model asymmetric relations.
ComplEx (Trouillon et al., 2016) cock, 1927), in which subject and object entity embeddings for the same entity are independent (note that DistMult is a special case of CP). Sim-plE's scoring function alters CP to make subject and object entity embedding vectors dependent on each other by computing the average of two terms, first of which is a bilinear product of the subject entity head embedding, relation embedding and object entity tail embedding and the second is a bilinear product of the object entity head embedding, inverse relation embedding and subject entity tail embedding.
Recently, state-of-the-art results have been achieved with non-linear models: ConvE (Dettmers et al., 2018) performs a global 2D convolution operation on the subject entity and relation embedding vectors, after they are reshaped to matrices and concatenated. The obtained feature maps are flattened, transformed through a linear layer, and the inner product is taken with all object entity vectors to generate a score for each triple. Whilst results achieved by ConvE are impressive, its reshaping and concatenating of vectors as well as using 2D convolution on word embeddings is unintuitive. HypER (Balažević et al., 2019) is a simplified convolutional model, that uses a hypernetwork to generate 1D convolutional filters for each relation, extracting relation-specific features from subject entity embeddings. The authors show that convolution is a way of introducing sparsity and parameter tying and that HypER can be understood in terms of tensor factorization up to a non-linearity, thus placing HypER closer to the well established family of factorization models. The drawback of HypER is that it sets most elements of the core weight tensor to 0, which amounts to hard regularization, rather than letting the model learn which parameters to use via soft regularization.
Scoring functions of all models described above and TuckER are summarized in Table 1.

Background
Let E denote the set of all entities and R the set of all relations present in a knowledge graph. A triple is represented as (e s , r, e o ), with e s , e o ∈ E denoting subject and object entities respectively and r ∈ R the relation between them.

Link Prediction
In link prediction, we are given a subset of all true triples and the aim is to learn a scoring function φ

Scoring Function
Relation Parameters Space Complexity RESCAL (Nickel et al., 2011) e s W r e o W r ∈ R de 2 O(n e d e + n r d 2 r ) DistMult (Yang et al., 2015) e s , w r , e o w r ∈ R de O(n e d e + n r d e ) ComplEx (Trouillon et al., 2016) Re( e s , w r , e o ) w r ∈ C de O(n e d e + n r d e ) ConvE (Dettmers et al., 2018) f (vec(f ([e s ; w r ] * w))W)e o w r ∈ R dr O(n e d e + n r d r ) SimplE (Kazemi and Poole, 2018) 1 2 ( h es , w r , t eo + h eo , w r −1 , t es ) w r ∈ R de O(n e d e + n r d e ) HypER (Balažević et al., 2019) f (vec(e s * vec −1 (w r H))W)e o w r ∈ R dr O(n e d e + n r d r ) TuckER (ours) W × 1 e s × 2 w r × 3 e o w r ∈ R dr O(n e d e + n r d r ) Table 1: Scoring functions of state-of-the-art link prediction models, the dimensionality of their relation parameters, and significant terms of their space complexity. d e and d r are the dimensionalities of entity and relation embeddings, while n e and n r denote the number of entities and relations respectively. e o ∈ C de is the complex conjugate of e o , e s , w r ∈ R dw×d h denote a 2D reshaping of e s and w r respectively, h es , t es ∈ R de are the head and tail entity embedding of entity e s , and w r −1 ∈ R dr is the embedding of relation r −1 (which is the inverse of relation r). * is the convolution operator, · denotes the dot product and × n denotes the tensor product along the n-th mode, f is a non-linear function, and W ∈ R de×de×dr is the core tensor of a Tucker decomposition.
that assigns a score s = φ(e s , r, e o ) ∈ R which indicates whether a triple is true, with the ultimate goal of being able to correctly score all missing triples. The scoring function is either a specific form of tensor factorization in the case of linear models or a more complex (deep) neural network architecture for non-linear models. Typically, a positive score for a particular triple indicates a true fact predicted by the model, while a negative score indicates a false one. With most recent models, a non-linearity such as the logistic sigmoid function is typically applied to the score to give a corresponding probability prediction p = σ(s) ∈ [0, 1] as to whether a certain fact is true.

Tucker Decomposition
Tucker decomposition, named after Ledyard R. Tucker (Tucker, 1964), decomposes a tensor into a set of matrices and a smaller core tensor. In a three-mode case, given the original tensor X ∈ R I×J×K , Tucker decomposition outputs a tensor Z ∈ R P ×Q×R and three matrices A ∈ R I×P , B ∈ R J×Q , C ∈ R K×R : with × n indicating the tensor product along the nth mode. Factor matrices A, B and C, when orthogonal, can be thought of as the principal components in each mode. Elements of the core tensor Z show the level of interaction between the different components. Typically, P , Q, R are smaller than I, J, K respectively, so Z can be thought of as a compressed version of X . Tucker decomposition is not unique, i.e. we can transform Z without affecting the fit if we apply the inverse transformation to A, B and C (Kolda and Bader, 2009).

Tucker Decomposition for Link Prediction
We propose a model that uses Tucker decomposition for link prediction on the binary tensor representation of a knowledge graph, with entity embedding matrix E that is equivalent for subject and object entities, i.e. E = A = C ∈ R ne×de and relation embedding matrix R = B ∈ R nr×dr , where n e and n r represent the number of entities and relations and d e and d r the dimensionality of entity and relation embedding vectors. We define the scoring function for TuckER as: where e s , e o ∈ R de are the rows of E representing the subject and object entity embedding vectors, w r ∈ R dr the rows of R representing the relation embedding vector and W ∈ R de×dr×de is the core tensor. We apply logistic sigmoid to each score φ(e s , r, e o ) to obtain the predicted probability p of a triple being true. Visualization of the TuckER architecture can be seen in Figure 1. As proven in Section 5.1, TuckER is fully expressive. Further, its number of parameters increases linearly with respect to entity and relation embedding dimensionality d e and d r , as the number of entities and relations increases, since the number of parameters of W depends only on the entity and relation embedding dimensionality and not on the number of entities or relations. By having the core tensor W, unlike simpler models such as DistMult, Com-plEx and SimplE, TuckER does not encode all the learned knowledge into the embeddings; some is stored in the core tensor and shared between all entities and relations through multi-task learning.
Rather than learning distinct relation-specific matrices, the core tensor of TuckER can be viewed as containing a shared pool of "prototype" relation matrices, which are linearly combined according to the parameters in each relation embedding.

Training
Since the logistic sigmoid is applied to the scoring function to approximate the true binary tensor, the implicit underlying tensor is comprised of −∞ and ∞. Given this prevents an explicit analytical factorization, we use numerical methods to train TuckER. We use the standard data augmentation technique, first used by Dettmers et al. (2018) and formally described by Lacroix et al. (2018), of adding reciprocal relations for every triple in the dataset, i.e. we add (e o , r −1 , e s ) for every (e s , r, e o ). Following the training procedure introduced by Dettmers et al. (2018) to speed up training, we use 1-N scoring, i.e. we simultaneously score entity-relation pairs (e s , r) and (e o , r −1 ) with all entities e o ∈ E and e s ∈ E respectively, in contrast to 1-1 scoring, where individual triples (e s , r, e o ) and (e o , r −1 , e s ) are trained one at a time. The model is trained to minimize the Bernoulli negative log-likelihood loss function. A component of the loss for one entity-relation pair with all others entities is defined as: (3) where p ∈ R ne is the vector of predicted probabilities and y ∈ R ne is the binary label vector.

Full Expressiveness and Embedding Dimensionality
A tensor factorization model is fully expressive if for any ground truth over all entities and relations, there exist entity and relation embeddings that accurately separate true triples from the false. As shown in (Trouillon et al., 2017), ComplEx is fully expressive with the embedding dimensionality bound d e = d r = n e · n r . Similarly to Com-plEx, Kazemi and Poole (2018) show that SimplE is fully expressive with entity and relation embeddings of size d e = d r = min(n e · n r , γ + 1), where γ represents the number of true facts. They further prove other models are not fully expressive: DistMult, because it cannot model asymmetric relations; and transitive models such as TransE (Bordes et al., 2013) and its variants FTransE (Feng et al., 2016) and STransE (Nguyen et al., 2016), because of certain contradictions that they impose between different relation types. By Theorem 1, we establish the bound on entity and relation embedding dimensionality (i.e. decomposition rank) that guarantees full expressiveness of TuckER.
Theorem 1. Given any ground truth over a set of entities E and relations R, there exists a TuckER model with entity embeddings of dimensionality d e = n e and relation embeddings of dimensionality d r = n r , where n e = |E| is the number of entities and n r = |R| the number of relations, that accurately represents that ground truth.
Proof. Let e s and e o be the n e -dimensional onehot binary vector representations of subject and object entities e s and e o respectively and w r the n r -dimensional one-hot binary vector representation of relation r. For each subject entity e (i) s , relation r (j) and object entity e (k) o , we let the i-th, j-th and k-th element respectively of the corresponding vectors e s , w r and e o be 1 and all other elements 0. Further, we set the ijk element of the tensor W ∈ R ne×nr×ne to 1 if the fact (e s , r, e o ) holds and -1 otherwise. Thus the product of the entity embeddings and the relation embedding with the core tensor, after applying the logistic sigmoid, accurately represents the original tensor.
The purpose of Theorem 1 is to prove that TuckER is capable of potentially capturing all information (and noise) in the data. In practice however, we expect the embedding dimensionalities needed for full reconstruction of the underlying binary tensor to be much smaller than the bound stated above, since the assignment of values to the tensor is not random but follows a certain structure, otherwise nothing unknown could be predicted. Even more so, low decomposition rank is actually a desired property of any bilin- ear link prediction model, forcing it to learn that structure and generalize to new data, rather than simply memorizing the input. In general, we expect TuckER to perform better than ComplEx and SimplE with embeddings of lower dimensionality due to parameter sharing in the core tensor (shown empirically in Section 6.4), which could be of importance for efficiency in downstream tasks.

Relation to Previous Linear Models
Several previous tensor factorization models can be viewed as a special case of TuckER: RESCAL (Nickel et al., 2011) Following the notation introduced in Section 3.2, the RESCAL scoring function (see Table 1) has the form: (4) This corresponds to Equation 1 with I = K = n e , P = R = d e , Q = J = n r and B = I J the J × J identity matrix. This is also known as Tucker2 decomposition (Kolda and Bader, 2009). As is the case with TuckER, the entity embedding matrix of RESCAL is shared between subject and object entities, i.e. E = A = C ∈ R ne×de and the relation matrices W r ∈ R de×de are the d e × d e slices of the core tensor Z. As mentioned in Section 2, the drawback of RESCAL compared to TuckER is that its number of parameters grows quadratically in the entity embedding dimension d e as the number of relations increases. DistMult (Yang et al., 2015) The scoring function of DistMult (see Table 1) can be viewed as equivalent to that of TuckER (see Equation 1) with a core tensor Z ∈ R P ×Q×R , P = Q = R = d e , which is superdiagonal with 1s on the superdiagonal, i.e. all elements z pqr with p = q = r are 1 and all the other elements are 0 (as shown in Figure 2a). Rows of E = A = C ∈ R ne×de contain subject and object entity embedding vectors e s , e o ∈ R de and rows of R = B ∈ R nr×de contain relation embedding vectors w r ∈ R de . It is interesting to note that the TuckER interpretation of the DistMult scoring function, given that matrices A and C are identical, can alternatively be interpreted as a special case of CP decomposition (Hitchcock, 1927), since Tucker decomposition with a superdiagonal core tensor is equivalent to CP decomposition. Due to enforced symmetry in subject and object entity mode, DistMult cannot learn to represent asymmetric relations.
ComplEx (Trouillon et al., 2016) Bilinear models represent subject and object entity embeddings as vectors e s , e o ∈ R de , relation as a matrix W r ∈ R de×de and the scoring function as a bilinear product φ(e s , r, e o ) = e s W r e o . It is trivial to show that both RESCAL and DistMult belong to the family of bilinear models. As explained by Kazemi and Poole (2018), ComplEx can be considered a bilinear model with the real and imaginary part of an embedding for each entity concatenated in a single vector, [Re(e s ); Im(e s )] ∈ R 2de for subject, [Re(e o ); Im(e o )] ∈ R 2de for object, and a relation matrix W r ∈ R 2de×2de , constrained so that its leading diagonal contains duplicated elements of Re(w r ), its d e -diagonal elements of Im(w r ) and its -d e -diagonal elements of -Im(w r ), with all other elements set to 0, where d e and -d e represent offsets from the leading diagonal.
Similarly to DistMult, we can regard the scoring function of ComplEx (see Table 1) as equivalent to the scoring function of TuckER (see Equation 1), with core tensor Z ∈ R P ×Q×R , P = Q = R = 2d e , where 3d e elements on different tensor diagonals are set to 1, d e elements on one tensor diagonal are set to -1 and all other elements are set to 0 (see Figure 2b). This shows that the scoring function of ComplEx, which computes a bilinear product with complex entity and relation embeddings and disregards the imaginary part of the obtained result, is equivalent to a hard regularization of the core tensor of TuckER in the real domain.
SimplE (Kazemi and Poole, 2018) The authors show that SimplE belongs to the family of bilinear models by concatenating embeddings for head and tail entities for both subject and object into vectors [h es ; t es ] ∈ R 2de and [h eo ; t eo ] ∈ R 2de and constraining the relation matrix W r ∈ R 2de×2de so that it contains the relation embedding vector 1 2 w r on its d e -diagonal and the inverse relation embedding vector 1 2 w r −1 on its -d e -diagonal and 0s elsewhere. The SimplE scoring function (see Table 1) is therefore equivalent to that of TuckER (see Equation 1), with core tensor Z ∈ R P ×Q×R , P = Q = R = 2d e , where 2d e elements on two tensor diagonals are set to 1 2 and all other elements are set to 0 (see Figure 2c).

Representing Asymmetric Relations
Each relation in a knowledge graph can be characterized by a certain set of properties, such as symmetry, reflexivity, transitivity. So far, there have been two possible ways in which linear link prediction models introduce asymmetry into factorization of the binary tensor of triples: • distinct (although possibly related) embeddings for subject and object entities and a diagonal matrix (or equivalently a vector) for each relation, as is the case with models such as ComplEx and SimplE; or • equivalent subject and object entity embeddings and each relation represented by a full rank matrix, which is the case with RESCAL.
The latter approach appears more intuitive, since asymmetry is a property of the relation, rather than the entities. However, the drawback of the latter approach is quadratic growth of parameter number with the number of relations, which often leads to overfitting, especially for relations with a small number of training triples. TuckER overcomes this by representing relations as vectors w r , which makes the parameter number grow linearly with the number of relations, while still keeping the desirable property of allowing relations to be asymmetric by having an asymmetric relation-agnostic core tensor W, rather than encoding the relation-specific information in the entity embeddings. Multiplying W ∈ R de×dr×de with w r ∈ R dr along the second mode, we obtain a full rank relation-specific matrix W r ∈ R de×de , which can perform all possible linear transformations on the entity embeddings, i.e. rotation, reflection or stretch, and is thus also capable of modeling asymmetry. Regardless of what kind of transformation is needed for modeling a particular relation, TuckER can learn it from the data.
To demonstrate this, we show sample heatmaps of learned relation matrices W r for a WordNet symmetric relation "derivationally related form" and an asymmetric relation "hypernym" in Figure 3, where one can see that TuckER learns to model the symmetric relation with the relation matrix that is approximately symmetric about the main diagonal, whereas the matrix belonging to the asymmetric relation exhibits no obvious structure. 6 Experiments and Results

Datasets
We evaluate TuckER using four standard link prediction datasets (see Table 2): FB15k (Bordes et al., 2013) is a subset of Freebase, a large database of real world facts. FB15k-237 (Toutanova et al., 2015) was created from FB15k by removing the inverse of many relations that are present in the training set from validation and test sets, making it more difficult for simple models to do well. WN18 (Bordes et al., 2013) is a subset of Word-Net, a hierarchical database containing lexical relations between words. WN18RR (Dettmers et al., 2018) is a subset of WN18, created by removing the inverse relations from validation and test sets.

Implementation and Experiments
We implement TuckER in PyTorch (Paszke et al., 2017) and make our code available on GitHub. 1 We choose all hyper-parameters by random search based on validation set performance. For FB15k and FB15k-237, we set entity and relation embedding dimensionality to d e = d r = 200. For WN18 and WN18RR, which both contain a significantly smaller number of relations relative to the number of entities as well as a small number of relations compared to FB15k and FB15k-237, we set d e = 200 and d r = 30. We use batch normalization (Ioffe and Szegedy, 2015) and dropout (Srivastava et al., 2014) to speed up training. We find that lower dropout values (0.1, 0.2) are required for datasets with a higher number of training triples per relation and thus less risk of overfitting (WN18 and WN18RR), whereas higher dropout values (0.3, 0.4, 0.5) are required for FB15k and FB15k-237. We choose the learning rate from {0.01, 0.005, 0.003, 0.001, 0.0005} and learning rate decay from {1, 0.995, 0.99}. We find the following combinations of learning rate and learning rate decay to give the best results: (0.003, 0.99) for FB15k, (0.0005, 1.0) for FB15k-237, (0.005, 0.995) for WN18 and (0.01, 1.0) for WN18RR (see Table 5 in the Appendix A for a complete list of hyper-parameter values on each dataset). We train the model using Adam (Kingma and Ba, 2015) with the batch size 128.
At evaluation time, for each test triple we generate n e candidate triples by combining the test entity-relation pair with all possible entities E, ranking the scores obtained. We use the filtered setting (Bordes et al., 2013), i.e. all known true triples are removed from the candidate set except for the current test triple. We use evaluation metrics standard across the link prediction literature: mean reciprocal rank (MRR) and hits@k, k ∈ {1, 3, 10}. Mean reciprocal rank is the average of the inverse of the mean rank assigned to the true triple over all candidate triples. Hits@k measures the percentage of times a true triple is ranked within the top k candidate triples.

Link Prediction Results
Link prediction results on all datasets are shown in Tables 3 and 4. Overall, TuckER outperforms previous state-of-the-art models on all metrics across all datasets (apart from hits@10 on WN18 where a non-linear model, R-GCN, does better). Results achieved by TuckER are not only better than those of other linear models, such as DistMult, ComplEx and SimplE, but also better than the results of many more complex deep neural network and reinforcement learning architectures, e.g. R-GCN, MINERVA, ConvE and HypER, demonstrating the expressive power of linear models and supporting our claim that simple linear models should serve as a baseline before moving onto more elaborate models. Even with fewer parameters than ComplEx and SimplE at d e = 200 and d r = 30 on WN18RR (∼9.4 vs ∼16.4 million), TuckER consistently obtains better results than any of those models. We believe this is because TuckER exploits knowledge sharing between relations through the core tensor, i.e. multi-task learning. This is supported by the fact that the margin by which TuckER outperforms other linear models is notably increased on datasets with a large number of relations. For example, improvement on FB15k is +14% over ComplEx and +8% over SimplE on the toughest hits@1 metric. To our knowledge, ComplEx-N3 (Lacroix et al., 2018) is the only other linear link prediction model that benefits from multitask learning. There, rank regularization of the embedding matrices is used to encourage a lowrank factorization, thus forcing parameter sharing between relations. We do not include their published results in Tables 3 and 4, since they use the highly non-standard d e = d r = 2000 and thus a far larger parameter number (18x more parameters than TuckER on WN18RR; 5.5x on FB15k-237), making their results incomparable to those typically reported, including our own. However, running their model with equivalent parameter number to TuckER shows comparable performance, supporting our belief that the two models both attain the benefits of multi-task learning, although by different means.

Influence of Parameter Sharing
The ability of knowledge sharing through the core tensor suggests that TuckER should need a lower number of parameters for obtaining good results than ComplEx or SimplE. To test this, we re-implement ComplEx and SimplE with reciprocal relations, 1-N scoring, batch normalization and dropout for fair comparison, perform random search to choose best hyper-parameters   no (Schlichtkrull et al., 2018) no .151 MINERVA (Das et al., 2018) no (Dettmers et al., 2018) no   (see Table 6 in the Appendix A for exact hyperparameter values used) and train all three models on FB15k-237 with embedding sizes d e = d r ∈ {20, 50, 100, 200}. Figure 4 shows the obtained MRR on the test set for each model. It is important to note that at embedding dimensionalities 20, 50 and 100, TuckER has fewer parameters than Com-plEx and SimplE (e.g. ComplEx and SimplE have ∼3 million and TuckER has ∼2.5 million parameters for embedding dimensionality 100).  We can see that the difference between the MRRs of ComplEx, SimplE and TuckER is approximately constant for embedding sizes 100 and 200. However, for lower embedding sizes, the dif-ference between MRRs increases by 0.7% for embedding size 50 and by 4.2% for embedding size 20 for ComplEx and by 3% for embedding size 50 and by 9.9% for embedding size 20 for Sim-plE. At embedding size 20 (∼300k parameters), the performance of TuckER is almost as good as the performance of ComplEx and SimplE at embedding size 200 (∼6 million parameters), which supports our initial assumption.

Conclusion
In this work, we introduce TuckER, a relatively straightforward linear model for link prediction on knowledge graphs, based on the Tucker decomposition of a binary tensor of known facts. TuckER achieves state-of-the-art results on standard link prediction datasets, in part due to its ability to perform multi-task learning across relations. Whilst being fully expressive, TuckER's number of parameters grows linearly with respect to the number of entities or relations in the knowledge graph. We further show that previous linear state-of-the-art models, RESCAL, DistMult, ComplEx and Sim-plE, can be interpreted as special cases of our model. Future work might include exploring how to incorporate background knowledge on individual relation properties into the existing model. Table 5 shows best performing hyper-parameter values for TuckER across all datasets, where lr denotes learning rate, dr decay rate, ls label smoothing, and d#k, k ∈ {1, 2, 3} dropout values applied on the subject entity embedding, relation matrix and subject entity embedding after it has been transformed by the relation matrix respectively.