On Evaluating Embedding Models for Knowledge Base Completion

Knowledge graph embedding models have recently received significant attention in the literature. These models learn latent semantic representations for the entities and relations in a given knowledge base; the representations can be used to infer missing knowledge. In this paper, we study the question of how well recent embedding models perform for the task of knowledge base completion, i.e., the task of inferring new facts from an incomplete knowledge base. We argue that the entity ranking protocol, which is currently used to evaluate knowledge graph embedding models, is not suitable to answer this question since only a subset of the model predictions are evaluated. We propose an alternative entity-pair ranking protocol that considers all model predictions as a whole and is thus more suitable to the task. We conducted an experimental study on standard datasets and found that the performance of popular embeddings models was unsatisfactory under the new protocol, even on datasets that are generally considered to be too easy. Moreover, we found that a simple rule-based model often provided superior performance. Our findings suggest that there is a need for more research into embedding models as well as their training strategies for the task of knowledge base completion.


Introduction
A knowledge base (KB) is a collection of relational facts, often represented as (subject, relation, object)-triples. KBs provide rich information for NLP tasks such as question answering (Abujabal et al., 2017) or entity linking (Shen et al., 2015). Since knowledge bases are inherently incomplete (West et al., 2014), there has been considerable interest into methods that infer missing knowledge.
In particular, a large number of knowledge graph embedding (KGE) models have been pro-posed in the recent literature (Nickel et al., 2016a). These models embed the entities and relations of a given knowledge base into a low-dimensional latent space such that the structure of the knowledge base is captured. The embeddings are subsequently used to assess whether unobserved triples constitute missing facts or are likely to be false.
To evaluate the performance of a KGE model, the most commonly adopted protocol is the entity ranking (ER) protocol. 1 The protocol takes as input a set of previously unobserved test triples, such as (Einstein, bornIn, Ulm), and uses the embedding model to rank all possible answers to the questions (?, bornIn, Ulm) and (Einstein, bornIn, ?). Model performance is then assessed based on the rank of the answer present in the test triple (Einstein and Ulm, resp.). Since each question is constructed from a test triple, the protocol ensures that questions are meaningful and always have a correct answer. Throughout this paper, we refer to the task of answering such questions as question answering (QA). The ER protocol is, in principle, well-suited to evaluate performance of KGE models for QA, although concerns about the benchmark datasets (Toutanova and Chen, 2015), the considered models (Kadlec et al., 2017) and the evaluation (Joulin et al., 2017) have been raised.
In this paper, we aim to study the performance of popular embedding models for the task of knowledge base completion (KBC): given a relation of a knowledge base (bornIn), infer true missing facts ((Einstein, bornIn, Ulm)). This task is different from QA (as defined above) since no information about potential missing triples is provided upfront. We argue that the ER protocol is not well-suited to assess model performance for KBC. To see this, observe that models that assign high confidence scores to incorrect triples such as (Ulm, bornIn, Einstein) are not penalized by the ER protocol because the corresponding questions (e.g., (Ulm, bornIn, ?)) are never asked. Thus a model that performs well on ER may still not perform well for KBC. In fact, we argue here that some commonly used KGE models are inherently not well-suited to KBC.
We propose a simple entity-pair ranking (PR) protocol (PR), which is more suitable to assess model performance for KBC. Given a relation such as bornIn, PR uses the KGE model to rank all possible answers-i.e., all entity pairs-to the question (?, bornIn, ?), and subsequently assesses model performance based on the rank of the test triples for relation bornIn in the answer. The protocol ensures that a model's performance is negatively affected if the model assigns high scores to false triples such as (Ulm, bornIn, Einstein).
We conducted an experimental study on commonly used benchmark datasets under the ER and the PR protocols. We found that the performance of popular embeddings models was often good under the ER but unsatisfactory under the PR protocol, even on "simple" datasets that are generally considered to be too easy. Moreover, we found that a simple rule-based model often provided superior performance for PR. Our findings suggests that there is a need for more research into embedding models as well as their training strategies for the task of knowledge base completion.

Preliminaries
Given a set of entities E and a set of relations R, a knowledge base K ⊆ E × R × E is a set of triples (i, k, j), where i, j ∈ E and k ∈ R. We refer to i, k and j as the subject, relation, and object to the triple, respectively.
Embedding models. A KGE model associates an embedding e i ∈ R de and r k ∈ R dr with each entity i and relation k, resp. We refer to d e and d r ∈ N + as the size of the embeddings. Each KGE model uses a scoring function s : E × R × E → R to associate a score s(i, k, j) to each triple (i, k, j) ∈ E ×R×E. The scores induce a ranking: triples with high scores are considered more likely to be true than triples with low scores. Roughly speaking, the models try to find embeddings that capture the structure of the entire knowledge graph well. In this work, we consider a popular family of embedding models called bilinear models.
Bilinear KGE models. Bilinear models use the relation embedding r k ∈ R dr to construct a mixing matrix R k ∈ R de×de , and they use scoring function s(i, k, j) = e T i R k e j . The models differ mainly in how R k is constructed. Unless stated otherwise, the models use the same embedding sizes for entities and relations (i.e., d r = d e ).
RESCAL (Nickel et al., 2011) is the most general bilinear model: it sets d r = d 2 e and stores in r k the values of each entry of R k . Analogy (Liu et al., 2017) constrains R k ∈ R de×de to a block diagonal matrix in which each block is either (i) a real scalar or (ii) a 2 × 2 matrix of form x −y y x with x, y ∈ R. DistMult (Carroll and Chang, 1970;Yang et al., 2014) is a symmetric factorization model with R k = diag (r k ) or, equivalently, considers only case (i) of Analogy. Com-plEx (Trouillon et al., 2016) and HolE (Nickel et al., 2016b) are equivalent to a model that restricts R k to case (ii). TransE (Bordes et al., 2013) is a translation-based model with scoring function s(i, k, j) = − e i + r k − e j 2 (or · 1 ); it can also be written in bilinear form .
Rule learning. Rule learning methods derive logical rules that encode dependencies found in the KBs (Galárraga et al., 2013). We consider a simple rule-based model called RuleN (Meilicke et al., 2018) as baseline. The model learns (weighted) implication rules of form r(i, j) ← r 1 (i, z 1 ) ∧ · · · ∧ r n (z n , j) where r i are relations, c is a constant entity, and i, j, and z i are variables quantified over entities. To perform KBC, rule-based models query the KB for instances of the bodies of each rule and interpret the corresponding head as (weighted) predicted fact.

Evaluation Protocols
We first review two widely used evaluation protocols for QA. We then argue that these protocols are not well-suited for assessing KBC performance, because they focus on a small subset of all possible facts for a given relation. We then introduce the entity-pair ranking (PR) protocol and discuss its advantages and potential shortcomings.

Current Evaluation Protocols
The triple classification (TC) or the entity ranking (ER) protocols are commonly used to assess KGE model performance, where ER is arguably the most widely adopted protocol. We assume throughout that only true but no false triples are available (as is commonly the case), and that the available true triples are divided into training, validation, and test triples.
Triple classification (TC) The goal of triple classification is to test the model's ability to discriminate between true and false triples (Socher et al., 2013). Since only true triples are available in practice, pseudo-negative triples are generated by randomly replacing either the subject or the object of each test triple by a random entity (that appears as a subject or object in the considered relation). All triples are then classified as positive or negative according to the KGE scores. In particular, triple (i, k, j) is classified as positive if its score s(i, k, j) exceeds a relation-specific decision threshold τ k (learned on validation data using the same procedure). Model performance is assessed by classification accuracy.
Entity ranking (ER) ER assesses model performance by testing its ability to perform QA (as defined before). In particular, for each test triple t = (i, k, j), two questions q s = (?, k, j) and q o = (i, k, ?) are generated. For question q s , all entities i ∈ E are ranked based on the score s(i , k, j). To avoid misleading results, entities i = i that correspond to observed triples in the dataset-i.e., (i , k, j) occurs in the training/validation/test triples-are discarded to obtain a filtered ranking. The same process is applied for question q o . Model performance is evaluated based on the recorded positions of the test triples in the filtered ranking. Models that tend to rank test triples (known to be true) higher than unknown triples (assumed to be false) are thus considered superior. Usually, the micro-average of filtered Hits@K-i.e., the proportion of test triples ranking in the top-K-and filtered MRR-i.e., the mean reciprocal rank of the test triples-are used to assess model performance.  found that most models achieve a TC accuracy of at least 93% on a benchmark dataset. This is because each test triple is com-pared against a single negative triple, and due to the high number of possible negative triples, it is unlikely that the chosen triple has a high predicted score, rendering most classification tasks "easy". Consequently, the accuracy of triple classification overestimates model performance. This protocol is less adopted in recent work.

Discussion
We argue that ER is appropriate to evaluate QA performance, but may overestimate model performance for KBC. Since ER generates questions from true test triples, it only asks questions that are known to have a correct answer. The question itself thus provides useful information. This perfectly matches QA, but it does not match KBC where such information is not available.
To better illustrate why ER can lead to misleading assessment of a model's KBC performance, consider the DistMult model and the asymmetric relation nominatedFor. As described in Sec. 2, DistMult models all relations as symmetric in that s(i, k, j) = s(j, k, i). Now consider triple t = (H. Simon, nominatedFor, Nobel Prize), and let us suppose that the model correctly assigns t a high score s(t). Then the inverse triple t = (Nobel Prize, nominatedFor, H. Simon) will also obtain a high score since s(t ) = s(t). Thus the score produced by DistMult does not discriminate between the true triple t and the false triple t . In ER, however, questions about t are never asked; there is no test triple for this relation containing either Nobel Prize as subject or H. Simon as object. The symmetry of DistMult's prediction thus barely affects its performance under the ER protocol.
For another example, consider TransE and the relation k = marriedTo, which is symmetric but not reflexive. One can show that for all (i, k, j), the TransE scores satisfy For symmetric relations, TransE should aim to assign high scores to both (i, k, j) and (j, k, i). To do so, TransE has the tendency to push the relation embedding r k towards 0 as well as e i and e j towards each other. But when r k ≈ 0, then s(i, k, i) is high for all i, so that the relation is treated as if it were reflexive. Again, in ER, this property only slightly influences the results: there is only one "reflexive" tuple in each filtered entity list so that the correct answer i for question (?, k, j) ranks at most one position lower.
More expressive models such as RESCAL or ComplEx do not have such inherent limitations. Nevertheless, our experimental study shows that these models (at least in the way are currently trained) also tend to assign high scores to false triples.

Entity-Pair Ranking Protocol
We propose a simple alternative protocol called entity-pair ranking (PR). The protocol is more suitable to assess a model's KBC performance (although it is not without flaws either; see below). PR proceeds as follows: for each relation k, we use the KGE model to rank all triples for a specified relation k, i.e., to rank all answers to question (?, k, ?). As in ER, we filter out all triples that occur in the training and validation data to obtain a filtered ranking, i.e., to only rank triples that were not used during model training. If a model tends to assign a high score to negative triples, its performance is likely to be negatively affected because it becomes harder for true triples to rank high.
Note that the number of candidate answers considered by PR is much larger than those considered by ER. Consider a relation k and let T k be the set of test triples for relation k. Then ER considers 2|T k | |E| candidates in total during evaluation, while PR considers |E| 2 candidates. Moreover, PR considers all test triples in T k simultaneously instead of sequentially. For this reason, we cannot use the MRR metric commonly used in ER. Instead, we assess model performance using weighted MAP@K-i.e., the weighted mean average precision in the top-K filtered results-and weighted Hits@K-i.e., the weighted percentage of test triples in the top-K filtered results. We weight the influence of relation k proportionally to its number of test triples (capped at K), thereby closely following ER: Here AP k @K is the average precision of the top-K list (w.r.t. test triples T k ) and Hits k @K refers to the fraction of test triples in the top-K list. Note that K should be chosen much larger for PR than for ER since it roughly corresponds to the number of triples we aim to predict for relation k.
The PR protocol is more suited to evaluate KBC performance because it considers all model predictions. The protocol also has some disadvantages, however. First, as ER, the PR protocol may underestimate model performance due to unobserved true triples ranked high by the model. Since a larger number of candidates is considered, PR may be more sensitive to this problem than ER. We explore the effect of underestimation in our empirical study in Sec. 4.4. Another concern with PR is its potentially high computational cost. For current benchmark datasets, we found that the PR evaluation is feasible. Generally, one may argue that an embedding model is suitable for KBC only if it is feasible to determine high-scoring triples in a sufficiently efficient way. Since PR only requires the computation of the top-K predictions, performance can potentially be improved using techniques such as maximum innerproduct search Shrivastava and Li (2014).

Experimental Study
We conducted an experimental study to assess the performance of various bilinear embedding models for KBC. 2 All datasets, experimental results, and source code are publicly available. 3 For all models, we performed evaluation under both the ER and PR protocols in order to assess their performance for the QA and KBC tasks, respectively. We found that many KGE models provided good ER but low PR performance. We also considered a simple rule-based system called RuleN (Meilicke et al., 2018), which provided good performance under the ER protocol, and found that RuleN performed better in both ER and PR. Our results imply that more research into KGE models for KBC is needed.
We also investigated the extent to which PR 2 Some other KGE models do not support KBC directly due to their architecture; e.g., ConvE (Dettmers et al., 2018).
3 http://www.uni-mannheim.de/dws/ research/resources/kge-eval/ underestimates model performance due to unobserved true triples. We found that underestimation is not the main reason for the low PR performance of many KGE models; in fact, many KGE models ranked high clearly wrong tuples (e.g., with incorrect types).

Experimental Setup
Datasets. We used the four common KBC benchmark datasets: FB15K, WN18, FB-237, and WNRR. The latter two datasets were created since the former two datasets were considered too easy for embedding models (based on ER). Key dataset statistics are summarized in Table 1.
Negative sampling. We trained the embedding models using negative sampling to obtain pseudonegative triples. We consider three sampling strategies in our experiments: Perturb 1: For each training triple t = (i, k, j), sample pseudo-negative triples by randomly replacing either i or j with a random entity (but such that the resulting triple is unobserved). This strategy matches ER, which is based on questions (?, k, j) and (i, k, ?). Perturb 1-R: For each training triple t = (i, k, j), sample pseudo-negative triples by randomly replacing either i, k or j. The generated negative samples are not compared with the training set (Liu et al., 2017). Perturb 2: For each training triple t = (i, k, j), obtain pseudo-negative triples by randomly sampling unobserved tuples for relation k. This method appears more suited to PR.
Training and implementation. We trained DistMult, ComplEx, Analogy and RESCAL with AdaGrad (Duchi et al., 2011) using binary crossentropy loss. We used pair-wise ranking loss for TransE (as it always produces negative scores). All embeddding models are implemented on top of the code of Liu et al. (2017) 4

in C++ using
OpenMP. For RuleN, we use the original implementation provided by the authors. The evaluation protocols were written in Python, with Bottleneck 5 used for efficiently obtaining the top-K entries for PR. We found PR (which took ≈30-90 minutes) was about 3-4 times slower than ER .
Hyperparameters. The best hyperparameters were selected based on MRR (for ER) and MAP@100 (for PR) on the validation data. For both protocols, we performed an exhaustive grid search over the following hyperparameter settings: d e ∈ {100, 150, 200}, weight of l 2regularization λ ∈ {0.1, 0.01, 0.001}, learning rate η ∈ {0.01, 0.1}, negative sampling strategies Perturb 1, Perturb 2 and Perturb 1-R, 6 and margin hyperparameter γ ∈ {0.5, 1, 2, 3, 4} for TransE. For each training triple, we sampled 6 pseudonegative triples. To keep effort tractable, we only used the most frequent relations from each dataset for hyperparameter tuning (top-5, top-5, top-15, and top-30 most frequent relations for WN18, WNRR, FB-237 and FB-15k, respectively). We trained each model for up to 500 epochs during grid search. In all cases, we evaluated model performance every 50 epochs and used the overall best-performing model. For RuleN, we used the best settings reported by the authors for ER (Meilicke et al., 2018). For PR, we learned path rules of length 2 using a sampling size of 500 for FB15K and FB-237. For WN18 and WNRR, we learned path rules of length 3 and sampling size of 500. Table 2 summarizes the ER results. Embedding models perform competitively with respect to RuleN on all datasets, except for their MRR performance on FB15K. Notice that this generally holds even for the more restricted models (TransE and DistMult) on the more challenging datasets, which were created after criticizing FB15K and WN18 as too easy (Toutanova and Chen, 2015;Dettmers et al., 2018). In particular, although DistMult can only model symmetric relations, and although most relations in these datasets are asymmetric, DistMult has good ER performance. Likewise, TransE achieved great performance in Hits@10 on all datasets, including WN18 which contains a large number of symmetric relations, which are not easily modeled by TransE.

Performance Results with PR
The evaluation results of PR with K = 100 are summarized in Table 3. Note that Tables 2 and 3 are not directly comparable: they measure different tasks. Also note that we use a different value of K, which in PR corresponds to the number of predicted facts per relation. We discuss the effect of the choice of K later.   For the embeddings, observe that with the exception of Analogy and ComplEx on WN18, the performance of all models is unsatisfactory on all datasets, especially when compared with RuleN on FB15K and WN18, which were previously considered to be too easy for embedding models. Specifically, DistMult's Hits@100 is slightly less than 10% on WN18, meaning that if we add the top 100 ranked triples to the KB, over 90% of what is added is likely false. Even when using Com-plEx, the best model on FB15K, we would potentially add more than 50% false triples. This implies that embedding models cannot capture simple rules successfully. The notable exceptions are ComplEx and Analogy on WN18, although both are still behind RuleN. TransE and DistMult did not achieve competitive results on WN18. In addition, DistMult did not achieve competitive results on FB15K and FB-237 and TransE did not achieve competitive results in WNRR. In general, Com-plEx and Analogy performed consistently better than other models across different datasets. When compared with the RuleN baseline, however, the performance of these models was often not satisfactory. This suggests that better KGE models and/or training strategies are needed for KBC.
RuleN did not perform well on FB-237 and WNRR, likely because the way these datasets were constructed makes them intrinsically difficult for rule-based methods (Meilicke et al., 2018). This is reflected in both ER and PR results.
To better understand the change in performance of TransE and DistMult, we investigated their predictions for the top-5 most frequent relations on WN18. Table 4 shows the number of test triples appearing in the top-100 for each relation (after filtering triples from the training and validation sets). The numbers in parentheses are discussed in Section 4.4.
We found that DistMult worked well on the symmetric relation derivationally related form, where its symmetry assumption clearly helps. Here 93% of the training data consists of symmetric pairs (i.e., (i, k, j) and (j, k, i)), and 88% of the test triples have its symmetric counterpart in the training set. In contrast, TransE contained no test triples for derivationally related form in the top-100 list. We found that the norm of the embedding vector of this relation was 0.1, which was considerably smaller than for the other relations (avg. 1.4). This supports our argument that TransE tends to push symmetric relation embeddings to 0.
Note that while hyponymy, hypernymy, member meronym and member holonym are semantically transitive, the dataset contains almost exclusively their transitive core, i.e., the dataset (both train and test) does not contain many of the transitive links of the relations. As a result, models that cannot handle transitivity well may still produce good results. This might explain why TransE performed better for these relations than for derivationally related form. DistMult did not perform well on

Influence of Unobserved True Triples
Since all datasets are based on incomplete knowledge bases, all evaluation protocols may systematically underestimate model performance. In particular, any true triple t that is neither in the training, nor validation, nor test data is treated as negative during ranking-based evaluations. A model which correctly ranks t high is thus penalized. PR might be particularly sensitive to this due to the large number of candidates considered.
It is generally unclear how to design an automatic evaluation strategy that avoids this problem. Manual labeling can be used to address this, but it may sometimes be infeasible given the large number of relations, entities, and models for KBC.
To explore such underestimation effect in PR, we decoded the unobserved triples in the top-100 predictions of the 5 most frequent relations of WN18. We then checked whether those triples are implied by the symmetry and transitivity properties of each relation. In Table 4, we give the resulting number of triples in parentheses (i.e., number of test triples + implied triples). We observed that underestimation indeed happened. TransE was mostly affected, but still did not lead to competi-tive results when compared to ComplEx and Analogy. RuleN achieves the best possible results in all 5 relations. These results suggest that (1) underestimation is indeed a concern, and (2) the results in PR can nevertheless give an indication of relative model performance.

Type Filtering
When background knowledge (BK) is available, embedding models only need to score triples consistent with the BK. We explored whether their performance can be improved by filtering out type-inconsistent triples from each model's predictions. Notice that this is inherently what rulebased approaches do, since all predicted candidates will be type-consistent. In particular, we investigated how model performance is affected when we filter out predictions that violate type constraints (domain and range of each relation). If a model's performance improves with such type filtering, it must have ranked tuples with incorrect types high in the first place. We can thus assess to what extent models capture entity types as well as the domain and range of the relations.
We extracted from Freebase type definitions for entities and domain and range constraints for relations. We also added the domain (or range) of a relation k to the type set of each subject (or object) entity which appeared in k. We obtained types for all entities in both FB datasets, and domain/range specifications for roughly 93% of relations in FB15K and 97% of relations in FB-237. The remaining relations were evaluated as before.
We report in Table 5 the Hits@100 and MAP@100 as well as their absolute improvement (in parentheses) w.r.t. Table 3. We also include the results of RuleN from Table 3, which are already type-consistent. The results show that all KGE models improve by type filtering; thus all models do predict triples with incorrect types. In particular, DistMult shows considerable improvement on both datasets. Indeed, about 90% of the   relations in FB15K (about 85% for FB-237) have a different type for their domain and range. As DistMult treats all relations as symmetric, it introduces a wrong triple for each true triple into the top-K list on these relations; type filtering allows us to ignore these wrong tuples. This is also consistent with DistMult's improved performance under ER, where type constraints are implicitly used since only questions with correct types are considered. Interestingly, ComplEx and Analogy improved considerably on FB15K, suggesting that the best performing embedding models on this dataset are still making a considerable number of type-inconsistent predictions. On FB15K, the relative ranking of the models with type filtering is roughly equal to the one without type filtering. On the harder FB-237 dataset, all models now perform similarly. Notice that when compared with RuleN, embedding models are still behind on FB15K, but are no longer behind on FB-237.

Conclusion
We investigated whether current embedding models provide good results for knowledge base completion, i.e., the task or inferring new facts from an incomplete knowledge base. We argued that the commonly-used ER evaluation protocol is not suited to answer this question, and proposed the PR evaluation protocol as an alternative. We evaluated a number of popular KGE models under the ER and PR protocols and found that most KGE models obtained good results under the ER but not the PR protocol. Therefore, more research into embedding models and their training is needed to assess whether, when, and how KGE models can be exploited for knowledge base completion.