A Re-evaluation of Knowledge Graph Completion Methods

Knowledge Graph Completion (KGC) aims at automatically predicting missing links for large-scale knowledge graphs. A vast number of state-of-the-art KGC techniques have got published at top conferences in several research fields, including data mining, machine learning, and natural language processing. However, we notice that several recent papers report very high performance, which largely outperforms previous state-of-the-art methods. In this paper, we find that this can be attributed to the inappropriate evaluation protocol used by them and propose a simple evaluation protocol to address this problem. The proposed protocol is robust to handle bias in the model, which can substantially affect the final results. We conduct extensive experiments and report performance of several existing methods using our protocol. The reproducible code has been made publicly available.


Introduction
Real-world knowledge bases are usually expressed as multi-relational graphs, which are collections of factual triplets, where each triplet (h, r, t) represents a relation r between a head entity h and a tail entity t. However, real-word knowledge bases are usually incomplete (Dong et al., 2014), which motivates the research of automatically predicting missing links. A popular approach for Knowledge Graph Completion (KGC) is to embed entities and relations into continuous vector or matrix space, and use a well-designed score function f (h, r, t) to measure the plausibility of the triplet (h, r, t). Most of the previous methods use translation distance based (Bordes et al., 2013;Wang et al., 2014;Xiao et al., 2016;Sun et al., 2019) and semantic matching based (Nickel and Tresp, 2013;Yang et al., 2014;Nickel et al., 2016;Trouillon et al., 2016; * Equal contribution. Liu et al., 2017) scoring functions which are easy to analyze.
However, recently, a vast number of neural network-based methods have been proposed. They have complex score functions which utilize blackbox neural networks including Convolutional Neural Networks (CNNs) (Dettmers et al., 2018;Nguyen et al., 2018), Recurrent Neural Networks (RNNs) (Lin et al., 2015;Wang et al., 2018), Graph Neural Networks (GNNs) (Schlichtkrull et al., 2017;Shang et al., 2019), and Capsule Networks (Nguyen et al., 2019). While some of them report state-of-the-art performance on several benchmark datasets that are competitive to previous embedding-based approaches, a considerable portion of recent neural network-based papers report very high performance gains which are not consistent across different datasets. Moreover, most of these unusual behaviors are not at all analyzed. Such a pattern has become prominent and is misleading the whole community.
In this paper, we investigate this problem and find that this is attributed to the inappropriate evaluation protocol used by these approaches. We demonstrate that their evaluation protocol gives a perfect score to a model that always outputs a constant irrespective of the input. This has lead to artificial inflation of performance of several models. For this, we find a simple evaluation protocol that creates a fair comparison environment for all types of score functions. We conduct extensive experiments to re-examine some recent methods and fairly compare them with existing approaches. The source code of the paper has been publicly available at http://github.com/svjan5/kg-reeval.

Background
Knowledge Graph Completion Given a Knowledge Graph G = (E, R, T ), where E and R de- note the set of entities and relations and T = {(h, r, t) | h, t ∈ E, r ∈ R} is the set of triplets (facts), the task of Knowledge Graph Completion (KGC) involves inferring missing facts based on the known facts. Most the existing methods define an embedding for each entity and relation in G, i.e., e h , e r ∀h ∈ E, r ∈ R and a score function f (h, r, t) : E × R × E → R which assigns a high score for valid triplets than the invalid ones.
KGC Evaluation During KGC evaluation, for predicting t in a given triplet (h, r, t), a KGC model scores all the triplets in the set T = {(h, r, t ) | t ∈ E}. Based on the score, the model first sorts all the triplets and subsequently finds the rank of the valid triplet (h, r, t) in the list. In a more relaxed setting called filtered setting, all the known correct triplets (from train, valid, and test triplets) are removed from T except the one being evaluated (Bordes et al., 2013). The triplets in T − {t} are called negative samples.
Related Work Prior to our work, Kadlec et al. (2017) cast doubt on the claim that performance improvement of several models is due to architectural changes as opposed to hyperparameter tuning or different training objective. In our work, we raise similar concerns but through a different angle by highlighting issues with the evaluation procedure used by several recent methods. Chandrahas et al. (2018) analyze the geometry of KG embeddings and its correlation with task performance while Nayyeri et al. (2019) examine the effect of different loss functions on performance. However, their analysis is restricted to non-neural approaches. Triplet Score Figure 1: Sorted score distribution of ConvKB for an example valid triplet and its negative samples. The score is normalized into [0, 1] (lower the better). Dotted line indicate the score for the valid triplet. We find that in this example, around 58.5% negative sampled triplets obtain the exact same score as the valid triplet.

Observations
In this section, we first describe our observations and concerns and then investigate the reason behind.

Inconsistent Improvements over Benchmark Datasets
Several recently proposed methods report high performance gains on a particular dataset. However, their performance on another dataset is not consistently improved. In Table 1, we report change in MRR score on FB15k-237 (Toutanova and Chen, 2015) and WN18RR  . Overall, we find that for a few recent NN based methods, there are inconsistent gains on these two datasets. For instance, in ConvKB, there is a 21.8% improvement over ConvE on FB15k-237, but a degradation of 42.3% on WN18RR, which is surprising given the method is claimed to be better than ConvE. On the other hand, methods like RotatE and TuckER give consistent improvement across both benchmark datasets.

Observations on Score Functions
Score distribution When evaluating KGC methods, for a given triplet (h, r, t), the ranking of t given h and r is computed by scoring all the triplets of form {(h, r, t ) | t ∈ E}, where E is the set of all entities. On investing a few recent NN based approaches, we find that they have unusual score distribution, where some negatively sampled triplets have the same score as the valid triplet. An instance of FB15k-237 dataset is presented in Figure  1. Here, out of 14,541 negatively sampled triplets, 8,520 have the exact same score as the valid triplet.
Statistics on the whole dataset In Figure 2, we report the total number of triplets with the exact same score over the entire dataset for ConvKB (Nguyen et al., 2018)

Evaluation Protocols for KGC
In this section, we present different evaluation protocols that can be adopted in knowledge graph completion. We further show that inappropriate evaluation protocol is the key reason behind the unusual behavior of some recent NN-based methods.
How to deal with the same scores? An essential aspect of the evaluation method is to decide how to break ties for triplets with the same score. More concretely, while scoring the candidate set T , if there are multiple triplets with the same score from the model, one should decide which triplet to pick. Assuming that the triplets are sorted in a stable manner, we design a general evaluation scheme for KGC, which consists of the following three different protocols: • TOP: In this setting, the correct triplet is inserted in the beginning of T . • BOTTOM: Here, the correct triplet is inserted at the end of T . • RANDOM: In this, the correct triplet is placed randomly in T .  Discussion Based on the definition of the three evaluation protocols, it is clear that TOP evaluation protocol does not evaluate the model rigorously. It gives the models that have a bias to provide the same score for different triplets, an inappropriate advantage. On the other hand, BOTTOM evaluation protocol can be unfair to the model during inference time because it penalizes the model for giving the same score to multiple triplets, i.e., if many triplets have the same score as the correct triple, the correct triplet gets the least rank possible. As a result, RANDOM is the best evaluation technique which is both rigorous and fair to the model. It is in line with the situation we meet in the real world: given several same scored candidates, the only option is to select one of them randomly. Hence, we propose to use RANDOM evaluation scheme for all model performance comparisons.

Experiments
In this section, we conduct extensive experiments using our proposed evaluation protocols and make a fair comparison for several existing methods.

Datasets
We evaluate the proposed protocols on FB15k-237 (Toutanova and Chen, 2015) dataset 1 , which is a subset of FB15k (Bordes et al., 2013) with inverse relations deleted to prevent direct inference of test triples from training.

Methods Analyzed
In our experiments, we categorize existing KGC methods into the following two categories: 1 We also report our results on WN18RR (Dettmers et al., 2018) dataset in the appendix.
• Non-Affected: This includes methods which give consistent performance under different evaluation protocols. For experiments in this paper, we consider three such methods -ConvE, Ro-tatE, and TuckER.
• Affected: This category consists of recently proposed neural-network based methods whose performance is affected by different evaluation protocols. ConvKB, CapsE, TransGate 2 , and KBAT are methods in this category.

Evaluation Metrics
For all the methods, we use the code and the hyperparameters provided by the authors in their respective papers. Model performance is evaluated by Mean Reciprocal Rank (MRR), Mean Rank (MR) and Hits@10 (H@10) on the filtered setting (Bordes et al., 2013).

Evaluation Results
To analyze the effect of different evaluation protocols described in Section 4, we study the performance variation of the models listed in Section 5.2.
We study the effect of using TOP and BOTTOM protocols and compare them to RANDOM protocol. In their original paper, ConvE, RotatE, and TuckER use a strategy similar to the proposed RANDOM protocol, while ConvKB, CapsE, and KBAT use TOP protocol. We also study the random error in RANDOM protocol with multiple runs, where we report the average and standard deviation on 5 runs with different random seeds. The results are presented in Tables 2.
We observe that for Non-Affected methods like ConvE, RotatE, and TuckER, the performance remains consistent across different evaluation protocols. However, with Affected methods, there is a considerable variation in performance. Specifically, we can observe that these models perform best when evaluated using TOP and worst when evaluated using BOTTOM 3 . Finally, we find that the proposed RANDOM protocol is very robust to different random seeds. Although the theoretic upper and lower bounds of a RANDOM score are TOP and BOTTOM scores respectively, when we evaluate knowledge graph completion for real-world largescale knowledge graphs, the randomness doesn't affect the evaluation results much.

Conclusion
In this paper, we performed an extensive reexamination study of recent neural network based KGC techniques. We find that many such models have issues with their score functions. Combined with inappropriate evaluation protocol, such methods reported inflated performance. Based on our observations, we propose RANDOM evaluation protocol that can clearly distinguish between these affected methods from others. We also strongly encourage the research community to follow the RANDOM evaluation protocol for all KGC evaluation purposes.  Table 3: Performance comparison under different evaluation protocols on WN18RR dataset. For TOP and BOT-TOM, we report changes in performance with respect to RANDOM protocol. ‡: CapsE uses the pre-trained 100dimensional Glove (Pennington et al., 2014) word embeddings for initialization on WN18RR dataset, which makes the comparison on WN18RR still unfair. †: KBAT has test data leakage in their original implementation, which is fixed in our experiments. Number of Triplets with Same Score 1 -4 5 -1 6 1 7 -6 4 6 5 -2 5 6 2 5 7 -1 0 2 4 1 0 2 5 -4 0 9 6 4 0 9 7 -1 6 3 8 4 1 6 3 8 5 -6 5 5 3 6 ConvKB CapsE ConvE Figure 4: Plot shows the frequency of the number of negative triplets with the same assigned score as the valid triplet during evaluation on WN18RR dataset. The results show that Unlike FB15k-237, in this dataset, only ConvKB has a large number of negative triplets get the same score as the valid triplets.