CoDEx: A Comprehensive Knowledge Graph Completion Benchmark

We present CoDEx, a set of knowledge graph completion datasets extracted from Wikidata and Wikipedia that improve upon existing knowledge graph completion benchmarks in scope and level of difficulty. In terms of scope, CoDEx comprises three knowledge graphs varying in size and structure, multilingual descriptions of entities and relations, and tens of thousands of hard negative triples that are plausible but verified to be false. To characterize CoDEx, we contribute thorough empirical analyses and benchmarking experiments. First, we analyze each CoDEx dataset in terms of logical relation patterns. Next, we report baseline link prediction and triple classification results on CoDEx for five extensively tuned embedding models. Finally, we differentiate CoDEx from the popular FB15K-237 knowledge graph completion dataset by showing that CoDEx covers more diverse and interpretable content, and is a more difficult link prediction benchmark. Data, code, and pretrained models are available at https://bit.ly/2EPbrJs.


Introduction
Knowledge graphs are multi-relational graphs that express facts about the world by connecting entities (people, places, things, concepts) via different types of relationships. The field of automatic knowledge graph completion (KGC), which is motivated by the fact that knowledge graphs are usually incomplete, is an active research direction spanning several subfields of artificial intelligence (Nickel et al., 2015;Wang et al., 2017;Ji et al., 2020).
As progress in artificial intelligence depends heavily on data, a relevant and high-quality benchmark is imperative to evaluating and advancing the state of the art in KGC. However, the field has largely remained static in this regard over the past decade. Outdated subsets of Freebase (Bollacker et al., 2008) are most commonly used for evaluation in KGC, even though Freebase had known quality issues (Tanon et al., 2016) and was eventually deprecated in favor of the more recent Wikidata knowledge base (Vrandečić and Krötzsch, 2014).
Indeed, KGC benchmarks extracted from Freebase like FB15K and FB15K-237 (Bordes et al., 2013;Toutanova and Chen, 2015) are questionable in quality. For example, FB15K was shown to have train/test leakage (Toutanova and Chen, 2015). Later in this paper ( § 6.2), we will show that a relatively large proportion of relations in FB15K-237 can be covered by a trivial frequency rule.
To address the need for a solid benchmark in KGC, we present CODEX, a set of knowledge graph COmpletion Datasets EXtracted from Wikidata and its sister project Wikipedia. Inasmuch as Wikidata is considered the successor of Freebase, CODEX improves upon existing Freebase-based KGC benchmarks in terms of scope and level of difficulty (Table 1). Our contributions include: Foundations We survey evaluation datasets in encyclopedic knowledge graph completion to motivate a new benchmark ( § 2 and Appendix A).
Data We introduce CODEX, a benchmark consisting of three knowledge graphs varying in size and structure, entity types, multilingual labels and descriptions, and-unique to CODEX-manually verified hard negative triples ( § 3). To better understand CODEX, we analyze the logical relation patterns in each of its datasets ( § 4).
Benchmarking We conduct large-scale model selection and benchmarking experiments, reporting baseline link prediction and triple classification results on CODEX for five widely used embedding models from different architectural classes ( § 5).
Comparative analysis Finally, to demonstrate the unique value of CODEX, we differentiate Multi-domain, with focuses on writing, entertainment, music, politics, journalism, academics, and science ( § 6.1 and Appendix E) Scope (auxiliary data) Various decentralized versions of FB15K with, e.g., entity types (Xie et al., 2016), sampled negatives (Socher et al., 2013), and more (Table 8) Centralized repository of three datasets with entity types, multilingual text, and manually annotated hard negatives ( § 3) Level of difficulty FB15K has severe train/test leakage from inverse relations (Toutanova and Chen, 2015); while removal of inverse relations makes FB15K-237 harder than FB15K, FB15K-237 still has a high proportion of easy-to-predict relational patterns ( § 6.2) Inverse relations removed from all datasets to avoid train/test leakage ( § 3.2); manually annotated hard negatives for the task of triple classification ( § 3.4); few trivial patterns for the task of link prediction ( § 6.2) CODEX from FB15K-237 in terms of both content and difficulty ( § 6). We show that CODEX covers more diverse and interpretable content, and is a more challenging link prediction benchmark.

Existing datasets
We begin by surveying existing KGC benchmarks. Table 8 in Appendix A provides an overview of evaluation datasets and tasks on a per-paper basis across the artificial intelligence, machine learning, and natural language processing communities. Note that we focus on data rather than models, so we only overview relevant evaluation benchmarks here. For more on existing KGC models, both neural and symbolic, we refer the reader to (Meilicke et al., 2018) and(Ji et al., 2020).

Freebase extracts
These datasets, extracted from the Freebase knowledge graph (Bollacker et al., 2008), are the most popular for KGC (see Table 8 in Appendix A).
FB15K was introduced by Bordes et al. (2013). It contains 14,951 entities, 1,345 relations, and 592,213 triples covering several domains, with a strong focus on awards, entertainment, and sports.
FB15K-237 was introduced by Toutanova and Chen (2015) to remedy data leakage in FB15K, which contains many test triples that invert triples in the training set. FB15K-237 contains 14,541 entities, 237 relations, and 310,116 triples. We compare FB15K-237 to CODEX in § 6 to assess each dataset's content and relative difficulty. NELL-995 (Xiong et al., 2017) was taken from the Never Ending Language Learner (NELL) sys-tem (Mitchell et al., 2018), which continuously reads the web to obtain and update its knowledge. NELL-995, a subset of the 995th iteration of NELL,contains 75,492 entities,200 relations,and 154,213 triples. While NELL-995 is general and covers many domains, its mean average precision was less than 50% around its 1000th iteration (Mitchell et al., 2018). A cursory inspection reveals that many of the triples in NELL-995 are nonsensical or overly generic, suggesting that NELL-995 is not a meaningful dataset for KGC evaluation. 1 YAGO3-10 (Dettmers et al., 2018) is a subset of YAGO3 (Mahdisoltani et al., 2014), which covers portions of Wikipedia, Wikidata, and Word-Net. YAGO3-10 has 123,182 entities, 37 relations, and 1,089,040 triples mostly limited to facts about people and locations. While YAGO3-10 is a highprecision dataset, it was recently shown to be too easy for link prediction because it contains a large proportion of duplicate relations (Akrami et al., 2020;Pezeshkpour et al., 2020).

Domain-specific datasets
In addition to large encyclopedic knowledge graphs, it is common to evaluate KGC methods on at least one smaller, domain-specific dataset, typically drawn from the WordNet semantic network (Miller, 1998;Bordes et al., 2013). Other choices include the Unified Medical Language System (UMLS) database (McCray, 2003), the Alyawarra kinship dataset (Kemp et al., 2006), the Countries dataset (Bouchard et al., 2015), and variants of a synthetic "family tree" (Hinton, 1986). As our focus in this paper is encyclopedic knowledge, we do not cover these datasets further.

Data collection
In this section we describe the pipeline used to construct CODEX. For reference, we define a knowledge graph G as a multi-relational graph consisting of a set of entities E, relations R, and factual statements in the form of (head, relation, tail) triples (h, r, t) ∈ E × R × E.

Seeding the collection
We collected an initial set of triples using a type of snowball sampling (Goodman, 1961). We first manually defined a broad seed set of entity and relation types common to 13 domains: Business, geography, literature, media and entertainment, medicine, music, news, politics, religion, science, sports, travel, and visual art. Examples of seed entity types include airline, journalist, and religious text; corresponding seed relation types in each respective domain include airline alliance, notable works, and language of work or name. Table 9 in Appendix B gives all seed entity and relation types. Using these seeds, we retrieved an initial set of 380,038 entities, 75 relations, and 1,156,222 triples by querying Wikidata for statements of the form (head entity of seed type, seed relation type, ?).

Filtering the collection
To create smaller data snapshots, we filtered the initial 1.15 million triples to k-cores, which are maximal subgraphs G of a given graph G such that every node in G has a degree of at least k (Batagelj and Zaveršnik, 2011). 2 We constructed three CODEX datasets (Table 2): • CODEX-S (k = 15), which has 36k triples.
Because of its smaller size, we recommend that CODEX-S be used for model testing and debugging, as well as evaluation of methods that are less computationally efficient (e.g., symbolic search-based approaches).
• CODEX-M (k = 10), which has 206k triples. CODEX-M is all-purpose, being comparable in size to FB15K-237 ( § 2.1), one of the most popular benchmarks for KGC evaluation.
• CODEX-L (k = 5), which has 612k triples. CODEX-L is comparable in size to FB15K ( § 2.1), and can be used for both general evaluation and "few-shot" evaluation.
We also release the raw dump that we collected via snowball sampling, but focus on CODEX-S through L for the remainder of this paper.
To minimize train/test leakage, we removed inverse relations from each dataset (Toutanova and Chen, 2015). We computed (head, tail) and (tail, head) overlap between all pairs of relations, and removed each relation whose entity pair set overlapped with that of another relation more than 50% of the time. Finally, we split each dataset into 90/5/5 train/validation/test triples such that the validation and test sets contained only entities and relations seen in the respective training sets.

Auxiliary information
An advantage of Wikidata is that it links entities and relations to various sources of rich auxiliary information. To enable tasks that involve joint learning over knowledge graph structure and such additional information, we collected: • Entity types for each entity as given by Wikidata's instance of and subclass of relations; • Wikidata labels and descriptions for entities, relations, and entity types; and • Wikipedia page extracts (introduction sections) for entities and entity types.
For the latter two, we collected text where available in Arabic, German, English, Spanish, Russian, and Chinese. We chose these languages because they are all relatively well-represented on Wikidata (Kaffee et al., 2017). Table 2 provides the coverage by language for each CODEX dataset.

Hard negatives for evaluation
Knowledge graphs are unique in that they only contain positive statements, meaning that triples not observed in a given knowledge graph are not necessarily false, but merely unseen; this is called the Open World Assumption (Galárraga et al., 2013). However, most machine learning tasks on knowledge graphs require negatives in some capacity. While different negative sampling strategies exist (Cai and Wang, 2018), the most common approach is to randomly perturb observed triples to generate negatives, following Bordes et al. (2013). While random negative sampling is beneficial and even necessary in the case where a large number of negatives is needed (i.e., training), it is not necessarily useful for evaluation. For example, in the task of triple classification, the goal is to discriminate between positive (true) and negative (false) triples. As we show in § 5.5, triple classification over randomly generated negatives is trivially easy for state-of-the-art models because random negatives are generally not meaningful or plausible. Therefore, we generate and manually evaluate hard negatives for KGC evaluation.
Generation To generate hard negatives, we used each pre-trained embedding model from § 5.2 to predict tail entities of triples in CODEX. For each model, we took as candidate negatives the triples (h, r,t) for which (i) the type of the predicted tail entityt matched the type of the true tail entity t; (ii)t was ranked in the top-10 predictions by that model; and (iii) (h, r,t) was not observed in G.
Annotation We manually labeled all candidate negative triples generated for CODEX-S and CODEX-M as true or false using the guidelines provided in Appendix C. 3 We randomly selected among the triples labeled as false to create validation and test negatives for CODEX-S and CODEX-M, examples of which are given in Ta- 3 We are currently investigating methods for obtaining highquality crowdsourced annotations of negatives for CODEX-L. ble 3. To assess the quality of our annotations, we gathered judgments from two independent native English speakers on a random selection of 100 candidate negatives. The annotators were provided the instructions from Appendix C. On average, our labels agreed with those of the annotators 89.5% of the time. Among the disagreements, 81% of the time we assigned the label true whereas the annotator assigned the label false, meaning that we were comparatively conservative in labeling negatives.

Analysis of relation patterns
To give an idea of the types of reasoning necessary for models to perform well on CODEX, we analyze the presence of learnable binary relation patterns within CODEX. The three main types of such patterns in knowledge graphs are symmetry, inversion, and compositionality (Trouillon et al., 2019; Sun et al., 2019). We address symmetry and compositionality here, and omit inversion because we specifically removed inverse relations to avoid train/test leakage ( § 3.2).

Symmetry
Symmetric relations are relations r for which (h, r, t) ∈ G implies (t, r, h) ∈ G. For each relation, we compute the number of its (head, tail) pairs that overlap with its (tail, head) pairs, divided by the total number of pairs, and take those with 50% overlap or higher as symmetric. CODEX datasets have five such relations: diplomatic relation, shares border with, sibling, spouse, and unmarried partner. Table 4 gives the proportion of triples containing symmetric relations per dataset. Symmetric patterns are more prevalent in CODEX-S, whereas the larger datasets are mostly antisymmetric, i.e., (h, r, t) ∈ G implies (t, r, h) ∈ G.

Composition
Compositionality captures path rules of the form (h, r 1 , x 1 ), . . . , (x n , r n , t) → (h, r, t). To learn these rules, models must be capable of "multi-hop" reasoning on knowledge graphs (Guu et al., 2015). To identify compositional paths, we use the AMIE3 system (Lajus et al., 2020), which outputs rules with confidence scores that capture how many times those rules are seen versus violated, to identify paths of lengths two and three; we omit longer paths as they are relatively costly to compute. We identify 26, 44, and 93 rules in CODEX-S, CODEX-M, and CODEX-L, respectively, with average confidence (out of 1) of 0.630, 0.556, and 0.459. Table 4 gives the percentage of triples per dataset participating in a discovered rule.
Evidently, composition is especially prevalent in CODEX-L. An example rule in CODEX-L is "if X was founded by Y, and Y's country of citizenship is Z, then the country [i.e., of origin] of X is Z" (confidence 0.709). We release these rules as part of CODEX for further development of KGC methodologies that incorporate or learn rules.

Benchmarking
Next, we benchmark performance on CODEX for the tasks of link prediction and triple classification. To ensure that models are fairly and accurately compared, we follow Ruffinelli et al. (2020), who conducted what is (to the best of our knowledge) the largest-scale hyperparameter tuning study of knowledge graph embeddings to date.
Note that CODEX can be used to evaluate any type of KGC method. However, we focus on embeddings in this section due to their widespread usage in modern NLP (Ji et al., 2020).

Tasks
Link prediction The link prediction task is conducted as follows: Given a test triple (h, r, t), we construct queries (?, r, t) and (h, r, ?). For each query, a model scores candidate head (tail) entitieŝ h (t) according to its belief thatĥ (t) completes the triple (i.e., answers the query). The goal is of link prediction is to rank true triples (ĥ, r, t) or (h, r,t) higher than false and unseen triples.
Link prediction performance is evaluated with mean reciprocal rank (MRR) and hits@k. MRR is the average reciprocal of each ground-truth entity's rank over all (?, r, t) and (h, r, ?) test triples. Hits@k measures the proportion of test triples for which the ground-truth entity is ranked in the top-k predicted entities. In computing these metrics, we exclude the predicted entities for which (ĥ, r, t) ∈ G or (h, r,t) ∈ G so that known positive triples do not artificially lower ranking scores. This is called "filtering" (Bordes et al., 2013).
Triple classification Given a triple (h, r, t), the goal of triple classification is to predict a corresponding label y ∈ {−1, 1}. Since knowledge graph embedding models output real-valued scores for triples, we convert these scores into labels by selecting a decision threshold per relation on the validation set such that validation accuracy is maximized for the model in question. A similar approach was used by Socher et al. (2013). We compare results on three sets of evaluation negatives: (1) We generate one negative per positive by replacing the positive triple's tail entity by a tail entity t sampled uniformly at random; (2) We generate negatives by sampling tail entities according to their relative frequency in the tail slot of all triples; and (3) We use the CODEX hard negatives. We measure accuracy and F1 score.

Models
We compare the following embedding methods:  azevic et al., 2019b). These models represent several classes of architecture, from linear (RESCAL, TuckER, ComplEx) to translational (TransE) to nonlinear/learned (ConvE). Appendix D provides more specifics on each model.

Model selection
As recent studies have observed that training strategies are equally, if not more, important than architecture for link prediction (Kadlec et al., 2017;Lacroix et al., 2018;Ruffinelli et al., 2020), we search across a large range of hyperparameters to ensure a truly fair comparison. To this end we use the PyTorch-based LibKGE framework for training and selecting knowledge graph embeddings. 4 In the remainder of this section we outline the most important parameters of our model selection process. Training negatives Given a set of positive training triples {(h, r, t)}, we compare three types of negative sampling strategy implemented by LibKGE: (a) NegSamp, or randomly corrupting head entities h or tail entities t to create negatives; (b) 1vsAll, or treating all possible head/tail corruptions of (h, r, t) as negatives, including the corruptions that are actually positives; and (c) KvsAll, or treating batches of head/tail corruptions not seen in the knowledge graph as negatives.
Loss functions We consider the following loss functions: (i) MR or margin ranking, which aims to maximize a margin between positive and negative triples; (ii) BCE or binary cross-entropy, which is computed by applying the logistic sigmoid to triple scores; and (iii) CE or cross-entropy between the softmax over the entire distribution of triple scores and the label distribution over all triples, normalized to sum to one.
Search strategies We select models using the Ax platform, which supports hyperparameter search using both quasi-random sequences of generated configurations and Bayesian optimization (BO) with Gaussian processes. 5 At a high level, for each dataset and model, we generate both quasirandom and BO trials per negative sampling and loss function combination, ensuring that we search over a wide range of hyperparameters for different types of training strategy. Appendix F provides specific details on the search strategy for each dataset, which was determined according to resource constraints and observed performance patterns.  By contrast, TuckER is strongest at modeling compositional relations, so it performs best on CODEX-L, which has a high degree of compositionality. For example, on the most frequent compositional relation in CODEX-L (languages spoken, written, or signed), TuckER achieves 0.465 MRR, compared to 0.464 for RESCAL, 0.463 for ConvE, 0.456 for ComplEx, and 0.385 for TransE. By contrast, since CODEX-M is mostly asymmetric and non-compositional, ComplEx performs best because of its ability to model asymmetry. Figure 1, hyperparameters have a strong impact on link prediction performance: Validation MRR for all models varies by over 30 percentage points depending on the training strategy and input configuration. This finding is consistent with previous observations in the literature (Kadlec et al., 2017;Ruffinelli et al., 2020). Appendix F provides the best configurations for each model. Overall, we find that the choice of loss function in particular significantly impacts model performance. Each model consistently achieved its respective peak performance with cross-entropy (CE) loss, a finding which is corroborated by several other KGC comparison papers (Kadlec et al., 2017;Ruffinelli et al., 2020;Jain et al., 2020). As far as negative sampling techniques, we do not find that a single strategy is dominant, suggesting that the choice of loss function is more important. Table 6 gives triple classification results. Evidently, triple classification on randomly generated negatives is a nearly-solved task. On negatives generated uniformly at random, performance scores are nearly identical at almost 100% accuracy. Even with a negative sampling strategy "smarter" than uniform random, all models perform well.

Triple classification results
Hard negatives Classification performance degenerates considerably on our hard negatives, around 8 to 11 percentage points from relative frequency-based sampling and 13 to 19 percentage points from uniformly random sampling. Relative performance also varies: In contrast to our link prediction task in which ComplEx and TuckER were by far the strongest models, RESCAL is slightly stronger on the CODEX-S hard negatives, whereas ConvE performs best on the CODEX-M hard negatives. These results indicate that triple classification is indeed a distinct task that requires different architectures and, in many cases, different training strategies (Appendix F).
We believe that few recent works use triple classification as an evaluation task because of the lack of true hard negatives in existing benchmarks. Early works reported high triple classification accuracy on sampled negatives (Socher et al., 2013;Wang et al., 2014), perhaps leading the community to believe that the task was nearly solved. However, our results demonstrate that the task is far from solved when the negatives are plausible but truly false.

Comparative case study
Finally, we conduct a comparative analysis between CODEX-M and FB15K-237 ( § 2.1) to demonstrate the unique value of CODEX. We choose FB15K-237 because it is the most popular encyclopedic KGC benchmark after FB15K, which was already shown to be an easy dataset by Toutanova and Chen (2015). We choose CODEX-M because it is the closest in size to FB15K-237.

Content
We first compare the content in CODEX-M, which is extracted from Wikidata, with that of FB15K-237, which is extracted from Freebase. For brevity, Figure 2 compares the top-15 relations by mention count in the two datasets. Appendix E provides more content comparisons.
Diversity The most common relation in CODEX-M is occupation, which is because most people on Wikidata have multiple occupations listed. By contrast, the frequent relations in FB15K-237 are mostly related to awards and film. In fact, over 25% of all triples in FB15K-237 belong to the /award relation domain, suggesting that CODEX covers a more diverse selection of content.
Interpretability The Freebase-style relations are also arguably less interpretable than those in Wikidata. Whereas Wikidata relations have concise natural language labels, the Freebase relation labels are hierarchical, often at five or six levels of hierarchy ( Figure 2). Moreover, all relations in Wikidata are binary, whereas some Freebase relations are n-nary (Tanon et al., 2016), meaning that they connect more than two entities. The relations containing a dot (".") are such n-nary relations, and are difficult to reason about without understanding the structure of Freebase, which has been deprecated. We further discuss the impact of such n-nary relations for link prediction in the following section.

Difficulty
Next, we compare the datasets in a link prediction task to show that CODEX-M is more difficult.
Baseline We devise a "non-learning" link prediction baseline. Let (h, r, ?) be a test query. Our baseline scores candidate tail entities by their relative frequency in the tail slot of all training triples mentioning r, filtering out tail entities t for which (h, r, t) is already observed in the training set. If all tail entities t are filtered out, we score entities by frequency before filtering. The logic of our approach works in reverse for (?, r, t) queries. In evaluating our baseline, we follow LibKGE's protocol for breaking ties in ranking (i.e., for entities that appear with equal frequency) by taking the mean rank of all entities with the same score.
Setup We compare our baseline to the best pretrained embedding model per dataset: RESCAL for FB15K-237, which was released by Ruffinelli et al. (2020), and ComplEx for CODEX-M. We evaluate performance with MRR and Hits@10. Beyond overall performance, we also compute per-relation improvement of the respective embedding over our baseline in terms of percentage points MRR. This measures the amount of learning beyond frequency statistics necessary for each relation. Table 7 compares the overall performance of our baseline versus the best embedding per dataset, and Figure 3 shows the improvement of the respective embedding over our baseline per relation type on each dataset. The improvement of the embedding is much smaller on FB15K-237 than CODEX-M, and in fact our baseline performs on par with or even outperforms the  embedding on FB15K-237 for some relation types.

Results and discussion
To further explore these cases, Figure 4 gives the empirical cumulative distribution function of improvement, which shows the percentage of test triples for which the level of improvement is less than or equal to a given value on each dataset. Surprisingly, the improvement for both MRR and Hits@10 is less than five percentage points for nearly 40% of FB15K-237's test set, and is zero or negative 15% of the time. By contrast, our baseline is significantly weaker than the strongest embedding method on CODEX-M.
The disparity in improvement is due to two relation patterns prevalent in FB15K-237:  We conclude that while FB15K-237 is a valuable dataset, CODEX is more appropriately difficult for link prediction. Additionally, we note that in FB15K-237, all validation and test triples containing entity pairs directly linked in the training set were deleted (Toutanova and Chen, 2015), meaning that symmetry cannot be tested for in FB15K-237. Given that CODEX datasets contain both symmetry and compositionality, CODEX is more suitable for assessing how well models can learn relation patterns that go beyond frequency.

Conclusion and outlook
We present CODEX, a set of knowledge graph COmpletion Datasets EXtracted from Wikidata and Wikipedia, and show that CODEX is suitable for multiple KGC tasks. We release data, code, and pretrained models for use by the community at https://bit.ly/2EPbrJs. Some promising future directions on CODEX include: • Better model understanding CODEX can be used to analyze the impact of hyperparameters, training strategies, and model architectures in KGC tasks.
• Revival of triple classification We encourage the use of triple classification on CODEX in addition to link prediction because it directly tests discriminative power.
• Fusing text and structure Including text in both the link prediction and triple classification tasks should substantially improve performance (Toutanova et al., 2015). Furthermore, text can be used for few-shot link prediction, an emerging research direction (Xiong et al., 2017;Shi and Weninger, 2017).
Overall, we hope that CODEX will provide a boost to research in KGC, which will in turn impact many other fields of artificial intelligence. Table 8 provides an overview of knowledge graph embedding papers with respect to datasets and evaluation tasks. In our review, we only consider papers published between 2014 and 2020 in the main proceedings of conferences where KGC embedding papers are most likely to appear: Artificial intelligence (AAAI, IJCAI), machine learning (ICML, ICLR, NeurIPS), and natural language processing (ACL, EMNLP, NAACL). The main evaluation benchmarks are FB15K  Table 9 provides all seed entity and relation types used to collect CODEX. Each type is given first by its natural language label and then by its Wikidata unique ID: Entity IDs begin with Q, whereas relation (property) IDs begin with P. For the entity types that apply to people (e.g., actor, musician, journalist), we retrieved seed entities by querying Wikidata using the occupation relation. For the entity types that apply to things (e.g., airline, disease, tourist attraction), we retrieved seed entities by querying Wikidata using the instance of and subclass of relations.

C Negative annotation guidelines
We provide the annotation guidelines we used to label candidate negative triples ( § 3.4).
Task You must label each triple as either true or false. To help you find the answer, we have provided you with Wikipedia and Wikidata links for the entities and relations in each triple. You may also search on Google for the answer, although most claims should be resolvable using Wikipedia and Wikidata alone. If you are not able to find any reliable, specific, clear information supporting the claim, choose false. You may explain your reasoning if need be or provide sources to back up your answer in the optional explanation column.
Examples False triples may have problems with grammar, factual content, or both. Examples of grammatically incorrect triples are those whose entity or relation types do not make sense, for example: • (United States of America, continent, science fiction writer) • (Mohandas Karamchand Gandhi, medical condition, British Raj) • (Canada, foundational text, Vietnamese cuisine) Examples of grammatically correct but factually false triples include: • (United States of America, continent, Europe) • (Mohandas Karamchand Gandhi, country of citizenship, Argentina) • (Canada, foundational text, Harry Potter and the Goblet of Fire) • (Alexander Pushkin, influenced by, Leo Tolstoy) -Pushkin died only a few years after Tolstoy was born, so this sentence is unlikely.
Notice that in the latter examples, the entity types match up, but the statements are still false.
Tips For triples about people's occupation and genre, try to be as specific as possible. For example, if the triple says (<person>, occupation, guitarist) but that person is mainly known for their singing, choose false, even if that person plays the guitar. Likewise, if a triple says (<person>, genre, classical) but they are mostly known for jazz music, choose false even if, for example, that person had classical training in their childhood.

D Embedding models
We briefly overview the five models compared in our link prediction and triple classification tasks. (Nickel et al., 2011) was one of the first knowledge graph embedding models. Although it is not often used as a baseline, Ruffinelli et al. (2020) showed that it is competitive when appropriately tuned. RESCAL treats relational learning as tensor decomposition, scoring entity embeddings h, r ∈ R de and relation embeddings R ∈ R de×de with the bilinear form h Rt.   TransE (Bordes et al., 2013) treats relations as translations between entities, i.e., h + r ≈ t for h, r, t ∈ R de , and scores embeddings with negative Euclidean distance − h + r − t . TransE is likely the most popular baseline for KGC tasks and the most influential of all KGC embedding papers.

RESCAL
ComplEx (Trouillon et al., 2016) uses a bilinear function to score triples with a diagonal relation embedding matrix and complex-valued embeddings. Its scoring function is re h diag(r)t , where t is the complex conjugate of t and re denotes the real part of a complex number. (Dettmers et al., 2018) is one of the first and most popular nonlinear models for KGC. It concatenates head and relation embeddings h and r into a two-dimensional "image", applies a pointwise linearity over convolutional and fullyconnected layers, and multiplies the result with the tail embedding t to obtain a score. Formally, its scoring function is given as f (vec(f ([h; r] * ω))W)t, where f is a nonlinearity (originally, ReLU), [h; r] denotes a concatenation and twodimensional reshaping of the head and relation embeddings, ω denotes the filters of the convolutional layer, and vec denotes the flattening of a two-dimensional matrix.

ConvE
TuckER (Balazevic et al., 2019b) is a linear model based on the Tucker tensor decomposition, which factorizes a tensor into three lower-rank matrices and a core tensor. The TuckER scoring function for a single triple (h, r, t) is given as W × 1 h × 2 r × 3 t, where W is the mode-three core tensor that is shared among all entity and relation embeddings, and × n denotes the tensor product along the nth mode of the tensor. TuckER can be seen as a generalized form of other linear KGC embedding models like RESCAL and ComplEx.

E Content comparison
We provide additional comparison of the contents in CODEX-M and FB15K-237. Figure 5, which plots the top-30 entities by frequency in the two benchmarks, demonstrates that both dataset are biased toward developed Western countries and cultures. However, CODEX-M is more diverse in domain. It covers academia, entertainment, journalism, politics, science, and writing, whereas FB15K-237 covers mostly entertaiment and sports. FB15K-237 is also much more biased toward the United States in particular, as five of its top-30 entities are specific to the US: United States of America, United States dollar, New York City,  Los Angeles, and the United States Department of Housing and Urban Development. Figure 6 compares the top-15 entity types in CODEX-M and FB15K-237. Again, CODEX-M is diverse, covering people, places, organizations, movies, and abstract concepts, whereas FB15K-237 has many overlapping entity types mostly about entertainment. Table 10 gives our hyperparameter search space. Tables 11, 12, and 13 report the best hyperparameter configurations for link prediction on CODEX-S, CODEX-M, and CODEX-L, respectively. Tables 14 and 15 report the best hyperparameter configurations for triple classification on the hard negatives in CODEX-S and CODEX-M, respectively.

F Hyperparameter search
Terminology For embedding initialization, Xv refers to Xavier initialization (Glorot and Bengio, 2010). The reciprocal relations model refers to learning separate relation embeddings for queries in the direction of (h, r, ?) versus (?, r, t) (Kazemi and Poole, 2018). The frequency weighting regularization technique refers to regularizing embeddings by the relative frequency of the corresponding entity or relation in the training data.
Search strategies Recall that we select models using Ax, which supports hyperparameter search using both quasi-random sequences of generated configurations and Bayesian optimization (BO). The search strategy for each CODEX dataset is as follows: • CODEX-S: Per negative sampling type/loss combination, we generate 30 quasi-random trials followed by 10 BO trials. We select the best-performing model by validation MRR over all such combinations. In each trial, the model is trained for a maximum of 400 epochs with an early stopping patience of 5. We also terminate a trial after 50 epochs if the model does not reach ≥ 0.05 MRR.
• CODEX-M: Per negative sampling type/loss combination, we generate 20 quasi-random trials. The maximum number of epochs and early stopping criteria are the same as for CODEX-S.
• CODEX-L: Per negative sampling type/loss combination, we generate 10 quasi-random trials of 20 training epochs instead of 400. We reduce the number of epochs to limit resource usage. In most cases, MRR plateaus after 20-30 epochs, an observation which is consistent with (Ruffinelli et al., 2020). Then, we take the best-performing model by validation MRR over all such combinations, and retrain that model for a maximum of 400 epochs.
Note that we search using MRR as our metric, but the triple classification task measures 0/1 accuracy, not ranking performance. For triple classification, we choose the model with the highest validation accuracy among the pre-trained models across all negative sampling type/loss function combinations. We release all pretrained LibKGE models and accompanying configuration files in the centralized CODEX repository. Table 10: Our hyperparameter search space. We follow the naming conventions and ranges given by Ruffinelli et al. (2020), and explain the meanings of selected hyperparameter settings in Appendix F. As most KGC embedding models have a wide range of configuration options, we encourage future work to follow this tabular scheme for transparent reporting of implementation details.