Can We Predict New Facts with Open Knowledge Graph Embeddings? A Benchmark for Open Link Prediction

Open Information Extraction systems extract (“subject text”, “relation text”, “object text”) triples from raw text. Some triples are textual versions of facts, i.e., non-canonicalized mentions of entities and relations. In this paper, we investigate whether it is possible to infer new facts directly from the open knowledge graph without any canonicalization or any supervision from curated knowledge. For this purpose, we propose the open link prediction task,i.e., predicting test facts by completing (“subject text”, “relation text”, ?) questions. An evaluation in such a setup raises the question if a correct prediction is actually a new fact that was induced by reasoning over the open knowledge graph or if it can be trivially explained. For example, facts can appear in different paraphrased textual variants, which can lead to test leakage. To this end, we propose an evaluation protocol and a methodology for creating the open link prediction benchmark OlpBench. We performed experiments with a prototypical knowledge graph embedding model for openlink prediction. While the task is very challenging, our results suggests that it is possible to predict genuinely new facts, which can not be trivially explained.


Introduction
A knowledge graph (KG) (Hayes-Roth, 1983) is a set of (subject, relation, object)-triples, where the subject and object correspond to vertices, and relations to labeled edges. In curated KGs, each triple is fully disambiguated against a fixed vocabulary of entities 1 and relations.
An application for KGs, for example, is the problem of drug discovery based on bio-medical knowledge (Mohamed et al., 2019). The construction of a curated bio-medical KG, which is required for such an approach, is challenging and constrained by the available amount of human effort and domain expertise. Many tools that could assist humans in KG construction (e.g., an entity linker) need a KG to begin with. Moreover, current methods for KG construction often rely on the rich structure of Wikipedia, such as links and infoboxes, which are not available for every domain. Therefore, we ask if it is possible to make predictions about, for example, new drug applications from raw text without the intermediate step of KG construction. Open information extraction systems (OIE) (Etzioni et al., 2011) automatically extract ("subject text", "relation text", "object text")-triples from unstructured data such as text. We can view OIE data as an open knowledge graph (OKG) , in which vertices correspond to mentions of entities and edges to open relations (see Fig. 1). Our overarching interest is whether and how we can reason over an OKG without any canonicalization and without any supervision on its latent factual knowledge. The focus of this study are the challenges of benchmarking the inference abilities of models in such a setup.
A common task that requires reasoning over a KG is link prediction (LP). The goal of LP is to predict missing facts in a KG. In general, LP is defined as answering questions such as (NBC, headquar-terIn, ?) or (?, headquarterIn, NewYorkCity); see Fig. 2a. In OKGs, we define open link prediction (OLP) as follows: Given an OKG and a question consisting of an entity mention and an open relation, predict mentions as answers. A predicted mention is correct if it is a mention of the correct answer entity. For example, given the question ("NBC-TV", "has office in", ?), correct answers include "NYC" and "New York"; see Fig. 2b).
To evaluate LP performance, the LP model is trained on known facts and evaluated to predict unknown facts, i.e., facts not seen during training. A simple but problematic way to transfer this approach to OKGs is to sample a set of evaluation triples from the OKG and to use the remaining part of the OKG for training. To see why this approach is problematic, consider the test triple ("NBC-TV", "has office in", "New York") and suppose that the triple ("NBC", "has headquarter in", "NYC") is also part of the OKG. The latter triple essentially leaks the test fact. If we do not remove such facts from the training data, a successful models only paraphrases known facts but does not perform reasoning, i.e., does not predict genuinely new facts. Furthermore, we also want to quantify if there are other trivial explanations for the prediction of an evaluation fact. For example, how much can be predicted with simple popularity statistics, i.e., only the mention, e.g. ("NBC-TV", ?), or only the relation, e.g. ("has office in", ?). Such non-relational information also does not require reasoning over the graph.
To experimentally explore whether it is possible to predict new facts, we focus on knowledge graph embedding (KGE) models (Nickel et al., 2016), which have been applied successfully to LP in KGs. Such models can be easily extended to handle the surface forms of mentions and open relations.
Our contributions are as follows: We propose the OLP task, an OLP evaluation protocol, and a method to create an OLP benchmark dataset. Using the latter method, we created a large OLP benchmark called OLPBENCH, which was derived from the state-of-the-art OIE corpus OPIEC (Gashteovski et al., 2019). OLPBENCH contains 30M open triples, 1M distinct open relations and 2.5M distinct mentions of approximately 800K entities. We investigate the effect of paraphrasing and nonrelational information on the performance of a prototypical KGE model for OLP. We also investigate the influence of entity knowledge during model selection with different types of validation data. For training KGE models on such large datasets, we describe an efficient training method.
In our experiments, we found the OLP task and OLPBENCH to be very challenging. Still, the KGE model we considered was able to predict genuinely new facts. We also show that paraphrasing and non-relational information can indeed dilute performance evaluation, but can be remedied by appropriate dataset construction and experimental settings.

Open Knowledge Graphs
OKGs can be constructed in a fully automatic way. They are open in that they do not require a vocabulary of entities and relations. For this reason, they can capture more information than curated KGs. For example, different entity mentions can refer to different versions of an entity at different points of time, e.g., "Senator Barack Obama" and "President Barack Obama". Similarly, relations may be of varying specificity: headquarterIn may be expressed directly by open relations such as "be based in" or "operate from" but may also be implied by "relocated their offices to". In contrast to KGs, OKGs contain rich conceptual knowledge. For example, the triple ("a class action lawsuit", "is brought by", "shareholders") does not directly encode entity knowledge, although it does provide information about entities that link to "a class action lawsuit" or "shareholders".
OKGs tend to be noisier and the factual knowledge is less certain than in a KG, however. They Filter other correct answer entities : : Map answer entities to mentions to identify correct answers Figure 3: Mention-ranking protocol: Example for computing the filtered rank for a test question.
can not directly replace KGs. OKGs have mostly been used as a weak augmentation to KGs, e.g., to infer new unseen entities or to aid link prediction (see App. A for a comprehensive discussion of related work). Much of prior work that solely leverages OKGs without a reference KG-and therein is closest to our work-focused on canonicalization and left inference as a follow-up step (Cohen et al., 2000, inter alia). In contrast, we propose to evaluate inference in OKGs with OLP directly.

Open Link Prediction
The open link prediction task is based on the link prediction task for KGs (Nickel et al., 2016), which we describe first. Let E be a set of entities, R be a set of relations, and T ⊆ E × R × E be a knowledge graph. Consider questions of the form q h = (?, k, j) or q t = (i, k, ?), where i, j ∈ E is a head and tail entity, respectively, and k ∈ R is a relation. The link prediction problem is to provide answers that are correct but not yet present in T . In OKGs, only mentions of entities and open relations are observed. We model each entity mention and each open relation as a non-empty sequence of tokens from some vocabulary V (e.g., a set of words). Denote by M = V + the set of all such sequences and observe that M is unbounded. An open knowledge graph T ⊂ M×M×M consists of triples of form (i, k, j), where i, j ∈ M are head and tail entity mentions, resp., and k ∈ M is an open relation. Note that we overload notation for readability: i, j, and k refer to entity mentions and open relations in OKGs, but to disambiguated entities and relations in KGs. The intended meaning will always be clear from the context. We denote by M(E) and M(R) the sets of entity and relations present in T , respectively. The open link prediction task is to predict new and correct answers to questions (i, k, ?) or (?, k, j). Answers are taken from M(E), whereas questions may refer to arbitrary mentions of entities and open relations from M. For example, for the question ("NBC-TV", "has office in", ?), we expect an answer from the set of mentions {"New York", "NYC", . . . } of the entity NewYorkCity. Informally, an answer (i, k, j) is correct if there is a correct triple (e 1 , r, e 2 ), where e 1 and e 2 are entities and r is a relation, such that i,j, and k are mentions of e 1 , e 2 , and r, respectively.

Evaluation protocol
To describe our proposed evaluation protocol, we first revisit the most commonly used methodology to evaluate link prediction methods for KGs, i.e., the entity-ranking protocol (Bordes et al., 2013). Then, we discuss its adaptation to OLP, which we call the mention-ranking protocol (see Fig. 3).
KGs and entity ranking. For each triple z = (i, k, j) in the evaluation data, a link prediction model ranks the answers for two questions, q t (z) = (i, k, ?) and q h (z) = (?, k, j). The model is evaluated based on the ranks of the correct entities j and i; this setting is called raw. When true answers for q t (z) and q h (z) other than j and i are filtered from the rankings, then the setting is called filtered.
OKGs and mention ranking. In OLP, the model predicts a ranked list of mentions. But questions might have multiple equivalent true answers, i.e., answers that refer to the same entity but use different mentions. Our evaluation metrics are based on the highest rank of a correct answer mention in the ranking. For the filtered setting, the mentions of known answer entities other than the evaluated entity are filtered from the ranking. This mentionranking protocol thus uses knowledge of alternative mentions of the entity in the evaluation triple to obtain a suitable ranking. The mention-ranking protocol therefore requires (i) ground truth annotations for the entity mentions in the head and tail of the evaluation data, and (ii) a comprehensive set of mentions for these entities.

Creating the Open Link Prediction
Benchmark OLPBENCH An OLP benchmark should enable us to evaluate a model's capability to predict genuinely new facts, i.e., facts can not be trivially derived. Due to the nature of OKGs, paraphrasing of facts may leak facts from validation and test data into training, making the prediction of such evaluation facts trivial. Nevertheless, the creation of training and validation data should require as little human effort as possible so that the methodology can be readily applied to new domains. Our mention-ranking protocol uses knowledge about entities for disambiguation (of the evaluation data, not the training data), however, which requires human effort to create. We investigate experimentally to what extent this entity knowledge is necessary for model selection and, in turn, how much manual effort is required to create a suitable validation dataset. In the following, we describe the source dataset of OLPBENCH and discuss how we addressed the points above to create evaluation and training data.

Source Dataset
OLPBENCH is based on OPIEC (Gashteovski et al., 2019), a recently published dataset of OIE triples that were extracted from the text of English Wikipedia with the state-of-the-art OIE system MinIE (Gashteovski et al., 2017). We used a subset of 30M distinct triples, comprised of 2.5M entity mentions and 1M open relations. In 1.25M of these triples, the subject and the object contained a Wikipedia link. Fig. 4 shows how a Wikipedia link is used to disambiguate a triple's subject and object mentions. Tab. 1 shows an excerpt from the unlinked and linked triples. For the evaluation protocol, we collected a dictionary, where each entity Was the second ship of the United States Navy to be named for William Conway, who distinguished himself during the Civil War. en.wikipedia.org/wiki/William_Conway_(U.S._Navy) en.wikipedia.org/wiki/American_Civil_War Figure 4: Example for a triple extracted from Wikipedia. With a Wikipedia hyperlink, a mention is disambiguated to its entity. Inversely this yields a mapping from an entity to all its mentions.
is mapped to all possible mentions. See App. B for more details about the dataset creation.

Evaluation Data
From the source dataset, we created validation and test data with the following requirements: Data quality. The evaluation data should be challenging, and noise should be limited as much as possible. We chose a pragmatic and easy heuristic: we did not consider short relations with less than three tokens as candidates for sampling evaluation data. This decision was based on the following observations: (i) Due to the OPIEC's extractions, short relations-e.g. ("kerry s. walters", "is", "professor emeritus")-are often subsumed by longer relations-e.g. ("kerry s. walters", "is professor emeritus of", "philosophy")-, which would always lead to leakage from the longer relation to the shorter relation. (ii) Longer relations are less likely to be easily captured by simple patterns that are already successfully used by KG construction methods, e.g. ("elizabeth of hungary", "is the last member of", "the house ofárpád"). We conjecture that long relations are more interesting for evaluation to measure progress in reasoning with OKG data. (iii) The automatically extracted entity annotations were slightly noisier for short relations; e.g., ("marc anthony", "is" "singer") had the object entity annotation SinFrenos.
Human effort for data creation. The mentionranking protocol uses knowledge about entities for disambiguation. We want to experimentally quantify the influence of this entity knowledge on model selection, i.e., whether entity knowledge is necessary to find a good model. If so, human expertise is necessary to create the validation data. While our goal is to require almost no human domain expertise to learn a good model, the size of validation data is much smaller than the size of the training data. Therefore, this effort-if helpful-may be  Table 1: Example from the unlinked and linked data in OLPBENCH. For the unlinked data, we show the first of 3443 triples from the unlinked data containing the token "conway". For the linked data, we show the triples and also the alternative mentions for their entities. The first linked triple is about William Conway (U.S. Navy).
feasible. To investigate this, we perform model selection performed with three different validation datasets that require increasingly more human effort to create: VALID-ALL (no effort), VALID-MENTION (some effort) and VALID-LINKED (most amount of human effort).
TEST and VALID-LINKED data. Sample 10K triples with relations that have at least three tokens from the 1.25M linked triples. In these triples, the subject and object mentions have an annotation for their entity, which allows the mention-ranking protocol to identify alternative mentions of the respective entities.
VALID-MENTION data. Proceed as in VALID-LINKED but discard the entity links. During validation, no access to alternative mentions is possible so that the mention-ranking protocol cannot be used. Nevertheless, the data has the same distribution as the test data. Such validation data may be generated automatically using a named entity recognizer, if one is available for the target domain.
VALID-ALL data. Sample 10K triples with relations that have at least three tokens from the entire 30M unlinked and linked triples. This yields mostly triples from the unlinked portion. These triples may also include common nouns such as "a nice house" or "the country". Entity links are discarded, i.e., the mention-ranking protocol cannot be used for validation.

Training Data
To evaluate LP models for KGs, evaluation facts are generated by sampling from the KG. Given an evaluation triple (i, k, j), the simplest action to avoid leakage from the training data is to remove only this evaluation triple from training. For KGs, it was observed this simple approach is not satisfactory in that evaluation answers may still leak and thus can be trivially inferred (Toutanova et al., 2015;Dettmers et al., 2018). For example, an evaluation triple (a, siblingOf, b) can be trivially answered with the training triple (b, siblingOf, a).
In OKGs, paraphrases of relations pose additional sources of leakage. For example, the relations "is in" and "located in" may contain many of the same entity pairs. For evaluation triple (i, k, j), such leakage can be prevented by removing any other relation between i and j from the training data. However, individual tokens in the arguments or relations may also cause leakage. For example, information about test triple ("NBC-TV", "has office in", "NYC") is leaked by triples such as ("NBC Television", "has NYC offices in", "Rockefeller Plaza") even though it has different arguments. Fig. 5 visualizes this example.
We use three levels of leakage removal from training: SIMPLE, BASIC, and THOROUGH. To match evaluation triple (i, k, j) with training triples, we ignored word order and stopwords. THOROUGH removal. Additionally to BASIC removal, we also remove triples from training matched by the following patterns. The patterns are explained with the example ("J. Smith", "is defender of", "Liverpool"): (a) (i, * , j) and (j, * , i). E.g., matches ("J. Smith", "is player of", "Liverpool").
For OLPBENCH, THOROUGH removed 196,717 more triples from the OKG than BASIC. Note that this yields three different training data sets.
2 Other permutations of this pattern did not occur in our data.

Open Knowledge Graph Embeddings
KG embedding (KGE) models have been successfully applied for LP in KGs, and they can be easily extended to handle surface forms, i.e., mentions and open relations. We briefly describe KGE models and their extension.
Knowledge Graph Embedding (KGE) model. A KGE model (Nickel et al., 2016) associates an embedding with each entity and each relation. The embeddings are dense vector representations that are learned with an LP objective. They are used to compute a KGE model-specific score s(i, k, j) for a triple (i, k, j); the goal is to predict high scores for true triples and low scores for wrong triples.
KGE model with composition. For our experiments, we considered composition functions to create entity and relation representations from the tokens of the surface form. Such an approach has been used, for example, by Toutanova et al. (2015) to produce open relation embedding via a CNN. A model that reads the tokens of mentions and open relations can, in principle, handle any mention and open relation as long as the tokens have been observed during training.
We use a general model architecture that combines a relational model and a composition func-( "Jamie" "Carragher", "is" "defender" "of", "Liverpool" ) mention/relation tokens token embeddings mention/relation embeddings score for triple tion, see Fig. 6. Formally, let V(E) + be the set of non-empty token sequences over the token vocabulary V(E) of entity mentions. We denote by d, o ∈ N + the size of the embeddings of entities and relations. We first embed each entity mention into a continuous vector space via an entity mention embedding function f : Similarly, each open relation is embedded into a continuous vector space via a relation embedding function g : V(R) + → R o . The embeddings are then fed into a relational scoring function RM : R d × R o × R d → R. Given a triple (i, k, j), where i, j ∈ V(E) + and k ∈ V(R) + , our model computes the final score as s(i, k, j) = RM( f (i), g(k), f (j) ).

Experiments
In our experimental study, we investigated whether a simple prototypical OLP model can predict genuinely new facts or if many successful predictions can be trivially explained by leakage or nonrelational information. Our goal was to study the effectiveness and necessity of the mention-ranking protocol and leakage removal, and how much human effort is necessary to create suitable validation data. Finally, we inspected data and model quality.
We first describe the models and their training, then the performance metrics, and finally the evaluation. In our experimental results, model performance dropped by ≈25% with THOROUGH leakage removal so that leakage due to paraphrasing is indeed a concern. We also implemented two diagnostic models that use non-relational infor-mation (only parts of a triple) to predict answers. These models reached ≈20-25% of the prototypical model's performance, which indicates that relational modelling is important. In our quality and error analysis, we found that at least 74% of the prediction errors were not due to noisy data. A majority of incorrectly predicted entity mentions have a type similar to the one of the true entity.

Models and Training
Prototypical model. We use COMPLEX (Trouillon et al., 2016) as relational model, which is an efficient bilinear model and has shown state-of-theart results. For the composition functions f and g, we used an LSTM (Hochreiter and Schmidhuber, 1997) with one layer and the hidden size equivalent to the token embedding size. We call this model  Diagnostic models. To expose potential biases in the data, we employ two diagnostic models to discover how many questions can simply be answered without looking at the whole question, i.e., by exploiting non-relational information. Given question (i, k, ?), the model PREDICT-WITH-REL considers (r, ?) for scoring. E.g., for question ("Jamie Carragher", "is defender of", ?), we actually ask ("is defender of", ?). This is likely to work reasonably for relations that are specific about the potential answer entities; e.g., predicting popular football clubs for ("is defender of", ?). The model uses scoring functions s t : R o × R d → R and s h : R d × R o → R for questions (i, k, ?) and (?, k, j) respectively: Likewise, the PREDICT-WITH-ENT model ignores the relation by computing a score for pair (i, j).
Training. See App. C for details about the hyperparameters, training and model selection.  while HITS@k equally rewards correct answers in the top-k ranks. See App. D for a more formal definition of MRR and HITS@k. The ranks are based on mention ranking for VALID-LINKED and TEST and on entity-ranking (treating distinct mentions as distinct entities) for VALID-ALL and VALID-MENTION.

Results
Influence of leakage. In Tab. 2, we observed that BASIC leakage removal of evaluation data lowers the performance of all models considerably in contrast to the SIMPLE leakage removal. With the THOROUGH leakage removal, performace drops further; e.g., HITS@50 performance dropped by ≈ 25% from SIMPLE. This confirms our conjecture that leakage can trivially explain some successful predictions. Most predictions, however, cannot be explained by paraphrasing leakage.
Influence of non-relational information. In Tab. 2, we see that PREDICT-WITH-ENT, which essentially learns popularity statistics between entity mentions, has no success on the evaluation data. However, PREDICT-WITH-REL reaches ≈ 20−25% of HITS@50 performance of COMPLEX-LSTM by simply predicting popular mentions for a relation, even in the THOROUGH setting.
Effectiveness of mention-ranking. Tab. 3 shows validation results for the three types of validation data for COMPLEX-LSTM and THOR-OUGH removal. The evaluation protocol has access to alternative mentions only in VALID-LINKED, but not in VALID-ALL and VALID-MENTION. Clearly, using VALID-LINKED results in higher metrics when models associate different mentions to an answer entity.
Influence of model selection. The THOROUGH block of Tab. 2 shows the results for model selection based on VALID-ALL, VALID-MENTION or VALID-LINKED. In VALID-ALL, many triples contain common nouns instead of entity mentions, while in VALID-MENTION or VALID-LINKED triples have entity mentions in both arguments. Model selection based on VALID-ALL clearly picked a weaker model than model selection based on VALID-LINKED, i.e., it led to a drop of ≈35% of HITS@50 performance. However, there is no improvement when we pick a model based on VALID-LINKED versus VALID-MENTION. Thus, computing the MRR using alternative entity mentions did not improve model selection, even though-as Tab. 3 shows-the mention-ranking protocol gives more credit when alternative mentions are ranked higher. Our results suggest that it may suffice to use validation data that contains entity mentions but avoid costly entity disambiguation.
Overall performance. In Tab. 2 we observed that performance numbers seem generally low. For comparison, the HITS@10 of COMPLEX on FB15k-237-a standard evaluation dataset for LP in curated KGs-lies between 45% and 55%. We conjecture that this drop may be due to: (i) The   level of uncertainty and noise in the training data, i.e., uninformative or even misleading triples in OKGs (Gashteovski et al., 2019). (ii) Our evaluation data is mostly from the more challenging long tail. (iii) OKGs might be fragmented, thus inhibiting information flow. Also, note that the removal of evaluation data from training removes evidence for the evaluated long-tail entities. (iv) Naturally, in LP, we do not know all the true answers to questions. Thus, the filtered rank might still contain many true predictions. In OLP, we expect this effect to be even stronger, i.e., the filtered ranking metrics are lower than in the KG setting. Still, like in KG evaluation, with a large enough test set, the metrics allow for model comparisons.
Model and data errors. We inspected predictions for VALID-LINKED from COMPLEX-LSTM trained on THOROUGH. We sampled 100 prediction errors, i.e., triples for which no correct predicted mention appeared in the filtered top-50 rank. We classified prediction errors by inspecting the top-3 ranks and judged their consistency. We classified triple quality judging the whole triple. We counted an error as correct sense / wrong entity, when the top-ranked mentions are semantically sensible, i.e. for ("Irving Azoff", "was head of", ?) the correct answer would be "MCA Records", but the model predicted other record companies. We counted an error as wrong sense when-for the same example-the model mostly consistently predicted other companies or music bands, but not other record companies. If the predictions are inconsistent, we counted the error as noise.
An additional quality assessment is the number of wrong triples caused by extraction errors in OPIEC, e.g., ("Finland", "is the western part of", "the balkan peninsula"), ("William Macaskill", "is vice-president of", "giving"), or errors in alternative mentions. We also looked for generic mentions in the evaluation data. Such mentions contain mostly conceptual knowledge like in ("computer science", "had backgrounds in", "mathematics"). Other generic triples, like ("Patrick E.", "joined the team in", "the season"), have conceptual meaning, but miss context to disambiguate "the season".
The results in Tab. 4 suggest that the low performance in the experiments is not due to noisy evaluation data. 74% of the examined prediction errors on VALID-LINKED contained correct, nongeneric facts. The shown model errors raise the question of whether there is enough evidence in the data to make better predictions.

Conclusion
We proposed the OLP task and a method to create an OLP benchmark. We created the large OLP benchmark OLPBENCH, which will be made publicly available 4 . We investigated the effect of leakage of evaluation facts, non-relational information, and entity-knowledge during model selection using a prototypical open link prediction model. Our results indicate that most predicted true facts are genuinely new. jointly in the context of LP.
Reading comprehension QA and language modelling. Two recently published reading comprehension question answering datasets-QAngaroo (Welbl et al., 2018) and HotPotQA (Yang et al., 2018)-evaluate multi-hop reasoning over facts in a collection of paragraphs. In contrast to these approaches, OLP models reason over the whole graph, and the main goal is to investigate the learning of relational knowledge despite ambiguity and noise. We consider those two directions as complementary to each other. Also, in their task setup, they do not stipulate a concept of relations between entities, i.e., the relations are assumed to be a latent/inherent property of the text in which the entities occur. This is true as well for language models trained on raw text. It has been shown that such language models can answer questions in a zero-shot setting (Radford et al., 2019). The authors of the latter study inspected the training data to estimate the number of near duplicates to their test data and could show that their model seemed to be able to generalize, i.e., to reason about knowledge in the training data.
TAC KBP Slot Filling. The TAC KBP Slot Filling challenge datasets provide a text corpus paired with canonicalized multi-hop questions. There are similarities to our work in terms of building knowledge from scratch and answering questions. The main difference is that our goal is to investigate the learning of knowledge without supervision on canonicalization and that we use link prediction questions to quantify model performance. If models in OLP show convincing progress, they could and should be applied to TAC KBP.

B Dataset creation
The process of deriving the dataset from OPIEC was as follows. Initially, the dataset contained over 340M non-distinct triples, 5 which are enriched with metadata such as source sentence, linguistic annotations, confidence scores about the correctness of the extractions and the Wikipedia links in the triple's subject or object. Triples of the following types are not useful for our purpose and are removed: (i) having a confidence score < 0.3, 6 (ii) having personal or possessive pronouns, whdeterminer, adverbs or determiners in one of their arguments, (iii) having a relation from an implicit appositive clause extraction, which we found to be very noisy, and (iv) having a mention or a relation that is longer than 10 tokens. This left 80M nondistinct triples. Next, we lowercased the remaining 60M distinct triples and collect an entity-mentions map from all triples that have an annotated entity. We collected token counts and created a mention token vocabulary with the top 200K most frequent tokens, and a relation token vocabulary with the top 50K most frequent tokens. This was done to ensure that each token is seen at least ≈ 50 times. Finally, we kept only the triples whose tokens were contained in these vocabularies, i.e., the final 30M distinct triples.

C Training details C.1 Multi-Label Binary Classification Batch-Negative Example Loss
Recent studies (Dettmers et al., 2018) obtained state-of-the-art results using multi-label binary classification over the full entity vocabulary. Let the cardinality of the OKG's mention set be N = |T h ∪ T t |. A training instance is either a prefix (i, k) with label y ik ∈ {0, 1} N given by or, likewise, a suffix (k, j) and y kj ∈ {0, 1} N . Computing such a loss over the whole entity mention vocabulary is infeasible because (a) our entity mention vocabulary is very large and (b) we have to recompute the entity mention embeddings after each parameter update for each batch. To improve memory efficiency and speed, we devise a strategy to create negative examples dubbed batch negative examples. This method simplifies the batch construction by using only the entities in the batch as negative examples. Formally, after sampling the prefix and suffix instances for a batch b, we collect all true answers in a setB b , such that the label vectors y ik and y kj in batch b is defined overB b and the loss in batch b is computed by c · log σ(s(i, k, c))  and L kj is computed likewise. With batch negative examples the mentions/entities appear in expectation proportional to their frequency in the training data as a "negative example".

C.2 Training settings
We used Adagrad with mini batches (Duchi et al., 2011) with batch size 4096. The token embeddings were initialized with the Glorot initialization (Glorot and Bengio, 2010). One epoch takes ≈ 50 min with a TitanXp/1080Ti. We performed a grid search over the following hyperparameters: entity and relation token embedding sizes [256,512], drop-out after the composition function f and g [0.0, 0.1], learning rate [0.05, 0.1, 0.2] and weight decay [10 −6 , 10 −10 ]. We trained the models for 10 epochs and selected the hyperparameters, which achieved the best MRR with mention ranking on VALID-LINKED. We trained the final models for up to 100 epochs but did early stopping if no improvement occured within 10 epochs.

D Performance Metrics
Denote by M(E) all mentions from the dataset. Denote by Q the set of all questions generated from the evaluation data. Given a question q t ∈ Q, we rank all m ∈ M(E) by the scores s(i, k, m) (or s(m, k, j) for q h ∈ Q), then filter the raw rank according to either the entity-ranking protocol or the mention-ranking protocol. Finally, we record the positions of the correct answers in the filtered ranking.
MRR is defined as follows: For each question q ∈ Q, let RR q be the filtered reciprocal rank of the top-ranked correct answer. MRR is the microaverage over {RR q | q ∈ Q}. HITS@k is the proportion of the questions where at least one correct mention appears in the top k positions of the filtered ranking.

E Additional Results
Tab. 5 provides results for other models and hyperparameters. The COMPLEX-LSTM results from the Sec. 6 are given at the bottom for comparison. COMPLEX-LSTM-XL has a larger embedding size of 768, which did not help to improve the results. COMPLEX-UNI is the ComplEx model with the uni-gram pooling composition function, i.e., averaging the token embeddings. Compared to COMPLEX-LSTM it shows that LSTM as a composition function did yield better results. DISTMULT-LSTM is the DistMult relational model (Yang et al., 2015) with an LSTM as composition function, which did not improve over COMPLEX-LSTM. In Summary, the results support the hyperparameters, model and composition function chosen for the experiments in Sec. 6. Overall, we observed that model selection based on VALID-ALL seems to have a higher variance because the model selected for COMPLEX-LSTM with VALID-ALL is outperformed by other models, whereas COMPLEX-LSTM performed best for models selected with VALID-MENTION and VALID-LINKED.