Sparsity and Noise: Where Knowledge Graph Embeddings Fall Short

Knowledge graph (KG) embedding techniques use structured relationships between entities to learn low-dimensional representations of entities and relations. One prominent goal of these approaches is to improve the quality of knowledge graphs by removing errors and adding missing facts. Surprisingly, most embedding techniques have been evaluated on benchmark datasets consisting of dense and reliable subsets of human-curated KGs, which tend to be fairly complete and have few errors. In this paper, we consider the problem of applying embedding techniques to KGs extracted from text, which are often incomplete and contain errors. We compare the sparsity and unreliability of different KGs and perform empirical experiments demonstrating how embedding approaches degrade as sparsity and unreliability increase.


Introduction
Recently knowledge graphs (KGs), structured representations of knowledge bases, have become an essential component of systems that perform question-answering (Berant et al., 2013), provide decision support, and enable exploration and discovery (Dong et al., 2014). Initial efforts to create KGs focused on structured information sources or relied extensively on manual curation. However, the diversity of knowledge available on resources like the World Wide Web have spurred many projects that tackle the more difficult task of automatically constructing KGs (Nickel et al., 2016a).
Unfortunately, information extraction approaches for KG construction must overcome complex, unreliable, and incomplete data. Many machine learning methods have been proposed to address the challenge of cleaning and completing KGs. One popular class of methods learn embeddings that translate entities and relationships into a latent subspace, then use this latent representation to derive additional, unobserved facts and score existing facts (Bordes et al., 2013;Wang et al., 2014;Lin et al., 2015).
Embedding methods have shown state-ofthe-art results on several benchmark datasets. However, by construction, these benchmark datasets differ from data in real KGs. First, benchmark datasets have largely been restricted to the most frequently occurring entities in the KG. However in most KGs, entities are associated with a sparse set of observations. Second, benchmark datasets consist only of highly reliable facts from curated knowledge bases. In contrast, many KG construction projects extract knowledge from noisy data such as text or images, which introduces unreliable information.
In this paper, we evaluate popular KG embedding approaches on KGs that have sparse entities and unreliable candidate facts. We apply embedding methods to an extracted KG and modify existing benchmarks by varying the sparsity and reliability of training data used to learn embedding models. Using this suite of datasets, we characterize where embedding approaches are successful and the conditions that result in degrading results. Based on our insights, we provide recommendations for improving embedding models and identify promising areas of future exploration.

Background and Related Work
Diverse strategies for knowledge base construction include manually-crafted ontologies for common-sense reasoning (Lenat, 1995), community-driven collaborative efforts (Bollacker et al., 2008), ontology-based extraction from structured and textual sources (Mitchell et al., 2015), and "open" approaches that rely on textual information (Mausam et al., 2012).
In this paper, we contrast the properties of two knowledge graphs that have clean, humanvetted facts with two knowledge graphs that are extracted from textual data. Semantically meaningful embeddings of text have been a longstanding topic of study in NLP research (Turney and Pantel, 2010). More recently, knowledge graphs, which capture structured relationships between entities, has inspired methods such as matrix factorization (Riedel et al., 2013), tensor factorization (Nickel et al., 2011), and deep learning (Socher et al., 2013) that embed entities while preserving this relationship structure. We consider four state-of-the-art embedding methods (Bordes et al., 2013;Wang et al., 2014;Nickel et al., 2016b;Nguyen et al., 2016) and assess their performance on knowledge graphs with different properties.

Comparing Properties of KGs
In Table 1, we introduce three knowledge graphs and a parallel set of benchmark datasets derived from these KGs. Each KG takes the form of triples that specify a relationship between a subject and an object. The first two KGs, Freebase and WordNet, benefit from human curation that results in precisely defined entities and relationships and highly reliable facts. The third KG, NELL, is extracted from a large Web text corpus using an iterative co-training process and a pre-defined set of relations and types. Due to the iterative nature, NELL is a dynamic dataset and the table reports statistics of the 1000 th iteration. FB15K and WN18, derived from Freebase and WordNet, respectively, have been used to train and evaluate many embedding strategies. NELL165, based on an earlier iteration of NELL, has been used as a benchmark for probabilistic models. We compare the vital statistics of these six datasets.

Size and Sampling
Despite the reliance on curation, Freebase is the largest KG with more facts ( T ), unique entities ( E ), and relationship types ( R ) than others. NELL, is a tenth the size of Freebase with substantially fewer entities and limited relations. WordNet, focused on NLP, is the smallest and expresses only 27 relationships between different words. The derived benchmark datasets are substantially smaller than the source KGs, with the largest, NELL, containing 1M facts. FB15K is generated by sampling a subset of the KG centered around 15K entities. WN18 is generated by restricting to 18 relations. NELL165 performs no sampling, but is limited by the comprehensiveness of patterns learned during training.

Diversity
To understand the distribution of entities and relationships in the KG, we introduce an entropy-based measure using the probability an entity or relation will occur in a randomly selected triple. For triples T of the form (s, p, o), relations R, entities E, We define the entity and relation probabilities as the probability that a randomly selected triple will contain a particular relation or entity. More formally, we define these probabilities: Using these definitions, we define: We compute entity entropy (EE) and relation entropy (RE) for each dataset. Higher entropy values indicate more uniform distributions of facts across entities and relations, lower values signal biases in the facts. For example, the low RE values for Freebase and NELL165 are due to an abundance of facts specifying entity types (such as person), relative to other relations between entities. While Freebase has the most facts and entities, these facts are less diverse compared to other KGs. Through sampling, FB15K rebalances Freebase, increasing the diversity of entities and relations. In contrast, WordNet and WN18 have similar diversity statistics. Compared to NELL1000, NELL165 has a more diverse set of entities and a less diverse set of relations. All KGs have much higher EE than RE, since they use a manually defined set of relations but include many diverse entities.

Sparsity
In addition to diversity, KGs have differing levels of factual information for each entity or relation. One sparsity metric is information density, defined as the average triples per entity or relation. We formally define densities: We compare the datasets using entity density (ED) and relational density (RD). Most datasets have a similar ED, but the benchmark dataset FB15K has much higher entity density while the benchmark dataset NELL165 has a much lower entity density. NELL1000 has the highest RD, since extractions are focused on a small set of relations, while FB15K has a particularly low RD value due to the entitycentric approach to construction. We note that FB15K has much higher ED and much lower RD than parent Freebase, due to the sampling choices made during its construction.

Reliability
Embedding approaches rely on using facts that are reliable. Human-curated KGs generally have high precision due to strong oversight. In contrast, extracted KGs are far noisier, including erroneous relationships between entities. Extracted KGs are often evaluated on small, manually-labeled evaluation sets to estimate precision. In recent evaluations (Mitchell et al., 2015) using 11K annotations, NELL facts had a precision of ranging from 0.75-0.85 for confident extractions and 0.35-0.45 across the broader set of extractions.

Empirical Evaluation
To better understand embedding performance with sparse and unreliable data, we select four popular embedding approaches and perform four empirical analyses.

Extracted Knowledge Graphs
In Section 3, we noted that the extracted NELL165 dataset is sparse, with fewer (candidate) facts per relation or entity than the FB15K benchmarks. Moreover, the precision of these candidates can be far lower than benchmark datasets. To evaluate whether embeddings can succeed under such challenging conditions, we applied four state-of-the-art embedding techniques, We evaluated all methods on 4.5K manually-labeled facts (Jiang et al., 2012), reporting the area under the precision-recall curve (AUPRC) and the F1 score, computed with parameters that maximize performance on the labeled training set. We compare against a baseline that simply applies a threshold to NELL extractor confidences (but cannot score novel facts), the NELL promotion strategy, and a probabilistic approach PSL-KGI (Pujara et al., 2015), that reasons collectively about KG facts using ontological constraints and supports open-world reasoning. The results, in Table 2, suggest that embedding approaches cannot cope with the sparse and low-quality extractions, performing more poorly than the baseline approaches and substantially trailing the probabilistic model. In the next two experiments, we analyze whether this failure can be attributed to sparsity or sensitivity to noise.

Sensitivity to Sparsity
One potential explanation for the lackluster performance of embedding approaches on extracted KGs is the sparsity of these datasets. To assess the impact of sparsity on the qual- Figure 1: Triples are removed from FB15K to preserve relational density (stable, solid) or to increase sparsity (sparse, dotted). Sparse training sets have a pronounced impact on the learned embedding, as measured by HITS@10 on the test set.
ity of learned embeddings, we remove triples from FB15K using two different techniques. The first technique, sparse, removes triples uniformly at random, with a constraint that such removal does not eliminate any entity or relation from the dataset. The second technique, stable, removes all triples for a particular relation, leaving other relations intact. stable is calibrated so that the training set size does not vary more than 2% between techniques. Fig. 1 shows the filtered hits@10 metric (proportion of correct triples in top ten triples excluding training data) for both sparse and stable using the TransE, TransH, HolE, and STransE embeddings. Performance universally decreases as the training set diminishes. However, in the sparse treatment, performance deteriorates much more rapidly than in stable. Our experiments show that more complex representations such as TransH and HolE suffer more from sparsity, while TransE and the more sophisticated STransE have somewhat better performance. Ultimately, when half the triples have been randomly removed, corresponding to a (relatively high) RD value of 220, the stable outperforms sparse by as much as 60%. The contrast between a dense set of facts for each relation (stable) and a sparse set of relational training data is a vivid demonstration that embedding quality relies on dense training data.

Sensitivity to Unreliability
Beyond sparsity, candidate facts generated by knowledge extraction approaches can also be unreliable. To understand the sensitivity of embedding techniques to noise, we modified the FB15K dataset to include unreliable triples. Our approach to introducing noise, corrupt involved "corrupting" triples, substituting a replacement entity or relation for the true subject, predicate or object. The embedding approach is then trained with a corrupted version of the benchmark. Fig. 2 show how the Hits@10 metric suffers as increasing numbers of facts are either corrupted (corrupt) or removed (sparse). We find that across all methods, removing training data is better than providing incorrect training data to the learning algorithm, but surprisingly the deficit between sparse and corrupt remains relatively stable across all embeddings.

Trading off Sparsity and Noise
In many real-world scenarios, constructing a KG requires navigating a tradeoff between sparsity and noise. A sparse, high-quality set of extractions may be insufficient to learn meaningful embeddings. However, the benefit of incorporating additional, unreliable facts may also be questionable. We explore this tradeoff by randomly removing 300K triples from FB15K and incrementally adding unreliable triples at differing noise levels, where noise measures the probability a newly-added training triple is corrupted. We generate training sets for each noise level and size, train TransE, and compute the filtered Hits@10 metric on the test set. Fig. 2 shows all embeddings have an initial benefit from new training data, but noise level dictates the improvement as more data is introduced. For low noise settings, performance climbs steadily, while higher noise results in plateauing or diminishing performance. Surprisingly, even with 90% noise embeddings demonstrate a small net improvement, suggesting that for embedding methods a large, unreliable corpus may be better than an extremely sparse, high-quality one.

Conclusion
In this paper, we analyze several knowledge graphs and discuss key metrics for diversity, sparsity, and unreliability in realistic KGs. Our experimental evaluation concludes that KG embeddings are sensitive to sparse and unreliable data, and perform poorly on KGs extracted from text. These findings suggest a rich area of future research, determining new strategies to extend embeddings to cope with sparse and unreliable data. Three promising approaches include revising the closed-world assumption frequently used in training embeddings, combining embeddings and collective probabilistic models that perform well on extracted KGs, and devising an optimization approach for embeddings that exploits confidence from knowledge extraction systems.