KGEval: Accuracy Estimation of Automatically Constructed Knowledge Graphs

Automatic construction of large knowledge graphs (KG) by mining web-scale text datasets has received considerable attention recently. Estimating accuracy of such automatically constructed KGs is a challenging problem due to their size and diversity. This important problem has largely been ignored in prior research – we fill this gap and propose KGEval. KGEval uses coupling constraints to bind facts and crowdsources those few that can infer large parts of the graph. We demonstrate that the objective optimized by KGEval is submodular and NP-hard, allowing guarantees for our approximation algorithm. Through experiments on real-world datasets, we demonstrate that KGEval best estimates KG accuracy compared to other baselines, while requiring significantly lesser number of human evaluations.


Introduction
Automatic construction of Knowledge Graphs (KGs) from Web documents has received significant interest over the last few years, resulting in the development of several large KGs consisting of hundreds of predicates (e.g., isCity, sta-diumLocatedInCity(Stadium, City)) and millions of their instances called beliefs (e.g., (Joe Luis Arena, stadiumLocatedInCity, Detroit)). Examples of such KGs include NELL , Knowledge-Vault (Dong et al., 2014) etc.
Due to imperfections in the automatic KG construction process, many incorrect beliefs are also found in these KGs. Knowing accuracy for each predicate in the KG can provide targeted feedback for improvement and highlight its strengths from weaknesses, whereas overall accuracy of a KG can quantify the effectiveness of its constructionprocess. Knowing accuracy at predicate-level granularity is immensely helpful for Question-Answering (QA) systems that integrate opinions from multiple KGs . For such systems, being aware that a particular KG is more accurate than others in a certain domain, say sports, helps in restricting the search over relevant and accurate subsets of KGs, thereby improving QA-precision and response time. In comparison to the large body of recent work focused on construction of KGs, the important problem of accuracy estimation of such large KGs is unexploredwe address this gap in this paper.
True accuracy of a predicate may be estimated by aggregating human judgments on correctness of each and every belief in the predicate 1 . Even though crowdsourcing marketplaces such as Amazon Mechanical Turk (AMT) provide a convenient way to collect human judgments, accumulating such judgments at the scale of larges KGs is prohibitively expensive. We shall refer to the task of manually classifying a single belief as true or false as a Belief Evaluation Task (BET). Thus, the crucial problem is: How can we select a subset of beliefs to evaluate which will best estimate the true (but unknown) accuracy of KG and its predicates?
A naive and popular approach is to evaluate randomly sampled subset of beliefs from the KG. Since random sampling ignores relationalcouplings present among the beliefs, it usually results in oversampling and poor accuracy estimates. Let us motivate this through an example.
Motivating example: We motivate efficient accuracy estimation through the KG fragment shown in Figure 1. Here, each belief is an edge-triple in the graph, for example (RedWings, isA, Sport-sTeam). There are six correct and two incorrect beliefs (the two incident on Taj Mahal), resulting in an overall accuracy of 75%(= 6/8) which we would like to estimate. Additionally, we would also like to estimate accuracies of the predicates: homeStadiumOf, homeCity, stadiumLocate-dInCity, cityInState and isA.
We now demonstrate how evaluation labels of beliefs are constrained by each other. Type consistency is one such coupling constraint. For instance, we know from KG ontology that the home-StadiumOf predicate connects an entity from Stadium category to another entity in Sports Team category. Now, if (Joe Louis Arena, homeStadiumOf, Red Wings) is evaluated to be correct, then from these type constraints we can infer that (Joe Louis Arena, isA, Stadium) and (Red Wings, isA, Sports Team) are also correct. Similarly, by evaluating (Taj Mahal, isA, State) as false, we can infer that (Detroit, cityInState, TajMahal) is incorrect.
Additionally, we have Horn-clause coupling constraints Lao et al., 2011), such as homeStadiumOf(x, y) ∧ homeCity(y, z) → stadiumLocatedInCity(x, z). By evaluating (Red Wings, homeCity, Detroit) and applying this hornclause to the already evaluated facts mentioned above, we infer that (Joe Louis Arena, stadium-LocatedInCity, Detroit) is also correct. We explore generalized forms of these constraints in Section 3.1.
Thus, evaluating only three beliefs, and exploiting constraints among them, we exactly estimate the overall true accuracy as 75% and also cover all predicates. In contrast, the empirical accuracy by randomly evaluating three beliefs, averaged over 5 trials, is 66.7%.
Our contributions in this paper are the following: (1). Systematic study into the important problem of evaluation of automatically constructed Knowledge Graphs. (2). A novel crowdsourcingbased system KGEval to estimate accuracy of large knowledge graphs (KGs) by exploiting dependencies among beliefs for more accurate and faster KG accuracy estimation. (3). Extensive experiments on real-world KGs to demonstrate KGEval's effectiveness and also evaluate its robustness and scalability. All the data and code used in the paper are available at http://talukdar.net/ mall-lab/KGEval 2 Overview and Problem Statement

KGEval: Overview
We try to estimate correctness of as many beliefs as possible while evaluating only a subset of them through crowdsourcing. KGEval achieves this goal using an iterative algorithm which alternates between the following two stages: • Control Mechanism (Section 3.4): In this step, KGEval selects the belief which is to be evaluated next using crowdsourcing.
• Inference Mechanism (Section 3.3): Coupling constraints are applied over evaluated beliefs to automatically estimate correctness of additional beliefs.
This iterative process is repeated until there are no more beliefs to be evaluated. Single iteration of KGEval over the KG fragment from Figure 1 is shown in Figure 2 where, belief (John Louis Arena, homeStadiumOf, Red Wings) is selected and evaluated by crowdsourcing. Subsequently, the inference mechanism uses type coupling constraints to infer (JL Arena, isA, Stadium) and (R. Wings, isA, Team) also as true. Next, we formalize the notations used in this paper.

Notations and Problem Statement
We are given a KG with n beliefs. Evaluating a single belief as true or false forms a Belief Evaluation Task (BET). Coupling constraints are derived by determining relationships among BETs, which we further discuss in Section 3.1. Notations are also summarized in Table 1.   Inference algorithm helps us work out evaluation labels of other BETs using constraints C. For a set of already evaluated BETs Q ⊆ H, we define inferable set I G, Q ⊆ H as BETs whose evaluation labels can be deduced by the inference algorithm. We calculate the average true accuracy of a given set of evaluated BETs Q ⊆ I G, Q ⊆ H by Φ(Q) = 1 |Q| h∈Q t(h). KGEval aims to sample and crowdsource a BET set Q with the largest inferable set, and solves the optimization problem:

KGEval: Method Details
In this section, we describe various components of KGEval.

Coupling Constraints
The evaluation labels of beliefs are often dependent on each other due to rich relational struc-ture of KGs. In this work, we derive coupling constraints C from the KG ontology and linkprediction algorithms, such as PRA (Lao et al., 2011) over NELL andAMIE (Galárraga et al., 2013) over Yago. These rules are jointly learned over entire KG with millions of facts and are assumed true. We use conjunction-form first-order-logic rules and refer to them as Horn clauses. Examples of a few such coupling constraints are shown below.
C 2 : (x, homeStadiumOf, y) → (y, isA, sportsTeam) C 5 : (x, homeStadiumOf, y) ∧ (y, homeCity, z) → (x, stadiumLocatedInCity, z) Each coupling constraint C i operates over H i ⊆ H to the left of its arrow and infers label of the BET on the right of its arrow. C 2 enforces type consistency and C 5 is an instance of PRA path. These constraints have also been successfully employed earlier during knowledge extraction  and integration (Pujara et al., 2013). Note that the constraints are directional and inference propagates in forward direction.

Evaluation Coupling Graph (ECG)
To combine all beliefs and constraints at a common place, for all H and C, we construct a graph with two types of nodes: (1) a node for each BET h ∈ H, and (2) a node for each constraint C i ∈ C. Each C i node is connected to all h nodes that participate in it. We call this graph the Evaluation Coupling Graph (ECG), represented as G = Note that ECG is a bipartite factor graph (Kschischang et al., 2001) with h as variable-nodes and C i as factor-nodes.  Figure 3 shows ECG constructed out of the motivating example in Figure 1 with |C| = 8 and separate nodes for each of the edges (beliefs or BETs) in KG. We pose the KG evaluation problem as classification of BET nodes in the ECG by allotting them a label of 1 or 0 to represent true or false respectively.

Inference Mechanism
Inference mechanism helps propagate true/false labels of evaluated beliefs to other non-evaluated beliefs using available coupling constraints (Bragg et al., 2013). We use Probabilistic Soft Logic (PSL), (Broecheler et al., 2010) as our inference engine.Below we briefly describe the internal workings of our inference engine for accuracy estimation problem.
Potential function ψ j is defined for each C j using Lukaseiwicz t-norm and it depicts how satisfactorily constraint C j is satisfied. For example, C 5 mentioned earlier is transformed from first-order logical form to a real valued number by where h x denotes the evaluation score ∈ [0, 1] associated with the BETs. The probability distribution over label assignment is so structured such that labels which satisfy more coupling constraints are more probable. Probability of any label assignment Ω H ∈ {0, 1} |H| over BETs in G is given by where Z is the normalizing constant and ψ j corresponds to potential function acting over BETs h ∈ Dom(C j ). Final assignment of Ω(H) P SL ∈ {1, 0} |H| labels is obtained by solving the maximum a-posteriori (MAP) optimization problem P Ω H We denote by M P SL (h, γ) ∈ [0, 1] the PSLestimated score for label γ ∈ {1, 0} on BET h in the optimization above. Inferable Set using PSL: We define estimated label for each BET h as shown below.
where threshold τ is system hyper-parameter. Inferable set is composed of BETs for which inference algorithm (PSL) confidently propagates labels.
I(G, Q) = {h ∈ H | l(h) = ∅} Note that two BET nodes from ECG can interact with varying strengths through different constraint nodes; this multi-relational structure requires soft probabilistic propagation.

Control Mechanism
Control mechanism selects the BET to be crowdevaluated at every iteration. We first present the following two theorems involving KGEval's optimization in Equation (1). Please refer Appendix for proofs of both theorems. Theorem 1.
[Submodularity] The function optimized by KGEval (Equation (1)) using the PSL-based inference mechanism is submodular (Lovász, 1983). The proof follows from the fact that all pairs of BETs satisfy the regularity condition (Jegelka and Bilmes, 2011;Kolmogorov and Zabih, 2004), further used by a proven conjecture (Mossel and Roch, 2007). Theorem 2. [NP-Hardness] The problem of selecting optimal solution in KGEval's optimization (Equation (1)) is NP-Hard. Proof follows by reducing NP-complete Set-cover Problem (SCP) to selecting Q which covers I(G, Q).
Justification for Greedy Strategy: From Theorem 1 and 2, we observe that the function optimized by KGEval is NP-hard and submodular.
if Q ≡ H then Results from (Nemhauser et al., 1978) prove that greedy hill-climbing algorithms solve such maximization problem within an approximation factor of (1−1/e) ≈ 63% of the optimal solution. Hence we iteratively select the next BET which gives the greatest increase in size of inferable set.
We acknowledge the importance of crowdsourcing aspects such as label-aggregation, worker's quality estimation etc. Appendix A.1 presents a mechanism to handle noisy crowd workers under limited budget.

Bringing it All Together
Algorithm 1 presents KGEval. In Lines 1-3, we build the Evaluation Coupling Graph G = (H ∪ C, E) and use the labels of seed set S to initialize G. In lines 4-16, we repetitively run our inference mechanism, until either the accuracy estimates have converged, or all the BETs are covered. In each iteration, the BET with the largest inferable set is identified and evaluated using crowdsourcing (Lines 5-6). The new inferable set Q t is estimated. These automatically annotated nodes are added to Q (Lines 7-10). Convergence: In this paper, we define convergence whenever the variance of sequence of accuracy estimates [ Acc t−k , . . ., Acc t−1 , Acc t ] is less than α. We set k = 9 and α = 0.002 for our experiments.

Experiments
To assess the effectiveness of KGEval, we ask the following questions: (1).How effective is KGEval in estimating KG accuracy, both at predicate-level and at overall KG-level? (Section 4.3). (2). What is the importance of coupling constraints on its performance? (Section 4.4). (3). And lastly, how robust is KGEval to estimating accuracy of KGs with varying quality? (Section 4.5).   (Lao et al., 2011). The confidence score returned by PRA are used as weights θ i . We use NELL-ontology's predicate-signatures to get information for type constraints. Please note that PSL is capable of handling weighted constraints and also learn their weights (relative importance). So, it is not critical to provide absolutely correct constraints. We also select YAGO2-sample (YAGO) , which unlike NELLsports, is not domain specific. We use AMIE horn clauses (Galárraga et al., 2013) to construct multirelational coupling constraints C Y . For each C i , the score returned by AMIE is used as rule weight θ i . Table 2 reports the statistics of datasets used, their true accuracy and number of coupling constraints. Obtaining gold-labels for millions of facts is non-trivial and expensive as crowdsourcing over full KG incurs significant cost. Size of evaluation set: NELL-sport consists of 23, 422 beliefs with 13, 290 unique entities and 53 unique predicates. Whereas YAGO-sample has 31, 720 beliefs, with unique 32, 103 entities and 17 predicates. In order to calculate accuracy, we require gold evaluation of all beliefs in the evaluation set. Since obtaining gold evaluation of the entire (or large subsets of) NELL and Yago2 KGs will be prohibitively expensive, we take subset of these KGs for evaluation. (KGEval) consists of datasets used, their crowdsourced labels, coupling constraints and code for inference/control.

Model Description
Initialization: Algorithm 1 requires initial seed set S which we generate by randomly evaluating |S| = 50 beliefs from H. To maintain fairness, all baselines start from S. For asserting true (or false) value for beliefs, we set a high soft label confidence threshold at τ = 0.8 (see Section 3.3).

Crowdsourcing of BETs
To compare KGEval predictions against human evaluations, we evaluate all BETs {H N ∪ H Y } on AMT. For the ease of workers, we translate each entity-relation-entity belief into human readable format before posting to crowd.
We published BETs on AMT under 'classification project' category. We hired AMT recognized master workers for high quality labels and paid $0.01 per BET. To compare between 'master' and 'noisy' workers, we correlated their labels individually to expert labels on random subset and observed that master workers were better correlated (93%) as compared to three non-masters (89%). Consequently we consider votes of master workers for {H N ∪ H Y } as gold labels, which we would like our inference algorithm to be able to predict. As the average turnaround time for AMT tasks runs into a few minutes (Dupuis et al., 2013), KGEval is effectively real-time within such turnaround time range.

Performance Evaluation Metrics
Performance of various methods are evaluated using the following two metrics. To capture accuracy at the predicate level, we define ∆ predicate as the average of difference between gold and estimated accuracy of each of the R predicates in KG.
We define ∆ overall as the difference between gold and estimated accuracy over the entire evaluation set.
Above, Φ(H) is the overall gold accuracy, Φ(H r ) is the gold accuracy of predicate r and l(h) is the label assigned by the currently evaluated method. ∆ overall treats entire KG as a single bag of BETs whereas ∆ predicate segregates beliefs based on their type of predicate-relation. For both metrics, lower is better.

Baseline Methods
Since accuracy estimation of large multi-relational KGs is a relatively unexplored problem, there are no well established baselines for this task (apart from random sampling). We present below the baselines which we compared against KGEval. Random: Randomly sample a BET h ∈ H without replacement and crowdsource for its correctness. Selection of every subsequent BET is independent of previous selections. Max-Degree: Sort the BETs in decreasing order of their degrees in ECG and select them from top for evaluation; this method favors selection of more centrally connected BETs first. Independent Cascade: This method is based on contagion transmission model where nodes only infect their immediate neighbors (Kempe et al., 2003). At every time iteration t, we choose a BET which is not evaluated yet, crowdsource for its label and let it propagate its evaluation label in ECG. KGEval: Method proposed in Algorithm 1.

Effectiveness of KGEval
Experimental results of all methods comparing ∆ overall and ∆ predicate at convergence, are presented in Table 3. We observe that KGEval is able to achieve the best estimate across both datasets and metrics. Due to the significant positive bias in H Y (see Table 2), all methods do fairly well as per ∆ overall on this dataset, even though KGEval still outperforms others. Also, KGEval is able to estimate KG accuracy most closely while utilizing least number of crowd-evaluated queries. This clearly demonstrates KGEval's effectiveness.
Nodes in coupling graph with higher degrees are those which participate in large number of constraints. In real KGs, such facts tend to be correct as they interact with several other facts. Hence, MaxDegree overestimates the accuracy by propagating True label. In contrast, Random samples True and False labels in unbiased fashion. Predicate-level Analysis: Here, we consider the top two baselines from

Importance of Coupling Constraints
This paper is largely motivated by the thesis -exploiting richer relational couplings among BETs may result in faster and more accurate evaluations. To evaluate this thesis, we successively ablated Horn clause coupling constraints of bodylength 2 and 3 from C N . We observe that with the full (non-ablated) constraint set C N , KGEval takes least number of crowd evaluations of BETs to convergence, while providing best accuracy estimate. Whereas with ablated constraint sets, KGEval takes up to 2.4x more crowd-evaluation queries for convergence; thus validating our thesis.

Additional Experiments
Other Baselines along with Inference: In order to evaluate how Random and Max-degree perform in conjunction with inference mechanism, we replaced KGEval's greedy control mechanism in Line 5 of Algorithm 1 with these two control mechanisms. In our experiments, we observed that both Random+inference and Max-degree+inference are able to estimate accuracy more accurately than their control-only variants. Secondly, even though the accuracies estimated by Random+inference and Max-degree+inference were comparable to that of KGEval, they required larger number of crowd-evaluation queries -1.2x and 1.35x more, respectively. This shows effectiveness of greedy mechanism.
Rate of Coverage: In case of large KGs with scarce budget, it is imperative to have a mechanism which covers greater parts of KG with lesser number of crowdsource queries. Figure 5 shows the fraction of total beliefs whose evaluations were automatically inferred by different methods as a function of number of crowd-evaluated beliefs. We observe that KGEval infers evaluation for the largest number of BETs at each supervision level.
Robustness to Noise: In order to test robustness of the methods in estimating accuracies   of KGs with different gold accuracies, we artificially added noise to H N by flipping a fixed fraction of edges, otherwise following the same evaluation procedure as in Section 3.5. We analyze ∆ overall (and not ∆ predicate ) because flipping edges in KG distorts predicate-relations dependencies and present in Table 5. We evaluated all the methods and observed that while performance of other methods degraded significantly with diminishing KG quality (more noise), KGEval was significantly robust to noise. Scalability comparisons with MLN: Markov Logic Networks (Richardson and Domingos, 2006) can serve as a candidate for Inference Mechanism. We compared the runtime performance of KGEval with PSL and MLN as inference engines. While PSL took 320 seconds to complete one iteration, the MLN implementation (PyMLN) could not finish grounding the rules even after 7 hours. This justifies our choice of PSL as the inference engine for KGEval.

Related Work
Even though Knowledge Graph (KG) construction is an active area of research, we are not aware of any previous research which systemati-cally studies the important problem of estimating accuracy of such automatically constructed KGs. Random sampling has traditionally been the most preferred way for large-scale KG accuracy estimation .
Traditional crowdsourcing research has typically considered atomic allocation of tasks where the requester posts them independently. KGEval operates in a rather novel crowdsourcing setting as it exploits dependencies among its tasks (BETs or belief evaluations). Our notion of interdependence (coupling constraints) among tasks is more general and different than related ideas explored in the crowdsourcing literature before (Kolobov et al., 2013;Bragg et al., 2013;Sun et al., 2015). Even though coupling constraints have been used for KG construction (Nakashole et al., 2011;Galárraga et al., 2013;, they have so far not been exploited for KG evaluation. We address this gap in this paper. The task of knowledge corroboration (Kasneci et al., 2010) proposes probabilistic model to utilize a fixed set of basic first-order logic rules for label propagation and is closely aligned with our motivations. However, unlike KGEval, it does not try to reduce the number of queries to crowdsource or maximize coverage.

Conclusion
In this paper, we have initiated a systematic study into the important problem of evaluation of automatically constructed Knowledge Graphs. In order to address this challenge, we have proposed KGEval, an instance of a novel crowdsourcing paradigm where dependencies among tasks presented to humans (BETs) are exploited. To the best of our knowledge, this is the first method of its kind. We demonstrated that the objective optimized by KGEval is in fact NP-Hard and submodular, and hence allows for the application of simple greedy algorithms with guarantees. Through extensive experiments on real datasets, we demonstrated effectiveness of KGEval. We hope to extend KGEval to incorporate varying evaluation cost, and also explore more sophisticated evaluation aggregation.

Submodularity:
A real valued function f , which acts over subsets of any finite set H, is said to be submodular if ∀R, S ⊂ H it fulfills We call potential function ψ as pairwise regular if for all pairs of BETs {p, q} ∈ H it satisfies Proof. (for Theorem 1) The additional utility, in terms of label inference, obtained by adding a BET to larger set is lesser than adding it to any smaller subset. By construction, any two BETs which share a common factor node Cj are encouraged to have similar labels in G.
We consider a BET h to be confidently inferred when the soft score of its label assignment in I(G, Q) is greater than threshold τ h ∈ [0, 1]. From above we know that P(l(h)|Q) is submodular with respect to fixed initial set Q. Although max or min of submodular functions are not submodular in general, but (Kempe et al., 2003) conjectured that global function of Equation (1) is submodular if local threshold function P(h|Q) ≥ τ h respected submodularity, which holds good in our case of Equation (3). This conjecture was further proved in (Mossel and Roch, 2007) and thus making our global optimization function of Equation (1) submodular.
Proof. (for Theorem 2) We reduce KGEval to NP-complete Set-cover Problem (SCP) so as to select Q which covers I(G, Q). For the proof to remain consistent with earlier notations, we define SCP by collection of subsets I1, I2, . . . , Im from set H = {h1, h2, . . . , hn} and we want to determine if there exist k subsets whose union equals H. We define a bipartite graph with m + n nodes corresponding to Ii's and hj's respectively and construct edge (Ii, hj) if hj ∈ Ii. We need to find a set Q, with cardinality k, such that |I(G, Q)| ≥ n + k.
Choosing our BET-set Q from SCP solution and further inferring evaluations of other remaining BETs using PSL will solve the problem in hand.

A.1 Noisy Crowd Workers and Budget
Here, we provide a scheme to allot crowd workers so as to remain within specified budget and upper bound total error on accuracy estimate. We have not integrated this mechanism with Algorithm 1 to maintain its simplicity.
We resort to majority voting in our analysis and assume that crowd workers are not adversarial. So expectation over responses r h (u) for a task h with respect to multiple workers u is close to its true label t(h) (Tran-Thanh et al., 2013), i.e., E u∼D(u,h) where D is joint probability distribution of workers u and tasks h.
Our key idea is that we want to be more confident about BETs h with larger inferable set (as they impact larger parts of KG) and hence allocate them more budget to post to more workers. We determine the number of workers {w h 1 , . . . , w hn } for each task such that ht with larger inference set have higher w h t . For total budget B, we allocate where it denotes the cardinality of inferable set I(G, Q∪ht), c the cost of querying crowd worker, imax the size of largest inferable set and γ ∈ [0, 1] constant.
This allocation mechanism easily integrates with Algorithm 1; in (Line 8) we determine size of inferable set it = |Qt| for task h and allocate w h crowd workers. Budget depletion (Line 9) is modified to Br = Br − w h c(h). The following theorem bounds the error with such allocation scheme. Proof. Let γ ∈ [0, 1] control the reduction in size of inferable set by it+1 = γ it. By allocating w h t redundant workers for task ht, ∀t ∈ {1 · · · n} with size of inferable set it, we incur total cost of n t=1 Note that the above geometric approximation helps in estimating summation n t=1 it at iteration t ≤ n.
Error Bounds: Here we show that the expected error of estimating of ht for any time t decreases exponentially in the size of inferable set it. We use majority voting to aggregate w h t worker responses for ht, denoted byr h t ∈ {0, 1} where r h t (u k ) is the response by k th worker for ht. The error from aggregated response can be given by ∆(ht) = |r h t − t(ht)|, where t(ht) is its true label. From Equation (5) and Hoeffding-Azuma bounds over w h t i.i.d responses and error margin εt, we have For fixed budget B and given error margin εt, we have ∆(ht) = e −O(i t ) . Summing up over all tasks t, by union bounds we get the total expected error from absolute truth as ∆(B) = n t=1 ∆(ht).
The accuracy estimation error will decay exponentially with increase in total budget for fixed parameters.