Importance sampling for unbiased on-demand evaluation of knowledge base population

Knowledge base population (KBP) systems take in a large document corpus and extract entities and their relations. Thus far, KBP evaluation has relied on judgements on the pooled predictions of existing systems. We show that this evaluation is problematic: when a new system predicts a previously unseen relation, it is penalized even if it is correct. This leads to significant bias against new systems, which counterproductively discourages innovation in the field. Our first contribution is a new importance-sampling based evaluation which corrects for this bias by annotating a new system’s predictions on-demand via crowdsourcing. We show this eliminates bias and reduces variance using data from the 2015 TAC KBP task. Our second contribution is an implementation of our method made publicly available as an online KBP evaluation service. We pilot the service by testing diverse state-of-the-art systems on the TAC KBP 2016 corpus and obtain accurate scores in a cost effective manner.


Introduction
Harnessing the wealth of information present in unstructured text online has been a long standing goal for the natural language processing community. In particular, knowledge base population seeks to automatically construct a knowledge base consisting of relations between entities from a document corpus. Knowledge bases have found many applications including question answering (Berant et al., 2013;Fader et al., 2014; * Authors contributed equally.  Reddy et al., 2014), automated reasoning (Kalyanpur et al., 2012) and dialogue (Han et al., 2015).
Evaluating these systems remains a challenge as it is not economically feasible to exhaustively annotate every possible candidate relation from a sufficiently large corpus. As a result, a poolingbased methodology is used in practice to construct datasets, similar to them methodology used in information retrieval (Jones and Rijsbergen, 1975;Harman, 1993). For instance, at the annual NIST TAC KBP evaluation, all relations predicted by participating systems are pooled together, annotated and released as a dataset for researchers to develop and evaluate their systems on. However, during development, if a new system predicts a previously unseen relation it is considered to be wrong even if it is correct. The discrepancy between a system's true score and the score on the pooled dataset is called pooling bias and is typically assumed to be insignificant in practice (Zobel, 1998).
The key finding of this paper contradicts this assumption and shows that the pooling bias is actu-ally significant, and it penalizes newly developed systems by 2% F 1 on average (Section 3). Novel improvements, which typically increase scores by less than 1% F 1 on existing datasets, are therefore likely to be clouded by pooling bias during development. Worse, the bias is larger for a system which predicts qualitatively different relations systematically missing from the pool. Of course, systems participating in the TAC KBP evaluation do not suffer from pooling bias, but this requires researchers to wait a year to get credible feedback on new ideas. This bias is particularly counterproductive for machine learning methods as they are trained assuming the pool is the complete set of positives. Predicting unseen relations and learning novel patterns is penalized. The net effect is that researchers are discouraged from developing innovative approaches, in particular from applying machine learning, thereby slowing progress on the task. Our second contribution, described in Section 4, addresses this bias through a new evaluation methodology, on-demand evaluation, which avoids pooling bias by querying crowdworkers, while minimizing cost by leveraging previous systems' predictions when possible. We then compute the new system's score based on the predictions of past systems using importance weighting. As more systems are evaluated, the marginal cost of evaluating a new system decreases. We show how the on-demand evaluation methodology can be applied to knowledge base population in Section 5. Through a simulated experiment on evaluation data released through the TAC KBP 2015 Slot Validation track, we show that we are able to obtain unbiased estimates of a new systems score's while significantly reducing variance.
Finally, our third contribution is an implementation of our framework as a publicly available evaluation service at https://kbpo.stanford. edu, where researchers can have their own KBP systems evaluated. The data collected through the evaluation process could even be valuable for relation extraction, entity linking and coreference, and will also be made publicly available through the website. We evaluate three systems on the 2016 TAC KBP corpus for about $150 each (a fraction of the cost of official evaluation). We believe the public availability of this service will speed the pace of progress in developing KBP systems.

Humans
System A System B System C i 1 : (s 1 , r, o 1 , p 1 ) i 2 : (s 1 , r, o 2 , p 2 ) i 3 : (s 1 , r, o 3 , p 3 ) i 4 : (s 1 , r, o 2 , p 4 ) i 5 : (s 1 , r, o 3 , p 5 ) i 6 : (s 1 , r, o 4 , p 6 ) × Figure 2: In pooled evaluation, an evaluation dataset is constructed by labeling relation instances collected from the pooled systems (A and B) and from a team of human annotators (Humans). However, when a new system (C) is evaluated on this dataset, some of its predictions (i 6 ) are missing and can not be fairly evaluated. Here, the precision and recall for C should be 3 3 and 3 4 respectively, but its evaluation scores are estimated to be 2 3 and 2 3 . The discrepancy between these two scores is called pooling bias.

Background
In knowledge base population, each relation is a triple (SUBJECT, PREDICATE, OBJECT) where SUBJECT and OBJECT are some globally unique entity identifiers (e.g. Wikipedia page titles) and PREDICATE belongm to a specified schema. 1 A KBP system returns an output in the form of relation instances (SUBJECT, PREDICATE, OBJECT, PROVENANCE), where PROVENANCE is a description of where exactly in the document corpus the relation was found. In the example shown in Figure 1, CARRIE FISHER and DEBBIE REYNOLDS are identified as the subject and object, respectively, of the predicate CHILD OF, and the whole sentence is provided as provenance. The provenance also identifies that CARRIE FISHER is referenced by Fisher within the sentence. Note that the same relation can be expressed in multiple sentences across the document corpus; each of these is a different relation instance.
Pooled evaluation. The primary source of evaluation data for KBP comes from the annual TAC KBP competition organized by NIST (Ji et al.,1 The TAC KBP guidelines specify a total of 65 predicates (including inverses) such as per:title or org:founded on, etc. Subject entities can be people, organizations, geopolitical entities, while object entities also include dates, numbers and arbitrary string-values like job titles. 2011). Let E be a held-out set of evaluation entities. There are two steps performed in parallel: First, each participating system is run on the document corpus to produce a set of relation instances; those whose subjects are in E are labeled as either positive or negative by annotators. Second, a team of annotators identify and label correct relation instances for the evaluation entities E by manually searching the document corpus within a time budget (Ellis et al., 2012). These labeled relation instances from the two steps are combined and released as the evaluation dataset. In the example in Figure 2, systems A and B were used in constructing the pooling dataset, and there are 3 distinct relations in the dataset, between s 1 and o 1 , o 2 , o 3 .
A system is evaluated on the precision of its predicted relation instances for the evaluation entities E and on the recall of the corresponding predicted relations (not instances) for the same entities (see Figure 2 for a worked example). When using the evaluation data during system development, it is common practice to use the more lenient anydoc score that ignores the provenance when checking if a relation instance is true. Under this metric, predicting the relation (CARRIE FISHER, CHILD OF, DEBBIE REYNOLDS) from an ambiguous provenance like "Carrie Fisher and Debbie Reynolds arrived together at the awards show" would be considered correct even though it would be marked wrong under the official metric.

Measuring pooling bias
The example in Figure 2 makes it apparent that pooling-based evaluation can introduce a systematic bias against unpooled systems. However, it has been assumed that the bias is insignificant in practice given the large number of systems pooled in the TAC KBP evaluation. We will now show that the assumption is not valid using data from the TAC KBP 2015 evaluation. 2 Measuring bias. In total, there are 70 system submissions from 18 teams for 317 evaluation entities (E) and the evaluation set consists of 11,008 labeled relation instances. 3 Figure 3: Median pooling bias (difference between pooled and unpooled scores) on the top 40 systems of TAC KBP 2015 evaluation using the official and anydoc scores. The bias is much smaller for the lenient anydoc metric, but even so, it is larger than the largest difference between adjacent systems (1.5% F 1 ) and typical system improvements (around 1% F 1 ).
tion dataset gives us a good measure of the true scores for the participating systems. Similar to Zobel (1998), which studied pooling bias in information retrieval, we simulate the condition of a team not being part of the pooling process by removing any predictions that are unique to its systems from the evaluation dataset. The pooling bias is then the difference between the true and unpooled scores.
Results. Figure 3 shows the results of measuring pooling bias on the TAC KBP 2015 evaluation on the F 1 metric using the official and anydoc scores. 45 We observe that even with lenient anydoc heuristic, the median bias (2.05% F 1 ) is much larger than largest difference between adjacently ranked systems (1.5% F 1 ). This experiment shows that pooling evaluation is significantly and systematically biased against systems that make novel predictions! sider instances selected in the first part of this process. 4 We note that anydoc scores are on average 0.88%F1 larger than the official scores. 5 The outlier at rank 36 corresponds to a University of Texas, Austin system that only filtered predictions from other systems and hence has no unique predictions itself.

On-demand evaluation with importance sampling
Pooling bias is fundamentally a sampling bias problem where relation instances from new systems are underrepresented in the evaluation dataset. We could of course sidestep the problem by exhaustively annotating the entire document corpus, by annotating all mentions of entities and checking relations between all pairs of mentions. However, that would be a laborious and prohibitively expensive task: using the interfaces we've developed (Section 6), it costs about $15 to annotate a single document by non-expert crowdworkers, resulting in an estimated cost of at least $1,350,000 for a reasonably large corpus of 90,000 documents (Dang, 2016). The annotation effort would cost significantly more with expert annotators. In contrast, labeling relation instances from system predictions can be an order of magnitude cheaper than finding them in documents: using our interfaces, it costs only about $0.18 to verify each relation instance compared to $1.60 per instance extracted through exhaustive annotations. We propose a new paradigm called on-demand evaluation which takes a lazy approach to dataset construction by annotating predictions from systems only when they are underrepresented, thus correcting for pooling bias as it arises. In this section, we'll formalize the problem solved by ondemand evaluation independent of KBP and describe a cost-effective solution that allows us to accurately estimate evaluation scores without bias using importance sampling. We'll then instantiate the framework for KBP in Section 5.

Problem statement
Let X be the universe of (relation) instances, Y ⊆ X be the unknown subset of correct instances, X 1 , . . . X m ⊆ X be the predictions for m systems, and let , then the precision, π i , and recall, r i , of the set of predictions X i is where p i is a distribution over X i and p 0 is a distribution over Y. We assume that p i is known, e.g. the uniform distribution over X i and that we know p 0 up to normalization constant and can sample from it.
In on-demand evaluation, we can query f (x) (e.g. labeling an instance) or draw a sample from p 0 ; typically, querying f (x) is significantly cheaper than sampling from p 0 . We obtain prediction sets X 1 , . . . , X m sequentially as the systems are submitted for evaluation. Our goal is to estimate π i and r i for each system i = 1, . . . , m.

Simple estimators
We can estimate each π i and r i independently with simple Monte Carlo integration. LetX 1 , . . . ,X m be multi-sets of n 1 , . . . , n j i.i.d. samples from X 1 , . . . , X m respectively, and letŶ 0 be a multiset of n 0 samples drawn from Y. Then, the simple estimators for precision and recall are:

Joint estimators 6
The simple estimators are unbiased but have wastefully large variance because evaluating a new system does not leverage labels acquired for previous systems.
On-demand evaluation with the joint estimator works as follows: FirstŶ 0 is randomly sampled from Y once when the evaluation framework is launched. For every new set of predictions X m submitted for evaluation, the minimum number of samples n m required to accurately evaluate X m is calculated based on the current evaluation data,Ŷ 0 andX 1 , . . . ,X m−1 . Then, the setX m is added to the evaluation data by evaluating f (x) on n m samples drawn from X m . Finally, estimates π i and r i are updated for each system i = 1, . . . , m using the joint estimators that will be defined next. In the rest of this section, we will answer the following three questions: 1. How can we use all the samplesX 1 , . . .X m when estimating the precision π i of system i?
2. How can we use all the samplesX 1 , . . . ,X m withŶ 0 when estimating recall r i ?
Estimating precision jointly. Intuitively, if two systems have very similar predictions X i and X j , we should be able to use samples from one to estimate precision on the other. However, it might also be the case that X i and X j only overlap on a small region, in which case the samples from X j do not accurately represent instances in X i and could lead to a biased estimate. We address this problem by using importance sampling (Owen, 2013), a standard statistical technique for estimating properties of one distribution using samples from another distribution. In importance sampling, ifX i is sampled from is an unbiased estimate of π i . We would like the proposal distribution q i to both leverage samples from all m systems and be tailored towards system i. To this end, we first define a distribution over systems j, represented by probabilities w ij . Then, define q i as sampling a j and drawing x ∼ p j ; formally q i (x) = m j=1 w ij p j (x). We note that q i (x) not only significantly differs between systems, but also changes as new systems are added to the evaluation pool. Unfortunately, the standard importance sampling procedure requires us to draw and use samples from each distribution q i (x) independently and thus can not effectively reuse samples drawn from different distributions. To this end, we introduce a practical refinement to the importance sampling procedure: we independently draw n j samples according to p j (x) from each of the m systems independently and then numerically integrate over these samples using the weights w ij to "mix" them appropriately to produce and unbiased estimate of π i while reducing variance. Formally, we define the joint precision estimator: where eachX j consists of n j i.i.d. samples drawn from p j . It is a hard problem to determine what the optimal mixing weights w ij should be. However, we can formally verify that if X i and X j are disjoint, then w ij = 0 minimizes the variance of π i , and if X i = X j , then w ij ∝ n j is optimal. This motivates the following heuristic choice which interpolates between these two extremes: w ij ∝ n j x∈X p j (x)p i (x).
Estimating recall jointly. The recall of system i can be expressed can be expressed as a product r i = θν i , where θ is the recall of the pool, which measures the fraction of all positive instances predicted by the pool (any system), and ν i is the pooled recall of system i, which measures the fraction of the pool's positive instances predicted by system i. Letting g(x) def = I[x ∈ X], we can define these as: We can estimate θ analogous to the simple recall estimatorr i , except we use the pool g instead a system g i . For ν i , the key is to leverage the work from estimating precision. We already evaluated f (x) onX i , so we can computeŶ i def =X i ∩ Y and form the subsetŶ = m i=1Ŷ i .Ŷ is an approximation of Y whose bias we can correct through importance reweighting. We then define estimators as follows: where q i and w ij are the same as before.
Adaptively choosing the number of samples. Finally, a desired property for on-demand evaluation is to label new instances only when the current evaluation data is insufficient, e.g. when a new set of predictions X m contains many instances not covered by other systems. We can measure how well the current evaluation set covers the predictions X m by using a conservative estimate of the variance ofπ (joint) m . 7 In particular, the variance ofπ (joint) m is a monotonically decreasing function in n m , the number of samples drawn from X m . We can easily solve for the minimum number of samples required to estimateπ (joint) m within a confidence interval by using the bisection method (Burden and Faires, 1985).

On-demand evaluation for KBP
Applying the on-demand evaluation framework to a task requires us to answer three questions: 1. What is the desired distribution over system predictions p i ? 2. How do we label an instance x, i.e. check if x ∈ Y? 3. How do we sample from the unknown set of true instances x ∼ p 0 ?
In this section, we present practical implementations for knowledge base population.

Sampling from system predictions
Both the official TAC-KBP evaluation and the on-demand evaluation we propose use microaveraged precision and recall as metrics. However, in the official evaluation, these metrics are computed over a fixed set of evaluation entities chosen by LDC annotators, resulting in two problems: (a) defining evaluation entities requires human intervention and (b) typically a large source of variability in evaluation scores comes from not having enough evaluation entities (see e.g. (Webber, 2010)). In our methodology, we replace manually chosen evaluation entities by sampling entities from each system's output according p i . In effect, p i makes explicit the decision process of the annotator who chooses evaluation entities. Identifying a reasonable distribution p i is an important implementation decision that depends on what one wishes to evaluate. Our goal for the ondemand evaluation service we have implemented is to ensure that KBP systems are fairly evaluated on diverse subjects and predicates, while at the same time, ensuring that entities with multiple relations are represented to measure completeness of knowledge base entries. As a result, we propose a distribution that is inversely proportional to the frequency of the subject and predicate and is proportional to the number of unique relations identified for an entity (to measure knowledge base completeness). See Appendix A in the supplementary material for an analysis of this distribution and a study of other potential choices.

Labeling predicted instances
We label predicted relation instances by presenting the instance's provenance to crowdworkers and asking them to identify if a relation holds between the identified subject and object mentions (Figure 4a). Crowdworkers are also asked to link the subject and object mentions to their canonical mentions within the document and to pages on Wikipedia, if possible, for entity linking. On average, we find that crowdworkers are able to perform this task in about 20 seconds, correspond-ing to about $0.05 per instance. We requested 5 crowdworkers to annotate a small set of 200 relation instances from the 2015 TAC-KBP corpus and measured a substantial inter-annotator agreement with a Fleiss' kappa of 0.61 with 3 crowdworkers and 0.62 with 5. Consequently, we take a majority vote over 3 workers in subsequent experiments.

Sampling true instances
Sampling from the set of true instances Y is difficult because we can't even enumerate the elements of Y. As a proxy, we assume that relations are identically distributed across documents and have crowdworkers annotate a random subset of documents for relations using an interface we developed (Figure 4b). Crowdworkers begin by identifying every mention span in a document. For each mention, they are asked to identify its type, canonical mention within the document and associated Wikipedia page if possible. They are then presented with a separate interface to label predicates between pairs of mentions within a sentence that were identified earlier.
We compare crowdsourced annotations against those of expert annotators using data from the TAC KBP 2015 EDL task on 10 randomly chosen documents. We find that 3 crowdworkers together identify 92% of the entity spans identified by expert annotators, while 7 crowdworkers together identify 96%. When using a token-level majority vote to identify entities, 3 crowdworkers identify about 78% of the entity spans; this number does not change significantly with additional crowdworkers. We also measure substantial token-level interannotator agreement using Fleiss' kappa for identifying typed mention spans (κ = 0.83), canonical mentions (κ = 0.75) and entity links (κ = 0.75) with just three workers. Based on this analysis, we use token-level majority over 3 workers in subsequent experiments.
The entity annotation interface is far more involved and takes on average about 13 minutes per document, corresponding to about $2.60 per document, while the relation annotation interface takes on average about $2.25 per document. Because documents vary significantly in length and complexity, we set rewards for each document based on the number of tokens (.75c per token) and mention pairs (5c per pair) respectively. With 3 workers per document, we paid about $15 per document on average. Each document contained an average   9.2 relations, resulting in a cost of about $1.61 per relation instance. We note that this is about ten times as much as labeling a relation instance. We defer details regarding how documents themselves should be weighted to capture diverse entities that span documents to Appendix A.

Evaluation
Let us now see how well on-demand evaluation works in practice. We begin by empirically studying the bias and variance of the joint estimator proposed in Section 4 and find it is able to correct for pooling bias while significantly reducing variance in comparison with the simple estimator. We then demonstrate that on-demand evaluation can serve as a practical replacement for the TAC KBP evaluations by piloting a new evaluation service we have developed to evaluate three distinct systems on TAC KBP 2016 document corpus.

Bias and variance of the on-demand evaluation.
Once again, we use the labeled system predictions from the TAC KBP 2015 evaluation and treat them as an exhaustively annotated dataset. To evaluate the pooling methodology we construct an evaluation dataset using instances found by human annotators and labeled instances pooled from 9 randomly chosen teams (i.e. half the total number of participating teams), and use this dataset to evaluate the remaining 9 teams. On average, the pooled evaluation dataset contains between 5,000 and 6,000 labeled instances and evaluates 34 different systems (since each team may have submitted multiple systems). Next, we evaluated sets of 9 randomly chosen teams with our proposed simple and joint estimators using a total of 5,000 samples: about 150 of these samples are drawn from Y, i.e. the full TAC KBP 2015 evaluation data, and 150 samples from each of the systems being evaluated. We repeat the above simulated experiment 500 times and compare the estimated precision and recall with their true values (Figure 4). The simulations once again highlights that the pooled methodology is biased, while the simple and joint estimators are not. Furthermore, the joint estimators significantly reduce variance relative to the simple estimators: the median 90% confidence intervals reduce from 0.14 to 0.06 precision and from 0.14 to 0.08 for recall.

Number of samples required by on-demand evaluation
Separately, we evaluate the efficacy of the adaptive sample selection method described in Section 4.3 through another simulated experiment. In each trial of this experiment, we evaluate the top 40 systems in random order. As each subsequent system is evaluated, the number of samples to pick from the system is chosen to meet a target variance and added to the current pool of labeled instances.
To make the experiment more interpretable, we choose the target variance to correspond with the estimated variance of having 500 samples. Figure 4 plots the results of the experiment. The number of samples required to estimate systems quickly drops off from the benchmark of 500 samples as the pool of labeled instances covers more systems. This experiment shows that on-demand evaluation using joint estimation can scale up to an order of magnitude more submissions than a simple estimator for the same cost.

A mock evaluation for TAC KBP 2016
We have implemented the on-demand evaluation framework described here as an evaluation service to which researchers can submit their own system predictions. As a pilot of the service, we evaluated three relation extraction systems that also participated in the official 2016 TAC KBP competition. Each system uses Stanford CoreNLP  to identify entities, the Illinois Wikifier (Ratinov et al., 2011) to perform entity linking and a combination of a rule-based system (P), a logistic classifier (L), and a neural network classifier (N) for relation extraction. We used 15,000 Newswire documents from the 2016 TAC KBP evaluation as our document corpus. In total, 100 documents were exhaustively annotated for about $2,000 and 500 instances from each system were labeled for about $150 each. Evaluating all three system only took about 2 hours. Figure 4f reports scores obtained through ondemand evaluation of these systems as well as their corresponding official TAC evaluation scores. While the relative ordering of systems between the two evaluations is the same, we note that precision and recall as measured through ondemand evaluation are respectively higher and lower than the official scores. This is to be expected because on-demand evaluation measures precision using each systems output as opposed to an externally defined set of evaluation entities. Likewise, recall is measured using exhaustive annotations of relations within the corpus instead of annotations from pooled output in the official evaluation.

Related work
The subject of pooling bias has been extensively studied in the information retrieval (IR) community starting with Zobel (1998), which examined the effects of pooling bias on the TREC AdHoc task, but concluded that pooling bias was not a significant problem. However, when the topic was later revisited, Buckley et al. (2007) identified that the reason for the small bias was because the submissions to the task were too similar; upon repeating the experiment using a novel system as part of the TREC Robust track, they identified a 23% point drop in average precision scores! 8 Many solutions to the pooling bias problem have been proposed in the context of information retrieval, e.g. adaptively constructing the pool to collect relevant data more cost-effectively (Zobel, 1998;Cormack et al., 1998;Aslam et al., 2006), or modifying the scoring metrics to be less sensitive to unassessed data (Buckley and Voorhees, 2004;Sakai and Kando, 2008;Aslam et al., 2006). Many of these ideas exploit the ranking of documents in IR which does not apply to KBP. While both Aslam et al. (2006) and Yilmaz et al. (2008) estimate evaluation metrics by using importance sampling estimators, the techniques they propose require knowing the set of all submissions beforehand. In contrast, our on-demand methodology can produce unbiased evaluation scores for new development systems as well.
There have been several approaches taken to crowdsource data pertinent to knowledge base population (Vannella et al., 2014;Angeli et al., 2014;He et al., 2015;Liu et al., 2016). The most extensive annotation effort is probably Pavlick et al. (2016), which crowdsources a knowledge base for gun-violence related events. In contrast to previous work, our focus is on evaluating systems, not collecting a dataset. Furthermore, our main contribution is not a large dataset, but an evaluation service that allows anyone to use crowdsourcing predictions made by their system.

Discussion
Over the last ten years of the TAC KBP task, the gap between human and system performance has barely narrowed despite the community's best efforts: top automated systems score less than 36% F 1 while human annotators score more than 60%. In this paper, we've shown that the current evaluation methodology may be a contributing factor because of its bias against novel system improvements. The new on-demand framework proposed in this work addresses this problem by obtaining human assessments of new system output through crowdsourcing. The framework is made economically feasible by carefully sampling output to be assessed and correcting for sample bias through importance sampling.
Of course, simply providing better evaluation scores is only part of the solution and it is clear that better datasets are also necessary. However, the very same difficulties in scale that make evaluating KBP difficult also make it hard to collect a high quality dataset for the task. As a result, existing datasets (Angeli et al., 2014;Adel et al., 2016) have relied on the output of existing systems, making it likely that they exhibit the same biases against novel systems that we've discussed in this paper. We believe that providing a fair and standardized evaluation platform as a service allows researchers to exploit such datasets and while still being able to accurately measure their performance on the knowledge base population task.
There are many other tasks in NLP that are even harder to evaluate than KBP. Existing evaluation metrics for tasks with a generation componentsuch as summarization or dialogue-leave much to be desired. We believe that adapting the ideas of this paper to those tasks is a fruitful direction, as progress of a research community is strongly tied to the fidelity of evaluation.