Active Testing: An Unbiased Evaluation Method for Distantly Supervised Relation Extraction

Distant supervision has been a widely used method for neural relation extraction for its convenience of automatically labeling datasets. However, existing works on distantly supervised relation extraction suffer from the low quality of test set, which leads to considerable biased performance evaluation. These biases not only result in unfair evaluations but also mislead the optimization of neural relation extraction. To mitigate this problem, we propose a novel evaluation method named active testing through utilizing both the noisy test set and a few manual annotations. Experiments on a widely used benchmark show that our proposed approach can yield approximately unbiased evaluations for distantly supervised relation extractors.


Introduction
Relation extraction aims to identify relations between a pair of entities in a sentence. It has been thoroughly researched by supervised methods with hand-labeled data. To break the bottleneck of manual labeling, distant supervision (Mintz et al., 2009) automatically labels raw text with knowledge bases. It assumes that if a pair of entities have a known relation in a knowledge base, all sentences with these two entities may express the same relation. Clearly, the automatically labeled datasets in distant supervision contain amounts of sentences with wrong relation labels. However, previous works only focus on wrongly labeled instances in training sets but neglect those in test sets. Most of them estimate their performance with the held-out evaluation on noisy test sets, which will yield inaccurate evaluations of existing models and seriously mislead the model optimization. As shown in Table 1, we compare the results of held-out evaluation and human evaluation for the same model on a widely used * Corresponding author: jiawj@bnu.edu.cn. benchmark dataset NYT-10 (Riedel et al., 2010). The biases between human evaluation and existing held-out evaluation are over 10%, which are mainly caused by wrongly labeled instances in the test set, especially false negative instances.
Evaluations P@100 P@200 P@300 Held-out Evaluation 83 77 69 Human Evaluation 93(+10) 92.5(+15.5) 91(+22) A false negative instance is an entity pair labeled as non-relation, even if it has at least one relation in reality. This problem is caused by the incompleteness of existing knowledge bases. For example, over 70% of people included in Freebase have no place of birth (Dong et al., 2014). From a random sampling, we deduce that about 8.75% entity pairs in the test set of NYT-10 are misclassified as non-relation. 1 Clearly, these mislabeled entity pairs yield biased evaluations and lead to inappropriate optimization for distantly supervised relation extraction.
In this paper, we propose an active testing approach to estimate the performance of distantly supervised relation extraction. Active testing has been proved effective in evaluating vision models with large-scale noisy datasets (Nguyen et al., 2018). In our approach, we design an iterative approach, with two stage per iteration: vetting stage and estimating stage. In the vetting stage, we adopt an active strategy to select batches of the most valuable entity pairs from the noisy test set for annotating. In the estimating stage, a metric estimator is proposed to obtain a more accurate evaluation.
With a few vetting-estimating iterations, evaluation results can be dramatically close to that of human evaluation by using limited vetted data and all noisy data. Experimental results demonstrate that the proposed evaluation method yields approximately unbiased estimations for distantly supervised relation extraction.

Related Work
Distant supervision (Mintz et al., 2009) was proposed to deal with large-scale relation extraction with automatic annotations. A series of studies have been conducted with human-designed features in distantly supervised relation extraction (Riedel et al., 2010;Surdeanu et al., 2012;Takamatsu et al., 2012;Angeli et al., 2014;Han and Sun, 2016). In recent years, neural models were widely used to extract semantic meanings accurately without hand-designed features (Zeng et al., 2015;Lin et al., 2017;Zhang et al., 2019). Then, to alleviate the influence of wrongly labeled instances in distant supervision, those neural relation extractors integrated techniques such as attention mechanism (Lin et al., 2016;Han et al., 2018;Huang and Du, 2019), generative adversarial nets (Qin et al., 2018a;Li et al., 2019), and reinforcement learning (Feng et al., 2018;Qin et al., 2018b). However, none of the above methods pay attention to the biased and inaccurate test set. Though human evaluation can yield accurate evaluation results (Zeng et al., 2015;Alt et al., 2019), labeling all the instances in the test set is too costly.

Task Definition
In distant supervision paradigm, all sentences containing the same entity pair constitute a bag. Researchers train a relation extractor based on bags of sentences and then use it to predict relations of entity pairs. Suppose that a distantly supervised model returns confident score 2 s i = {s i1 , s i2 . . . s ip } for entity pair i ∈ {1 . . . N }, where p is the number of relations, N is the number of entity pairs, and s ij ∈ (0, 1). y i = {y i1 , y i2 . . . y ip } and z i = {z i1 , z i2 . . . z ip } respectively represent automatic labels and true labels for entity pair i, where y ij and z ij are both in {0, 1} 3 .
In widely used held-out evaluation, existing methods observe two key metrics which are precision at top K (P @K) and Precision-Recall curve 2 Confident scores are estimated probabilities for relations. 3 An entity pair may have more than one relations.
(PR curve). To compute both metrics, confident score for all entity pairs are sorted in descending order, which is defined as s = {s 1 , s 2 . . . s P } where P = N p. Automatic labels and true labels are denoted as y = {y 1 , . . . , y P } and z = {z 1 , . . . , z P }. In summary, P @K and R@K can be described by the following equations, Held-out evaluation replaces z with y to calculate P @K and R@K, which leads to incorrect results obviously.

Methodology
In this section, we present the general framework of our method. A small random sampled set is vetted in the initial state. In each iteration there are two steps: 1) select a batch of entity pairs with a customized vetting strategy, label them manually, and add them to the vetted set; 2) use a new metric estimator to evaluate existing models by the noisy set and the vetted set jointly. After a few vetting-evaluating iterations, unbiased performance of relation extraction is appropriately evaluated. In summary, our method consists of two key components: a vetting strategy and a metric estimator.

Metric Estimator
Our test set consists of two parts: 1) a noisy set U in which we only know automatic label y i ; 2) a vetted set V in which we know both automatic label y i and manual labelz i . We treat the true label z i as a latent variable andz i is its observed value. The performance evaluation mainly depends on the estimation of z i . In our work, we estimate the probability as where Θ represents all available elements such as confident score, noisy labels and so on. We make the assumption that the distribution of true latent labels is conditioned on Θ.
Given posterior estimates p(z i |Θ), we can compute the expected performance by replacing the true latent label by its probability. Then, the precision and recall equations can be rewritten as To predict the true latent label z i for a specific relation, we use noisy label y i and confident score s i . This posterior probability can be derived as (see appendix for proof) where v ∈ {0, 1}. s jk , y jk , z jk are the corresponding elements of s i , y i , z i before sorting confident score. Given a few vetted data, we fit p(y jk |z jk ) by standard maximum likelihood estimation (counting frequencies). p(z jk |s jk ) is fitted by using logistic regression. For each relation, there is a specific logistic regression function to fit.

Vetting Strategy
In this work, we apply a strategy based on maximum expected model change(MEMC) (Settles, 2009). The vetting strategy is to select the sample which can yield a largest expected change of performance estimation. Let E p(z |V ) Q be the expected performance based on the distribution p(z |V ) estimated from current vetted set V . After vetting example i and updating that estimator, it will become E p(z |V,z i ) Q. The change caused by vetting example i can be written as For precision at top K, this expected change can be written as where p i = P (z i = 1|Θ). For the PR curve, every point depends on P @K for different K. Thus, this vetting strategy is also useful for the PR curve. With this vetting strategy, the most valuable data is always selected first. Therefore, vetting budget is the only factor controlling the vetting procedure. In this approach, we take it as a hyper parameter. When the budget is used up, the vetting stops. The procedure is described in Algorithm 1.

Algorithm 1 Active Testing Algorithm
Require: unvetted set U , vetted set V , vetting budget T , vetting strategy VS, confident score S, estimator p(z ) 1: while T > 0 do

Experiment
We conduct sufficient experiments to support our claims; 1) The proposed active testing is able to get more accurate results by introducing very few manual annotations.
2) The held-out evaluation will misdirect the optimization of relation extraction, which can be further proved through re-evaluation of eight up-to-date relation extractors.

Experimental Setting
Dataset. Our experiments are conducted on a widely used benchmark NYT-10 (Riedel et al., 2010) and an accurate dataset named NYT-19, which contains 500 randomly selected entity pairs from the test set of NYT-10. It contains 106 positive entity pairs and 394 negative entity pairs, in which 35 entity pairs are false negative. NYT-19 has been well labeled by NLP researchers.
Initialization. We use PCNN+ATT (Lin et al., 2016) as baseline relation extractors. To be more convincing, we provide the experimental results of BGRU+ATT in the appendix. The initial state of vetted set includes all the positive entity pairs of the test set in NYT-10 and 150 vetted negative entity pairs. The batch size for vetting is 20 and the vetting budget is set to 100 entity pairs.

Effect of Active Testing
We evaluate the performance of PCNN+ATT with held-out evaluation, human evaluation and our method. The results are shown in Table 2, and Figure 1. Due to high costs of manual labeling for the whole test set, we use the PR-curve on NYT-19 to simulate that on NYT-10.   To measure the distance between two curves, we sample 20 points equidistant on each curve and calculate the Euclidean distance of the two vectors. In this way, our method gets the distances 0.17 to the curve of human evaluation while corresponding distances for held-out evaluation is 0.72. We can observe that 1) The performance biases between manual evaluation and held-out evaluation are too significant to be neglected. 2) The huge biases caused by wrongly labeled instances are dramatically alleviated by our method. Our method obtains at least 8.2% closer precision to manual evaluation than the held-out evaluation.

Effect of Vetting Strategy
We compare our MEMC strategy with a random vetting strategy as shown in Figure 2. The distance from curves of different vetting strategies to that of human evaluation is 0.176 and 0.284. From the figure, we can conclude that the proposed vetting strategy is much more effective than the random vetting strategy. With the same vetting budget, MEMC gets more accurate performance estimation at most parts of the range.

Re-evaluation of Relation Extractors
With the proposed performance estimator, we reevaluate eight up-to-date distantly supervised rela-   From Table 3, we can observe that: 1) The relative ranking of the models according to precision at top K almost remains the same except Qin et al. 2018b andQin et al. 2018a. Although GAN and reinforcement learning are helpful to select valuable training instances, they are tendentiously to be overfitted. 2) Most models make the improvements as they mentioned within papers at high confident score interval. 3) BGRU performs better than any other models, while BGRU based method Liu et al. 2018 achieves highest precision. More results and discussions can be found in the Appendix.

Conclusion
In this paper, we propose a novel active testing approach for distantly supervised relation extraction, which evaluates performance of relation extractors with both noisy data and a few vetted data. Our experiments show that the proposed evaluation method is appropriately unbiased and significant for optimization of distantly relation extraction in future.

A.1 Logistic Regression
Here we provide the derivation of Equation.6 in the main paper.
We assume that given z jk , the observed label y jk is conditionally independent of s jk , which means p(y jk |z jk , s jk ) = p(y jk |z jk ). The expression is simplified to:

A.2 Vetting Strategy
Here we provide the derivation of Equation.8 in the main paper.

A.3 Experimental result of BGRU+ATT
We also evaluate the performance of BGRU+ATT with held-out evaluation, human evaluation and our method. The results are shown in Table 4, and Figure 3. Our method gets the distances 0.15 to the curve of human evaluation while corresponding distances for held-out evaluation is 0.55.

A.4 The result of different iterations
We have recorded the distance of different iterations between the curves obtained by our method and manual evaluation in Figure 4. With the results, we can observe that the evaluation results obtained by our method become closer to human evaluation when the number of annotated entity pairs is less than 100. When the number is more than 100, the distance no longer drops rapidly but begins to fluctuate.

B Case Study
We present realistic cases in NYT-10 to show the effectiveness of our method. In Figure 6, all cases are selected from Top 300 predictions of PCNN+ATT. These instances are all negative instances and has the automatic label N A in NYT-10. In held-out evaluation, relation predictions for these instances are judged as wrong. However, part of them are false negative instances in fact and have the corresponding relations, which cause considerable biases between manual and held-out evaluation. In Figure 4: The result of different iterations for the active testing algorithm with PCNN+ATT and BGRU+ATT our approach, those relation predictions for false negative instances are given a high probability to be corrected. At the same time, true negative instances are accurately identified and given a low (near zero) probability.

C Re-evaluation Discussion
The detailed descriptions and discussions of reevaluation experiments are conducted in this section.
C.1 Models PCNN (Zeng et al., 2015) is the first neural method used in distant supervision without humandesigned features. PCNN+ATT (Lin et al., 2016) further integrates a selective attention mechanism to alleviate the influence of wrongly labeled instances. The selective attention mechanism generates attention weights over multiple instances, which is expected to reduce the weights of those noisy instances dynamically. PCNN+ATT+SL (Liu et al., 2017) is the development of PCNN+ATT. To correct the wrong labels at entity-pair level during training, the labels of entity pairs are dynamically changed according to the confident score of the predictive labels. Clearly, this method highly depends on the quality of label generator, which has great potential to be over-fitting. PCNN+ATT+RL (Qin et al., 2018b) adopts reinforcement learning to overcome wrong labeling problem for distant supervision. A deep reinforcement learning agent is designed to choose correctly labeled instances based on the performance change of the relation classifier. After that, PCNN+ATT is adopted on the filtered data to do relation classifi-cation. PCNN+ATT+DSGAN (Qin et al., 2018a) is an adversarial training framework to learn a sentence level true-positive generator. The positive samples generated by the generator are labeled as negative to train the generator. The optimal generator is obtained when the discriminator cannot differentiate them. Then the generator is adopted to filter distant supervision training dataset. PCNN+ATT is applied to do relation extraction on the new dataset. BGRU is one of recurrent neural network, which can effectively extract global sequence information.
It is a powerful fundamental model for wide use of natural language processing tasks.
BGRU+ATT is a combination of BGRU and the selective attention. STPRE (Liu et al., 2018) extracts relation features with BGRU. To reduce inner-sentence noise, authors utilize a Sub-Tree Parse(STP) method to remove irrelevant words. Furthermore, model parameters are initialized with a prior knowledge learned from the entity type prediction task by transfer learning.

C.2 Discussion
In this section, we additionally provide PR curves to show the performance of baselines. From both Table 3 and Figure 5, we are aware of that: 1) The relative ranking is quite different from that on held-out evaluation according to PR curve. 2) The selective attention has limited help in improving the overall performance, even though it may have positive effects at high confident score. 4) The soft-label method greatly improves the accuracy at high confident score but significantly reduces the overall performance. We deduce that it is severely Figure 6: A case study of active testing approach for distantly supervised relation extraction. The entities are labeled in red. 1.0(vetted) and 0.0(vetted) mean that the entity pair is vetted in our method.
affected by the unbalanced instance numbers of different relations, which will make label generator over-fitting to frequent labels. 4) For the overall performance indicated by PR curves, BGRU is the most solid relation extractor.