Towards Accurate and Consistent Evaluation: A Dataset for Distantly-Supervised Relation Extraction

In recent years, distantly-supervised relation extraction has achieved a certain success by using deep neural networks. Distant Supervision (DS) can automatically generate large-scale annotated data by aligning entity pairs from Knowledge Bases (KB) to sentences. However, these DS-generated datasets inevitably have wrong labels that result in incorrect evaluation scores during testing, which may mislead the researchers. To solve this problem, we build a new dataset NYTH, where we use the DS-generated data as training data and hire annotators to label test data. Compared with the previous datasets, NYT-H has a much larger test set and then we can perform more accurate and consistent evaluation. Finally, we present the experimental results of several widely used systems on NYT-H. The experimental results show that the ranking lists of the comparison systems on the DS-labelled test data and human-annotated test data are different. This indicates that our human-annotated data is necessary for evaluation of distantly-supervised relation extraction.


Introduction
There has been significant progress on Relation Extraction (RE) in recent years using models based on machine learning algorithms (Mintz et al., 2009;Hoffmann et al., 2011;Zeng et al., 2015;Zhou et al., 2016;Ji et al., 2017;Su et al., 2018;Qin et al., 2018b;. The task of RE is to identify semantic relationships among entities from texts. Traditional supervised methods require a massive amount of annotated data, which are often labelled by human annotators. However, it is hard to annotate data within strict time limits and hiring the annotators is non-scalable and costly. In order to quickly obtain new annotated data, Mintz et al. (2009) use Distant Supervision (DS) to automatically generate labelled data. Given an entity pair and its relation from a Knowledge Base (KB) such as Freebase, they simply tag all sentences containing these two entities by this relation. Through this way, the framework has achieved great success and brought state-of-the-art performance in RE (Qin et al., 2018a;Feng et al., 2018;Chen et al., 2019). But as an exchange, the auto-generated data may be of lower quality than those from annotators. For example, the sentence "Steve Jobs left Apple in 1985 ." is not suitable for relation founders labelled by the triple <Steve Jobs, founders, Apple> in knowledge base.
To address the above problem, previous studies regard distantly-supervised relation extraction as a Multi-Instance Learning (MIL) task to reduce the effect of noises while training (Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Lin et al., 2016;Han et al., 2018). MIL aggregates sentences into bags, where instances with the same entity pair are grouped into one bag. However, the wrong labelling problem still remains during system evaluation. There are two types of noises: False Positive (FP) means that a sentence is tagged with a relation when it is not, and False Negative (FN) means that a sentence is tagged with NA relation due to the KB incompleteness, where NA relation means there is not an available relation for the sentence or the relation is not included in the pre-defined relation candidate set. Figure 1 shows the examples of FP and FN. Therefore, the performance evaluated on the DS test data might mislead the researchers since the evaluation scores are not exactly correct.
In this paper, we present a new RE dataset NYT-H, which is an enhanced distantly-supervised relation extraction dataset with human annotation based on NYT10 (Riedel et al., 2010). NYT10 is a widely used benchmark data on the task of distantly-supervised relation extraction, which has been used in many recent studies (Lin et al., 2016;Qin et al., 2018b;Qin et al., 2018a;Chen et al., 2019). However, the NYT10-test set is also distantly-supervised, which contains many FP and FN noises that may hinder the evaluation. Besides, NYT10 is not consistent for building classifiers, where the relations in NYT10-train set and test set are not totally overlapped. There are 56 relations existed in the train set while only 32 relations appear in the test set. Both the inaccuracy and the inconsistency make the evaluation results less convincing and less objective, which limit the development of distantlysupervised RE. In NYT-H, we hire several annotators to label test data. With this new data, we are able to perform accurate and consistent evaluation that may boost the research on distantly-supervised RE. Our contributions can be summarised as below: • We present a new dataset NYT-H for distantly-supervised relation extraction, in which we use DSlabelled training data and hire several annotators to label test data. With this newly built data, we are able to perform accurate and consistent evaluation for the task of distantly-supervised relation extraction.
• NYT-H can serve as a benchmark of distantly-supervised relation extraction. We design three different evaluation tracks: Bag2Bag, Bag2Sent, and Sent2Sent, and present the evaluation results of several widely used systems for comparison.
• We analyse and discuss the results on different evaluation metrics. We find the ranking lists of the comparison systems are different on the DS-labelled and manually-annotated test sets. This indicates that the manually-annotated test data is necessary for evaluation.

Related Work
Distantly-Supervised Data Construction: Considering the cost and limitations in fully annotated datasets, Mintz et al. (2009) think all sentences with the same entity pair express the same relation. Based on this assumption, they construct a distantly-supervised RE dataset by aligning Wikipedia articles with Freebase. Restricted by the Wikipedia corpus, expressions of sentences are very limited in each relation. Riedel et al. (2010) publish the NYT10, a distantly-supervised dataset which takes the New York Times news as the source texts. It is smaller in scale, but contains a variety of expressions. Besides, they make a relaxation on Mintz's assumption, they think that at least one instance, which has the same entity pair, can express the same relation. However, NYT10 does not match this assumption totally. Without human annotation, they cannot ensure that there is a correct-labelled sentence in each bag. Jat et al. (2018) develop a new dataset called Google Distant Supervision (GDS) based on Google Relation Extraction corpus 1 , which follows Riedel's assumption and makes the automatic evaluation more reliable (Vashishth et al., 2018). However, noises that come with DS cannot be eliminated in the test set, so the automatic evaluation cannot be exactly accurate.
Evaluation Strategy for Distantly-Supervised Relation Extraction: There are two measures to evaluate models trained on DS-constructed datasets. The first one is the Precision-Recall Curve (PRC), and its Area Under Curve (AUC). DS-constructed datasets are usually unbalanced, and PRC is a good way to measure the classification performance, especially when the dataset is unbalanced. But PRC does not deal with the DS noises. If we do not have the ground truth answers, the evaluation results will be less convincing and objective. The second one is Precision@K (P@K). Researchers manually annotate systems' top-K output results (Riedel et al., 2010;Zeng et al., 2015;Lin et al., 2016) to give relatively objective evaluation results with less cost. However, there exist annotation biases since the criterions may differ from researchers, and researchers have to annotate different data if the outputs are changed. Besides, the predicted results are often biased due to the dataset unbalanced problem. This means the top-K instances could not cover all the relations, unless the 'K' value is large enough. Noises in DS datasets can provide a simulation to real scenarios, but they are harmful to model selection and evaluation if there are no ground thruths in the test set. So we develop NYT-H, a distantlysupervised dataset with manually annotated test set. We believe this would help researchers to obtain more accurate evaluation results easily.

NYT-H Dataset
In this section, we introduce the procedure of constructing the NYT-H dataset and list the statistics information of the data. We also compare our dataset with the previous datasets in detail.

Data Preprocessing
NYT-H is built on NYT10 2 (Riedel et al., 2010). There are many data files in the original NYT10, organised in protocol buffer format. We use the protobuf 3 tool to extract relations and entities. There are many repetitions between train and test sets. Thus a deduplication operation is applied on NYT10 to eliminate duplicates. In addition, to enlarge the size of test set, we randomly select some non-NA bags from the NYT10-train set 4 . Different from NYT10, NYT-H has a NA set that contains all the NA instances. Finally, we obtain three sets: train set, test set, and NA set.

Human Annotation
Then, we hire annotators to label sentences in the test set. The sentences are annotated manually in a binary strategy, which is to decide whether the sentences express the relations assigned by DS. It is difficult to pick out one from over 50 types of relations, so we decide to use the binary strategy. If a sentence actually expresses the DS-annotated relation for the given entity pair, the sentence will be labelled as "Yes", otherwise "No". Some examples can be found in Table 1, where FP noises (like the first example) will be recognised and annotated as "No".   There are three annotators working on this project. Each sentence is assigned to two annotators. If their annotations are different, the third annotator will make the final decision. The Kappa coefficient of annotation results is 0.753, which shows a substantial agreement among the annotators. Finally, we obtain 10,065 sentences labelled by the annotators in the test set. Since all the sentences in the test set are checked by the annotators, the new test set does not have the FP and FN problem which are often occurred in the DS-constructed data.

Data Post-Processing
There are some relations that do not occur both in the train and test sets, and we decide to convert the labels of these relations into NA. After that, we continue to process the data by two steps: (1) if the number of instances for a relation is less than 100 in the train set, the relation will be converted into NA.
(2) If there are no instances labelled as "Yes" by the annotators for a relation in the test set, the relation will be converted into NA. Finally, there are 550,720 instances (357,196 bags) in the NA set, 9,955 instances (3,548 bags) in the test set, and 21 relations (excluding NA) are kept in NYT-H.

Dataset Statistics
Following the setting of MIL, the datasets are composed of bags. If there is at least one sentence in a bag manually labelled as "Yes", the DS-assigned relation will be the label of the bag. Thus, we can predict relations at bag-level and sentence-level. In the test set, 5,202 out of 9,955 instances are annotated as "Yes", which also indicates the wrong label problem of distant supervision. Besides, to make the dataset more consistent, relations are filtered to 21 (without NA relation), and all these relations are covered in both train and test set. Figure 2 shows that the distributions of the train set are very close to those of the test set, which also indicates the consistency in NYT-H.
Although NYT-H is more consistent, it is still very challenging for building a relation extractor. As  Table 2: Statistics information of NYT-H. "#Yes Ins." and "#Yes Bag" mean the number of instances and bags that are annotated as "Yes". Table 2 shows, it is an unbalanced dataset. Besides, there are 3,548 bags in the test set, but Table 2 shows the sum of test bag numbers is 3,735, which means there are some bags that have more than one relations. In this condition, this dataset is also suitable for multi-instance multi-label research (Surdeanu et al., 2012;Angeli et al., 2014;Jiang et al., 2016;.

Dataset Comparison
Here, we compare our newly built data NYT-H with the previous datasets for relation extraction. The datasets can be divided into fully-annotated and distantly-supervised categories based on the constructing methods.
Fully-Annotated Datasets: SemEval2010 Task8 (Hendrickx et al., 2009), ACE05 (Walker et al., 2006) and TACRED (Zhang et al., 2017b) are fully annotated. As Table 3 shows, the scale of fully human annotated datasets are often small, both in the train and the test sets. Among them, TACRED is the largest dataset, but only has 3,325 non-NA instances in the test set. We can easily find that NYT-H has about three times as many non-NA test instances as TACRED.
Distantly-Supervised Datasets: Wiki-KBP (Ling and Weld, 2012), GDS (Jat et al., 2018) and NYT10 (Riedel et al., 2010) are all constructed by distant supervision. NYT-Filtered (Zeng et al., 2015) and NYT-manual (Hoffmann et al., 2011) are generated from NYT10 with some predefined filtering rules. Among them, GDS, Wiki-KBP, and NYT-manual have human-annotated sentences in the test data like ours. This indicates that the researchers also realise the inaccurate evaluation problem in DS relation extraction. However, due to the cost, these human-annotated test sets are very small. Thus, the inaccurate evaluation problem is far from resolved. Therefore, in this paper we annotate more instances in the test set.
Besides, these datasets are facing an inconsistency problem, where relations in the train and the test sets are not fully overlapped. Table 4 shows that the number of relations in train and test sets are not the same in Wiki-KBP, NYT-Manual and NYT10. We believe that with NYT-H we can perform more accurate and consistent evaluation than ever.

Evaluation Strategy
We expect that NYT-H can serve as a benchmark on the task of distantly-supervised relation extraction.
In this section, we design three different tracks for evaluation on NYT-H at bag-level and sentence-level.

Bag2Bag Track
The Bag2Bag track is to evaluate the systems which are trained at bag-level and tested at bag-level. This setting is widely used for evaluating the MIL methods on the task of distantly-supervised relation extraction (Hoffmann et al., 2011;Surdeanu et al., 2012;Zeng et al., 2015;Lin et al., 2016). Thus, we also follow this setting of the previous studies. The difference is that the NYT-H data provides humanannotated labels in the test data. We report the system quality by macro-averaged precision, recall, and F 1 scores. The precision, recall and F 1 scores for relation r i in each system are calculated by the following equations: where N i r is the number of bags that the system correctly judges, N i sys is the number of bags that the system predicts as relation r i , and N i data is the number of bags that the annotators label as relation r i . Finally, the macro F 1 score of the system can be computed as follows: where m denotes the number of relations.

Sent2Sent Track
The Sent2Sent track is to evaluate the systems which are trained at sentence-level and tested at sentencelevel. Researchers often utilize this strategy to the experiments on the fully human-annotated data.
Some studies use micro-averaged scores as the metrics in the Sent2Sent track (Zhang et al., 2017a;Sun et al., 2019), while we report the systems' macro-averaged precision, recall, and F 1 scores to make relations equally weighted. The scores are calculated using Equation (1) and (2) as in the Bag2Bag track. The difference is the definition of the numbers, where N i r is the number of sentences that the system correctly judges, N i sys is the number of sentences that the system predicts as relation r i , and N i data is the number of sentences that the annotators label as relation r i .

Bag2Sent Track
The Bag2Sent track is to evaluate the systems which are trained at bag-level and tested at sentence-level.
NYT-H offers a compromise of labelled data between distant supervision and manual annotation. Thus, we propose the Bag2Sent track, where we train models on distantly-supervised data at bag-level and evaluate the models on manually annotated data at sentence-level. The evaluation scores are calculated in the same way as the method in Sent2Sent track.

Experiments
In this section, we present the experimental results of several widely-used RE systems on NYT-H.

Experimental Settings
We randomly select 10% instances from the train set as a development set to help the model selection.
We use 5 different random seeds and take the mean scores as final results. In addition, NA instances are sub-sampled from the NA set with the same scale of the train set. We use 50 dimensional GloVe 5 vectors as the system input, and other parameter settings are listed in Table 5.

Comparison Systems
In this section, we list the systems used in our experiments. There are two groups of systems: training at sentence-level and bag-level. Sentence-Level: Convolutional Neural Network (CNN) (Zeng et al., 2014) is able to capture local features effectively. Piecewise Convolutional Neural Network (PCNN) (Zeng et al., 2015) and Classification by Ranking Convolutional Neural Network (CR-CNN) (dos Santos et al., 2015) are variations of CNN, where PCNN can retain richer information and CR-CNN has a strong ability to reduce NA impact. ATT-BLSTM (Zhou et al., 2016) is an RNN-based model which can capture global features. These systems are all trained at sentence-level.
Bag-Level: CNN and PCNN can be further combined with ONE (Zeng et al., 2015) 6 and ATT (Lin et al., 2016) 7 to train at bag-level, where ONE selects the most valuable sentence per bag and ATT generates a weighted sentence representation via attention mechanism for each bag.

Evaluation Results
Two types of evaluation results are presented here. The first one is DS labels as Ground Truths (DSGT), which assumes that all the DS labels are ground truths. This is a common setting used on distantlysupervised relation extraction, but may result in incorrect evaluation during testing. The second one is the Manual Annotation as Ground Truth (MAGT), where the metrics are calculated on the same evaluation set, but uses the human-annotated labels as truths.
From the F 1 results in Table 6, we can observe the following facts:   • In Bag2Sent track, all the bag-level models show a much better ability to deal with the DS noises, since the gap is not that much. Although the MIL strategy does great on noise deduction during training, the final MAGT results of the bag-level models are worse than sentence-level models.
• In Bag2Bag track, the ranking order is changed because of the DS noises. The results of AUC show PCNN+ATT gets the best performance, and the best model on DSGT F 1 is CNN+ONE, while the order of MAGT is PCNN+ONE > CNN+ONE > PCNN+ATT > CNN+ATT. These facts indicate that the results of AUC and F 1 in the DS test set are not reliable to draw objective conclusions. Bag2Bag track is the most common setting in DS relation extraction due to the bag assumption. The Precision-Recall curves are widely-used evaluation metric in Bag2Bag track, and curves in Figure 3 also indicate the evaluation problem. There are conspicuous gaps between DSGT curves and MAGT curves.
In summary, we find: (1) F 1 scores on MAGT are much lower than the ones on DSGT in most cases, and the gaps can also be found in Figure 3. (2) The ranking lists of the systems are different in many cases between two test settings, especially in Bag2Bag track which is the most used setting in DS relation extraction. The evaluation on the DS-generated test data produces incorrect scores that may lead to misjudgement. Thus, we believe that the accurate evaluation can boost the research on distantly-supervised relation extraction.

Relation Coverage in Precision@K Evaluation
As we have mentioned in Section 2, P@K is another widely used measure in DS relation extraction, which is mainly evaluated by manually check for each system. Since we have the human-annotated test data, we can automatically obtain the scores of P@K, even 'K' is much larger than ever. Table 7 shows the P@K results where 'K' ranges from 50 to 2,000. We further check relation coverage when performing P@K evaluation. Figure 4a shows the results of relation coverage for correct prediction, while Figure 4b shows the results for all predictions. From the figure, we can find that PCNN+ONE could not cover all the 21 non-NA relations in the top 2,000 results, while PCNN+ATT covered all the relations until 'K' equals to 1,405.
As we have discussed in section 2, the 'K' value selection is crucial to P@K evaluation. P@K is a good way to measure predictions especially when the scale of predicted results are too large to be fully annotated. However, relations cannot be fully covered if the 'K' is too small. Figure 4b shows that only 5 relations are covered in PCNN+ATT when 'K' is 50, while the number of the whole relation candidate set is 21. As Table 6 shows, none of the F 1 scores in the three tracks exceed 50%, which is far below our expectations. But the P@K results give us an illusion that the reuslts have reached the ceiling, which may lead to an evaluation bias.

Conclusion
In this paper, we present the NYT-H dataset, an enhanced distantly-supervised dataset with human annotation, where we use DS-generated training data and human-annotated test data. The NYT-H dataset can resolve the inaccurate evaluation problem caused by the assumption of distant supervision. To compare the performance of the previous systems, we design three evaluation tracks, which are Sent2Sent, Bag2Sent and Bag2Bag. We conduct experiments on the newly built NYT-H data with widely used baseline systems for comparisons. All the evaluation scripts and data resources are available in https://github.com/Spico197/NYT-H.