TACRED Revisited: A Thorough Evaluation of the TACRED Relation Extraction Task

TACRED is one of the largest, most widely used crowdsourced datasets in Relation Extraction (RE). But, even with recent advances in unsupervised pre-training and knowledge enhanced neural RE, models still show a high error rate. In this paper, we investigate the questions: Have we reached a performance ceiling or is there still room for improvement? And how do crowd annotations, dataset, and models contribute to this error rate? To answer these questions, we first validate the most challenging 5K examples in the development and test sets using trained annotators. We find that label errors account for 8% absolute F1 test error, and that more than 50% of the examples need to be relabeled. On the relabeled test set the average F1 score of a large baseline model set improves from 62.1 to 70.1. After validation, we analyze misclassifications on the challenging instances, categorize them into linguistically motivated error groups, and verify the resulting error hypotheses on three state-of-the-art RE models. We show that two groups of ambiguous relations are responsible for most of the remaining errors and that models may adopt shallow heuristics on the dataset when entities are not masked.


Introduction
Relation Extraction (RE) is the task of extracting relationships between concepts and entities from text, where relations correspond to semantic categories such as per:spouse, org:founded by or org:subsidiaries (Figure 1). This makes RE a key part of many information extraction systems, and its performance determines the quality of extracted facts for knowledge base population (Ji and Grishman, 2011), or the quality of answers in question answering systems (Xu et al., 2016). Standard benchmarks such as SemEval 2010 Task 8 (Hendrickx et al., 2010) and the more recent TACRED [...] included Aerolineas's domestic subsidiary, Austral. TACRED is one of the largest and most widely used RE datasets. It contains more than 106k examples annotated by crowd workers. The methods best performing on the dataset use some form of pre-training to improve RE performance: finetuning pre-trained language representations (Alt et al., 2019;Shi and Lin, 2019; or integrating external knowledge during pre-training, e.g. via joint language modelling and linking on entity-linked text (Zhang et al., 2019;Peters et al., 2019;Baldini Soares et al., 2019); with the last two methods achieving a state-of-the-art performance of 71.5 F1. While this performance is impressive, the error rate of almost 30% is still high. The question we ask in this work is: Is there still room for improvement, and can we identify the underlying factors that contribute to this error rate? We analyse this question from two separate viewpoints: (1) to what extent does the quality of crowd based annotations contribute to the error rate, and (2) what can be attributed to dataset and models? Answers to these questions can provide insights for improving crowdsourced annotation in RE, and suggest directions for future research.
To answer the first question, we propose the following approach: We first rank examples in the development and test sets according to the misclas-sifications of 49 RE models and select the top 5k instances for evaluation by our linguists. This procedure limits the manual effort to only the most challenging examples. We find that a large fraction of the examples are mislabeled by the crowd. Our first contribution is therefore a extensively relabeled TACRED development and test set.
To answer the second question, we carry out two analyses: (1) we conduct a manual explorative analysis of model misclassifications on the most challenging test instances and categorize them into several linguistically motivated error categories; (2) we formulate these categories into testable hypotheses, which we can automatically validate on the full test set by adversarial rewriting -removing the suspected cause of error and observing the change in model prediction (Wu et al., 2019). We find that two groups of ambiguous relations are responsible for most of the remaining errors. The dataset also contains clues that are exploited by models without entity masking, e.g. to correctly classify relations even with limited access to the sentential context.
We limit our analysis to TACRED, but want to point out that our approach is applicable to other RE datasets as well. We make the code of our analyses publicly available. 1 In summary, our main contributions in this paper are: • We validate the 5k most challenging examples in the TACRED development and test sets, and provide a revised dataset 2 that will improve the accuracy and reliability of future RE method evaluations.
• We evaluate the most challenging, incorrectly predicted examples of the revised test set, and develop a set of 9 categories for common RE errors, that will also aid evaluation on other datasets.
• We verify our error hypotheses on three stateof-the-art RE models and show that two groups of ambiguous relations are responsible for most of the remaining errors and that models exploit cues in the dataset when entities are unmasked.  Table 1 summarizes key statistics of the dataset. All relation labels were obtained by crowdsourcing, using Amazon Mechanical Turk. Crowd workers were shown the example text, with head (subject) and tail (object) mentions highlighted, and asked to select among a set of relation label suggestions, or to assign no relation. Label suggestions were limited to relations compatible with the head and tail types. 5 The data quality is estimated as relatively high by Zhang et al. (2017), based on a manual verification of 300 randomly sampled examples (93.3% validated as correct). The inter-annotator kappa label agreement of crowd workers was moderate at κ = 0.54 for 761 randomly selected mention pairs.

An Analysis of TACRED Label Errors
In order to identify the impact of potentially noisy, crowd-generated labels on the observed model performance, we start with an analysis of TACRED's label quality. We hypothesize that while comparatively untrained crowd workers may on average produce relatively good labels for easy relation mentions, e.g. those with obvious syntactic and/or lexical triggers, or unambiguous entity type signatures such as per:title, they may frequently err on challenging examples, e.g. highly ambiguous ones or relation types whose scope is not clearly defined.
An analysis of the complete dataset using trained annotators would be prohibitively expensive. We therefore utilize a principled approach to selecting examples for manual analysis (Section 3.1). Based on the TAC-KBP annotation guidelines, we then validate these examples (Section 3.2), creating new Dev and Test splits where incorrect annotations made by crowd workers are revised (Section 3.3).

Data Selection
Since we are interested in identifying potentially incorrectly labeled examples, we implement a selection strategy which is based upon ordering examples by the difficulty of predicting them correctly. 6 We use a set of 49 different RE models to obtain predictions on the development and test sets, and rank each example according to the number of models predicting a different relation label than the ground truth. 7 Intuitively, examples with large disagreement, between all models or between models and the ground truth, are either difficult, or incorrectly annotated.
We select the following examples for validation: (a) Challenging -all examples that were misclassified by at least half of the models, and (b) Control -a control group of (up to) 20 random examples per relation type, including no relation, from the set of examples classified correctly by at least 39 models. The two groups cover both presumably hard and easy examples, and allow us to contrast validation results based on example difficulty. In total we selected 2,350 (15.2%) Test examples and 3,655 (16.2%) Dev examples for validation. Of these, 1,740 (Test) and 2,534 (Dev) were assigned a positive label by crowd workers.

Human Validation
We validate the selected examples on the basis of the TAC KBP guidelines. 8 We follow the approach of Zhang et al. (2017), and present each example by showing the example's text with highlighted head and tail spans, and a set of relation label suggestions. We differ from their setup by showing more label suggestions to make the label choice less restrictive: (a) the original, crowd-generated ground truth label, (b) the set of labels predicted by the models, (c) any other relation labels matching the head and tail entity types, and (d) no relation. The suggested positive labels are presented in an alphabetical order and are followed by no relation, with no indication of a label's origin. Annotators are asked to assign no relation or up to two positive labels from this set. A second label was allowed only if the sentence expressed two relations, according to the guidelines, e.g. per:city of birth and per:city of residence. Any disagreements are subsequently resolved by a third annotator, who is also allowed to consider the original ground truth label. All annotators are educated in general linguistics, have extensive prior experience in annotating data for information extraction tasks, and are trained in applying the task guidelines in a trial annotation of 500 sentences selected from the development set. Two labels were assigned to only 3.1% of the Test, and 2.4% of the Dev examples. The multi-labeling mostly occurs with location relations, e.g. the phrase "[Gross] head:per , a 60-year-old native of [Potomac] tail:city " is labeled with per:city of birth and per:city of residence, which is justified by the meaning of the word native.

The Revised TACRED Dev and Test Sets
As expected, the revision rate in the Control groups is much lower, at 8.9% for Test and 8.1% for Dev. We can also see that the fraction of negative examples is approximately one-third in the Challenging group, much lower than the dataset average of 79.5%. This suggests that models have more difficulty predicting positive examples correctly.
The validation inter-annotator agreement is shown in Table 3. It is very high at κ T est = 0.87 and κ Dev = 0.80, indicating a high annotation quality. For both Test and Dev, it is higher for the easier Control groups than for the Challenging   Table 3: Inter-Annotator Kappa-agreement for the relation validation task on TACRED Dev and Test splits (H1,H2 = human re-annotators, H = revised labels, C = original TACRED crowd-generated labels).
groups. In contrast, the average agreement between our annotators and the crowdsourced labels is much lower at κ T est = 0.55, κ Dev = 0.53, and lowest for Challenging examples (e.g., κ T est = 0.44).
Frequently erroneous crowd labels are per:cities of residence, org:alternate names, and per:other family. Typical errors include mislabeling an example as positive which does not express the relation, e.g. labeling "[Alan Gross] head:per was arrested at the [Havana] tail:loc airport." as per:cities of residence, or not assigning a positive relation label, e.g. per:other family in "[Benjamin Chertoff] head:per is the Editor in Chief of Popular Mechanics magazine, as well as the cousin of the Director of Homeland Security, [Michael Chertoff] tail:per ". Approximately 49% of the time an example's label was changed to no relation during validation, 36% of the time from no relation to a positive label, and the remaining 15% it was changed to or extended with a different relation type.
To measure the impact of dataset quality on the performance of models, we evaluated all 49 models on the revised test split. The average model F1 score rises to 70.1%, a major improvement of 8% over the 62.1% average F1 on the original test split, corresponding to a 21.1% error reduction.
Discussion The large number of label corrections and the improved average model performance show that the quality of crowdsourced annotations is a major factor contributing to the overall error rate of models on TACRED. Even though our selection strategy was biased towards examples challenging for models, the large proportion of changed labels suggests that these examples were difficult to label for crowd workers as well. To put this number into perspective - Riedel et al. (2010) showed that, for a distantly supervised dataset, about 31% of the sentence-level labels were wrong, which is less than what we observe here for human-supervised data. 9 The low quality of crowd-generated labels in the Challenging group may be due to their complexity, or due to other reasons, such as lack of detailed annotation guidelines, lack of training, etc. It suggests that, at least for Dev and Test splits, crowdsourcing, even with crowd worker quality checks as used by Zhang et al. (2017), may not be sufficient to produce high quality evaluation data. While models may be able to adequately utilize noisily labeled data for training, measuring model performance and comparing progress in the field may require an investment in carefully labeled evaluation datasets. This may mean, for example, that we need to employ well-trained annotators for labeling evaluation splits, or that we need to design better task definitions and task presentations setups as well as develop new quality control methods when using crowd-sourced annotations for complex NLP tasks like RE. linguists to annotate model misclassifications with their potential causes (Section 4.1). We then categorize and analyze the causes and formulate testable hypotheses that can be automatically verified (Section 4.2). For the automatic analysis, we implemented a baseline and three state-of-the-art models (Section 4.3).

Misclassification Annotation
The goal of the annotation is to identify possible linguistic aspects that cause incorrect model predictions. We first conduct a manual exploratory analysis on the revised Control and Challenging test instances that are misclassified by the majority of the 49 models. Starting from single observations, we iteratively develop a system of categories based on the existence, or absence, of contextual and entity-specific features that might mislead the models (e.g. entity type errors or distracting phrases). Following the exploration, we define a final set of categories, develop guidelines for each, and instruct two annotators to assign an error category to each misclassified instance in the revised test subset. In cases where multiple categories are applicable the annotator selected the most relevant one. As in the validation step, any disagreements between the two annotators are resolved by a third expert.

Error Hypotheses Formulation and Adversarial Rewriting
In a next step, we extend the misclassification categories to testable hypotheses, or groups, that are verifiable on the whole dataset split. For example, if we suspect a model to be distracted by an entity in context of same type as one of the relation arguments, we formulate a group has distractor. The group contains all instances, both correct and incorrect, that satisfy a certain condition, e.g. there exists at least one entity in the sentential context of same type as one of the arguments. The grouping ensures that we do not mistakenly prioritize groups that are actually well-handled on average. We follow the approach proposed by Wu et al. (2019), and extend their Errudite framework 10 to the relation extraction task. After formulating a hypothesis, we assess the error prevalence over the entire dataset split to validate whether the hypothesis holds, i.e. the group of instances shows an above average error rate. In a last step, we test the error hypothesis explicitly by adversarial rewriting of a group's ex-10 https://github.com/uwdata/errudite amples, e.g. by replacing the distracting entities and observing the models' predictions on the rewritten examples. In our example, if the has distractor hypothesis is correct, removing the entities in context should change the prediction of previously incorrect examples.

Models
We evaluate our error hypotheses on a baseline and three of the most recent state-of-the-art RE models. None of the models were part of the set of models used for selecting challenging instances (Section 3.1), so as not to bias the automatic evaluation. As the baseline we use a single layer CNN ( , which is an extension to BERT that integrates external knowledge. In particular, we use KnowBERT-W+W, which is trained by joint entity linking and language modelling on Wikipedia and WordNet.

Model Error and Dataset Analysis
In this section, we present our analysis results, providing an answer to the question: which of the remaining errors can be attributed to the models, and what are the potential reasons for these errors?
We first discuss the findings of our manual misclassification analysis (Section 5.1), followed by the results of the automatic analysis (Section 5.2).  entity spans or entity types of arguments. We always labeled type annotation errors, but tolerated minor span annotation errors if they did not change the interpretation of the relation or the entity. The category context misinterpretation refers to cases where the sentential context of the arguments is misinterpreted by the model. We identify the following context problems: (1) Inverted arguments: the prediction is inverse to the correct relation, i.e. the model's prediction would be correct if head and tail were swapped. (2) Wrong arguments: the model incorrectly predicts a relation that holds between head or tail and an un-annotated entity mention in the context, therefore misinterpreting one annotated argument. (3) Linguistic distractor: the example contains words or phrases related to the predicted relation, however they do not connect to any of the arguments in a way justifying the prediction. (4) Factuality: the model ignores negation, speculation, future tense markers, etc. (5) Context ignored: the example does not contain sufficient linguistic evidence for the predicted relation except for the matching entity types. (6) Relation definition: the predicted relation could be inferred from the context using common sense or world knowledge, however the inference is prohibited by the guidelines (e.g. the spokesperson of an organization is not a top member/employee, or a work location is not a pointer to the employee's residence). (7) No Relation: the model incorrectly predicts no relation even though there is sufficient linguistic evidence for the relation in the sentential context.

Discussion
The relation label predicted most frequently across the 49 models disagreed with the ground truth label of the re-annotated Challenging and Control Test groups in 1017 (43.3%) of the cases. The inter-annotator agreement of error categories assigned to these examples is high at κ T est = 0.83 (κ T est = 0.67 if the category No Relation is excluded).
Argument errors accounted for only 43 (4.2%) misclassifications, since the entities seem to be mostly correctly assigned in the dataset. In all entity type misclassification cases except one, the errors originate from false annotations in the dataset itself.
Context misinterpretation caused 974 (95.8%) false predictions. No relation is incorrectly assigned in 646 (63.6%) of misclassified instances, even though the correct relation is often explicitly and unambiguously stated. In 134 (13.2%) of the erroneous instances the misclassification resulted from inverted or wrong argument assignment, i.e. the predicted relation is stated, however the arguments are inverted or the predicted relation involves an entity other than the annotated one. In 96 (9.4%) instances the error results from TAC KBP guidelines prohibiting specific inferences, affecting most often the classification of the relations per:cities of residence and org:top member/employee. Furthermore, in 52 (5.1%) of the false predictions models seem to ignore the sentential context of the arguments, i.e. the predictions are inferred mainly from the entity types. Sentences containing linguistic distractors accounted for 35 (3.4%) incorrect predictions. Factuality recognition causes only 11 errors (1.1%). However, we assume that this latter low error rate is due to TACRED data containing an insufficient number of sentences suitable for extensively testing a model's ability to consider the missing factuality of relations.

Automatic Model Error Analysis
For the automatic analysis, we defined the following categories and error groups: • Surface structure -Groups for argument distance (argdist=1, argdist>10) and sentence length (sentlen>30) • Arguments -Head and tail mention NER type (same nertag, per:*, org:*, per:loc), and pronominal head/tail (has coref ) • Context -Existence of distracting entities (has distractor) • Ground Truth -Groups conditioned on the ground truth (positive, negative, same nertag&positive) What is the error rate for different groups? In Figure 2, we can see that KnowBERT has the lowest error rate on the full test set (7.9%), and the masked CNN model the highest (11.9%). Span-BERT's and TRE's error rates are in between the two. Overall, all models exhibit a similar pattern of error rates across the groups, with KnowBERT performing best across the board, and the CNN model worst. We can see that model error rates e.g. for the groups has distractor, argdist>10, and has coref do not diverge much from the corresponding overall model error rate. The presence of distracting entities in the context therefore does not seem to be detrimental to model performance. Similarly, examples with a large distance between the relation arguments, or examples where co-referential information is required, are generally predicted correctly.
On the other hand, we can see that all models have above-average error rates for the group positive, its subgroup same nertag&positive, and the group per:loc. The above-average error rate for positive may be explained by the fact that the dataset contains much fewer positive than negative training instances, and is hence biased towards predicting no relation. A detailed analysis shows that the groups per:loc and same nertag&positive are the most ambiguous. per:loc contains relations such as per:cities of residence, per:countries of residence and per:origin, that may be expressed in a similar context but differ only in the fine-grained type of the tail argument (e.g. per:city vs. per:country). In contrast, same nertag contains all person-person relations such as per:parents, per:children and per:other family, as well as e.g. org:parent and org:subsidiaries that involve the same argument types (per:per vs. org:org) and may be only distinguishable from context.

How important is context? KnowBERT and
SpanBERT show about the same error rate on the groups per:loc and same nertag&positive. They differ, however, in which examples they predict correctly: For per:loc, 78.6% are predicted by both models, and 21.4% are predicted by only one of the models. For same nertag&positive, 12.8% of the examples are predicted by only of the models. The two models thus seem to identify complementary information. One difference between the models is that KnowBERT has access to entity information, while SpanBERT masks entity spans. To test how much the two models balance context and argument information, we apply rewriting to alter the instances belonging to a group and observe the impact on performance. We use two strategies: (1) we remove all tokens outside the span between head and tail argument (outside), and (2) we remove all tokens between the two arguments (between). We find that SpanBERT's performance on per:loc drops from 62.1 F1 to 57.7 (outside) and 43.3 (between), whereas Know-BERT's score decreases from 63.7 F1 to 60.9 and 50.1, respectively. On same nertag&positive, we observe a drop from 89.2 F1 to 58.2 (outside) and 47.7 (between) for SpanBERT. Know-BERT achieves a score of 89.4, which drops to 83.8 and 49.0. The larger drop in performance on same nertag&positive suggests that SpanBERT, which uses entity masking, focuses more on the context, whereas KnowBERT focuses on the entity content because the model has access to the arguments. Surprisingly, both models show similar  performance on the full test set (Table 5). This suggests that combining both approaches may further improve RE performance.
Should instance difficulty be considered? Another question is whether the dataset contains instances that can be solved more easily than others, e.g. those with simple patterns or patterns frequently observed during training. We assume that these examples are also more likely to be correctly classified by our baseline set of 49 RE models.
To test this hypothesis, we change the evaluation setup and assign a weight to each instance based on the number of correct predictions. An example that is correctly classified by all 49 baseline models would receive a weight of zero -and thus effectively be ignored -whereas an instance misclassified by all models receives a weight of one. In Table 5, we can see that SpanBERT has the highest score on the weighted test set (61.9 F1), a 16% decrease compared to the unweighted revised test set. KnowBERT has the second highest score of 58.7, 3% less than SpanBERT. The performance of TRE and CNN is much worse at 48.8 and 34.8 F1, respectively. The result suggests that SpanBERT's span-level pre-training and entity masking are beneficial for RE and allow the model to generalize better to challenging examples. Given this observation, we propose to consider an instance's difficulty during evaluation.

Related Work
Relation Extraction on TACRED Recent RE approaches include PA-LSTM (Zhang et al., 2017) and GCN (Zhang et al., 2018), with the former combining recurrence and attention, and the latter leveraging graph convolutional neural networks.
Many current approaches use unsupervised or semi-supervised pre-training: fine-tuning of language representations pre-trained on token-level (Alt et al., 2019;Shi and Lin, 2019) or span-level , fine-tuning of knowledge enhanced word representations that are pre-trained on entity-linked text (Zhang et al., 2019;Peters et al., 2019), and "matching the blanks" pre-training (Baldini Soares et al., 2019).
Dataset Evaluation Chen et al. (2016) and Barnes et al. (2019) also use model results to assess dataset difficulty for reading comprehension and sentiment analysis. Other work also explores bias in datasets and the adoption of shallow heuristics on biased datasets in natural language inference (Niven and Kao, 2019) and argument reasoning comprehension .
Analyzing trained Models Explanation methods include occlusion or gradient-based methods, measuring the relevance of input features to the output (Zintgraf et al., 2017;Harbecke et al., 2018), and probing tasks (Conneau et al., 2018;Kim et al., 2019) that probe the presence of specific features e.g. in intermediate layers. More similar to our approach is rewriting of instances (Jia and Liang, 2017;Ribeiro et al., 2018) but instead of evaluating model robustness we use rewriting to test explicit error hypotheses, similar to Wu et al. (2019).

Conclusion and Future Work
In this paper, we conducted a thorough evaluation of the TACRED RE task. We validated the 5k most challenging examples in development and test set and showed that labeling is a major error source, accounting for 8% absolute F1 error on the test set. This clearly highlights the need for careful evaluation of development and test splits when creating datasets via crowdsourcing. To improve the evaluation accuracy and reliability of future RE methods, we provide a revised, extensively relabeled TACRED. In addition, we categorized model misclassifications into 9 common RE error categories and observed that models are often unable to predict a relation, even if it is expressed explicitly. Models also frequently do not recognize argument roles correctly, or ignore the sentential context. In an automated evaluation we verified our error hypotheses on the whole test split and showed that two groups of ambiguous relations are responsible for most of the remaining errors. We also showed that models adopt heuristics when en-tities are unmasked and proposed that evaluation metrics should consider an instance's difficulty.  (2017). We employ Adagrad as an optimizer, with an initial learning rate of 0.1 and run training for 50 epochs. Starting from the 15th epoch, we gradually decrease the learning rate by a factor of 0.9. For the CNN we use 500 filters of sizes [2, 3, 4, 5] and apply l 2 regularization with a coefficient of 10 −3 to all filter weights. We use tanh as activation and apply dropout on the encoder output with a probability of 0.5. We use the same hyperparameters for variants with ELMo. For variants with BERT, we use an initial learning rate of 0.01 and decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing. Also we use 200 filters of sizes [2,3,4,5].
LSTM/Bi-LSTM For training we use the hyperparameters of Zhang et al. (2017). We employ Adagrad with an initial learning rate of 0.01, train for 30 epochs and gradually decrease the learning rate by a factor of 0.9, starting from the 15th epoch. We use word dropout of 0.04 and recurrent dropout of 0.5. The BiLSTM consists of two layers of hidden dimension 500 for each direction. For training with ELMo and BERT we decrease the learning rate by a factor of 0.9 every time the validation F1 score is plateauing.
GCN We reuse the hyperparameters of Zhang et al. (2018). We employ SGD as optimizer with an initial learning rate of 0.3, which is reduced by a factor of 0.9 every time the validation F1 score plateaus. We use dropout of 0.5 between all but the last GCN layer, word dropout of 0.04, and embedding and encoder dropout of 0.5. Similar to the authors we use path-centric pruning with K=1. We use two 200-dimensional GCN layers and similar two 200-dimensional feedforward layers with ReLU activation.
Self-Attention After hyperparameter tuning we found 8 layers of multi-headed self-attention to perform best. Each layer uses 8 attention heads with attention dropout of 0.1, keys and values are projected to 256 dimensions before computing the similarity and aggregated in a feedforward layer with 512 dimensions. For training we use Adam optimizer with an initial learning rate of 10 −4 , which is reduced by a factor of 0.9 every time the validation F1 score plateaus. In addition we use word dropout of 0.04, embedding dropout of 0.5, and encoder dropout of 0.5. Table 6 show the relation extraction performances for the models on TACRED and our revised version. Models with 'w/synt/sem' use named entity and part-of-speech embeddings in addition to the input word embeddings.