Exposing Shallow Heuristics of Relation Extraction Models with Challenge Data

The process of collecting and annotating training data may introduce distribution artifacts which may limit the ability of models to learn correct generalization behavior. We identify failure modes of SOTA relation extraction (RE) models trained on TACRED, which we attribute to limitations in the data annotation process. We collect and annotate a challenge-set we call Challenging RE (CRE), based on naturally occurring corpus examples, to benchmark this behavior. Our experiments with four state-of-the-art RE models show that they have indeed adopted shallow heuristics that do not generalize to the challenge-set data. Further, we find that alternative question answering modeling performs significantly better than the SOTA models on the challenge-set, despite worse overall TACRED performance. By adding some of the challenge data as training examples, the performance of the model improves. Finally, we provide concrete suggestion on how to improve RE data collection to alleviate this behavior.


Introduction
In the relation extraction (RE) task, our goal is, given a set of sentences s ∈ S to extract tuples (s, e 1 , e 2 , r) where a relation r ∈ R holds between e 1 and e 2 (entities that appear in the sentence s, each represented as a span over s). RE is often represented as relation classification (RC): given a triplet of (s, e 1 , e 2 ), determine which relation r ∈ R holds between e 1 and e 2 in s, or indicate norelation (∅). This can be presented as a set of |R| binary decision problems, (s, e 1 , e 2 , r) → {0, 1}: return 1 for tuples for which the relation holds, and 0 otherwise. The reduction between RE and RC is clear: given a sentence, extract all entity pair candidates (given a NER system), and run the RC problem on each of them.
Indeed, contemporary methods are all RC methods, and the popular TACRED large-scale relation extraction dataset is annotated for RC: each instance in the dataset is a triplet of (s, e 1 , e 2 ) and is associated with a label r ∈ R ∪ {∅}. Importantly, the annotation is non exhaustive: not all e 1 , e 2 pairs in the dataset are annotated (only 17.2% of the entity pairs whose type match a TACRED relation are). While this saves a lot of annotation effort, as we show this also leads to sub-optimal behavior of the trained models, and hinders our ability to properly assess their real-world utility.
We show that state of the art models trained on TACRED are often "right for the wrong reasons" (McCoy et al., 2019): instead of learning to perform the intended task, they rely on shallow heuristics which are effective for solving many dataset instances, but which may fail on more challenging examples. In particular, we show two concrete heuristics: classifying based on entity types, and classifying based on the existence of an event without linking the event to its arguments. We show that while they are not well attested in the dev and test sets, these challenging examples do occur in practice. We introduce CRE (Challenging RE), a challenge set for quantifying and demonstrating the problem, and show that four SOTA RC models significantly fail on the challenge set. We release the challenge set to encourage future research on better models. 1 While we demonstrate the problem on TACRED, we stress that the model behaviors we expose are directly linked to the dataset construction procedure, and will likely occur in any dataset that is created in a similar fashion. We propose guidelines to help guide better datasets, and in particular better evaluation sets, in the future.
We also show that different modeling techniques  Figure 1: CRE dataset instances illustrating the various heuristics and error types. "# Members" refers to the number of human members or employees of an organization. may alleviate this problem: models trained for QA are better at linking events to their arguments. While performing worse on TACRED overall, they perform significantly better on the challenge set.

Relation Classification Heuristics
McCoy et al. (2019) discusses the concept of "model heuristics"-decision rules that are used by ML models to score high on a test set, but which are too simplistic to solve the underlying problem-and demonstrated such heuristics used by NLI models. In this work we demonstrate model heuristics used by TACRED-trained RC models. Recall that a relation classification instance is (s, e 1 , e 2 , r) → {0, 1}. Event Heuristic: Classify based on (s, r). This heuristic ignores the entities altogether, acting as a classification model answering the question "does the sentence attest the relation". This heuristic is of limited applicability, as many sentences attest more than a single related pair of entities. Type Heuristic: Classify based on (type(e 1 ), type(e 2 ), r), where type(e) is the named-entity type of entity e. In a given dataset, a decision can be made based on the type of entities alone. For example of the 41 relations in the TACRED dataset, only the per:religion relation is between a PERSON and a RELIGION. A model may learn to incorrectly rely on the types when making a decision, ignoring the sentence s altogether. 2 Many type-pairs are compatible with multiple relations in a dataset, weakening the utility of this heuristic. However, for applicable type-pairs, it can be very effective. For example, out of 21,284 Wikipedia sentences containing a PERSON name and a RELIGION name, 8,156 (38%) were classified by a RoBERTA-based RC model as per:religion. Manual inspection of a random sample of 100 of these, found that 42% are false-positives. Event+Type Heuristic: The event and type heuristics can be combined, by requiring the two decision rules (s, r) → {0, 1} and (type(e 1 ), type(e 2 ), r) → {0, 1} to hold. The resulting heuristic verifies that the sentence mentions the relation, and that the entity pairs of interest are type-compatible with the relation; it does not verify that the entities are arguments of the relation.
We demonstrate that the event+type heuristic is a particularly strong one for relation-classification datasets, and is widely used by trained state-of-theart relation classifiers.

Challenge Set
Consider the date of birth relation, that holds between a person e 1 and a year e 2 , and the classification instance: [ e 1 Steve Jobs] was born in California in [ e 2 1955].
A model making use of the event+type heuristic will correctly classify this relation. 3 The heuristic is challenged by sentences that include multiple entities of an applicable type. For example: [ e 1 Ed] was born in [ e 2 1561], the son of John, a carpenter, and his wife Mary.
A model relying solely on the event+type heuristic will correctly classify the above, but also incorrectly classify the following instances: Ed was born in [ e 2 1561], the son of [ e 1 John], a carpenter, and his wife Mary.
Ed was born in York in [ e 2 1561], the son of John, a carpenter, and his wife [ e 2 Mary].
While such sentences are frequent in practice, these cases are not represented in the dataset: in only 17.2% of the sentences in TACRED more than a single pair is annotated (and only 3.46% of the sentences have more than one different annotated labels in the sentence) . Additionally, due to the data collection method (Zhang et al., 2017), if a sentence includes a positive pair for a relation r ∈ R, it is significantly more likely that this pair will be chosen for annotation rather than a matching-type pair with a no-relation label. In other words, the data collection process leads to no-relation labels between a pair of entities being assigned with very high probability to sentences in which other pairs in the sentence also do not adhere to a relation of interest.
As a result, for models trained on TACRED, the model is incentivized to learn to identify the existence of the relation in the sentence, irrespective of the arguments. There is no signal in the data to incentivize the model to distinguish cases where the relation holds between the given arguments, from cases where the relation holds but between a different pair of arguments. We expect the same to hold for any large-scale RC dataset created in a similar manner. 4 Challenge Set Construction. We seek a benchmark to highlight RC models' susceptibility to the event+type heuristic. The benchmark takes the form of a challenge/contrast set; a collection of related examples that specialize in a specific failure case, meant only for evaluation and not training (Kaushik et al., 2020;Gardner et al., 2020). In contrast to the NLI challenge set of McCoy et al. Methodology. Coming up with a set of realworld sentences that demonstrates a failure mode is not easy. The main challenge is in identifying potential candidate sentences to pass to manual annotation. To identify such cases, we require an effective method for sampling a population that is likely to exhibit the behavior we are interested in.
We propose the following challenge-set creation methodology: (1) use a strong seed-model to perform large-scale noisy annotation; (2) identify suspicious cases in the model's output; (3) manually verify (annotate) suspicious cases.
In stage (2), we identify suspicious cases by looking for sentences in which: (a) there are at least two entity-pairs of a NE type which is compatible to a TACRED relation (in most of the cases, the entity-pairs share one of the items, i.e. (e 1 , e 2 ), (e 1 , e 3 )); and (b) these two pairs were assigned by the seed model to the same relation. Note that cases that satisfy condition (b) have, with high probability, at least one incorrect model prediction.
On the other hand, given a strong seed model, there is also a high probability that one of the predictions is correct.
For our seed-model we use a SOTA relation classification model which is based on fine-tuned Span-BERT (Joshi et al., 2019), which we run over a large corpus of English Wikipedia sentences. 5 Out of the sentences that passed stage (2), we randomly sampled 100 sentences for each of the 30 relation predicted by the model. All instance entity pairs were manually labeled by two of the authors of this work, while closely adhering to the TACRED annotation guideline. For each instance, the annotators provide a binary decision: does the entity pair in question adhere to relation r (as predicted by the model) or not. 6 The resulting challenge set (CRE) has 3,000 distinct sentences, and 10,844 classification instances. The dataset is arranged into 30 groups of 100 sentences, where each sentence is binary labeled for a given relation. Example sentences from the challenge-set are given in Figure 1. In 57% of the sentences, there are at least two classification instances with conflicting labels, indicating the use of the event+type heuristic. On average there are 3.7 candidate entity-pairs per sentence. In 89.2% of the sentences in the set, the entity-pairs share an argument. Further details are available in the supplementary material. 5 Additional details in the supplementary material. 6 We chose to perform binary annotation, as we find it makes the annotation process faster and more accurate. As demonstrated by Alt et al. (2020), multi-class relation labeling by crowd-workers lead to frequent annotation errors. We observed the same phenomena also with non-crowd workers.  (Table 1), these models achieve SOTA scores. We evaluate model's results on the CRE dataset in terms of accuracy. We also report positive accuracy (Acc + ) (the accuracy on instances for which the relation hold; models that make use of the heuristic are expected to score high here) and likewise negative accuracy (Acc − ) (accuracy on instances in which the relation does not hold; models using the heuristic are expected to score low here). The models are consistently more accurate on the positive set then on the negative set showing that models struggle on cases where the heuristic make incorrect predictions. 7 A direct comparison to the state-of-the-art is difficult with these metrics alone. To facilitate such a comparison, we also report precision, recall and F1 scores on TACRED+Positive (Table 3a), in which we add to the TACRED test set all the positive instances from the CRE dataset (easy cases), and on TACRED+Negative (Table 3b), in which we add the negative instances (hard cases). All models benefit significantly from the positive setting, and are hurt significantly in the negative setting primarily in precision, indicating that they do follow the heuristic on many instances: the TACRED-trained models often classify 7 CRE is binary labeled and relatively balanced between positive and negative examples, making accuracy a valid and natural metric. We chose to report Acc+ and Acc− instead of the popular precision and recall because precision and recall emphasize the positive class, and do not tell the full story of the negative class (indeed, prec/rec do not involve the truenegative case), which is of interest to us. Using the Acc +/− metric allows us to focus on both the positive and negative classes.

Model
Acc Acc + Acc −   based on the type of arguments and the existence of a relation in the text, without verifying that the entities are indeed the arguments of the relation.

QA Models Perform Better
The CRE dataset results indicate that RE-trained models systematically fail to link the provided relation arguments to the relation mention. We demonstrate that QA-trained models perform better in this respect. The QA models differ in both in their training data (SQuAD 2.0, (Rajpurkar et al., 2018)) and in their training objective (span-prediction, rather than classification). Inspired by Levy et al. (2017) we reduce RC instances into QA instances. We follow the reduction from Cohen et al. (2020) between QA and binary relation classification which works by forming two questions for each relation instance, one for each argument. For example, for the relation instance pair (Mark, FB, founded) we ask "Who founded FB?" and "what did Mark found?". 8 If the QA model answers either one of the questions with the correct span, we return 1 (relation holds), otherwise we return 0 (relation does not hold). We use three pre-trained SOTA models, fine-tuned on SQuAD 2.0: QA-SpanBERT, QA-BERT, and QA-AlBERT (Lan et al., 2020).
Results. While the scores on the TACRED test set are unsurprisingly substantially worse (with F 1 of 59.1%, 52.0% and 61.4%) than the TACRED trained models, they also perform better on the CRE dataset (Table 2). QA-trained models pay more attention to the relation between an event and its arguments than RC-trained models.

Augmenting the Training Data with Challenge Set Examples
We test the extent by which we can "inoculate" (Liu et al., 2019a) the relation extraction models by enhancing their training data with some examples from the challenge-set. We re-train each model on the TACRED dataset, which we augment with half of the challenge-set (5504 examples, 8% of the size of the original TACRED training set). The other half of the challenge-set is used for evaluation. Results. We begin by evaluating the inoculated models on the original TACRED evaluation set, establishing that this result in roughly the same scores, with small increases for most models (RC-SpanBERT: 71.0 F 1 (original: 70.8), RC-BERT: 69.9 F 1 (original:67.5), RC-KnowBERT: 72.1 F 1 (original:71.5), RC-RoBERTa: 70.8 F 1 (original:71.25)).
When evaluating on the CRE dataset examples, we see a large increase in performance for the inoculated dataset, as can be seen in Table 4. Compared to TACRED-only scores, While we see a small and expected drop for Acc + , it is accompanied by a very large improvement on Acc − . However, while accuracies improve, there is still a wide gap from perfect accuracy.

Discussion and Conclusion
We created a challenge dataset demonstrating the tendency of TACRED-trained models to classify using an event+type heuristic that fails to connect the relation and its arguments. QA-trained models are less susceptible to this behavior. Continuing Gardner et al. (2020), we conclude that challenge sets are an effective tool of benchmarking against shallow heuristics, not only of models and systems, but also of data collection methodologies. We suggest the following recommendation for future RE data collection: evaluation sets should be exhaustive, and contain all relevant entity pairs. Ideally, the same should apply also to training sets.
If impractical, the data should at least attempt to exhaustively annotate confusion-sets: if a certain entity-pair is annotated in a sentence, all other pairs of the same entity-types in the sentence should also be annotated.

A Challenge Set
We use SpanBERT as a seed model, SpanBert is a recent state-of-the-art model uses pre-trained language model fine-tuned to the RC task which uses a bidirectional LM pre-trained on SpanBERT We run the seed model over a large corpus of English Wikipedia sentences in which contains more than 10 million sentences. Table 8

C How we evaluate TACRED + pos/neg
In order to evaluate TACRED + Positive and TACRED+Negative we presented them as binary decision problem as explained in section 1. We simulate ACRED + pos/neg in which for each relation r ∈ R we create a set that contain all the examples (s, e 1 , e 2 , r) → {0, 1} , which the arguments types e 1 and e 2 match the relation r. We evaluate each set of relation r separately, and we report the results as micro-averaged F1 scores. Table 7 contain the templates for the questions, each relation have two questions, a question for the head entity and a question for the tail entity, we use this templates to reduce the relation classification task.   What is the religion of e 1 ? What religion does e 2 believe in? per:stateorprovince of death

E Detailed Results
Where did e 1 died? What is the place where e 2 died? org:parents What organization is the parent organization of e 1 ? What organization is the parent organization of e 2 ? org:subsidiaries What organization is the child organization of e 1 ? What organization is the child subsidiaries of e 2 ? per:other family Who are family of e 1 ? ho are family of e 2 ? per:stateorprovinces of residence What is the state of residence of e 1 ?
Where is e 2 's place of residence? org:members Who is a member of the organization e 1 ? What organization e 2 is member of? per:cause of death How did e 1 died?
What is e 2 's cause of death? org:member of What is the group the organization e 1 is member of? What organization is a member of e 2 ? org:number of employees/members How many members does e 1 have? What is the number of members of e 2 ? per:country of birth In what country was e 1 born hat is e 2 's country of birth? org:shareholders Who hold shares of e 1 ? Who are e 2 's shareholders? org:stateorprovince of headquarters What is the state or province of the headquarters of e 1 ? Where is the state or province of the headquarters of e 2 ? per:city of death In what city did e 1 died?
What is e 2 's city of death?

per:city of birth
In what city was e 1 born?
What is e 2 's city of birh? per:spouse Who is the spouse of e 1 ? Who is the spouse of e 2 ? org:city of headquarters Where are the headquarters of e 1 ? What is e 2 's city of headquarters? per:date of death When did e 1 die?
What is e 2 's date of death? per:schools attended Which schools did attend? What school did e 2 attend? org:political/religious affiliation What is e 1 political or religious affiliation? What religion does e 1 organization belong to? per:country of death Where did e 1 die? What e 2 's country of death? org:founded When was e 1 founded? What date did e 2 establish? per:stateorprovince of birth In what state was e 1 born?
What is e 2 's country of birth? per:city of birth Where was e 1 born? Who was born in e 2 ? org:dissolved When was e 1 dissolved? What date did e 2 dissolved? Table 7: Templates for the questions, for each relation two questions are defined.