Evaluation of Coreference Resolution Systems Under Adversarial Attacks

A substantial overlap of coreferent mentions in the CoNLL dataset magnifies the recent progress on coreference resolution. This is because the CoNLL benchmark fails to evaluate the ability of coreference resolvers that requires linking novel mentions unseen at train time. In this work, we create a new dataset based on CoNLL, which largely decreases mention overlaps in the entire dataset and exposes the limitations of published resolvers on two aspects—lexical inference ability and understanding of low-level orthographic noise. Our findings show (1) the requirements for embeddings, used in resolvers, and for coreference resolutions are, by design, in conflict and (2) adversarial approaches are sometimes not legitimate to mitigate the obstacles, as they may falsely introduce mention overlaps in adversarial training and test sets, thus giving an inflated impression for the improvements.


Introduction
Resolution of coreferring expressions is a natural step for text understanding, but coreference resolvers appear to have a negligible effect in downstream NLP tasks (Yu and Ji, 2016;Durrett et al., 2016;Voita et al., 2018). For instance, Durrett et al. (2016) rewrite pronouns with their antecedents (e.g., he is replaced by Dominick Dunne), using the Berkeley Entity Resolution System (Durrett and Klein, 2014). However, this fails to improve the cross-sentence coherence of system summaries, although the resolver performs well on the OntoNotes 4.0 dataset (Pradhan et al., 2011).
The CoNLL benchmark (Pradhan et al., 2012) reflects the recent advances of coreference resolution systems. Nevertheless, previous work (Moosavi and Strube, 2017) indicates that the progress on the CoNLL benchmark is inflated, as the training and test sets share a large size of mentions. This may Test Example: Iraqi leader Saddam has given a speech to mark the tenth anniversary of the Gulf war. The Iraqi leader said the Gulf war was a confrontation... Train Example: There were other signs today that Iraq's leaders have few regrets over the action that precipitated the Gulf war. The Gulf war began 10 years ago... Table 1: Replacing "the Gulf war" with "the Gulf warfare" or "the Gulf wärfäre" addresses (1) exact match in the test example; (2) mention overlaps across examples.
be the reason why coreference resolvers have little effect in downstream tasks.
As opposed to evaluating on standard benchmarks, recent work (Glockner et al., 2018;Pruthi et al., 2019;Eger et al., 2019;Eger and Benz, 2020) investigates the generalization ability of NLP systems under adversarial attacks. For instance, Glockner et al. (2018) show that natural language inference systems fail blatantly when lexical changes, e.g., replacing a word by its synonym, occur in premises and hypotheses. Pruthi et al. (2019) observe that spelling errors distract text classification systems from correct prediction. Inspired by these works, we investigate published coreference resolvers in two realistic adversarial setups, which challenge (a) lexical inference ability to resolve coreferent mentions, where one mention is, e.g., synonymous or in a type-of relationship with its antecedent and (b) denoising ability against typographic (low-level) noise. To do so, we construct a new benchmark dataset by modifying the mention spans from CoNLL (Pradhan et al., 2012). This can mitigate lexical overlaps between the CoNLL training and test sets, as illustrated in Table 1. Our analysis yields several findings: (1) We show that the lexical inference ability of published resolvers, including the state-of-the-art resolver based on BERT, is poor, i.e., the failure to properly resolve the coreference of a mention and its hy-pernymous (or hyponymous) antecedent within the same synset. (2) We identify an important reason for this failure: a mismatch, by design, between the requirements of coreference resolution and embeddings (used in resolvers). While a plausible coreference resolver anticipates ignoring the semantic difference of a word and its hypernym and linking them as coreferent mentions, embeddings capture the nuanced and fine-grained meanings well. (3) Further, we show that coreference resolvers fail to generalize to the CoNLL benchmark dataset with minor low-level (orthographic) noise. As a remedy, we use a common adversarial approach (Goodfellow et al., 2015) to incorporate lexical changes and low-level noise in coreferent mentions at train time, which appears to largely address the obstacles. However, we reveal that it introduces a large size of mention overlaps in the adversarial training and the test sets. This indicates an unrealistic situation where resolvers are only robust to what has been seen during training.
These findings indicate potential directions for future work, which may benefit coreference resolvers in downstream tasks and in real-world applications with natural occurring noise (e.g., usergenerated texts).

Adversarial Data Collection
Our goal is to construct a benchmark dataset on which we evaluate the ability to resolve coreference that requires lexical inference and understanding of low-level noise.

Generating Adversarial Examples
Recent work for adversarial attacks concerning lexical changes and orthographic modification has shown deficiencies of NLP models for many tasks. To adapt previous approaches to coreference resolution, we design the following attack schemes where we focus on text changes occurring in mention spans. This setup also can address lexical overlap issue. To do so, we collect mentions from the training and test sets in the CoNLL benchmark dataset. We i.i.d. randomly attack each word in a mention with probability p and apply one of the below schemes. Table 2 shows examples of our modifications.
Lexical Changes. Modifiers and head words of noun phrases in a chain of mentions sometimes occur repeatedly. For instance, president both appears in the mention the 44th president of the US and its   (2019) remove named entities overlapping in the training and test sets. In contrast, we choose a word overlap randomly from mentions and substitute it with its hyponym, hypernym and synonym, as found in WordNet (Miller, 1995). To prevent the meaning of a word substitution deviated from the original word, we make the substitution only when two words share one word sense (synset), obtained from adapted LESK algorithm (Banerjee and Pedersen, 2002).
Orthographic Changes. Character-level ("lowlevel") text changes, e.g., random swapping of characters (Pruthi et al., 2019), create surface form noise that often does not affect humans. We investigate the impact of different forms of low-level noise, namely (a) swapping a pair of adjacent letters, (b) deleting letters, and (c) visual perturbation, i.e., changing characters in a word by visually similar ones. To make text changes less perceptible to humans, we restrict for (a) and (b) to: (1) an individual word is allowed to be modified only once, (2) the first and the last letter of a word cannot be modified-as human reading is more resilient to internal letter exchanges, as shown by psycholinguistic research (Davis, 2003), and (3) modifications to a word with less than four characters are not allowed. As for visual attacks (c), we obtain character 'embeddings' from descriptions of each character in the Unicode 11.0.0 final names list, and then determine a set of nearest neighbors by choosing those characters whose descriptions refer to the same letter. Such perturbations have been shown little effect on human text processing (Eger et al., 2019).  Baselines. We investigate non-neural systems 1 , namely the DETERMINISTIC (Lee et al., 2013) and STATISTICAL (Clark and Manning, 2015) systems together with neural systems, including DEEP-RL (Clark and Manning, 2016), COARSE-TO-FINE (C2F) (Lee et al., 2018), C2F⊕BERT and C2F⊕SPANBERT (Joshi et al., 2019). The results are reported using the CoNLL F1 score-the average of MUC (Vilain et al., 1995), B3 (Bagga and Baldwin, 1998) and CEAFe (Luo, 2005).
Overall Results. Despite the minor changes in text, Table 3 shows that, the drop in performance is consistently big on average (10-12 points CoNLL F-score) across systems. The systems appear to suffer the most from orthographic changes, however, the percent of the examples of low-level noises is twice as large as that of lexical changes. Together, 1 For non-neual systems, their linguistic features are extracted from our benchmark dataset using spaCy.  this exposes the limitation of non-neural and neural systems, including the systems based on BERT and SpanBERT, on lexical inference ability and understanding of low-level noise. Also, we note that the drop in non-neural baselines is smaller, which we believe is because linguistic features are primary predictors in them and have a positive effect.

Shielding via Adversarial Training
Shielding Setup. We measure to what extent adversarial training (Goodfellow et al., 2015) can improve lexical inference ability and the robustness to low-level noise for the baseline systems. We include the adversarial training set at train time, but do not augment the training data, i.e., only replace 50% clean examples using our text manipulations. We split our evaluation into two setups: (1) indomain evaluation, e.g., the training and test set used for training and evaluation are modified by swapping characters and (2) out-of-domain evaluation, e.g., we use adversarial training that trains a baseline system from scratch on a modified training set of one noise, denoted as AT-NOISE, and evaluates on the adversarial test sets of the remaining noise.
Lexical Changes Analysis. Table 4 shows that the performance drops for C2F⊕BERT in the HY-PONYM and HYPERNYM test sets are much bigger than that in the SYNONYM test set, but AT-SYNONYM considerably helps. To more thoroughly examine this, we randomly extract pairs of 1,000 words and their synonyms, hyponyms and hypernyms from WordNet, as a form of coreferent mentions. We show histograms of the cosine similarity scores of word pairs, based on the last layer of BERT embeddings, used in C2F⊕BERT. Figure  1 (above) shows that a pair of a mention and its hypernymous/hyponymous antecedent is often assigned lower a cosine similarity score than a mention and its synonymous antecedent pair, suggesting that BERT embeddings capture the semantic differences of the three well. However, a plausible coreference resolver requires to ignore such finegrained differences in meanings and links them all as coreferent mentions. This indicates the requirements for embeddings, used in resolvers, and for coreference resolvers, by design, are in conflict. However, this issue can be mitigated using AT-SYNONYM, as illustrated in Figure 1 (below). This is because a gold label can bridge a mention and its hypernymous/synonymous antecedent (within the same synset), thus omitting the semantic differences of them. In-domain and Out-of-domain Evaluations. Figure 2 shows that C2F⊕BERT via adversarial training appears to achieve consistent improvements in the in-domain evaluation setup, e.g., the gain achieved by AT-SWAP is 15.3 points on the SWAP test set. However, we observe that about 10% percent of mention are overlapping in the adversarial training and test sets, introduced by the   adversarial training approach. This may give a false and inflated impression for the improvements. Further, the effects for the out-of-domain evaluation are different. For instance, AT-SWAP obtains a large gain (+6.76 points) on the DELETE and VI-SUAL test sets, as the domain difference between the two and the SWAP test set is small. However, we note that AT-SWAP has a negative effect for the performance on the adversarial test sets involving lexical changes, since character-level noise and lexical replacement have little in common. In contrast, AT-SYNONYM appears to have a positive effect for the performance in the low-level noise domain. However, Table 5 shows that C2F⊕BERT trained on full SYNONYM training set causes a big performance drop on average across low-level noise. This indicates that enriching the system with lexical knowledge fails to improve its robustness to orthographic changes (similarly as for the negative effect of AT-SWAP to lexical changes). The gain on the test sets with low-level noise only appears when involving clean training examples at train time, as this substantially increases the size of mention overlaps, leading to a simpler coreference resolution task.

Conclusions
Coreference resolution have the potential to help downstream NLP systems solve problems that require text understanding. However, the performance scores on the CoNLL benchmark are inflated, because mentions are largely overlapping in the whole dataset, and the evaluation in a constrained domain fails to expose the limitations of coreference resolvers in the wild. Our experiments show that published resolvers fail to link coreferent mentions involving minor low-level noise and lexical changes. Beyond that, we show a caveat when mitigating the obstacles via adversarial approaches: lexical overlaps introduced by data augmentation must be removed from adversarial training and test sets so as to see how the approaches perform realistically.