The More Antecedents, the Merrier: Resolving Multi-Antecedent Anaphors

Anaphor resolution is an important task in NLP with many applications. Despite much research effort, it remains an open problem. The difﬁculty of the problem varies substantially across different sub-problems. One sub-problem, in particular, has been largely untouched by prior work despite occurring frequently throughout corpora: the anaphor that has multiple antecedents, which here we call multi-antecedent anaphors or m - anaphors. Current coreference resolvers restrict anaphors to at most a single antecedent. As we show in this paper, re-laxing this constraint poses serious problems in coreference chain-building, where each chain is intended to refer to a single entity. This work provides a formalization of the new task with preliminary insights into multi-antecedent noun-phrase anaphors, and offers a method for resolving such cases that outperforms a number of baseline methods by a signiﬁcant margin. Our system uses local agglomerative clustering on candidate antecedents and an existing coreference system to score clusters to determine which cluster of mentions is antecedent for a given anaphor. When we augment an existing coreference system with our proposed method, we observe a substantial increase in performance (0.6 absolute CoNLL F1) on an annotated corpus.

To avoid the complexity of the overarching resolution task, many current systems -whether learning-based (Clark and Manning, 2015;Peng et al., 2015;Wiseman et al., 2015;Durrett and Klein, 2013;Björkelund and Farkas, 2012) or rule-based (Lee et al., 2011) -focus on a restricted version of the problem, where candidate anaphors are linked to at most one antecedent, from which coreference chains are built by propagating the induced equivalence relation, with each chain corresponding to an entity (Van Deemter and Kibble, 2000).
While this single-antecedent inference task does resolve a very large number of anaphors in any given text, it leaves one quite common subproblem virtually untouched: anaphors that link to multiple antecedents. These have sometimes been called split-antecedent anaphors; here we use the term multi-antecedent anaphors or m-anaphors in order to emphasize the existence of more than one (possibly more than two) antecedents for a given anaphor. Consider the following examples: (1) [Elizabeth] 1 met [Mary] 2 at the park and [they] 1,2 began their stroll to the river.
(2) Mrs. Dashwood, having moved to another country, saw her [mother] 1 and [sister-inlaw] 2 demoted to occasional visitors. As such, however, her old [kin] 1,2 were treated by her new family with quiet civility.
Such cases present a challenge to state-of-theart methods: certain features well-suited for the single-antecedent case do not apply (e.g. gender and pluarity) (Recasens and Hovy, 2009;Stoyanov et al., 2009;Bergsma and Lin, 2006), and strong long-distance effects cannot be ignored (Ingria and Stallard, 1989). Moreover, the presence of multiple antecedents for a single anaphor violates the separation between coreference chains.
In this paper, we address the multi-antecedent case of noun-phrase (NP) anaphor resolution in English, the most widely understood and studied form of coreference resolution (Ng, 2010;Ng, 2008). While we frame the general question of multi-antecedent inference, we restrict our analyses to one particular sub-problem: resolving the antecedents of the pronouns they and them. These pronouns best isolate the characteristics of manaphors (see Section 2 for more on the motivation of this choice). We propose a system for resolving they and them that models grouping compatibility of mentions through a maximum entropy pairwise model, independently from coreference of groupings, which is handled through an existing coreference resolution system leveraging corpus knowledge.
This paper makes four core contributions. First, it provides a generalization of the anaphor resolution problem to permit linking to multiple antecedents. Second, we characterize core properties of m-anaphors and their linguistic environments in a large, annotated corpus. Third, we provide a entity-centric system for specifically resolving multi-antecedent cases that outperforms a number of baselines. And, finally, we show how to pair our system with an existing coreference system and show a gain of 0.6 points (CoNLL F1) on the complete coreference resolution task (resolving all anaphors, single-and multi-antecedent).
The rest of the paper is organized as follows: We introduce the terminology and problem statement for split-antecedent resolution in Section 2. A summary of the data is given in Section 3 and the behaviour of split-antecedent anaphors is analyzed in Section 4. Our approach to antecedent prediction is presented in Section 5 and the results and analysis are reported in Section 6. Finally, we review related work in Section 7 and conclude and discuss future work in Section 8.

Problem
This section establishes the terminology used throughout the paper and reformulates the anaphor resolution problem to incorporate linking to multiple antecedents.

Terminology
We introduce the term m-anaphor for convenience as a special case of anaphor that has to multiple antecedents. For example, they and kin in Examples (1) and (2), respectively, from the Introduction are m-anaphors. By extension, 1-anaphors are anaphors that have only one antecedent.
Similarly, we define an m-antecedent as one of multiple antecedents of an m-anaphor and we refer to m-antecedents with the same m-anaphor as siblings. In Example (1) from the Introduction, Elizabeth and Mary are sibling m-antecedents of they, and in Example (2), mother and sister-in-law are sibling m-antecedents of kin.
[Both] 1,2 were troubled by the news.
(4) Virginia found herself alone with her [brother] 1 , and then the thought of her [sister] 2 came to mind.
The anaphor in Example (3) is a 2-anaphor and the anaphor in Example (4) is a 3-anaphor.

Definition
We define the NP anaphor resolution problem similar to Wiseman et al. (2015), Durrett  (2013), and Hirschman (1997): Let M denote the set of all identified mentions in a document and let M (x) ⊆ M denote all mentions preceding a mention x ∈ M. The objective of the task is, for each x ∈ M, to find C ⊆ M (x) such that all mentions in C are antecedent to x. If C = ∅, then x is nonanaphoric and if |C| ≥ 1, then x is 1-anaphoric, and if |C| > 1, then x is m-anaphoric. Hence, this formulation generalizes the problem to account for multi-antecedent anaphors.
To constrain the scope of the study, we perform all our analyses on gold mentions, leaving the effect of imperfect mention detection as a problem for future work (this has been studied for the single-antecedent case in Stoyanov et al. (2009)). Moreover, we only consider mentions of they and them that are known to be m-anaphoric for three reasons. First, non-pronomial m-anaphors, i.e. proper and common nouns, are much more susceptible to long-distance effects and may require external knowledge to resolve. Second, by focusing on this case, we circumvent a host of very involved aspects of the complete m-anaphor resolution problem, i.e. determining whether a mention is m-anaphoric, 1-anaphoric, or not anaphoric at all. For example, you may refer to one person or multiple, who can be used as an interrogative (non-anaphoric) or reflexive pronoun (anaphoric)), pronouns such as anyone and everyone introduce many scoping difficulties, and pleonastic pronouns must be removed from the inference task entirely. Third, they and them are the most prevalent of all pronouns in our dataset (refer to Table 1).

Data
Our dataset comprises of the Pride and Prejudice novel (P&P) (121440 words) and 36 short stories from the Scribner Anthology of Contemporary Short Fiction (Martone et al., 1999) (Scribner) (total of 216901 words), representing an eclectic collection of stories from the modern era. For P&P,  Table 2: Number of m-anaphoric they and them mentions and % of all they and them mentions that are m-anaphors.
all mentions of character have been fully resolved to their antecedents, including mentions referencing multiple characters. For Scribner, all mentions of they and them are resolved (m-anaphoric, 1anaphoric, and singleton), including those of nonperson entities. These stories were annotated by three annotators according to a slightly modified version of the ACE coreference resolution task formulation (Doddington et al., 2004) to allow multiple antecedents. Annotations were conducted through the brat 1 annotation tool (Stenetorp et al., 2012)) and the inter-annotator agreement on the shared texts (3 stories from Scribner + 7 chapters from P&P) was 86.5%.
Overall, in P&P, 1289 m-anaphors were discovered, of which 34 (2.6%) were proper nouns, 536 (41.6%) were common nouns, and 719 (55.8%) were pronouns. Table 2 shows the number of gold m-anaphoric they and them mentions and the percentage of all they and them mentions that are manaphoric.
Literary works were chosen over other textual modalities, e.g. news articles, because they showed a higher density of m-anaphors (a preliminary annotation exercise showed that literary works contained 37% more m-anaphors per word).
The dataset is partitioned according to a roughly, 60/20/20 split into training, validation, and testing sets, where the split is applied to the text of P&P (e.g. the first 60% of story text is used for training), and the collection of Scribner stories (e.g. 60% of the stories were used for training).  Table 3: Average and standard deviations of the word distance, sentence distance, and number of intermediate mentions between the first and second most recent mentions to an m-anaphor.
dataset to gain insight into the distribution of manaphors across a number of dimensions. Second, we fit a maximum entropy model over common coreference features for distinguishing manaphoric and anaphoric mentions to evaluate the importance of various features in determining manaphoricity versus anaphoricity of mentions.

m-anaphor Statistics
The distribution of m-anaphors according to the number of referenced m−antecedents is as follows: 79.3% are 2-anaphors, 13.2% are 3anaphors, 3.7% are 4-anaphors, and the remaining 3.8% refer to larger numbers of antecedents. Despite the bias towards 2-anaphors, the simple approach to m-anaphor resolution of taking the previous two mentions as m-antecedent siblings will fail according to Table 3. The usual presence of intermediate mentions between manaphors and their m-antecedents makes the resolution task non-trivial. Moreover, the large distances between m-anaphors and their antecedents attenuates any signal for coreference, introducing greater noise to the problem.

m-anaphoricity Features
The statistics discussed above shed light on the complexity of this problem. Here, we examine whether certain surface-level features of anaphoric phenomena from prior work exhibit any differences for m-anaphoric mentions over anaphoric ones. We construct a maximum entropy model from the training data over the combination of syntactic and semantic features in Table 4, inspired by Wiseman et al. (2015), Durrett and Klein (2013), and Recasens et al. (2013b). The binary classification decision is between m-anaphoric and 1anaphoric mentions, coded as '1' and '0', respectively. Therefore, the estimated coefficients that  are positive favor m-anaphoricity and those that are negative favor 1-anaphoricity. Except for the feature testing on the last sentence position, none of the results in Table 4 were able to reach statistical significance, suggesting at a surface level, m-anaphoricity and 1-anaphoricity behave very similarly and operate in similar linguistic environments. One possibility is that a deeper set of features is required for distinguishing m-anaphors from 1-anaphors. We identify this as an important topic for future work in this area.

m-anaphor Resolution
Our approach to m-anaphor resolution draws inspiration from mention pair models for coreference that make independent binary classification decisions (Ng, 2010). In our method, we employ a maximum entropy model that makes binary decisions on mention pairs as well, but the decision corresponds to "group compatibility" of mentions, i.e. to what degree can a given set of mentions be the sibling m-antecedents to the same manaphor. This model is embedded in an agglomerative clustering process, after which a coreference decision is made between clusters and the given m-anaphor. Thus, our model treats the grouping of candidate mentions into sibling sets independently from antecedent-anaphor linking.

Architecture
Given an m-anaphor g in document D, the steps of our approach are as follows: 1. Mentions preceding g within a k-sentence window are extracted as candidate mantecedents to g.

2.
Perform an agglomerative clustering of the candidate mentions using similarity metric SIM 1 and average-linkage criteria. Let C represent the clustering.
3. Each non-singleton cluster C ∈ C is scored according to the probability of coreference of the m-anaphor to the cluster. This is done by appealing to an external corpus comprising of sentences containing either they or them. The grouping of sentences in the document containing all of the mentions in C (and sentences in-between) are compared to each they or them sentence in the external corpus (depending on the identity of g) using similarity metric SIM 2 . The sentence yielding the maximum similarity is selected. The probability of coreference is then calculated by replacing the sentence grouping with the extracted sentence and applying an existing coreference system COREF between g and its counterpart (they or them) in the extracted sentence.
4. The cluster C max producing the highest probability of coreference is predicted as the group of m-antecedents for g.
Again, inspired by mention-pair models for coreference resolution (Clark and Manning, 2015;Björkelund and Farkas, 2012;Ng and Cardie, 2002a), the SIM 1 similarity metric is defined as σ(w x), where w is a weight vector and x is a feature vector defined for a pair of mentions. The parameter vector w is learned using the standard cross-entropy loss function in a maximum entropy model, where the target variable is a decision on whether the mentions pairs are siblings or not. The learning is conducted over the training set with L2regularization.
For SIM 2 , which is responsible for selecting replacement sentences, we experiment with two different similarity metrics: (1) longest common subsequence normalized by sentence length (LCS) and (2) a subset tree kernel (Collins and Duffy, 2002) with a bag-of-words extension as described in Moschitti (2006), which also describes a simple adaptation to forests (for multiple sentences). The named entity (NE) mentions in sentences are replaced by corresponding NE type placeholders (PERSON, LOCATION, etc. as described in Finkel et al. (2005)) before comparison.
In the experiments to follow, we adopt the classification mention-pair model, a component of the statistical coreference resolution system available in the Stanford CoreNLP suite 2 system, described in Clark and Manning (2015), as COREF for scoring coreference. The external corpus was built from texts comparable to our dataset. 651,108 sentences containing one of they or them were mined from a larger corpus of 798 literary texts spanning the nineteenth and twentieth centuries (including novels such as To The Lighthouse, by Virginia Woolf). Lastly, the candidate m-antecedents are extracted from a 5-sentence pre-window of the given m-anaphor (k = 5) and the regularization parameter in learning is set to 0.20. Table 5 depicts the features we chose to use in the pairwise similarity metric (SIM 1 ) for agglomerative clustering of candidate m-antecedents. All are common to many coreference resolver systems (Durrett and Klein, 2013;Recasens et al., 2013b;Stoyanov et al., 2010). We distinguish between mention features (Columns 1 & 2), which are defined for each candidate m-antecedent in a pair, and pairwise features (Columns 3-5), which are defined over a pair of candidate m-antecedents.

Clustering Features
Three features, in particular, deserve further discussion. Under morphosyntax (Column 3), [Type Conjunctions] is a placeholder for a number of conjunctive boolean features derived from the noun type (pronoun/proper/common) of each antecedent in a pairing: e.g., pronoun-pronoun, pronoun-proper, proper-pronoun. Similarly, [Dependency Conjunctions] is a placeholder for conjunctive boolean features derived from the grammatical dependency of each antecedent in a pairing: e.g., subject-subject, subject-object, objectsubject. The [# Dependency Pairings] is an ordinal version of the Dependency Conjunctions feature set -a count of the number of occurrences rather than an indicator variable.
The 'Governor = except' feature triggers if one of the mentions in the mention pair is governed by except or exclude. It represents a form of negation of group membership (e.g. Everyone except for Mary visited Castlebary).
Features were extracted using the Stanford CoreNLP system (Manning et al., 2014) and animacy information was specifically obtained through the Stanford deterministic coreference resolution module (Lee et al., 2011).

Experiments
In order to assess the performance of our method, we conduct two experiments. In the first, we assess performance of our system on the specific they-them m-anaphor resolution sub-task. Our system, and its variants, are compared against a number of baseline methods based on performance on the test set.
In the second experiment, we consider how our system improves the performance of a coreference resolution system when all anaphors (both 1-anaphors and m-anaphors) are considered.

Evaluation
Accuracy is measured in terms of the number of mention pairs correctly grouped as mantecedents for a given m-anaphor -similar to previous works in anaphor resolution (Peng et al., 2015). We use the standard classification metrics for precision, recall, and F1-score. If n 1 , n 2 , . . . , n N represent the number of gold m-antecedents for m-anaphors g 1 , g 2 , . . . , g N in a document, and m 1 , m 2 , . . . , m N are predicted, of which k 1 , k 2 , . . . , k N are correct, then precision is defined as i k i / i m i and recall as i k i / i n i , where i ranges from 1 to N . In order to align ourselves with the gold labels, we adjust the predicted mention corresponding to an entity to the closest one preceding the given m-anaphor. Because a given entity may appear multiple times in a candidate mention window, the most recent one, relative to the m-anaphor, is not always the one carrying the strongest signal and hence is not always predicted as an antecedent. For the purposes of evaluation, such cases are considered correct. Automatic handling would involve a separate, single-antecedent coreference resolver, but given the thesis of this work is the multi-antecedent case, this choice is justified.

System Comparison
We first describe the various baselines and variants of our method we assess and then analyze the performance results.

Systems
• The "most-recent-k" baselines (denoted RECENT-k), which predict the most recent k mentions, relative to the m-anaphor, as the m-antecedents for k = 2, 3, 4.
• The random selection baseline (denoted RANDOM), which randomly predicts mentions in a 5-sentence pre-window as the antecedents according to a binomial with probability 0.5 (imposing the constraint that at least two must be predicted).
• A simple rule-based method (denoted RULE) which proceeds as follows: -If the m-anaphor occupies a subject or prepositional position, then predict the most recent mentions in subject positions if they are coordinated, otherwise take them from previous, distinct sentences. If no such mentions can be found take the most recent mentions in subject and object positions governed by the same verb. -If the m-anaphor occupies in object position, take the previous mentions in object or prepositional positions if they are coordinated, otherwise take them from previous, distinct sentences. If no such mentions can be found, take the most recent mentions in subject and object positions governed by the same verb. -Otherwise, take the two most recent mentions (usually arrive here if there is an error in the dependency parsing).   Table 7: Performance results of the M-TREE system on the different classes of m-anaphors.
• The system described in Lee et al. (2011) (denoted LEE), which performs some light m-anaphor resolution (solely for conjunctive cases).
• The two variants of the developed method, one using the LCS similarity metric (denoted M-LCS) and the other using the subset tree kernel (M-TREE).

Results and Discussion
Accuracy results on the test set for each of the systems are given in Table 6. Both the proposed systems, M-LCS and M-TREE, outperform all other methods by a substantial margin. The Stanford system achieves the highest precision, which is not surprising because it targets conjunctive mentions, which often serve as m-antecedents. Based on the analysis of Section 4, the poor performance of RECENT-2, RECENT-3, and RECENT-4 is expected.
The results for the best-performing system, M-TREE, on the different classes of m-anaphors is given in Table 7. M-TREE outperforms all other systems but exhibits a bias towards 2-anaphors, recent mentions, and mentions coordinated by conjunction. This is not surprising given such cases are the easiest to resolve.

Full Coreference Resolution
For the complete coreference resolution task, the M-TREE system can be integrated with an exist-   Clark and Manning (2015) system, with (CLARK+M-TREE) and without (CLARK) the pairing with M-TREE.
ing coreference system. For this experiment, we pair the full coreference resolution system of Clark and Manning (2015) with M-TREE, and we raise the prediction threshold of our model to 0.89, at which point precision on the validation set is 78.9. Moreover, we restrict ourselves to the P&P portion of the test set, given the Scribner stories only have gold labels for instances of they and them. The Clark and Manning (2015) system is first run over the test set, producing coreference chains which are then filtered for character entities using the approach of Vala et al. (2015). Our adjusted M-TREE system is then applied over all they and them mentions. Each such mention predicted as m-anaphoric is added to the coreference chains of the entities corresponding to the m-antecedent mentions.
To evaluate the accuracy against the gold mention clusters, each m-anaphoric they and them is added to each cluster containing a gold mantecedent. The CoNLL metric scores (Bagga and Baldwin, 1998) of the coreference predictions are shown in Table 8, with the integrated system outperforming the Clark and Manning (2015) system by 0.6 average score (pairing the Clark and Manning (2015) system instead with an oracle manaphor resolver yields an average score of 44.8, an increase of 6.7 points).

Related Work
The formal problem statement for the noun phrase anaphor resolution we propose is an extension of the standard ACE (Doddington et al., 2004), MUC (Hirschman, 1997), and Ontonotes (Hovy et al., 2006) formulations, as well as the problem settings outlined in Wiseman et al. (2015) and Durrett and Klein (2013), to allow anaphors to link to multiple antecedents. Most previous works impose the constraint that anaphors can be assigned at most one antecedent. Some works cast the coreference resolution problem in an Integer Linear Programming framework, with an explicit constraint for assigning at most one antecedent to an anaphor (Peng et al., 2015;Denis et al., 2007).
The early work of Ingria and Stallard (1989) proposes the resolution of pronouns without the restriction they be linked to at most one antecedent. The method uses an indexing scheme for parse trees, similar to Hobb's algorithm (Hobbs, 1978), that eliminates candidates antecedents as more information is acquired. Those pronouns with multiple candidates remaining after treetraversal are predicted as m-anaphors.
The method considers each parse tree in isolation, and hence does not permit inter-sentential linking, a severe limitation in corpora such as the one offered in this work.
The Lee et al. (2011) is a deterministic system that attempts to resolve the "easy" multiantecedent cases, namely those in which mentions are joined by some conjunction. Our system goes beyond and attempts to predict more difficult cases as well.
Many of the individual features we employ in our model appear in a variety of other coreference systems, especially those involving mentionpair models (Durrett and Klein, 2013;Recasens et al., 2013b;Stoyanov et al., 2010). Recasens et al. (2013a) attempts to perform coreference resolution under conditions where many standard features for coreference are not suited. Peng et al. (2015) resort to corpus counts of predicates as features, much in the same way we obtain counts of mention pairings according to simple predicates on dependency structures.
The system of Clark and Manning (2015) also makes uses of agglomerative clustering, although it's employed in merging coreference chains, rather than candidate antecedent groupings.
Last, resorting to an external corpus for sentence structures is common practice in the Natural Language Generation literature for producing phrases that are coherent and consistent (Krishnamoorthy et al., 2013;Bangalore and Rambow, 2000;Langkilde and Knight, 1998).

Conclusion
We introduced a new class of anaphors to the anaphor resolution problem, m-anaphors, and extended the problem formulation to incorporate them. We offered insights into the linguistic behaviour of m-anaphors, finding that surface-level syntactic and semantic features do not carry enough discriminative power in distinguishing them from 1-anaphors. Furthermore, we developed a system combining a mention-pair model, an existing coreference resolver, and corpus knowledge to resolve m-anaphors that scores higher than a number of baseline methods. Finally, we paired this system with a coreference resolver to solve the general coreference resolution task, showing that m-anaphor prediction can help boost performance.
An important component of the m-anaphor resolution problem that falls outside the scope of this study, but is important for practical application, is the detection of m-anaphoric mentions. Section 4 gives some insight into the problem but a much deeper investigation is necessary to devise a detection method.
Moreover, for simplicity, this study focused solely on m-anaphoric they and them mentions, but as explained earlier, m-anaphoric mentions can take many forms, each introducing their own particular complexities that warrant special attention.
Regarding the system developed for m-anaphor resolution, resorting to an external corpus to obtain well-formed sentences proved to be very computationally expensive. In future work, we look to incorporate methods that incur less cost, possibly tolerating some error in the formation of sentences without significantly degrading performance. Also, negation of group membership is a complex linguistic phenomenon that was handled in a crude manner in our system. We look to devote future work to handling such cases.
To promote further research into m-anaphors, we make all our data and software freely available at http://www.github.com/ networkdynamics/manaphor-acl2016.