Learning Global Features for Coreference Resolution

There is compelling evidence that coreference prediction would benefit from modeling global information about entity-clusters. Yet, state-of-the-art performance can be achieved with systems treating each mention prediction independently, which we attribute to the inherent difficulty of crafting informative cluster-level features. We instead propose to use recurrent neural networks (RNNs) to learn latent, global representations of entity clusters directly from their mentions. We show that such representations are especially useful for the prediction of pronominal mentions, and can be incorporated into an end-to-end coreference system that outperforms the state of the art without requiring any additional search.


Introduction
While structured, non-local coreference models would seem to hold promise for avoiding many common coreference errors (as discussed further in Section 3), the results of employing such models in practice are decidedly mixed, and state-of-the-art results can be obtained using a completely local, mention-ranking system.
In this work, we posit that global context is indeed necessary for further improvements in coreference resolution, but argue that informative cluster, rather than mention, level features are very difficult to devise, limiting their effectiveness. Accordingly, we instead propose to learn representations of mention clusters by embedding them sequentially using a recurrent neural network (shown in Section 4). Our model has no manually defined cluster features, but instead learns a global representation from the individual mentions present in each cluster. We incorporate these representations into a mention-ranking style coreference system.
The entire model, including the recurrent neural network and the mention-ranking sub-system, is trained end-to-end on the coreference task. We train the model as a local classifier with fixed context (that is, as a history-based model). As such, unlike several recent approaches, which may require complicated inference during training, we are able to train our model in much the same way as a vanilla mentionranking model.
Experiments compare the use of learned global features to several strong baseline systems for coreference resolution. We demonstrate that the learned global representations capture important underlying information that can help resolve difficult pronominal mentions, which remain a persistent source of errors for modern coreference systems (Durrett and Klein, 2013;Kummerfeld and Klein, 2013;Wiseman et al., 2015;. Our final system improves over 0.8 points in CoNLL score over the current state of the art, and the improvement is statistically significant on all three CoNLL metrics.

Background and Notation
Coreference resolution is fundamentally a clustering task. Given a sequence (x n ) N n=1 of (intra-document) mentions -that is, syntactic units that can refer or be referred to -coreference resolution involves partitioning (x n ) into a sequence of clusters (X (m) ) M m=1 such that all the mentions in any particular cluster X (m) refer to the same underlying entity. Since the mentions within a particular cluster may be ordered linearly by their appearance in the document, 1 we will use the notation X (m) j to refer to the j'th mention in the m'th cluster.
A valid clustering places each mention in exactly one cluster, and so we may represent a clustering with a vector z ∈ {1, . . . , M } N , where z n = m iff x n is a member of X (m) . Coreference systems attempt to find the best clustering z * ∈ Z under some scoring function, with Z the set of valid clusterings.
One strategy to avoid the computational intractability associated with predicting an entire clustering z is to instead predict a single antecedent for each mention x n ; because x n may not be anaphoric (and therefore have no antecedents), a "dummy" antecedent may also be predicted. The aforementioned strategy is adopted by "mention-ranking" systems (Denis and Baldridge, 2008;Rahman and Ng, 2009;Durrett and Klein, 2013), which, formally, predict an antecedentŷ ∈ Y(x n ) for each mention x n , where Y(x n ) = {1, . . . , n − 1, }. Through transitivity, these decisions induce a clustering over the document.
Mention-ranking systems make their antecedent predictions with a local scoring function f (x n , y) defined for any mention x n and any antecedent y ∈ Y(x n ). While such a scoring function clearly ignores much structural information, the mentionranking approach has been attractive for at least two reasons. First, inference is relatively simple and efficient, requiring only a left-to-right pass through a document's mentions during which a mention's antecedents (as well as ) are scored and the highest scoring antecedent is predicted. Second, from a linguistic modeling perspective, mention-ranking models learn a scoring function that requires a mention x n to be compatible with only one of its coreferent antecedents. This contrasts with mention-pair models (e.g., Bengtson and Roth (2008)), which score all pairs of mentions in a cluster, as well as with certain cluster-based models (see discussion in Culotta et al. (2007)). Modeling each mention as having a single antecedent is particularly advantageous for pronominal mentions, which we might like to model as linking to a single nominal or proper antecedent, for example, but not necessarily to all other coreferent mentions.
Accordingly, in this paper we attempt to maintain the inferential simplicity and modeling benefits of mention ranking, while allowing the model to utilize global, structural information relating to z in making its predictions. We therefore investigate objective functions of the form arg max y1,...,y N N n=1 f (x n , y n ) + g(x n , y n , z 1:n−1 ) , where g is a global function that, in making predictions for x n , may examine (features of) the clustering z 1:n−1 induced by the antecedent predictions made through y n−1 .

The Role of Global Features
Here we motivate the use of global features for coreference resolution by focusing on the issues that may arise when resolving pronominal mentions in a purely local way. See Clark and Manning (2015) and Stoyanov and Eisner (2012) for more general motivation for using global models.

Pronoun Problems
Recent empirical work has shown that the resolution of pronominal mentions accounts for a substantial percentage of the total errors made by modern mention-ranking systems. Wiseman et al. (2015) show that on the CoNLL 2012 English development set, almost 59% of mention-ranking precision errors and almost 24% of recall errors involve pronominal mentions.  found a similar pattern in their comparison of mention-ranking, mention-pair, and latent-tree models. To see why pronouns can be so problematic, consider the following passage from the "Broadcast Conversation" portion of the CoNLL development set (bc/msnbc/0000/018); below, we enclose mentions in brackets and give the same subscript to coclustered mentions. (This example is also shown in Figure 2 This example is typical of Broadcast Conversation, and it is difficult because local systems learn to myopically link pronouns such as [you] 5 to other instances of the same pronoun that are close by, such as [you] 1 . While this is often a reasonable strategy, in this case predicting [you] 1 to be an antecedent of [you] 5 would result in the prediction of an incoherent cluster, since [you] 1 is coreferent with the singular [I] 1 , and [you] 5 , as part of the phrase "all of you," is evidently plural. Thus, while there is enough information in the text to correctly predict [you] 5 , doing so crucially depends on having access to the history of predictions made so far, and it is precisely this access to history that local models lack. More empirically, there are non-local statistical regularities involving pronouns we might hope models could exploit. For instance, in the CoNLL training data over 70% of pleonastic "it" instances and over 74% of pleonastic "you" instances follow (respectively) previous pleonastic "it" and "you" instances. Similarly, over 78% of referential "I" instances and over 68% of referential "he" instances corefer with previous "I" and "he" instances, respectively.
Accordingly, we might expect non-local models with access to global features to perform significantly better. However, models incorporating nonlocal features have a rather mixed track record. For instance, Björkelund and Kuhn (2014) found that cluster-level features improved their results, whereas  found that they did not. Clark and Manning (2015) found that incorporating cluster-level features beyond those involving the precomputed mention-pair and mention-ranking probabilities that form the basis of their agglomerative clustering coreference system did not improve performance. Furthermore, among recent, state-of-theart systems, mention-ranking systems (which are completely local) perform at least as well as their more structured counterparts (Durrett and Klein, 2014;Clark and Manning, 2015;Wiseman et al., 2015;Peng et al., 2015).

Issues with Global Features
We believe a major reason for the relative ineffectiveness of global features in coreference problems is that, as noted by Clark and Manning (2015), cluster-level features can be hard to define. Specif-ically, it is difficult to define discrete, fixed-length features on clusters, which can be of variable size (or shape). As a result, global coreference features tend to be either too coarse or too sparse. Thus, early attempts at defining cluster-level features simply applied the coarse quantifier predicates all, none, most to the mention-level features defined on the mentions (or pairs of mentions) in a cluster (Culotta et al., 2007;Rahman and Ng, 2011). For example, a cluster would have the feature 'most-female=true' if more than half the mentions (or pairs of mentions) in the cluster have a 'female=true' feature.
On the other extreme, Björkelund and Kuhn (2014) define certain cluster-level features by concatenating the mention-level features of a cluster's constituent mentions in order of the mentions' appearance in the document. For example, if a cluster consists, in order, of the mentions (the president, he, he), they would define a cluster-level "type" feature 'C-P-P=true', which indicates that the cluster is composed, in order, of a common noun, a pronoun, and a pronoun. While very expressive, these concatenated features are often quite sparse, since clusters encountered during training can be of any size.

Learning Global Features
To circumvent the aforementioned issues with defining global features, we propose to learn cluster-level feature representations implicitly, by identifying the state of a (partial) cluster with the hidden state of an RNN that has consumed the sequence of mentions composing the (partial) cluster. Before providing technical details, we provide some preliminary evidence that such learned representations capture important contextual information by displaying in Figure 1 the learned final states of all clusters in the CoNLL development set, projected using T-SNE (van der Maaten and Hinton, 2012). Each point in the visualization represents the learned features for an entity cluster and the head words of mentions are shown for representative points. Note that the model learns to roughly separate clusters by simple distinctions such as predominant type (nominal, proper, pronominal) and number (it, they, etc), but also captures more subtle relationships such as grouping geographic terms and long strings of pronouns.

Recurrent Neural Networks
A recurrent neural network is a parameterized nonlinear function RNN that recursively maps an input sequence of vectors to a sequence of hidden states. Let (m j ) J j=1 be a sequence of J input vectors m j ∈ R D , and let h 0 = 0. Applying an RNN to any such sequence yields where θ is the set of parameters for the model, which are shared over time.
There are several varieties of RNN, but by far the most commonly used in natural-language processing is the Long Short-Term Memory network (LSTM) (Hochreiter and Schmidhuber, 1997), particularly for language modeling (e.g., Zaremba et al. (2014)) and machine translation (e.g., ), and we use LSTMs in all experiments.

RNNs for Cluster Features
Our main contribution will be to utilize RNNs to produce feature representations of entity clusters which will provide the basis of the global term g. Recall that we view a cluster X (m) as a sequence of mentions (X (m) j ) J j=1 (ordered in linear document or-der). We therefore propose to embed the state(s) of X (m) by running an RNN over the cluster in order.
In order to run an RNN over the mentions we need an embedding function h c to map a mention to a real vector. First, following Wiseman et al. (2015) define φ a (x n ) : X → {0, 1} F as a standard set of local indicator features on a mention, such as its head word, its gender, and so on. (We elaborate on features below.) We then use a non-linear feature embedding h c to map a mention x n to a vector-space representation. In particular, we define where W c and b c are parameters of the embedding. We will refer to the j'th hidden state of the RNN corresponding to X (m) as h (m) j , and we obtain it according to the following formula Thus, we will effectively run an RNN over each (sequence of mentions corresponding to a) cluster X (m) in the document, and thereby generate a hidden state h (m) j corresponding to each step of each cluster in the document. Concretely, this can be implemented by maintaining M RNNs -one for each cluster -that all share the parameters θ. The process is illustrated in the top portion of Figure 2.

Coreference with Global Features
We now describe how the RNN defined above is used within an end-to-end coreference system.

Full Model and Training
Recall that our inference objective is to maximize the score of both a local mention ranking term as well as a global term based on the current clusters: f (x n , y n ) + g(x n , y n , z 1:n−1 ) We begin by defining the local model f (x n , y) with the two layer neural network of Wiseman et al. (2015), which has a specialization for the nonanaphoric case, as follows: DA: um and  . There are currently four entity clusters in scope X (1) , X (2) , X (3) , X (4) based on unseen previous decisions (y). Each cluster has a corresponding RNN state, two of which (h (1) and h (4) ) have processed multiple mentions (with X (1) notably including a singular mention [I]). At the bottom, we show the complete mention-ranking process. Each previous mention is considered as an antecedent, and the global term considers the antecedent clusters' current hidden state. Selecting is treated with a special case NA(x n ).
Above, u and v are the parameters of the model, and h a and h p are learned feature embeddings of the local mention context and the pairwise affinity between a mention and an antecedent, respectively. These feature embeddings are defined similarly to h c , as where φ a (mentioned above) and φ p are "raw" (that is, unconjoined) features on the context of x n and on the pairwise affinity between mentions x n and antecedent y, respectively (Wiseman et al., 2015). Note that h a and h c use the same raw features; only their weights differ. We now specify our global scoring function g based on the history of previous decisions. Define h (m) <n as the hidden state of cluster m before a decision is made for x n -that is, h (m) <n is the state of cluster m's RNN after it has consumed all mentions in the cluster preceding x n . We define g as where NA gives a score for assigning based on a non-linear function of all of the current hidden states: See Figure 2 for a diagram. The intuition behind the first case in g is that in considering whether y is a good antecedent for x n , we add a term to the score that examines how well x n matches with the mentions already in X (zy) ; this matching score is expressed via a dot-product. 2 In the second case, when predicting that x n is non-anaphoric, we add the NA term to the score, which examines the (sum of) the current states h (m) <n of all clusters. This information is useful both because it allows the non-anaphoric score to incorporate information about potential antecedents, and because the occurrence of certain singleton-clusters often predicts the occurrence of future singleton-clusters, as noted in Section 3.
The whole system is trained end-to-end on coreference using backpropagation. For a given training document, let z (o) be the oracle mapping from mention to cluster, which induces an oracle clustering. While at training time we do have oracle clusters, we do not have oracle antecedents (y) N n=1 , so following past work we treat the oracle antecedent as latent (Yu and Joachims, 2009;Fernandes et al., 2012;Chang et al., 2013;Durrett and Klein, 2013). We train with the following slack-rescaled, margin objective: where the latent antecedent y n is defined as if x n is anaphoric, and is otherwise. The term ∆(x n ,ŷ) gives different weight to different error types. We use a ∆ with 3 different weights (α 1 , α 2 , α 3 ) for "false link" (FL), "false new" (FN), and "wrong link" (WL) mistakes (Durrett and Klein, 2013), which correspond to predicting an antecedent when non-anaphoric, when anaphoric, and the wrong antecedent, respectively. Note that in training we use the oracle clusters z (o) . Since these are known a priori, we can precompute all the hidden states h (m) j in a document, which makes training quite simple and efficient. This approach contrasts in particular with the work of Björkelund and Kuhn (2014) -who also incorporate global information in mention-ranking -in that they train against latent trees, which are not annotated and must be searched for during training. On the other hand, training on oracle clusters leads to a mismatch between training and test, which can hurt performance.

Search
When moving from a strictly local objective to one with global features, the test-time search problem becomes intractable. The local objective requires O(n 2 ) time, whereas the full clustering problem is NP-Hard. Past work with global features has used integer linear programming solvers for exact search (Chang et al., 2013;Peng et al., 2015), or beam search with (delayed) early update training for an approximate solution (Björkelund and Kuhn, 2014). In contrast, we simply use greedy search at test time, which also requires O(n 2 ) time. 3 The full algorithm Algorithm 1 Greedy search with global RNNs 1: procedure GREEDYCLUSTER(x1, . . . , xN ) 2: Initialize clusters X (1) . . . as empty lists, hidden states h (0) , . . . as 0 vectors in R D , z as map from mention to cluster, and cluster counter M ← 0 3: for n = 2 . . . N do 4: y * ← arg max y∈Y(xn) f (xn, y) + g(xn, y, z1:n−1)

5:
m ← zy * 6: if y * = then 7: M ← M + 1 8: m ← M 9: append xn to X (m) 10: zn ← m 11: is shown in Algorithm 1. The greedy search algorithm is identical to a simple mention-ranking system, with the exception of line 11, which updates the current RNN representation based on the previous decision that was made, and line 4, which then uses this cluster representation as part of scoring.

Methods
We run experiments on the CoNLL 2012 English shared task (Pradhan et al., 2012). The task uses the OntoNotes corpus (Hovy et al., 2006), consisting of 3,493 documents in various domains and formats. We use the experimental split provided in the shared task. For all experiments, we use the Berkeley Coreference System (Durrett and Klein, 2013) for mention extraction and to compute features φ a and φ p .
Features We use the raw BASIC+ feature sets described by Wiseman et al. (2015), with the following modifications: • We remove all features from φ p that concatenate a feature of the antecedent with a feature of the current mention, such as bi-head features.
• We add true-cased head features, a current speaker indicator feature, and a 2-character they underperformed. We also experimented with training approaches and model variants that expose the model to its own predictions (Daumé III et al., 2009;Ross et al., 2011;Bengio et al., 2015), but found that these yielded a negligible performance improvement.  (2014), , Clark and Manning (2015), Peng et al. (2015), and Wiseman et al. (2015). F 1 gains are significant (p < 0.05 under the bootstrap resample test (Koehn, 2004)) compared with Wiseman et al. (2015) for all metrics.
• We add features indicating if a mention has a substring overlap with the current speaker (φ p and φ a ), and if an antecedent has a substring overlap with a speaker distinct from the current mention's speaker (φ p ).
• We add a single centered, rescaled document position feature to each mention when learning h c . We calculate a mention x n 's rescaled document position as 2n−N −1 N −1 . These modifications result in there being approximately 14K distinct features in φ a and approximately 28K distinct features in φ p , which is far fewer features than has been typical in past work.
For training, we use document-size minibatches, which allows for efficient pre-computation of RNN states, and we minimize the loss described in Section 5 with AdaGrad (Duchi et al., 2011) (after clipping LSTM gradients to lie (elementwise) in (−10, 10)). We find that the initial learning rate chosen for AdaGrad has a significant impact on results, and we choose learning rates for each layer out of {0.1, 0.02, 0.01, 0.002, 0.001}.
In experiments, we set h a (x n ), h c (x n ), and h (m) to be ∈ R 200 , and h p (x n , y) ∈ R 700 . We use a single-layer LSTM (without "peep-hole" connections), as implemented in the element-rnn library (Léonard et al., 2015). For regularization, we apply Dropout (Srivastava et al., 2014) with a rate of 0.4 before applying the linear weights u, and we also apply Dropout with a rate of 0.3 to the LSTM states before forming the dot-product scores.  Following Wiseman et al. (2015) we use the costweights α = 0.5, 1.2, 1 in defining ∆, and we use their pre-training scheme as well. For final results, we train on both training and development portions of the CoNLL data. Scoring uses the official CoNLL 2012 script . Code for our system is available at https: //github.com/swiseman/nn_coref. The system makes use of a GPU for training, and trains in about two hours.

Results
In Table 1 we present our main results on the CoNLL English test set, and compare with other recent stateof-the-art systems. We see a statistically significant improvement of over 0.8 CoNLL points over the previous state of the art, and the highest F 1 scores to date on all three CoNLL metrics. We now consider in more detail the impact of global features and RNNs on performance. For these experiments, we report MUC, B 3 , and CEAF e F 1scores in Table 2 as well as errors broken down by mention type and by whether the mention is anaphoric or not in Table 3. Table 3   are defined in Section 5.1. We typically think of FL and WL as representing precision errors, and FN as representing recall errors. Our experiments consider several different settings.
First, we consider an oracle setting ("RNN, OH" in tables), in which the model receives z (o) 1:n−1 , the oracle partial clustering of all mentions preceding x n in the document, and is therefore not forced to rely on its own past predictions when predicting x n . This provides us with an upper bound on the performance achievable with our model. Next, we consider the performance of the model under a greedy inference strategy (RNN, GH), as in Algorithm 1. Finally, for baselines we consider the mention-ranking system (MR) of Wiseman et al. (2015) using our updated feature-set, as well as a non-local baseline with oracle history (Avg, OH), which averages the representations h c (x j ) for all x j ∈ X (m) , rather than feed them through an RNN; errors are still backpropagated through the h c representations during learning.
In Table 3 we see that the RNN improves performance overall, with the most dramatic improve- ments on non-anaphoric pronouns, though errors are also decreased significantly for non-anaphoric nominal and proper mentions that follow at least one mention with the same head. While WL errors also decrease for both these mention-categories under the RNN model, FN errors increase. Importantly, the RNN performance is significantly better than that of the Avg baseline, which barely improves over mention-ranking, even with oracle history. This suggests that modeling the sequence of mentions in a cluster is advantageous. We also note that while RNN performance degrades in both precision and recall when moving from the oracle history upperbound to a greedy setting, we are still able to recover a significant portion of the possible performance improvement.

Qualitative Analysis
In this section we consider in detail the impact of the g term in the RNN scoring function on the two error categories that improve most under the RNN model (as shown in Table 3), namely, pronominal WL errors and pronominal FL errors. We consider an example from the CoNLL development set in each category on which the baseline MR model makes an error but the greedy RNN model does not.
The example in Figure 3 involves the resolution of the ambiguous pronoun "his," which is bracketed and in bold in the figure. Whereas the baseline MR model incorrectly predicts "his" to corefer with the closest gender-consistent antecedent "Justin"thus making a WL error -the greedy RNN model Figure 4: Magnitudes of gradients of NA score applied to bold "It's" with respect to final mention in three preceding clusters. See text for full description.
correctly predicts "his" to corefer with "Mr. Kaye" in the previous sentence. (Note that "the official" also refers to Mr. Kaye). To get a sense of the greedy RNN model's decision-making on this example, we color the mentions the greedy RNN model has predicted to corefer with "Mr. Kaye" in green, and the mentions it has predicted to corefer with "Justin" in blue. (Note that the model incorrectly predicts the initial "I" mentions to corefer with "Justin.") Letting X (1) refer to the blue cluster, X (2) refer to the green cluster, and x n refer to the ambiguous mention "his," we further shade each mention x j in X (1) so that its intensity corresponds to h c (x n ) T h (1) <k , where k = j + 1; mentions in X (2) are shaded analogously. Thus, the shading shows how highly g scores the compatibility between "his" and a cluster X (i) as each of X (i) 's mentions is added. We see that when the initial "Justin" mentions are added to X (1) the g-score is relatively high. However, after "The company" is correctly predicted to corefer with "Justin," the score of X (1) drops, since companies are generally not coreferent with pronouns like "his." Figure 4 shows an example (consisting of a telephone conversation between "A" and "B") in which the bracketed pronoun "It's" is being used pleonastically. Whereas the baseline MR model predicts "It's" to corefer with a previous "it" -thus making a FL error -the greedy RNN model does not. In Figure 4 the final mention in three preceding clusters is shaded so its intensity corresponds to the magnitude of the gradient of the NA term in g with respect to that mention. This visualization resembles the "saliency" technique of Li et al. (2016), and it attempts to gives a sense of the contribution of a (preceding) cluster in the calculation of the NA score.
We see that the potential antecedent "S-Bahn" has a large gradient, but also that the initial, obviously pleonastic use of "it's" has a large gradient, which may suggest that earlier, easier predictions of pleonasm can inform subsequent predictions.

Related Work
In addition to the related work noted throughout, we add supplementary references here. Unstructured approaches to coreference typically divide into mention-pair models, which classify (nearly) every pair of mentions in a document as coreferent or not (Soon et al., 2001;Ng and Cardie, 2002;Bengtson and Roth, 2008), and mention-ranking models, which select a single antecedent for each anaphoric mention (Denis and Baldridge, 2008;Rahman and Ng, 2009;Durrett and Klein, 2013;Chang et al., 2013;Wiseman et al., 2015). Structured approaches typically divide between those that induce a clustering of mentions (McCallum and Wellner, 2003;Culotta et al., 2007;Poon and Domingos, 2008;Haghighi and Klein, 2010;Stoyanov and Eisner, 2012;Cai and Strube, 2010), and, more recently, those that learn a latent tree of mentions (Fernandes et al., 2012;Björkelund and Kuhn, 2014;. There have also been structured approaches that merge the mention-ranking and mention-pair ideas in some way. For instance, Rahman and Ng (2011) rank clusters rather than mentions; Clark and Manning (2015) use the output of both mention-ranking and mention pair systems to learn a clustering.
The application of RNNs to modeling (the trajectory of) the state of a cluster is apparently novel, though it bears some similarity to the recent work of Dyer et al. (2015), who use LSTMs to embed the state of a transition based parser's stack.

Conclusion
We have presented a simple, state of the art approach to incorporating global information in an end-to-end coreference system, which obviates the need to define global features, and moreover allows for simple (greedy) inference. Future work will examine improving recall, and more sophisticated approaches to global training.