End-to-end Neural Coreference Resolution

We introduce the first end-to-end coreference resolution model and show that it significantly outperforms all previous work without using a syntactic parser or hand-engineered mention detector. The key idea is to directly consider all spans in a document as potential mentions and learn distributions over possible antecedents for each. The model computes span embeddings that combine context-dependent boundary representations with a head-finding attention mechanism. It is trained to maximize the marginal likelihood of gold antecedent spans from coreference clusters and is factored to enable aggressive pruning of potential mentions. Experiments demonstrate state-of-the-art performance, with a gain of 1.5 F1 on the OntoNotes benchmark and by 3.1 F1 using a 5-model ensemble, despite the fact that this is the first approach to be successfully trained with no external resources.


Introduction
We present the first state-of-the-art coreference resolution model that is learned end-to-end given only gold mention clusters.All recent coreference models, including neural approaches that achieved impressive performance gains (Wiseman et al., 2016;Clark and Manning, 2016b,a), rely on syntactic parsers, both for head-word features and as the input to carefully hand-engineered mention proposal algorithms.We demonstrate for the first time that these resources are not required, and in fact performance can be improved significantly without them, by training an end-to-end neural model that jointly learns which spans are entity mentions and how to best cluster them.
Our model reasons over the space of all spans up to a maximum length and directly optimizes the marginal likelihood of antecedent spans from gold coreference clusters.It includes a span-ranking model that decides, for each span, which of the previous spans (if any) is a good antecedent.At the core of our model are vector embeddings representing spans of text in the document, which combine context-dependent boundary representations with a head-finding attention mechanism over the span.The attention component is inspired by parser-derived head-word matching features from previous systems (Durrett and Klein, 2013), but is less susceptible to cascading errors.In our analyses, we show empirically that these learned attention weights correlate strongly with traditional headedness definitions.
Scoring all span pairs in our end-to-end model is impractical, since the complexity would be quartic in the document length.Therefore we factor the model over unary mention scores and pairwise antecedent scores, both of which are simple functions of the learned span embedding.The unary mention scores are used to prune the space of spans and antecedents, to aggressively reduce the number of pairwise computations.
Our final approach outperforms existing models by 1.5 F1 on the OntoNotes benchmark and by 3.1 F1 using a 5-model ensemble.It is not only accurate, but also relatively interpretable.The model factors, for example, directly indicate whether an absent coreference link is due to low mention scores (for either span) or a low score from the mention ranking component.The head-finding attention mechanism also reveals which mentioninternal words contribute most to coreference decisions.We leverage this overall interpretability to do detailed quantitative and qualitative analyses, providing insights into the strengths and weaknesses of the approach.
Machine learning methods have a long history in coreference resolution (see Ng (2010) for a detailed survey).However, the learning problem is challenging and, until very recently, handengineered systems built on top of automatically produced parse trees (Raghunathan et al., 2010) outperformed all learning approaches.Durrett and Klein (2013) showed that highly lexical learning approaches reverse this trend, and more recent neural models (Wiseman et al., 2016;Clark and Manning, 2016b,a) have achieved significant performance gains.However, all of these models still use parsers for head features and include highly engineered mention proposal algorithms.1Such pipelined systems suffer from two major drawbacks: (1) parsing mistakes can introduce cascading errors and (2) many of the handengineered rules do not generalize to new languages or domains.We present the first nonpipelined results, while providing further performance gains.

Task
We formulate the task of end-to-end coreference resolution as a set of decisions for every possible span in the document.The input is a document D containing T words along with metadata such as speaker and genre information.
Let N = T (T +1) 2 be the number of possible text spans in D. Denote the start and end indices of a span i in D respectively by START(i) and END(i), for 1 ≤ i ≤ N .We assume an ordering of the spans based on START(i); spans with the same start index are ordered by END(i).
The task is to assign to each span i an antecedent y i .The set of possible assignments for each y i is Y(i) = {ǫ, 1, . . ., i − 1}, a dummy antecedent ǫ and all preceding spans.True antecedents of span i, i.e. span j such that 1 ≤ j ≤ i − 1, represent coreference links between i and j.The dummy antecedent ǫ represents two possible scenarios: (1) the span is not an entity mention or (2) the span is an entity mention but it is not coreferent with any previous span.These decisions implicitly define a final clustering, which can be recovered by grouping all spans that are connected by a set of antecedent predictions.

Model
We aim to learn a conditional probability distribution P (y 1 , . . ., y N | D) whose most likely configuration produces the correct clustering.We use a product of multinomials for each span: where s(i, j) is a pairwise score for a coreference link between span i and span j in document D. We omit the document D from the notation when the context is unambiguous.There are three factors for this pairwise coreference score: (1) whether span i is a mention, (2) whether span j is a mention, and (3) whether j is an antecedent of i: Here s m (i) is a unary score for span i being a mention, and s a (i, j) is pairwise score for span j being an antecedent of span i.
By fixing the score of the dummy antecedent ǫ to 0, the model predicts the best scoring antecedent if any non-dummy scores are positive, and it abstains if they are all negative.
A challenging aspect of this model is that its size is O(T 4 ) in the document length.As we will see in Section 5, the above factoring enables aggressive pruning of spans that are unlikely to belong to a coreference cluster according the mention score s m (i).Scoring Architecture We propose an end-toend neural architecture that computes the above scores given the document and its metadata.
At the core of the model are vector representations g i for each possible span i, which we describe in detail in the following section.Given these span representations, the scoring functions above are computed via standard feed-forward neural networks: where • denotes the dot product, • denotes element-wise multiplication, and FFNN denotes a feed-forward neural network that computes a nonlinear mapping from input to output vectors.
The antecedent scoring function s a (i, j) includes explicit element-wise similarity of each span g i • g j and a feature vector φ(i, j) encoding speaker and genre information from the metadata and the distance between the two spans.
Span Representations Two types of information are crucial to accurately predicting coreference links: the context surrounding the mention span and the internal structure within the span.
We use a bidirectional LSTM (Hochreiter and Schmidhuber, 1997) to encode the lexical information of both the inside and outside of each span.We also include an attention mechanism over words in each span to model head words.
We assume vector representations of each word {x 1 , . . ., x T }, which are composed of fixed pretrained word embeddings and 1-dimensional convolution neural networks (CNN) over characters (see Section 7.1 for details) To compute vector representations of each span, we first use bidirectional LSTMs to encode every word in its context: indicates the directionality of each LSTM, and x * t is the concatenated output of the bidirectional LSTM.We use independent LSTMs for every sentence, since cross-sentence context was not helpful in our experiments.
Syntactic heads are typically included as features in previous systems (Durrett and Klein, 2013;Clark and Manning, 2016b,a).Instead of relying on syntactic parses, our model learns a taskspecific notion of headedness using an attention mechanism (Bahdanau et al., 2014) over words in each span: where xi is a weighted sum of word vectors in span i.The weights a i,t are automatically learned and correlate strongly with traditional definitions of head words as we will see in Section 9.2.The above span information is concatenated to produce the final representation g i of span i: This generalizes the recurrent span representations recently proposed for questionanswering (Lee et al., 2016), which only include the boundary representations x * START(i) and x * END(i) .We introduce the soft head word vector xi and a feature vector φ(i) encoding the size of span i.

Inference
The size of the full model described above is O(T 4 ) in the document length T .To maintain computation efficiency, we prune the candidate spans greedily during both training and evaluation.
We only consider spans with up to L words and compute their unary mention scores s m (i) (as defined in Section 4).To further reduce the number of spans to consider, we only keep up to λT spans with the highest mention scores and consider only up to K antecedents for each.We also enforce non-crossing bracketing structures with a simple suppression scheme. 2 We accept spans in decreasing order of the mention scores, unless, when considering span i, there exists a previously accepted span j such that START(i) < START(j) ≤ Despite these aggressive pruning strategies, we maintain a high recall of gold mentions in our experiments (over 92% when λ = 0.4).
For the remaining mentions, the joint distribution of antecedents for each document is computed in a forward pass over a single computation graph.The final prediction is the clustering produced by the most likely configuration.

Learning
In the training data, only clustering information is observed.Since the antecedents are latent, we optimize the marginal log-likelihood of all correct antecedents implied by the gold clustering: where GOLD(i) is the set of spans in the gold cluster containing span i.If span i does not belong to a gold cluster or all gold antecedents have been pruned, GOLD(i) = {ǫ}.
By optimizing this objective, the model naturally learns to prune spans accurately.While the initial pruning is completely random, only gold mentions receive positive updates.The model can quickly leverage this learning signal for appropriate credit assignment to the different factors, such as the mention scores s m used for pruning.
Fixing score of the dummy antecedent to zero removes a spurious degree of freedom in the overall model with respect to mention detection.It also prevents the span pruning from introducing noise.For example, consider the case where span i has a single gold antecedent that was pruned, so GOLD(i) = {ǫ}.The learning objective will only correctly push the scores of non-gold antecedents lower, and it cannot incorrectly push the score of the dummy antecedent higher.
This learning objective can be considered a span-level, cost-insensitive analog of the learning objective proposed by Durrett and Klein (2013).We experimented with these cost-sensitive alternatives, including margin-based variants (Wiseman et al., 2015;Clark and Manning, 2016a), but a simple maximum-likelihood objective proved to be most effective.

Hyperparameters
Word representations The word embeddings are a fixed concatenation of 300-dimensional GloVe embeddings (Pennington et al., 2014) and 50-dimensional embeddings from Turian et al. (2010), both normalized to be unit vectors.Outof-vocabulary words are represented by a vector of zeros.In the character CNN, characters are represented as learned 8-dimensional embeddings.
The convolutions have window sizes of 3, 4, and 5 characters, each consisting of 50 filters.

Hidden dimensions
The hidden states in the LSTMs have 200 dimensions.Each feedforward neural network consists of two hidden layers with 150 dimensions and rectified linear units (Nair and Hinton, 2010).
Pruning We prune the spans such that the maximum span width L = 10, the number of spans per word λ = 0.4, and the maximum number of antecedents K = 250.During training, documents are randomly truncated to up to 50 sentences.
Learning We use ADAM (Kingma and Ba, 2014) for learning with a minibatch size of 1.The LSTM weights are initialized with random orthonormal matrices as described in Saxe et al. (2013).We apply 0.5 dropout to the word embeddings and character CNN outputs.We apply 0.2 dropout to all hidden layers and feature embeddings.Dropout masks are shared across timesteps to preserve long-distance information as described in Gal and Ghahramani (2016).The learning rate is decayed by 0.1% every 100 steps.The model is trained for up to 150 epochs, with early stopping based on the development set.
All code is implemented in Tensor-Flow (Abadi et al., 2015) and is publicly available.

Ensembling
We also report ensemble experiments using five models trained with different random initializations.Ensembling is performed for both the span pruning and antecedent decisions.At test time, we first average the mention scores s m (i) over each model before pruning the spans.Given the same pruned spans, each model then computes the antecedent scores s a (i, j) separately, and they are averaged to produce the final scores.

Results
We report the precision, recall, and F1 for the standard MUC, B 3 , and CEAF φ 4 metrics using the official CoNLL-2012 evaluation scripts.The main evaluation is the average F1 of the three metrics.

Coreference Results
Table 1 compares our model to several previous systems that have driven substantial improvements over the past several years on the OntoNotes benchmark.We outperform previous systems in all metrics.In particular, our single model improves the state-of-the-art average F1 by 1.5, and our 5-model ensemble improves it by 3.1.
The most significant gains come from improvements in recall, which is likely due to our end-toend setup.During training, pipelined systems typically discard any mentions that the mention detector misses, which for Clark and Manning (2016a) consists of more than 9% of the labeled mentions in the training data.In contrast, we only discard mentions that exceed our maximum mention width of 10, which accounts for less than 2% of the training mentions.The contribution of joint mention scoring is further discussed in Section 8.3

Ablations
To show the importance of each component in our proposed model, we ablate various parts of the ar-chitecture and report the average F1 on the development set of the data (see Figure 2).
Features The distance between spans and the width of spans are crucial signals for coreference resolution, consistent with previous findings from other coreference models.They contribute 3.8 F1 to the final result.
Word representations Since our word embeddings are fixed, having access to a variety of word embeddings allows for a more expressive model without overfitting.We hypothesis that the different learning objectives of the GloVe and Turian embeddings provide orthogonal information (the former is word-order insensitive while the latter is word-order sensitive).Both embeddings contribute to some improvement in development F1.
The character CNN provides morphological information and a way to backoff for out-ofvocabulary words.Since coreference decisions often involve rare named entities, we see a contribution of 0.9 F1 from character-level modeling.
Metadata Speaker and genre indicators many not be available in downstream applications.We show that performance degrades by 1.4 F1 without them, but is still on par with previous state-of-theart systems that assume access to this metadata.
Head-finding attention Ablations also show a 1.3 F1 degradation in performance without the attention mechanism for finding task-specific heads.As we will see in Section 9.4, the attention mechanism should not be viewed as simply an approximation of syntactic heads.In many cases, it is beneficial to pay attention to multiple words that are useful specifically for coreference but are not traditionally considered to be syntactic heads.

Comparing Span Pruning Strategies
To tease apart the contributions of improved mention scoring and improved coreference decisions, we compare the results of our model with alternate span pruning strategies.In these experiments, we use the alternate spans for both training and evaluation.As shown in Table 3, keeping mention candidates detected by the rule-based system over predicted parse trees (Raghunathan et al., 2010) degrades performance by 1 F1.We also provide oracle experiment results, where we keep exactly the mentions that are present in gold coreference clusters.With oracle mentions, we see an

Analysis
To highlight the strengths and weaknesses of our model, we provide both quantitative and qualitative analyses.In the following discussion, we use predictions from the single model rather than the ensembled model.

Mention Recall
The training data only provides a weak signal for spans that correspond to entity mentions, since singleton clusters are not explicitly labeled.As a by product of optimizing marginal likelihood, our model automatically learns a useful ranking of spans via the unary mention scores from Section 4.
The top spans, according to the mention scores, cover a large portion of the mentions in gold clusters, as shown in Figure 3.Given a similar number of spans kept, our recall is comparable to the rulebased mention detector (Raghunathan et al., 2010) that produces 0.26 spans per word with a recall of 89.2%.As we increase the number of spans per word (λ in Section 5), we observe higher recall but with diminishing returns.In our experiments, keeping 0.4 spans per word results in 92.7% recall in the development data.

Mention Precision
While the training data does not offer a direct measure of mention precision, we can use the gold syntactic structures provided in the data as a proxy.Spans with high mention scores should correspond to syntactic constituents.
In Figure 4, we show the precision of topscoring spans when keeping 0.4 spans per word.For spans with 2-5 words, 75-90% of the predictions are constituents, indicating that the vast majority of the mentions are syntactically plausible.Longer spans, which are all relatively rare, prove more difficult for the model, and precision drops to 46% for spans with 10 words.
(A fire in a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized.Most of the deceased were killed in the crush as workers tried to flee (the blaze) in the four-story building.
A fire in (a Bangladeshi garment factory) has left at least 37 people dead and 100 hospitalized.Most of the deceased were killed in the crush as workers tried to flee the blaze in (the four-story building).

2
We are looking for (a region of central Italy bordering the Adriatic Sea).(The area) is mostly mountainous and includes Mt.Corno, the highest peak of the Apennines.(It) also includes a lot of sheep, good clean-living, healthy sheep, and an Italian entrepreneur has an idea about how to make a little money of them.

3
(The flight attendants) have until 6:00 today to ratify labor concessions.(The pilots') union and ground crew did so yesterday. 4

Head Agreement
We also investigate how well the learned head preferences correlate with syntactic heads.For each of the top-scoring spans in the development data that correspond to gold constituents, we compute the word with the highest attention weight.We plot in Figure 4 the proportion of these words that match syntactic heads.Agreement ranges between 68-93%, which is surprisingly high, since no explicit supervision of syntactic heads is provided.The model simply learns from the clustering data that these head words are useful for making coreference decisions.

Qualitative Analysis
Our qualitative analysis in Table 4 highlights the strengths and weaknesses of our model.Each row is a visualization of a single coreference cluster predicted by the model.Bolded spans in parentheses belong to the predicted cluster, and the redness of a word indicates its weight from the headfinding attention mechanism (a i,t in Section 4).

Strengths
The effectiveness of the attention mechanism for making coreference decisions can be seen in Example 1.The model pays attention to fire in the span A fire in a Bangladeshi garment factory, allowing it to successfully predict the coreference link with the blaze.For a subspan of that mention, a Bangladeshi garment factory, the model pays most attention instead to factory, allowing it successfully predict the coreference link with the four-story building.
The task-specific nature of the attention mechanism is also illustrated in Example 4. The model generally pays attention to coordinators more than the content of the coordination, since coordinators, such as and, provide strong cues for plurality.
The model is capable of detecting relatively long and complex noun phrases, such as a region of central Italy bordering the Adriatic Sea in Example 2. It also appropriately pays attention to region, showing that the attention mechanism provides more than content-word classification.The context encoding provided by the bidirectional LSTMs is critical to making informative head word decisions.
Weaknesses A benefit of using neural models for coreference resolution is their ability to use word embeddings to capture similarity between words, a property that many traditional featurebased models lack.While this can dramatically increase recall, as demonstrated in Example 1, it is also prone to predicting false positive links when the model conflates paraphrasing with relatedness or similarity.In Example 3, the model mistakenly predicts a link between The flight attendants and The pilots'.The predicted head words attendants and pilots likely have nearby word embeddings, which is a signal used-and often overused-by the model.The same type of error is made in Example 4, where the model predicts a coreference link between Prince Charles and his new wife Camilla and Charles and Diana, two noncoreferent mentions that are similar in many ways.These mistakes suggest substantial room for improvement with word or span representations that can cleanly distinguish between equivalence, entailment, and alternation.Unsurprisingly, our model does little in the uphill battle of making coreference decisions requiring world knowledge.In Example 5, the model incorrectly decides that them (in the context of let the rescuer locate them) is coreferent with some ships, likely due to plurality cues.However, an ideal model that uses common-sense reasoning would instead correctly infer that a rescuer is more likely to look for the man overboard rather than the ship from which he fell.This type of reasoning would require either (1) models that integrate external sources of knowledge with more complex inference or (2) a vastly larger corpus of training data to overcome the sparsity of these patterns.

Conclusion
We presented a state-of-the-art coreference resolution model that is trained end-to-end for the first time.Our final model ensemble improves performance on the OntoNotes benchmark by over 3 F1 without external preprocessing tools used by previous systems.We showed that our model implicitly learns to generate useful mention candidates from the space of all possible spans.A novel headfinding attention mechanism also learns a taskspecific preference for head words, which we empirically showed correlate strongly with traditional head-word definitions.
While our model substantially pushes the stateof-the-art performance, the improvements are potentially complementary to a large body of work on various strategies to improve coreference resolution, including entity-level inference and incorporating world knowledge, which are important avenues for future work.

Figure 2 :
Figure 2: Second step of our model.Antecedent scores are computed from pairs of span representations.The final coreference score of a pair of spans is computed by summing the mention scores of both spans and their pairwise antecedent score. 3

AvgFigure 3 :
Figure3: Proportion of gold mentions covered in the development data as we increase the number of spans kept per word.Recall is comparable to the mention detector of previous state-ofthe-art systems given the same number of spans.Our model keeps 0.4 spans per word in our experiments, achieving 92.7% recall of gold mentions.

Figure 4 :
Figure4: Indirect measure of mention precision using agreement with gold syntax.Constituency precision: % of unpruned spans matching syntactic constituents.Head word precision: % of unpruned constituents whose syntactic head word matches the most attended word.Frequency: % of gold spans with each width.

Table 1 :
Results on the test set on the English data from the CoNLL-2012 shared task.The final column (Avg.F1) is the main evaluation metric, computed by averaging the F1 of MUC, B 3 , and CEAF φ 4 .We improve state-of-the-art performance by 1.5 F1 for the single model and by 3.1 F1.
Prince Charles and his new wife Camilla) have jumped across the pond and are touring the United States making (their) first stop today in New York.It's Charles' first opportunity to showcase his new wife, but few Americans seem to care.Here's Jeanie Mowth.What a difference two decades make.(Charles and Diana) visited a JC Penney's on the prince's last official US tour.Twenty years later here's the prince with his new wife.5Alsosuch location devices, (some ships) have smoke floats (they) can toss out so the man overboard will be able to use smoke signals as a way of trying to, let the rescuer locate (them).

Table 4 :
Examples predictions from the development data.Each row depicts a single coreference cluster predicted by our model.Bold, parenthesized spans indicate mentions in the predicted cluster.The redness of each word indicates the weight of the head-finding attention mechanism (a i,t in Section 4).