Higher-Order Coreference Resolution with Coarse-to-Fine Inference

We introduce a fully-differentiable approximation to higher-order inference for coreference resolution. Our approach uses the antecedent distribution from a span-ranking architecture as an attention mechanism to iteratively refine span representations. This enables the model to softly consider multiple hops in the predicted clusters. To alleviate the computational cost of this iterative process, we introduce a coarse-to-fine approach that incorporates a less accurate but more efficient bilinear factor, enabling more aggressive pruning without hurting accuracy. Compared to the existing state-of-the-art span-ranking approach, our model significantly improves accuracy on the English OntoNotes benchmark, while being far more computationally efficient.


Introduction
Recent coreference resolution systems have heavily relied on first order models (Clark and Manning, 2016a;Lee et al., 2017), where only pairs of entity mentions are scored by the model. These models are computationally efficient and scalable to long documents. However, because they make independent decisions about coreference links, they are susceptible to predicting clusters that are locally consistent but globally inconsistent. Figure 1 shows an example from Wiseman et al. (2016)  We introduce an approximation of higher-order inference that uses the span-ranking architecture from Lee et al. (2017) in an iterative manner. At each iteration, the antecedent distribution is used as an attention mechanism to optionally update existing span representations, enabling later corefer-Speaker 1: Um and [I] think that is what's -Go ahead Linda. Speaker 2: Well and uh thanks goes to [you] and to the media to help us... So our hat is off to [all of you] as well. To alleviate computational challenges from this higher-order inference, we also propose a coarseto-fine approach that is learned with a single endto-end objective. We introduce a less accurate but more efficient coarse factor in the pairwise scoring function. This additional factor enables an extra pruning step during inference that reduces the number of antecedents considered by the more accurate but inefficient fine factor. Intuitively, the model cheaply computes a rough sketch of likely antecedents before applying a more expensive scoring function.
Our experiments show that both of the above contributions improve the performance of coreference resolution on the English OntoNotes benchmark. We observe a significant increase in average F1 with a second-order model, but returns quickly diminish with a third-order model. Additionally, our analysis shows that the coarse-to-fine approach makes the model performance relatively insensitive to more aggressive antecedent pruning, compared to the distance-based heuristic pruning from previous work.

Background
Task definition We formulate the coreference resolution task as a set of antecedent assignments y i for each of span i in the given document, following Lee et al. (2017). The set of possible assignments for each y i is Y(i) = {ǫ, 1, . . . , i − 1}, a dummy antecedent ǫ and all preceding spans. Non-dummy antecedents represent coreference links between i and y i . The dummy antecedent ǫ represents two possible scenarios: (1) the span is not an entity mention or (2) the span is an entity mention but it is not coreferent with any previous span. These decisions implicitly define a final clustering, which can be recovered by grouping together all spans that are connected by the set of antecedent predictions.

Baseline
We describe the baseline model (Lee et al., 2017), which we will improve to address the modeling and computational limitations discussed previously. The goal is to learn a distribution P (y i ) over antecedents for each span i : where s(i, j) is a pairwise score for a coreference link between span i and span j. The baseline model includes three factors for this pairwise coreference score: (1) s m (i), whether span i is a mention, (2) s m (j), whether span j is a mention, and (3) s a (i, j) whether j is an antecedent of i: In the special case of the dummy antecedent, the score s(i, ǫ) is instead fixed to 0. A common component used throughout the model is the vector representations g i for each possible span i. These are computed via bidirectional LSTMs (Hochreiter and Schmidhuber, 1997) that learn context-dependent boundary and head representations. The scoring functions s m and s a take these span representations as input: where • denotes element-wise multiplication, FFNN denotes a feed-forward neural network, and the antecedent scoring function s a (i, j) includes explicit element-wise similarity of each span g i • g j and a feature vector φ(i, j) encoding speaker and genre information from the metadata and the distance between the two spans. The model above is factored to enable a twostage beam search. A beam of up to M potential mentions is computed (where M is proportional to the document length) based on the spans with the highest mention scores s m (i). Pairwise coreference scores are only computed between surviving mentions during both training and inference.
Given supervision of gold coreference clusters, the model is learned by optimizing the marginal log-likelihood of the possibly correct antecedents. This marginalization is required since the best antecedent for each span is a latent variable.

Higher-order Coreference Resolution
The baseline above is a first-order model, since it only considers pairs of spans. First-order models are susceptible to consistency errors as demonstrated in Figure 1. Unlike in sentence-level semantics, where higher-order decisions can be implicitly modeled by the LSTMs, modeling these decisions at the document-level requires explicit inference due to the potentially very large surface distance between mentions.
We propose an inference procedure that allows the model to condition on higher-order structures, while being fully differentiable. This inference involves N iterations of refining span representations, denoted as g n i for the representation of span i at iteration n. At iteration n, g n i is computed with an attention mechanism that averages over previous representations g n−1 j weighted according to how likely each mention j is to be an antecedent for i, as defined below.
The baseline model is used to initialize the span representation at g 1 i . The refined span representations allow the model to also iteratively refine the antecedent distributions P n (y i ): where s is the coreference scoring function of the baseline architecture. The scoring function uses the same parameters at every iteration, but it is given different span representations. At each iteration, we first compute the expected antecedent representation a n i of each span i by using the current antecedent distribution P n (y i ) as an attention mechanism: The current span representation g n i is then updated via interpolation with its expected antecedent representation a n i : g n+1 The learned gate vector f n i determines for each dimension whether to keep the current span information or to integrate new information from its expected antecedent. At iteration n, g n i is an element-wise weighted average of approximately n span representations (assuming P n (y i ) is peaked), allowing P n (y i ) to softly condition on up to n other spans in the predicted cluster.
Span-ranking can be viewed as predicting latent antecedent trees (Fernandes et al., 2012;Martschat and Strube, 2015), where the predicted antecedent is the parent of a span and each tree is a predicted cluster. By iteratively refining the span representations and antecedent distributions, another way to interpret this model is that the joint distribution i P N (y i ) implicitly models every directed path of up to length N + 1 in the latent antecedent tree.
4 Coarse-to-fine Inference The model described above scales poorly to long documents. Despite heavy pruning of potential mentions, the space of possible antecedents for every surviving span is still too large to fully consider. The bottleneck is in the antecedent score s a (i, j), which requires computing a tensor of size M × M × (3|g| + |φ|).
This computational challenge is even more problematic with the iterative inference from Section 3, which requires recomputing this tensor at every iteration.

Heuristic antecedent pruning
To reduce computation, Lee et al. (2017) heuristically consider only the nearest K antecedents of each span, resulting in a smaller input of size M × K × (3|g| + |φ|).
The main drawback to this solution is that it imposes an a priori limit on the maximum distance of a coreference link. The previous work only considers up to K = 250 nearest mentions, whereas Coarse-to-fine (various K) Coarse-to-fine (chosen K) Heuristic (various K) Heuristic (chosen K) Figure 2: Comparison of accuracy on the development set for the two antecedent pruning strategies with various beams sizes K. The distance-based heuristic pruning performance drops by almost 5 F1 when reducing K from 250 to 50, while the coarse-to-fine pruning results in an insignificant drop of less than 0.2 F1. coreference links can reach much further in natural language discourse.

Coarse-to-fine antecedent pruning
We instead propose a coarse-to-fine approach that can be learned end-to-end and does not establish an a priori maximum coreference distance. The key component of this coarse-to-fine approach is an alternate bilinear scoring function: where W c is a learned weight matrix. In contrast to the concatenation-based s a (i, j), the bilinear s c (i, j) is far less accurate. A direct replacement of s a (i, j) with s c (i, j) results in a performance loss of over 3 F1 in our experiments. However, s c (i, j) is much more efficient to compute. Computing s c (i, j) only requires manipulating matrices of size M × |g| and M × M . Therefore, we instead propose to use s c (i, j) to compute a rough sketch of likely antecedents. This is accomplished by including it as an additional factor in the model: Similar to the baseline model, we leverage this additional factor to perform an additional beam pruning step. The final inference procedure involves a three-stage beam search:  Second stage Keep the top K antecedents of each remaining span i based on the first three factors, s m (i) + s m (j) + s c (i, j).
Third stage The overall coreference s(i, j) is computed based on the remaining span pairs. The soft higher-order inference from Section 3 is computed in this final stage. While the maximum-likelihood objective is computed over only the span pairs from this final stage, this coarse-to-fine approach expands the set of coreference links that the model is capable of learning. It achieves better performance while using a much smaller K (see Figure 2).

Experimental Setup
We use the English coreference resolution data from the CoNLL-2012 shared task (Pradhan et al., 2012) in our experiments. The code for replicating these results is publicly available. 1 Our models reuse the hyperparameters from Lee et al. (2017), with a few exceptions mentioned below. In our results, we report two improvements that are orthogonal to our contributions.
• We used embedding representations from a language model (Peters et al., 2018) at the input to the LSTMs (ELMo in the results).
• We changed several hyperparameters: 1. increasing the maximum span width from 10 to 30 words.
The baseline model considers up to 250 antecedents per span. As shown in Figure 2, the coarse-to-fine model is quite insensitive to more aggressive pruning. Therefore, our final model considers only 50 antecedents per span. On the development set, the second-order model (N = 2) outperforms the first-order model by 0.8 F1, but the third order model only provides an additional 0.1 F1 improvement. Therefore, we only compute test results for the secondorder model.

Results
We report the precision, recall, and F1 of the the MUC, B 3 , and CEAF φ 4 metrics using the official CoNLL-2012 evaluation scripts. The main evaluation is the average F1 of the three metrics.
Results on the test set are shown in Table 1. We include performance of systems proposed in the past 3 years for reference. The baseline relative to our contributions is the span-ranking model from Lee et al. (2017) augmented with both ELMo and hyperparameter tuning, which achieves 72.3 F1. Our full approach achieves 73.0 F1, setting a new state of the art for coreference resolution.
Compared to the heuristic pruning with up to 250 antecedents, our coarse-to-fine model only computes the expensive scores s a (i, j) for 50 antecedents. Despite using far less computation, it outperforms the baseline because the coarse scores s c (i, j) can be computed for all antecedents, enabling the model to potentially predict a coreference link between any two spans in the document. As a result, we observe a much higher recall when adopting the coarse-to-fine approach.
We also observe further improvement by including the second-order inference (Section 3). The improvement is largely driven by the overall increase in precision, which is expected since the higher-order inference mainly serves to rule out inconsistent clusters. It is also consistent with findings from Martschat and Strube (2015) who report mainly improvements in precision when modeling latent trees to achieve a similar goal.

Related Work
In addition to the end-to-end span-ranking model (Lee et al., 2017) that our proposed model builds upon, there is a large body of literature on coreference resolvers that fundamentally rely on scoring span pairs (Ng and Cardie, 2002;Bengtson and Roth, 2008;Denis and Baldridge, 2008;Fernandes et al., 2012;Durrett and Klein, 2013;Wiseman et al., 2015;Clark and Manning, 2016a).
Motivated by structural consistency issues discussed above, significant effort has also been devoted towards cluster-level modeling. Since global features are notoriously difficult to define (Wiseman et al., 2016), they often depend heavily on existing pairwise features or architectures (Björkelund and Kuhn, 2014;Manning, 2015, 2016b). We similarly use an existing pairwise span-ranking architecture as a building block for modeling more complex structures. In contrast to Wiseman et al. (2016) who use highly expressive recurrent neural networks to model clusters, we show that the addition of a relatively lightweight gating mechanism is sufficient to effectively model higher-order structures.

Conclusion
We presented a state-of-the-art coreference resolution system that models higher order interactions between spans in predicted clusters. Additionally, our proposed coarse-to-fine approach alleviates the additional computational cost of higher-order inference, while maintaining the end-to-end learnability of the entire model.