Optimizing Differentiable Relaxations of Coreference Evaluation Metrics

Coreference evaluation metrics are hard to optimize directly as they are non-differentiable functions, not easily decomposable into elementary decisions. Consequently, most approaches optimize objectives only indirectly related to the end goal, resulting in suboptimal performance. Instead, we propose a differentiable relaxation that lends itself to gradient-based optimisation, thus bypassing the need for reinforcement learning or heuristic modification of cross-entropy. We show that by modifying the training objective of a competitive neural coreference system, we obtain a substantial gain in performance. This suggests that our approach can be regarded as a viable alternative to using reinforcement learning or more computationally expensive imitation learning.


Introduction
Coreference resolution is the task of identifying all mentions which refer to the same entity in a document. It has been shown beneficial in many natural language processing (NLP) applications, including question answering (Hermann et al., 2015) and information extraction (Kehler, 1997), and often regarded as a prerequisite to any text understanding task.
Coreference resolution can be regarded as a clustering problem: each cluster corresponds to a single entity and consists of all its mentions in a given text. Consequently, it is natural to evaluate predicted clusters by comparing them with the ones annotated by human experts, and this is exactly what the standard metrics (e.g., MUC, B 3 , CEAF) do. In contrast, most state-of-theart systems are optimized to make individual co-reference decisions, and such losses are only indirectly related to the metrics.
One way to deal with this challenge is to optimize directly the non-differentiable metrics using reinforcement learning (RL), for example, relying on the REINFORCE policy gradient algorithm (Williams, 1992). However, this approach has not been very successful, which, as suggested by Clark and Manning (2016a), is possibly due to the discrepancy between sampling decisions at training time and choosing the highest ranking ones at test time. A more successful alternative is using a 'roll-out' stage to associate cost with possible decisions, as in Clark and Manning (2016a), but it is computationally expensive. Imitation learning (Ma et al., 2014b;Clark and Manning, 2015), though also exploiting metrics, requires access to an expert policy, with exact policies not directly computable for the metrics of interest.
In this work, we aim at combining the best of both worlds by proposing a simple method that can turn popular coreference evaluation metrics into differentiable functions of model parameters. As we show, this function can be computed recursively using scores of individual local decisions, resulting in a simple and efficient estimation procedure. The key idea is to replace nondifferentiable indicator functions (e.g. the member function I(m ∈ S)) with the corresponding posterior probabilities (p(m ∈ S)) computed by the model. Consequently, non-differentiable functions used within the metrics (e.g. the set size function |S| = m I(m ∈ S)) become differentiable (|S| c = m p(m ∈ S)). Though we assume that the scores of the underlying statistical model can be used to define a probability model, we show that this is not a serious limitation. Specifically, as a baseline we use a probabilistic version of the neural mention-ranking model of Wiseman et al. (2015b), which on its own outperforms the original one and achieves similar performance to its global version (Wiseman et al., 2016). Importantly when we use the introduced differentiable relaxations in training, we observe a substantial gain in performance over our probabilistic baseline. Interestingly, the absolute improvement (+0.52) is higher than the one reported in Clark and Manning (2016a) using RL (+0.05) and the one using reward rescaling 1 (+0.37). This suggests that our method provides a viable alternative to using RL and reward rescaling.
The outline of our paper is as follows: we introduce our neural resolver baseline and the B 3 and LEA metrics in Section 2. Our method to turn a mention ranking resolver into an entity-centric resolver is presented in Section 3, and the proposed differentiable relaxations in Section 4. Section 5 shows our experimental results.

Neural mention ranking
In this section we introduce neural mention ranking, the framework which underpins current stateof-the-art models (Clark and Manning, 2016a). Specifically, we consider a probabilistic version of the method proposed by Wiseman et al. (2015b). In experiments we will use it as our baseline.
Let (m 1 , m 2 , .., m n ) be the list of mentions in a document. For each mention m i , let a i ∈ {1, ..., i} be the index of the mention that m i is coreferent with (if a i = i, m i is the first mention of some entity appearing in the document). As standard in coreference resolution literature, we will refer to m a i as an antecedent of m i . 2 Then, in mention ranking the goal is to score antecedents of a mention higher than any other mentions, i.e., if s is the scoring function, we require s(a i = j) > s(a i = k) for all j, k such that m i and m j are coreferent but m i and m k are not.
Let φ a (m i ) ∈ R da and φ p (m i , m j ) ∈ R dp be respectively features of m i and features of pair 1 Reward rescaling is a technique that computes error values for a heuristic loss function based on the reward difference between the best decision according to the current model and the decision leading to the highest metric score.
2 This slightly deviates from the definition of antecedents in linguistics (Crystal, 1997).
(m i , m j ). The scoring function is defined by: and u, v, W a , W p , b a , b p are real vectors and matrices with proper dimensions, u 0 , v 0 are real scalars. Unlike Wiseman et al. (2015b), where the maxmargin loss is used, we define a probabilistic model. The probability 3 that m i and m j are coreferent is given by Following Durrett and Klein (2013) we use the following softmax-margin (Gimpel and Smith, 2010) loss function: where Θ are model parameters, C(m i ) is the set of the indices of correct antecedents of m i , and p (a i = j) ∝ p(a i = j)e ∆(j,C(m i )) . ∆ is a cost function used to manipulate the contribution of different error types to the loss function: The error types are "false anaphor", "false new", "wrong link", and "no mistake", respectively. In our experiments, we borrow their values from Durrett and Klein (2013): (α 1 , α 2 , α 3 ) = (0.1, 3, 1).
In the subsequent discussion, we refer to the loss as mention-ranking heuristic cross entropy.
for evaluation. However, because MUC is the least discriminative metric (Moosavi and Strube, 2016), whereas CEAF is slow to compute, out of the five most popular metrics we incorporate into our loss only B 3 . In addition, we integrate LEA, as it has been shown to provide a good balance between discriminativity and interpretability. Let G = {G 1 , G 2 , ..., G N } and S = {S 1 , S 2 , ..., S M } be the gold-standard entity set and an entity set given by a resolver. Recall that an entity is a set of mentions. The recall and precision of the B 3 metric is computed by: The LEA metric is computed as: where link(E) = |E| × (|E| − 1)/2 is the number of coreference links in entity E. F β , for both metrics, is defined by:  Figure 1: For each mention m u there is a potential entity E u so that m u is the first mention in the chain. Computing p(m i ∈ E u ), u < i takes into the account all directed paths from m i to E u (black arrows). Noting that there is no directed path from any m k , k < u to E u because p(m k ∈ E u ) = 0. (See text for more details.)

From mention ranking to entity centricity
Mention-ranking resolvers do not explicitly provide information about entities/clusters which is required by B 3 and LEA. We therefore propose a simple solution that can turn a mention-ranking resolver into an entity-centric one. First note that in a document containing n mentions, there are n potential entities E 1 , E 2 , ..., E n where E i has m i as the first mention. Let p(m i ∈ E u ) be the probability that mention m i corresponds to entity E u . We now show that it can be computed recursively based on p(a i = j) as follows: In other words, if u < i, we consider all possible m j with which m i can be coreferent, and which can correspond to entity E u . If u = i, the link to be considered is the m i 's self-link. And, if u > i, the probability is zero, as it is impossible for m i to be assigned to an entity introduced only later. See Figure 1 for extra information.
We now turn to two crucial questions about this formula: • Is p(m i ∈ •) a valid probability distribution?
• Is it possible for a mention m u to be mostly anaphoric (i.e. p(m u ∈ E u ) is low) but for the corresponding cluster E u to be highly probable (i.e. p(m i ∈ E u ) is high for some i)?
The first question is answered in Proposition 1. The second question is important because, intuitively, when a mention m u is anaphoric, the potential entity E u does not exist. We will show that the answer is "No" by proving in Proposition 2 that the probability that m u is anaphoric is always higher than any probability that m i , i > u refers to E u .
Proof. We prove this proposition by induction.
Assume that n u=1 p(m j ∈ E u ) = 1 for all j < i. Then, Proof. We prove this proposition by induction.

Entity-centric heuristic cross entropy loss
Having p(m i ∈ E u ) computed, we can consider coreference resolution as a multiclass prediction problem. An entity-centric heuristic cross entropy loss is thus given below: where E e(m i ) is the correct entity that m i belongs to, p (m i ∈ E u ) ∝ p(m i ∈ E u )e Γ(u,e(m i )) . Similar to ∆ in the mention-ranking heuristic loss in Section 2.1, Γ is a cost function used to manipulate the contribution of the four different error types ("false anaphor", "false new", "wrong link", and "no mistake"): There are two functions used in computing B 3 and LEA: the set size function |.| and the link function link(.). Because both of them are non-differentiable, the two metrics are nondifferentiable. We thus need to make these two functions differentiable. There are two remarks. Firstly, both functions can be computed using the indicator function I(m i ∈ S u ): Secondly, given π i,u = log p(m i ∈ S u ), the indicator function I(m i ∈ S u * ), u * = arg max u p(m i ∈ S u ) is the converging point of the following softmax as T → 0 (see Figure 2): (Kirkpatrick et al., 1983). Therefore, we propose to represent each S u as a soft-cluster: where, as defined in Section 3, E u is the potential entity that has m u as the first mention. Replacing the indicator function I(m i ∈ S u ) by the probability distribution p(m i ∈ E u ; T ), we then have a differentiable version for the set size function and the link function: |G v ∩ S u | d and link d (G v ∩ S u ) are computed similarly with the constraint that only mentions in G v are taken into account. Plugging these functions into precision and recall of B 3 and LEA in Section 2.2, we obtain differentiableF β,B 3 and F β,LEA , which are then used in two loss functions: where λ is the hyper-parameter of the L 1 regularization terms.
It is worth noting that, as T → 0,F β,B 3 → F β,B 3 andF β,LEA → F β,LEA . 5 Therefore, when training a model with the proposed losses, we can start at a high temperature (e.g., T = 1) and anneal to a small but non-zero temperature. However, in our experiments we fix T = 1. Annealing is left for future work.

Experiments
We now demonstrate how to use the proposed differentiable B 3 and LEA to train a coreference resolver. The source code and trained models are available at https://github.com/ lephong/diffmetric_coref.

Resolvers
We build following baseline and three resolvers: • baseline: the resolver presented in Section 2.1. We use the identical configuration as in Wiseman et al. (2016): W a ∈ R 200×da , W p ∈ R 700×dp , λ = 10 −6 (where d a , d p are respectively the numbers of mention features and pair-wise features). We also employ their pretraining methodology.
• L β,B 3 and L β,LEA : the resolvers using the losses proposed in Section 4. β is tuned on the development set by trying each value in To train these resolvers we use AdaGrad (Duchi et al., 2011) to minimize their loss functions with the learning rate tuned on the development set and with one-document mini-batches. Note that we use the baseline as the initialization point to train the other three resolvers.

Results
We firstly compare our resolvers against Wiseman et al. (2015b) and Wiseman et al. (2016). Results are shown in the first half of Table 1. Our baseline surpasses Wiseman et al. (2015b). It is likely due to using features from Wiseman et al. (2016). Using the entity-centric heuristic cross entropy loss and the relaxations are clearly beneficial: L ec is slightly better than our baseline and on par with the global model of Wiseman et al. (2016). L β=1,B 3 , L β=1,LEA outperform the baseline, the global model of Wiseman et al. (2016), and L ec . However, the best values of β are √ 1.4, √ 1.8 respectively for L β,B 3 , and L β,LEA . Among these resolvers, L β= √ 1.8,LEA achieves the highest F 1 scores across all the metrics except BLANC.
When comparing to Clark and Manning (2016a) (the second half of Table 1), we can see that the absolute improvement over the baselines (i.e. 'heuristic loss' for them and the heuristic cross entropy loss for us) is higher than that of reward rescaling but with much shorter training time: +0.37 (7 days 7 ) and +0.52 (15 hours) on the CoNLL metric for Clark and Manning (2016a) and ours, respectively. It is worth noting that our absolute scores are weaker than these of Clark and Manning (2016a), as they build on top of a similar but stronger mention-ranking baseline, which employs deeper neural networks and requires a much larger number of epochs to train (300 epochs, including pretraining). For the purpose of illustrating the proposed losses, we started with a simpler model by Wiseman et al. (2015b)   a much smaller number of epochs, thus faster, to train (20 epochs, including pretraining). Table 2 shows the breakdown of errors made by the baseline and our resolvers on the development set. The proposed resolvers make fewer "false anaphor" and "wrong link" errors but more "false new" errors compared to the baseline. This suggests that loss optimization prevents over-clustering, driving the precision up: when antecedents are difficult to detect, the self-link (i.e., a i = i) is chosen. When β increases, they make more "false anaphor" and "wrong link" errors but less "false new" errors.

Analysis
In Figure 3(a) the baseline, but not L β=1,B 3 nor L β= √ 1.4,B 3 , mistakenly links 17 [it] with 13 [the virus]. Under-clustering, on the other hand, is a problem for our resolvers with β = 1: in example (b), L β=1,B 3 missed 165 [We]. This behaviour results in a reduced recall but the recall is not damaged severely, as we still obtain a better F 1 score. We conjecture that this behaviour is a consequence of using the F 1 score in the objective, and, if undesirable, F β with β > 1 can be used instead. For instance, also in Figure 3, L β=  Figure 4 shows recall, precision, F 1 (average of MUC, B 3 , CEAF e ), on the development set when training with L β,B 3 and L β,LEA . As expected, higher values of β yield lower precisions but higher recalls. In contrast, F 1 increases until  Table 2: Number of: "false anaphor" (FA, a non-anaphoric mention marked as anaphoric), "false new" (FN, an anaphoric mention marked as non-anaphoric), and "wrong link" (WL, an anaphoric mention is linked to a wrong antecedent) errors on the development set.

Discussion
Because the resolvers are evaluated on F 1 score metrics, it should be that L β,B 3 and L β,LEA perform the best with β = 1. Figure 4 and Table 1 however do not confirm that: β should be set with values a little bit larger than 1. There are two hypotheses. First, the statistical difference between the training set and the development set leads to the case that the optimal β on one set can be suboptimal on the other set. Second, in our experiments we fix T = 1, meaning that the relaxations might not be close to the true evaluation metrics enough. Our future work, to confirm/reject this, is to use annealing, i.e., gradually decreasing T down to (but larger than) 0. Table 1 shows that the difference between L β,B 3 and L β,LEA in terms of accuracy is not substan-tial (although the latter is slightly better than the former). However, one should expect that L β,B 3 would outperform L β,LEA on B 3 metric while it would be the other way around on LEA metric. It turns out that, B 3 and LEA behave quite similarly in non-extreme cases. We can see that in Figure 2, 4, 5, 6, 7 in Moosavi and Strube (2016).

Related work
Mention ranking and entity centricity are two main streams in the coreference resolution literature. Mention ranking (Denis and Baldridge, 2007;Durrett and Klein, 2013;Martschat and Strube, 2015;Wiseman et al., 2015a) considers local and independent decisions when choosing a correct antecedent for a mention. This approach is computationally efficient and currently dominant with state-of-the-art performance (Wiseman et al., 2016;Clark and Manning, 2016a). Wiseman et al. (2015b) propose to use simple neural networks to compute mention ranking scores and to use a heuristic loss to train the model. Wiseman et al. (2016) extend this by employing LSTMs to compute mention-chain representations which are then used to compute ranking scores. They call these representations global features. Clark and Manning (2016a) build a similar resolver as in Wiseman et al. (2015b) but much stronger thanks to deeper neural networks and "better mention detection, more effective, hyperparameters, and more epochs of training". Furthermore, using reward rescaling they achieve the best performance in the literature on the English and Chinese portions of the CoNLL 2012 dataset. Our work is built upon mention ranking by turning a mentionranking model into an entity-centric one. It is worth noting that although we use the model proposed by Wiseman et al. (2015b), any mentionranking models can be employed.
Entity centricity (Wellner and McCallum, 2003;Poon and Domingos, 2008;Haghighi and Klein, 2010;Ma et al., 2014a;Clark and Manning, 2016b), on the other hand, incorporates entitylevel information to solve the problem. The approach can be top-down as in Haghighi and Klein (2010) where they propose a generative model. It can also be bottom-up by merging smaller clusters into bigger ones as in Clark and Manning (2016b). The method proposed by Ma et al. (2014a) greedily and incrementally adds mentions to previously built clusters using a prune-and-score technique. Importantly, employing imitation learning these two methods can optimize the resolvers directly on evaluation metrics. Our work is similar to Ma et al. (2014a) in the sense that our resolvers incrementally add mentions to previously built clusters.
However, different from both Ma et al. (2014a); Clark and Manning (2016b), our resolvers do not use any discrete decisions (e.g., merge operations). Instead, they seamlessly compute the probability that a mention refers to an entity from mentionranking probabilities, and are optimized on differentiable relaxations of evaluation metrics.
Using differentiable relaxations of evaluation metrics as in our work is related to a line of research in reinforcement learning where a nondifferentiable action-value function is replaced by a differentiable critic (Sutton et al., 1999;Silver et al., 2014). The critic is trained so that it is as close to the true action-value function as possible. This technique is applied to machine translation (Gu et al., 2017) where evaluation metrics (e.g., BLUE) are non-differentiable. A disadvantage of using critics is that there is no guarantee that the critic converges to the true evaluation metric given finite training data. In contrast, our differentiable relaxations do not need to train, and the convergence is guaranteed as T → 0.

Conclusions
We have proposed • a method for turning any mention-ranking resolver into an entity-centric one by using a recursive formula to combine scores of individual local decisions, and • differentiable relaxations for two coreference evaluation metrics, B 3 and LEA.
Experimental results show that our approach outperforms the resolver by Wiseman et al. (2016), and gains a higher improvement over the baseline than that of Clark and Manning (2016a) but with much shorter training time.