Deep Reinforcement Learning for Mention-Ranking Coreference Models

Coreference resolution systems are typically trained with heuristic loss functions that require careful tuning. In this paper we instead apply reinforcement learning to directly optimize a neural mention-ranking model for coreference evaluation metrics. We experiment with two approaches: the REINFORCE policy gradient algorithm and a reward-rescaled max-margin objective. We find the latter to be more effective, resulting in significant improvements over the current state-of-the-art on the English and Chinese portions of the CoNLL 2012 Shared Task.


Introduction
Coreference resolution systems typically operate by making sequences of local decisions (e.g., adding a coreference link between two mentions).However, most measures of coreference resolution performance do not decompose over local decisions, which means the utility of a particular decision is not known until all other decisions have been made.
Due to this difficulty, coreference systems are usually trained with loss functions that heuristically define the goodness of a particular coreference decision.These losses contain hyperparameters that are carefully selected to ensure the model performs well according to coreference evaluation metrics.This complicates training, especially across different languages and datasets where systems may work best with different settings of the hyperparameters.
To address this, we explore using two variants of reinforcement learning to directly optimize a coreference system for coreference evaluation metrics.In particular, we modify the max-margin coreference objective proposed by Wiseman et al. (2015) by incorporating the reward associated with each coreference decision into the loss's slack rescaling.We also test the REINFORCE policy gradient algorithm (Williams, 1992).
Our model is a neural mention-ranking model.Mention-ranking models score pairs of mentions for their likelihood of coreference rather than comparing partial coreference clusters.Hence they operate in a simple setting where coreference decisions are made independently.Although they are less expressive than entity-centric approaches to coreference (e.g., Haghighi and Klein, 2010), mention-ranking models are fast, scalable, and simple to train, causing them to be the dominant approach to coreference in recent years (Durrett and Klein, 2013;Wiseman et al., 2015).Having independent actions is particularly useful when applying reinforcement learning because it means a particular action's effect on the final reward can be computed efficiently.
We evaluate the models on the English and Chinese portions of the CoNLL 2012 Shared Task.The REINFORCE algorithm is competitive with a heuristic loss function while the reward-rescaled objective significantly outperforms both 1 .We attribute this to reward rescaling being well suited for a ranking task due to its max-margin loss as well as benefiting from directly optimizing for coreference metrics.Error analysis shows that using the reward-rescaling loss results in a similar number of mistakes as the heuristic loss, but the mistakes tend to be less severe.We use the neural mention-ranking model described in Clark and Manning (2016), which we briefly go over in this section.Given a mention m and candidate antecedent c, the mention-ranking model produces a score for the pair s(c, m) indicating their compatibility for coreference with a feedforward neural network.The candidate antecedent may be any mention that occurs before m in the document or NA, indicating that m has no antecedent.
Input Layer.For each mention, the model extracts various words (e.g., the mention's head word) and groups of words (e.g., all words in the mention's sentence) that are fed into the neural network.Each word is represented by a vector w i ∈ R dw .Each group of words is represented by the average of the vectors of each word in the group.In addition to the embeddings, a small number of additional features are used, including distance, string matching, and speaker identification features.See Clark and Manning (2016) for the full set of features and an ablation study.
These features are concatenated to produce an Idimensional vector h 0 , the input to the neural network.If c = NA, features defined over pairs of mentions are not included.For this case, we train a separate network with an identical architecture to the pair network except for the input layer to produce anaphoricity scores.
Hidden Layers.The input gets passed through three hidden layers of rectified linear (ReLU) units (Nair and Hinton, 2010).Each unit in a hidden layer is fully connected to the previous layer: Scoring Layer.The final layer is a fully connected layer of size 1: where W 4 is a 1 × M 3 weight matrix.At test time, the mention-ranking model links each mention with its highest scoring candidate antecedent.

Learning Algorithms
Mention-ranking models are typically trained with heuristic loss functions that are tuned via hyperparameters.These hyperparameters are usually given as costs for different error types, which are used to bias the coreference system towards making more or fewer coreference links.
In this section we first describe a heuristic loss function incorporating this idea from Wiseman et al. (2015).We then propose new training procedures based on reinforcement learning that instead directly optimize for coreference evaluation metrics.

Heuristic Max-Margin Objective
The heuristic loss from Wiseman et al. is governed by the following error types, which were first proposed by Durrett et al. (2013).
Suppose the training set consists of N mentions m 1 , m 2 , ..., m N .Let C(m i ) denote the set of candidate antecedents of a mention m i (i.e., mentions preceding m i and NA) and T (m i ) denote the set of true antecedents of m i (i.e., mentions preceding m i that are coreferent with it or {NA} if m i has no antecedent).Then we define the following costs for linking m i to a candidate antecedent c ∈ C(m i ): for "false new," "false anaphor," "wrong link", and correct coreference decisions.
The heuristic loss is a slack-rescaled max-margin objective parameterized by these error costs.Let ti be the highest scoring true antecedent of m i : Then the heuristic loss is given as Finding Effective Error Penalties.
We fix α WL = 1.0 and search for α FA and α FN out of {0.1, 0.2, ..., 1.5} with a variant of grid search.Each new trial uses the unexplored set of hyperparame-ters that has the closest Manhattan distance to the best setting found so far on the dev set.The search is halted when all immediate neighbors (within 0.1 distance) of the best setting have been explored.We found (α FN , α FA , α WL ) = (0.8, 0.4, 1.0) to be best for English and (α FN , α FA , α WL ) = (0.8, 0.5, 1.0) to be best for Chinese on the CoNLL 2012 data.

Reinforcement Learning
Finding the best hyperparameter settings for the heuristic loss requires training many variants of the model, and at best results in an objective that is correlated with coreference evaluation metrics.To address this, we pose mention ranking in the reinforcement learning framework (Sutton and Barto, 1998) and propose methods that directly optimize the model for coreference metrics.
We can view the mention-ranking model as an agent taking a series of actions a 1:T = a 1 , a 2 , ..., a T , where T is the number of mentions in the current document.Each action a i links the ith mention in the document m i to a candidate antecedent.Formally, we denote the set of actions available for the ith mention as , where an action (c, m) adds a coreference link between mentions c and m.The mentionranking model assigns each action the score s(c, m) and takes the highest-scoring action at each step.
Once the agent has executed a sequence of actions, it observes a reward R(a 1:T ), which can be any function.We use the B 3 coreference metric for this reward (Bagga and Baldwin, 1998).Although our system evaluation also includes the MUC (Vilain et al., 1995) and CEAF φ 4 (Luo, 2005) metrics, we do not incorporate them into the loss because MUC has the flaw of treating all errors equally and CEAF φ 4 is slow to compute.
Reward Rescaling.Crucially, the actions taken by a mention-ranking model are independent.This means it is possible to change any action a i to a different one a i ∈ A i and see what reward the model would have gotten by taking that action instead: R(a 1 , ..., a i−1 , a i , a i+1 , ..., a T ).We use this idea to improve the slack-rescaling parameter ∆ in the maxmargin loss L(θ).Instead of setting its value based on the error type, we compute exactly how much each action hurts the final reward: where a 1:T is the highest scoring sequence of actions according to the model's current parameters.Otherwise the model is trained in the same way as with the heuristic loss.
The REINFORCE Algorithm.We also explore using the REINFORCE policy gradient algorithm (Williams, 1992).We can define a probability distribution over actions using the mention-ranking model's scoring function as follows: ,m)   for any action a = (c, m).The REINFORCE algorithm seeks to maximize the expected reward It does this through gradient ascent.Computing the full gradient is prohibitive because of the expectation over all possible action sequences, which is exponential in the length of the sequence.Instead, it gets an unbiased estimate of the gradient by sampling a sequence of actions a 1:T according to p θ and computing the gradient only over the sample.
We take advantage of the independence of actions by using the following gradient estimate, which has lower variance than the standard REINFORCE gradient estimate: where b i is a baseline used to reduce the variance, which we set to E a i ∈A i ∼p θ R(a 1 , ..., a i , ..., a T ).

Experiments and Results
We run experiments on the English and Chinese portions of the CoNLL 2012 Shared Task data (Pradhan et al., 2012) and evaluate with the MUC, B 3 , and CEAF φ 4 metrics.Our experiments were run using predicted mentions from Stanford's rule-based coreference system (Raghunathan et al., 2010).
We follow the training methodology from Clark and Manning (2016): hidden layers of sizes M 1 = 1000, M 2 = M 3 = 500, the RMSprop optimizer (Hinton and Tieleman, 2012), dropout (Hinton et al., 2012) with a rate of 0.5, and pretraining with the all pairs classification and top pairs classification tasks.However, we improve on the previous system by using using better mention detection, more effective hyperparameters, and more epochs of training.

Results
We compare the heuristic loss, REINFORCE, and reward rescaling approaches on both datasets.We find that REINFORCE does slightly better than the heuristic loss, but reward rescaling performs significantly better than both on both languages.We attribute the modest improvement of REIN-FORCE to it being poorly suited for a ranking task.During training it optimizes the model's performance in expectation, but at test-time it takes the most probable sequence of actions.This mismatch occurs even at the level of an individual decision: the model only links the current mention to a single antecedent, but is trained to assign high probability to all correct antecedents.We believe the benefit of REINFORCE being guided by coreference evaluation metrics is offset by this disadvantage, which does not occur in the max-margin approaches.The reward-rescaled max-margin loss combines the best of both worlds, resulting in superior performance.

The Benefits of Reinforcement Learning
In this section we examine the reward-based cost function ∆ r and perform error analysis to determine how reward rescaling improves the mention-ranking model's accuracy.
Comparison with Heuristic Costs.We compare the reward-based cost function ∆ r with the error types used in the heuristic loss.For English, the average value of ∆ r is 0.79 for FN errors and 0.38 for FA errors when the costs are scaled so the average value of a WL error is 1.0.These are very close to the hyperparameter values (α FN , α FA , α WL ) = (0.8, 0.4, 1.0) found by grid search.However, there is a high variance in costs for each error type, suggesting that using a fixed penalty for each type as in the heuristic loss is insufficient (see Figure 1).
Avoiding Costly Mistakes.Embedding the costs of actions into the loss function causes the rewardrescaling model to prioritize getting the more important coreference decisions (i.e., the ones with the biggest impact on the final score) correct.As a   result, it makes fewer costly mistakes at test time.Costly mistakes often involve large clusters of mentions: incorrectly combining two coreference clusters of size ten is much worse than incorrectly combining two clusters of size one.However, the cost of an action also depends on other factors such as the number of errors already present in the clusters and the utilities of the other available actions.
Table 2 shows the breakdown of errors made by the heuristic and reward-rescaling models on the test set.The reward-rescaling model makes slightly more errors, meaning its improvement in performance must come from its errors being less severe.
Example Improvements.Table 3 shows two classes of mentions where the reward-rescaling loss particularly improves over the heuristic loss.
Proper nouns have a higher average cost for "false new" errors (0.90) than other mentions types (0.77).This is perhaps because proper nouns are important for connecting clusters of mentions far apart in a document, so incorrectly linking a proper noun to NA could result in a large decrease in recall.Because it more heavily weights these high-cost errors during training, the reward-rescaling model makes fewer "false new" errors for proper nouns than the heuristic loss.Although there is an increase in other kinds of errors as a result, most of these are low-cost "false anaphoric" errors.
The pronouns in the "telephone conversation" genre often group into extremely large coreference clusters, which means a "wrong link" error can have a large negative effect on the score.This is reflected in its high average cost of 1.21.After prioritizing these examples during training, the reward-rescaling model creates significantly fewer wrong links than the heuristic loss, which is trained using a fixed cost of 1.0 for all wrong links.

Related Work
Mention-ranking models have been widely used for coreference resolution (Denis and Baldridge, 2007;Rahman and Ng, 2009;Durrett and Klein, 2013).These models are typically trained with heuristic loss functions that assign costs to different error types, as in the heuristic loss we describe in Section 3.1 (Fernandes et al., 2012;Durrett et al., 2013;Björkelund and Kuhn, 2014;Wiseman et al., 2015;Martschat and Strube, 2015;Wiseman et al., 2016).
To the best of our knowledge reinforcement learning has not been applied to coreference resolution before.However, imitation learning algorithms such as SEARN (Daumé III et al., 2009) have been used to train coreference resolvers (Daumé III, 2006;Ma et al., 2014;Clark and Manning, 2015).These algorithms also directly optimize for coreference evaluation metrics, but they require an expert policy to learn from instead of relying on rewards alone.

Conclusion
We propose using reinforcement learning to directly optimize mention-ranking models for coreference evaluation metrics, obviating the need for hyperparameters that must be carefully selected for each particular language, dataset, and evaluation metric.Our reward-rescaling approach also increases the model's accuracy, resulting in significant gains over the current state-of-the-art.

Figure 1 :
Figure 1: Density plot of the costs ∆r associated with different error types on the English CoNLL 2012 test set.

Table 1 :
Comparison of the methods together with other state-of-the-art approaches on the test sets.

Table 3 :
Examples of classes of mention on which the reward-rescaling loss improves upon the heuristic loss due to its rewardbased cost function.Reported numbers are from the English CoNLL 2012 test set.

Table 2 :
Number of "false new," "false anaphoric," and "wrong link" errors produced by the models on the English CoNLL 2012 test set.