End-to-end Deep Reinforcement Learning Based Coreference Resolution

Recent neural network models have significantly advanced the task of coreference resolution. However, current neural coreference models are usually trained with heuristic loss functions that are computed over a sequence of local decisions. In this paper, we introduce an end-to-end reinforcement learning based coreference resolution model to directly optimize coreference evaluation metrics. Specifically, we modify the state-of-the-art higher-order mention ranking approach in Lee et al. (2018) to a reinforced policy gradient model by incorporating the reward associated with a sequence of coreference linking actions. Furthermore, we introduce maximum entropy regularization for adequate exploration to prevent the model from prematurely converging to a bad local optimum. Our proposed model achieves new state-of-the-art performance on the English OntoNotes v5.0 benchmark.


Introduction
Coreference resolution is one of the most fundamental tasks in natural language processing (NLP), which has a significant impact on many downstream applications including information extraction (Dai et al., 2019), question answering (Weston et al., 2015), and entity linking (Hajishirzi et al., 2013). Given an input text, coreference resolution aims to identify and group all the mentions that refer to the same entity.
In recent years, deep neural network models for coreference resolution have been prevalent (Wiseman et al., 2016;Clark and Manning, 2016b). These models, however, either assumed mentions were given and only developed a coreference linking model (Clark and Manning, 2016b) or built a pipeline system to detect mention first then resolved coreferences (Haghighi and Klein, 2010). In either case, they depend on hand-crafted fea-tures and syntactic parsers that may not generalize well or may even propagate errors.
To avoid the cascading errors of pipeline systems, recent NLP researchers have developed endto-end approaches (Lee et al., 2017;Luan et al., 2018;, which directly consider all text spans, jointly identify entity mentions and cluster them. The core of those end-to-end models are vector embeddings to represent text spans in the document and scoring functions to compute the mention scores for text spans and antecedent scores for pairs of spans. Depending on how the span embeddings are computed, the end-to-end coreference models could be further divided into first order methods (Lee et al., 2017;Luan et al., 2018; or higher order methods . Although recent end-to-end neural coreference models have advanced the state-of-the-art performance for coreference resolution, they are still trained with heuristic loss functions and make a sequence of local decisions for each pair of mentions. However as studied in Clark and Manning (2016a); Yin et al. (2018), most coreference resolution evaluation measures are not accessible over local decisions, but can only be known until all other decisions have been made. Therefore, the next key research question is how to integrate and directly optimize coreference evaluation metrics in an end-to-end manner.
In this paper, we propose a goal-directed endto-end deep reinforcement learning framework to resolve coreference as shown in Figure 1. Specifically, we leverage the neural architecture in  as our policy network, which includes learning span representation, scoring potential entity mentions, and generating a probability distribution over all possible coreference linking actions from the current mention to its antecedents. Once a sequence of linking actions are made, our  Figure 1: The basic framework of our policy gradient model for one trajectory. The policy network is an end-to-end neural module that can generate probability distributions over actions of coreference linking. The reward function computes a reward given a trajectory of actions based on coreference evaluation metrics. Solid line indicates the model exploration and (red) dashed line indicates the gradient update. reward function is used to measure how good the generated coreference clusters are, which is directly related to coreference evaluation metrics. Besides, we introduce an entropy regularization term to encourage exploration and prevent the policy from prematurely converging to a bad local optimum. Finally, we update the regularized policy network parameters based on the rewards associated with sequences of sampled actions, which are computed on the whole input document.
We evaluate our end-to-end reinforced coreference resolution model on the English OntoNotes v5.0 benchmark. Our model achieves the new state-of-the-art F1-score of 73.8%, which outperforms previous best-published result (73.0%) of  with statistical significance.

Related Work
Closely related to our work are the end-to-end coreference models developed by Lee et al. (2017) and . Different from previous pipeline approaches, Lee et al. used neural networks to learn mention representations and calculate mention and antecedent scores without using syntactic parsers. However, their models optimize a heuristic loss based on local decisions rather than the actual coreference evaluation metrics, while our reinforcement model directly optimizes the evaluation metrics based on the rewards calculated from sequences of actions.
Our work is also inspired by Clark and Manning (2016a) and Yin et al. (2018), which resolve coreferences with reinforcement learning techniques. They view the mention-ranking model as an agent taking a series of actions, where each action links each mention to a candidate antecedent. They also use pretraining for initialization. Nevertheless, their models assume mentions are given while our work is end-to-end. Furthermore, we add entropy regularization to encourage more exploration (Mnih et al.;Eysenbach et al., 2019) and prevent our model from prematurely converging to a sub-optimal (or bad) local optimum.

Task definition
Given a document, the task of end-to-end coreference resolution aims to identify a set of mention clusters, each of which refers to the same entity. Following Lee et al. (2017), we formulate the task as a sequence of linking decisions for each span i to the set of its possible antecedents, denoted as Y(i) = { , 1, · · · , i − 1}, a dummy antecedent and all preceding spans. In particular, the use of dummy antecedent for a span is to handle two possible scenarios: (i) the span is not an entity mention or (ii) the span is an entity mention but it is not coreferent with any previous spans. The final coreference clusters can be recovered with a backtracking step on the antecedent predictions.

Our Model
Figure 2 illustrates a demonstration of our iterative coreference resolution model on a document. Given a document, our model first identifies top scored mentions, and then conducts a sequence of actions a 1:T = {a 1 , a 2 , · · · , a T } over them, where T is the number of mentions and each action a t assigns mention t to a candidate antecedent Once our model has finished all the actions, it observes a reward R(a 1:T ). The calculated gradients are then propagated to update model parameters. We use the average of the three metrics: MUC (Grishman and Sundheim, 1995), B 3 (Recasens and Hovy, 2011) and CEAFφ 4 (Cai and (1) (2) (3) (4) (5) Observe Sample (2) Act (1) (2) (3) (4) Env update (1) (2) (3) (4) (1) (2) (3) (4) (1) (2) (3) (4) (1) (2) (3) (4) (5) (6) Figure 2: A demonstration of our reinforced coreference resolution method on a document with 6 mentions. The upper and lower rows correspond to step 5 and 6 respectively, in which the policy network selects mention (2) as the antecedent of mention (5) and leaves mention (6) as a singleton mention. The red (gray) nodes represent processed (current) mentions and edges between them indicate current predicted coreferential relations. The gray rectangles around circles are span embeddings and the reward is calculated at the trajectory end.  Figure 3: Architecture of the policy network. The components in dashed square iteratively refine span representations. The last layer is a masked softmax layer that computes probability distribution only over the candidate antecedents for each mention. We omit the span generation and pruning component for simplicity.
Strube, 2010) as the reward. Following Clark and Manning (2016a), we assume actions are independent and the next state S t+1 is generated based on the natural order of the starting position and then the end position of mentions regardless of action a t . Policy Network: We adopt the state-of-the-art end-to-end neural coreferene scoring architecture from  and add a masked softmax layer to compute the probability distribution over actions, as illustrated in Figure 3. The success of their approach lies in two aspects: (i) a coarse-tofine pruning to reduce the search space, and (ii) an iterative procedure to refine the span representation with an self-attention mechanism that av-erages over the previous round's representations weighted by the normalized coreference scores. Given the state S t and current network parameters θ, the probability of action a t choosing y t is: where s(i, j) is the pairwise coreference score between span i and span j defined as following: For the dummy antecedent, the score s(i, ) is fixed to 0. Here s m (.) is the mention score function, s c (., .) is a bilinear score function used to prune antecedents, and s a (., .) is the antecedent score function. Let g i denote the refined representation for span i after gating, the three functions are s m (i) = θ T m FFNN m (g i ), s c (i, j) = g T i Θ c g j , and s a (i, j) is: where FFNN denotes a feed-forward neural network and • denotes the element-wise product. θ m , Θ c and θ a are network parameters. φ(i, j) is the feature vector encoding speaker and genre information from metadata. The Reinforced Algorithm: We explore using the policy gradient algorithm to maximize the expected reward: Computing the exact gradient of J(θ) is infeasible due to the expectation over all possible action sequences. Instead, we use Monte-Carlo methods  (Peters et al., 2018). The F1 improvement is statistically significant under t-test with p < 0.05, compared with .
to approximate the actual gradient by randomly sampling N s trajectories according to p θ and compute the gradient only over the sampled trajectories. Meanwhile, following Clark and Manning (2016a), we subtract a baseline value from the reward to reduce the variance of gradient estimation. The gradient estimate is as follows: where N s is the number of sampled trajectories, τ i = {a i1 , · · · a iT } is the ith sampled trajectory and b = Ns i=1 R(τ i )/N s is the baseline reward. The Entropy Regularization: To prevent our model from being stuck in highly-peaked polices towards a few actions, an entropy regularization term is added to encourage exploration. The final regularized policy gradient estimate is as follows: where λ expr ≥ 0 is the regularization parameter that controls how diverse our model can explore. The larger the λ expr is, the more diverse our model can explore. If λ expr → ∞, all actions will be sampled uniformly regardless of current policies.
To the contrary, if λ expr = 0, all actions will be sampled based on current polices.
Pretraining: We pretrain the policy network parameterized by θ using the loss function below: where N is the number of mentions, I(i, j) = 1 if mention i and j are coreferred, and 0 otherwise. Y i is the set of candidate antecedents of mention i.

Experiments
We evaluate our model on the English OntoNotes v5.0 (Pradhan et al., 2011), which contains 2,802 training documents, 343 development documents, and 348 test documents. We reuse the hyperparameters and evaluation metrics from  with a few exceptions. First, we pretrain our model using Eq. (4) for around 200K steps and use the learned parameters for initialization. Besides, we set the number of sampled trajectories N s = 100, tune the regularization parameter λ expr in {10 −5 , 10 −4 , 0.001, 0.01, 0.1, 1} and set it to 10 −4 based on the development set. We use three standard metrics: MUC (Grishman and Sundheim, 1995), B 3 (Recasens and Hovy, 2011) and CEAFφ 4 (Cai and Strube, 2010). For each metric, we report the precision, recall and F1 score. The final evaluation is the average F1 of the above three metrics.

Results
In Table 1, we compare our model with the coreference systems that have produced significant improvement over the last 3 years on the OntoNotes benchmark. The reported results are either adopted from their papers or reproduced from their code. The first section of the table lists the pipeline models, while the second section lists the end-to-end approaches. The third section lists the results of our model with different variants. Note that Luan et al. (2018)'s method contains 3 tasks: named entity recognition, relation inference and coreference resolution and we disable the relation inference task and train the other two tasks.
Built on top of the model in  but excluding ELMo, our base reinforced model improves the average F1 score around 2 points (statistical significant t-test with p < 0.05) compared with Lee et al. (2017);. Besides, it is even comparable with the end-to-end multi-task coreference model that has ELMo support (Luan et al., 2018), which demonstrates the power of reinforcement learning combined with the state-of-the-art end-to-end model in . Regarding our model, using entropy regularization to encourage exploration can improve the result by 1 point. Moreover, introducing the context-dependent ELMo embedding to our base model can further boosts the performance, which is consistent with the results in . We also notice that our full model's improvement is mainly from higher precision scores and reasonably good recall scores, which indicates that our reinforced model combined with more active exploration produces better coreference scores to reduce false positive coreference links.
Overall, our full model achieves the state-ofthe-art performance of 73.8% F1-score when using ELMo and entropy regularization (compared to models marked with * in Table 1), and our approach simultaneously obtains the best F1-score of 70.5% when using fixed word embedding only.

Model
Prec  Since mention detection is a subtask of coreference resolution, it is worthwhile to study the performance. Table 2 shows the mention detection results on the test set. Similar to coreference linking results, our model achieves higher precision and F1 score, which indicates that our model can significantly reduce false positive mentions while it can still find a reasonable number of mentions.

Analysis and Discussion
Ablation Study: To understand the effect of different components, we conduct an ablation study on the development set as illustrated in Table 3. Clearly, removing entropy regularization deteriorates the average F1 score by 1%. Also, disabling coarse-to-fine pruning or second-order inference decreases 0.3/0.5 F1 score. Among all the components, ELMo embedding makes the most contribution and improves the result by 3.1%.

Model
Avg  "Coarse-to-fine pruning" and "second-order inference" are adopted from  Impact of the parameter λ expr : Since the parameter λ expr directly controls how diverse the model is explored during training, it is necessary to study its effect on the model performance. Figure 4 shows the avg. F1 score on the development set for our full model and . We observe that λ expr does have a strong effect on the performance and the best value is around 10 −4 . Besides, our full model consistently outperforms  over a wide range of λ expr .   is also plotted for comparison, which is a flat line since it does not depend on λ expr .

Conclusion
We present the first end-to-end reinforcement learning based coreference resolution model. Our model transforms the supervised higher order coreference model to a policy gradient model that can directly optimizes coreference evaluation metrics. Experiments on the English OntoNotes benchmark demonstrate that our full model integrated with entropy regularization significantly outperforms previous coreference systems.
There are several potential improvements to our model as future work, such as incorporating mention detection result as a part of the reward. Another interesting direction would be introducing intermediate step rewards for each action to better guide the behaviour of the RL agent.