Coreference Resolution with Entity Equalization

A key challenge in coreference resolution is to capture properties of entity clusters, and use those in the resolution process. Here we provide a simple and effective approach for achieving this, via an “Entity Equalization” mechanism. The Equalization approach represents each mention in a cluster via an approximation of the sum of all mentions in the cluster. We show how this can be done in a fully differentiable end-to-end manner, thus enabling high-order inferences in the resolution process. Our approach, which also employs BERT embeddings, results in new state-of-the-art results on the CoNLL-2012 coreference resolution task, improving average F1 by 3.6%.


Introduction
Coreference resolution is the task of grouping mentions into entities. A key challenge in this task is that information about an entity is spread across multiple mentions. Thus, deciding whether to assign a given mention to a candidate entity could require entity-level information that needs to be aggregated from all mentions.
The problem of entity-level representation (also referred to as high-order coreference models) has attracted considerable interest recently, with methods ranging from imitation learning (Clark and Manning, 2015) to iterative refinement . Specifically,  tackled this 1 Our code is publicly available at https://github. com/kkjawz/coref-ee problem by iteratively averaging the antecedents of each mention to create mention representations that are more "global" (i.e., reflect information about the entity to which the mention refers).
Here we propose an approach that provides an entity-level representation in a simple and intuitive manner, and also facilitates end-to-end optimization. Our "Entity Equalization" approach posits that each entity should be represented via the sum of its corresponding mention representations. It is not immediately obvious how to perform this equalization, which relies on the entity-to-mention mapping, but we provide a natural smoothed representation of this mapping, and demonstrate how to use it for equalization.
Now that each mention contains information about all its corresponding entities, we can use a standard pairwise scoring model, and this model will be able to use global entity-level information.
Similar to recent coreference models, our approach uses contextual embeddings as input mention representations. While previous approaches employed the ELMo model , we propose to use BERT embeddings (Devlin et al., 2018), motivated by the impressive empirical performance of BERT on other tasks. It is challenging to apply BERT to the coreference resolution setting because BERT is limited to a fixed sequence length which is shorter than most coreference resolution documents. We show that this can be done by using BERT in a fully convolutional manner. Our work is the first to use BERT for the task of coreference resolution, and we demonstrate that this results in significant improvement over current state-of-the-art.
In summary, our contributions are: a. A simple and intuitive approach for entity-level representation via the notion of Entity-Equalization. b. The first use of BERT embeddings in coreferenceresolution. c. New state-of-the-art performance on the CoNLL-2012 coreference resolution task, improving over previous F1 performance by 3.6%.

Background
Following Lee et al. (2017), we cast the coreference resolution task as finding a set of antecedent assignments y i for each span i in the document. The set of possible values for each y i is Y(i) = { , 1, . . . , i − 1}, a dummy antecedent and all preceding spans. Non-dummy antecedents represent coreference links between i and y i , whereas indicates that the span is either not a mention, or is a first mention in a newly formed cluster. Whenever a new cluster is formed it receives a new index, and every mention with y i = receives the index of its antecedents. Thus the process results in clusters of coreferent entities.

Baseline Model
We briefly describe the baseline model ) which we will later augment with Entity-Equalization and BERT features. Let s(i, j) denote a pairwise score between two spans i and j. Next, for each span i define the distribution P (y i ) over antecedents: The score is a function of the span representations defined as follows. For each span i let g i ∈ R d denote its corresponding representation vector (see  for more details about the model architecture). Lee et al. (2017) computes the antecedent score s(i, j) = f s (g i , g j ) as a pairwise function of the span representations, i.e. not directly incorporating any information about the entities to which they might belong.  improved upon this model by "refining" the span representations as follows. The expected antecedent representation a i of each span i is computed by using the current antecedent distribution P (y i ) as an attention mechanism: The current span representation g i is then updated via interpolation with its expected antecedent representation a i : where f i = f f (g i , a i ) is a learned gate vector. Thus, the refined representation g i is an elementwise weighted average of the current span representation and its direct antecedents. Using this representation the refined antecedent distribution can be calculated as follows:

Entity Equalization
The idea behind the refinement procedure in  was to create features that are closer to cluster-level representations and hence more "global". This was partially achieved by considering not only the current span but also its antecedents. We would like take this idea one step further and create refined span representations that contain information about the entire cluster to which it belongs. One way to achieve this is by simply representing each mention via the sum of the mentions currently in its coreference cluster. Formally, let C(i) be a coreference cluster (as defined by the antecedent distribution P (y i )) such that i ∈ C(i), and replace Equation 1 with: As a result each span will now contain information about all of its current coreference cluster, effectively equalizing the representations of different spans belonging to the same cluster. However, note that it is not clear how to train such a procedure end-to-end because the clustering process is not differentiable. To overcome this problem, we use a differentiable relaxation of the clustering process (Le and Titov, 2017) and use the resulting soft clustering matrix to create a fully differentiable cluster representation. We call this refinement procedure Entity Equalization and provide a detailed description in the next section.
To illustrate the difference between Entity Equalization and antecedent averaging, consider the following example: "[John] went to the park and [he] got tired.
[John] decided to go back home." Now assume that the model outputs the following antecedent distribution P (y i ): John 1 he John 2 John 1 1 0 0 he 1 0 0 John 2 1 0 0 there is only one coreference cluster induced by this antecedent matrix, C = {John 1 , he, John 2 }. A cluster representation for John 2 would be the sum of the representations of all three mentions. However, using antecedent averaging, the representation of John 2 will be a weighted average of the representations of John 2 and John 1 , because only John 1 is an antecedent of John 2 .

Implementing Equalization
In order to achieve differentiable cluster representations, we need a differentiable soft-clustering process. Le and Titov (2017) introduced such a relaxation given an antecedent distribution, based on the following observation: in a document containing m mentions there are m potential entities E 1 , ..., E m where E i has mention i as the first mention. Let Q(i ∈ E j ) be the probability that mention i corresponds to entity E j (that is, to the entity that has j as its first mention). Le and Titov (2017) showed that this probability can be computed recursively based on the antecedent distribution P (y i ) as follows: Note that this is a fully differentiable procedure that calculates the clustering distribution for each entity i. The distribution Q(i ∈ E j ) above leads to a simple differentiable implementation of the equalization operation in (3), as described next. In order to use entity representations for equalizing mention representations, we need a representation for each entity E i at each time step t, so we won't represent a mention using mentions not yet encountered. We denote it by: Finally, an entity representation for each mention i is calculated using the entity distribution of mention i and the global entity representations: It can be seen that the above a i will indeed lead to (3) when the distributions P (y) are deterministic.

Using BERT Embeddings
Our coreference model relies on input representations for each input token.  used the ELMo context-dependent embeddings for this purpose. Here we propose to use the more recent BERT embeddings (Devlin et al., 2018) instead, which have achieved state of the art performance on many natural language processing tasks. BERT is a bidirectional contextual language model based on the Transformer architecture (Vaswani et al., 2017). Using BERT for coreference resolution is not trivial: BERT can only run on sequences of fixed length which is determined in the pretraining process. In the pre-trained model published by Devlin et al. (2018), this limitation is 512 tokens, which is shorter than many of the documents in the CoNLL-2012 coreference resolution task. Even without considering the pre-training limitation, because the attention mechanism grows as the square of the sequence length, and because of the large number of parameters of the BERT model, running it on very large sequences is not feasible on most machines due to memory constraints.
In order to obtain BERT embeddings for sequences of unlimited length, we propose to use BERT in a convolutional mode as follows. Let D be a fixed window length. We obtain a representation for token i by applying BERT to the sequence of tokens from D to the left of i to D to the right of i. We then take the four last layers of the BERT model for token i and apply a learnable weighted averaging to those, similar to the process used in ELMo. The output of the network is taken as the representation of token i, and replaced the ELMo representation in the model of Section 3.1. We use D = 64, since using the maximum size of D = 256 is computationally intensive, and good results are already obtained with 64. 2

Related Work
Several works have addressed the issue of entitylevel representation (Culotta et al., 2007;Wick et al., 2009;Singh et al., 2011). In Wiseman et al. (2016) an RNN is used to model each entity. While this allows complex entity representations, the assignment of a mention to an RNN is a hard decision, and as such cannot be optimized in an end-to-end manner. Clark and Manning (2015) use whole-entity representations as obtained from agglomerative clustering. But again the clustering operation in non-differentiable, requiring the use of imitation learning. In , entity refinement is more restricted, as it is only obtained from the attention vector at each step. Thus, we believe that our model is the first to use entity-level representations that correspond directly to the inferred clusters, and are end-to-end differentiable.
Mention-entity mappings have been used in the context of optimizing coreference performance measures (Le and Titov, 2017;Clark and Manning, 2016). Here we show that these mappings can also be used for the resolution model itself. We note that we did not try to optimize for coreference measures as in Le and Titov (2017), and this is likely to further improve results.

Experiments
Data for all our experiments is taken from the English portion of the CoNLL-2012 coreference resolution tasks (Pradhan et al., 2012). Our experimental setup is very similar to , and our code is built on theirs. We did not change the optimizer or any of the training hyperparameters. The following changes were made to the model: • We used BERT word embeddings instead of ELMo as input to the LSTM (see Section 4).
• We replaced the span representation refinement mechanism with our Entity Equalization approach (see Section 3).

Results
Following Pradhan et al. (2012), we report precision, recall and F1 of the MUC, B 3 and CEAF φ 4 metrics, and average the F1 score of all three metrics to get the main evaluation metric used in the CoNLL-2012 coreference resolution task. We calculated the metrics using the official evaluation scripts of CoNLL-2012.
Results on the test set are shown in Table 1. Our baseline is the span-ranking model from  with ELMo input features and second-order span representations, which achieves 73.0% Avg. F1. Replacing the ELMo features with BERT features achieves 76.25% average F1. Removing the second-order span-representations while using BERT features achieves 76.37% F1, achieving higher recall and lower precision on all evaluation metrics, while somewhat surprisingly being better overall. Replacing secondorder span representations with Entity Equalization achieves 76.64% average F1, while also consistently achieving the highest F1 score on all three evaluation metrics. Our results set a new state of the art for coreference resolution, improving the previous state of the art by 3.6% average F1.

Conclusion
In this work we presented a new state-of-the-art coreference resolution system. Key to our approach is the idea that each mention should contain information about all its coreferring mentions. Here we implemented this idea by summing all mention representations within a cluster. In the future we plan to further enrich these representations by considering information from across the document. Furthermore, we can consider more structured representations of entities that reflect entity attributes and inter-entity relations.