Revisiting Joint Modeling of Cross-document Entity and Event Coreference Resolution

Recognizing coreferring events and entities across multiple texts is crucial for many NLP applications. Despite the task’s importance, research focus was given mostly to within-document entity coreference, with rather little attention to the other variants. We propose a neural architecture for cross-document coreference resolution. Inspired by Lee et al. (2012), we jointly model entity and event coreference. We represent an event (entity) mention using its lexical span, surrounding context, and relation to entity (event) mentions via predicate-arguments structures. Our model outperforms the previous state-of-the-art event coreference model on ECB+, while providing the first entity coreference results on this corpus. Our analysis confirms that all our representation elements, including the mention span itself, its context, and the relation to other mentions contribute to the model’s success.


Introduction
Recognizing that various textual spans across multiple texts refer to the same entity or event is an important NLP task. For example, consider the following news headlines: 1. 2018 Nobel prize for physics goes to Donna Strickland 2. Prof. Strickland is awarded the Nobel prize for physics Both sentences refer to the same entities (Donna Strickland and the Nobel prize for physics) and the same event (awarding the prize), using different words. In coreference resolution, the goal is to cluster expressions that refer to the same entity or event in a text, whether within a single document or across a document collection. Recently, there has been increasing interest in crosstext inferences, for example in question answering (Welbl et al., 2018;Yang et al., 2018;Khashabi et al., 2018;Postma et al., 2018). Such applications would benefit from effective cross-document coreference resolution.
Despite the importance of the task, the focus of most coreference resolution research has been on its within-document variant, and rather little on cross-document coreference (CDCR). The latter is sometimes addressed partially using entity linking, which links mentions of an entity to its knowledge base entry. However, cross-document entity coreference is substantially broader than entity linking, addressing also mentions of common nouns and unfamiliar named entities.
The commonly used dataset for CDCR is ECB+ (Cybulska and Vossen, 2014), which annotates within-document coreference as well. The annotations are denoted separately for entities and events, making it possible to solve one task while ignoring the other. Indeed, to the best of our knowledge, all previously published work on ECB+ addressed only event coreference.
Cross-document entity coreference has been addressed on EECB, a predecessor of the ECB+ dataset. Lee et al. (2012) proposed to model the entity and event coreference tasks jointly, leading to improved performance on both tasks. Their model preferred to cluster event mentions whose arguments are in the same entity coreference cluster, and vice versa. For instance, in the example sentences above, a system focusing solely on event coreference may find it difficult to recognize that goes to and awarded are coreferring, while a joint model would leverage the coreference between their arguments.
Inspired by the success of the joint approach of Lee et al. (2012), we propose a joint neural architecture for CDCR. In our joint model, an event (entity) mention representation is aware of other entities (events) that are related to it by predicateargument structure. We cluster mentions based on a learned pairwise mention coreference scorer.
A disjoint variant of our model, on its own, improves upon the previous state-of-the-art for event coreference on the ECB+ corpus (Kenyon-Dean et al., 2018) by 9.5 CoNLL F 1 points. To the best of our knowledge, we are the first to report performance on the entity coreference task in ECB+.
Our joint model further improves performance upon the disjoint model by 1.2 points for entities and 1 point for events (statistically significant with p < 0.001). Our analysis further shows that each of the mention representation components contributes to the model's performance. 1

Background and Related Work
Coreference resolution is the task of clustering text spans that refer to the same entity or event. Variants of the task differ on two axes: (1) resolving entities ("Duchess of Sussex", "Meghan Markle", "she") vs. events ("Nobel prize for physics [goes to] Donna Strickland", "Donna Strickland [is awarded] the 2018 Nobel prize for physics"), and (2) whether coreferring mentions occur within a single document (WD: within-document) or across a document collection (CD: cross-document).

Datasets
The largest datasets that include WD and CD coreference annotations for both entities and events are EECB (Lee et al., 2012) and ECB+ (Cybulska and Vossen, 2014). Both are extensions of the Event Coreference Bank (ECB) (Bejan and Harabagiu, 2010) which consists of documents from Google News clustered into topics and annotated for event coreference. Entity coreference annotations were first added in EECB, covering both common nouns and named entities. ECB+ increased the difficulty level by adding a second set of documents for each topic (subtopic), discussing a different event of the same type (Tara Reid enters a rehab center vs. Lindsay Lohan enters a rehab center). The annotation is not exhaustive, where only a number of salient events and entities in each topic are annotated.

Models
Entity Coreference. Of all the coreference resolution variants, the most well-studied is WD entity coreference resolution (e.g. Durrett and Klein, 2013;Clark and Manning, 2016). The current best performing model is a neural end-to-end system which considers all spans as potential entity mentions, and learns distributions over possible antecedents for each (Lee et al., 2017). CD entity coreference has received less attention (e.g. Bagga and Baldwin, 1998b;Rao et al., 2010;Dutta and Weikum, 2015), often addressing the narrower task of entity linking, which links mentions of known named entities to their corresponding knowledge base entries (Shen et al., 2015).
Event Coreference. Event coreference is considered a more difficult task, mostly due to the more complex structure of event mentions. While entity mentions are mostly noun phrases, event mentions may consist of a verbal predicate (acquire) or a nominalization (acquisition), where these are attached to arguments, including event participants and spatio-temporal information.
Early models employed lexical features (e.g. head lemma, WordNet synsets, word embedding similarity) as well as structural features (e.g. aligned arguments) to compute distances between event mentions and decide whether they belong to the same coreference cluster (e.g. Harabagiu, 2010, 2014;Yang et al., 2015).
More recent work is based on neural networks. Choubey and Huang (2017) alternate between WD and CD clustering, each step relying on previous decisions. The decision to link two event mentions is made by the pairwise WD and CD scorers. Mention representations rely on pre-trained word embeddings, contextual information, and features related to the event's arguments. Kenyon-Dean et al. (2018) similarly encode event mentions using lexical and contextual features. Differently from Choubey and Huang (2017), they do not cluster documents to topics as a pre-processing step. Instead, they encode the document as part of the mention representation.
Most of the recent models were trained and evaluated on the ECB+ corpus, addressing solely the event coreference aspect of the dataset.
Joint Modeling. Some of the prior models leverage the event arguments to improve their coreference decisions (Yang et al., 2015;Choubey and Huang, 2017), but mostly relying only on lexical similarity between arguments of candidate event mentions. A different approach was proposed by Lee et al. (2012), who jointly predicted event and entity coreference.
At the core of their model lies the assumption that arguments (i.e. entity mentions) play a key role in describing an event, therefore, knowing that two arguments are coreferring is useful for finding coreference relations between events, and vice versa. They incrementally merge entity or event clusters, computing the merge score between two clusters by learning a linear regression model based on discrete features. Lee et al. (2012) evaluated their model on EECB, outperforming disjoint CD coreference models for both entities and events. Nonetheless, as opposed to the more recent models, their representations are sparse. Lexical features are based on lexical resources such as WordNet (Miller, 1995), which are limited in coverage, and context is modeled using semantic role dependencies, which often do not cover the entire sentential context. We revisit the joint modeling approach, trying to overcome prior limitations by using modern neural techniques, which provide better and more generalizable representations.

Model
We propose an iterative algorithm that alternates between interdependent entity and event clustering, incrementally constructing the final clustering configuration. A single iteration for events is as follows (entity clustering is symmetric). We start by computing the mention representations (Section 3.1), which couple the entity and event clustering processes. When predicting event clusters, the event mention representations are updated to consider the current configuration of entity clusters. The mention representations are then fed to an event mention pair scorer that predicts whether the mentions belong to the same cluster (Section 3.2). Finally, we apply agglomerative clustering where the cluster merging score is based on the predicted pairwise mention scores. Sections 3.3 and 3.4 detail the specifics of the inference and training procedures, respectively. Various implementation details are mentioned in Section 3.5.

Mention Representation
Given a mention m (entity or event), we compute a vector representation with the following features.
Span. We combine word-level and characterlevel features. We compute word-level representations using pre-trained word embeddings. For events, we take the embedding of the head word, while for entities we average over the mention's words. Character-level representations are complementary, and may help with out-of-vocabulary words and spelling variations. We compute them by encoding the span using a character-based LSTM (Hochreiter and Schmidhuber, 1997). The span vector s(m) is a concatenation of the wordand character-level vectors.
Context. The context surrounding a mention may indicate its compatibility with other candidate mentions (Clark and Manning, 2016;Lee et al., 2017;Kenyon-Dean et al., 2018). To model context, we use ELMo, contextual representations derived from a neural language model (Peters et al., 2018). ELMo has recently improved performance on several challenging NLP tasks, including within-document entity coreference resolution . We set the context vector c(m) to the contextual representation of m's head word, taking the average of the 3 ELMo layers.
Semantic dependency to other mentions. To model dependencies between event and entity clusters, we identify semantic role relationships between their mentions using a semantic role labeling (SRL) system.
For a given event mention m v i , we extract its arguments, focusing on 4 semantic roles of interest: Arg0, Arg1, location, and time. Consider a specific argument slot, e.g. Arg1. If the slot is filled with an entity mention m e j which in the current configuration is assigned to an entity cluster c, we set the corresponding Arg1 vector to the averaged span vector of all the mentions in is the concatenation of the various argument vectors: Symmetrically, we compute the argument vectors of an entity mention according to the events in which the entity mention plays a role.
This representation allows our model to directly compute the similarity between two mentions while considering a rich distributed representation of the current coreference clusters of their related arguments or predicates. Lee et al. (2012), on the other hand, modeled the dependencies between event and entity clusters using only simple discrete features, indicating the number of coreferring arguments across clusters. The

Mention-Pair Coreference Scorer
Figure 1 illustrates our pairwise mention scoring function S(m i , m j ) that returns a score denoting the likelihood that two mentions m i and m j are coreferring. We learn a separate function for entities (S E ) and for events (S V ), both trained identically as feed-forward neural networks. For the sake of simplicity, we describe them here as a single function S(·, ·). Lee et al. (2012), we enrich our mention-pair representation with four pairwise binary features f (i, j), indicating whether the two mentions have coreferring arguments (or predicates) in a given role (Arg0, Arg1, location, and time). We encode each binary feature as 50-dimensional embedding to increase its signal.
To train S E we take as training examples all pairs of entity mentions that belong to different entity clusters in the current predicted configuration E t . The gold label for a given pair (m i , m j ) is set to 1 if they belong to the same gold cluster, and to 0 otherwise. We train it using binary cross entropy as the loss function. S V is trained symmetrically.

Inference
Figure 2 describes our model step-by-step: the left part is the training procedure, while the right part is the inference procedure. The differences between the two procedures are highlighted. We first focus on the inference procedure (right), which gets as input the document set D, the pairwise mention scorers S E and S V , and the gold standard mentions. 2 The algorithm operates over each topic separately. To that end, we start by applying document clustering using the K-Means algorithm, yielding a set of topics T. For a given topic t, the algorithm uses the gold entity and event mentions to build initial clusters. Event clusters V t are initialized to singletons (line 2). Similarly to Lee et al. (2012), entity clusters E t are initialized to the output of a within-document entity coreference resolution system (line 3). 3 Our iterative algorithm alternates between entity and event clustering, incrementally constructing the final clustering configuration (lines 4-12).
When the algorithm focuses on entities, it starts with updating the entity representations according to the event clusters in the current configuration, V t (line 6). This update includes the recreation of argument vectors for each entity mention, as described in Section 3.1. We use agglomerative clustering that greedily merges multiple cluster pairs with the highest cluster-pair scores (line 8) until the scores are below a pre-defined threshold δ 2 . The algorithm starts with high-precision merges, leaving less precise decisions to a latter stage, when more information becomes available. We define the cluster-pair score as the average mention linkage score: S cp (c i , c j ) = 1 |c i |·|c j | · m i ∈c i m j ∈c j S(m i , m j ). The same steps are repeated for events (lines 10-12), and repeat iteratively until no merges are available or up to a predefined number of iterations (line 4).

Training
The training steps are similarly described in the left part of Figure 2. At each iteration, we train two updated scorer functions S E (line 7) and S V (line 11). Since our representation requires a clustering configuration, we use a training procedure that simulates the inference step. The training examples for each scorer change between iterations : // Events 10:

Algorithm 2 Inference
Require: D: document set M e , M v : gold entity/event mentions S E (·, ·): pairwise entity mention scorer S V (·, ·): pairwise event mention scorer based on cluster-pair merges occurred in previous iterations. This allows our model to be trained on various predicted clustering configurations that are gradually improved during the training.
The training procedure differs from the inference procedure by using the gold standard topic clusters and by initializing the entity clusters with the gold standard within-document coreference clusters. We do so in order to reduce the noise during training.

Implementation Details
Our model is implemented in PyTorch (Paszke et al., 2017), using the ADAM optimizer (Kingma and Ba, 2014) with a minibatch size of 16. We initialize the word-level representations to the pretrained 300 dimensional GloVe word embeddings (Pennington et al., 2014), and keep them fixed during training. The character representations are learned using an LSTM with hidden size 50. We initialized them with pre-trained character embeddings 4 . Each scorer consists of a sigmoid output layer and two hidden layers with 4261 neurons activated by ReLU function (Nair and Hinton, 2010).
We set the merging threshold in the training step to δ 1 = 0.5. We tune the threshold for inference step on the validation set to δ 2 = 0.5. To cluster documents into topics at inference time, we use the K-Means algorithm implemented in Scikit-Learn (Pedregosa et al., 2011). Documents are represented using TF-IDF scores of unigrams, bigrams, and trigrams, excluding stop words. We set K = 20 based on the Silhouette Coefficient method (Rousseeuw, 1987), which successfully reconstructs the number of test sub-topics. During inference, we use Stanford CoreNLP  to initialize within-document entity coreference clusters.

Experimental Setup
We use the ECB+ corpus, which is the largest dataset consisting of within-and cross-document coreference annotations for entities and events. We follow the setup of Cybulska and Vossen (2015b), which was also employed by Kenyon-Dean et al. (2018). This setup uses a subset of the annotations which has been validated for correctness by Cybulska and Vossen (2014) and allocates a larger portion of the dataset for training (see Table 1). Since the ECB+ corpus only annotates a part of the mentions, the setup uses the gold-standard event and entity mentions rather, and does not require specific treatment for unannotated mentions during evaluation.
A different setup was carried out by Yang et al. (2015) and Choubey and Huang (2017). They used the full ECB+ corpus, including parts with known annotation errors. At test time, they rely on the output of a mention extraction tool (Yang et al., 2015). To address the partial annotation of the corpus, they only evaluated their systems on the subset of predicted mentions which were also gold mentions. Finally, their evaluation setup was criticized by Upadhyay et al. (2016) for ignoring singletons (cluster with a single mention), effectively making the task simpler; and for evaluating each sub-topic separately, which entails ignoring incorrect coreference links across sub-topics.

Baselines
We compare our full model to published results on ECB+, available for event coreference only, as well as to a disjoint variant of our model and a deterministic lemma baseline. 6 CLUSTER+LEMMA. We first cluster the documents to topics (Section 3.3), and then group mentions within the same document cluster which share the same head lemma. This baseline differs from the lemma baseline of Kenyon-Dean et al. (2018) which is applied across topics.
CV (Cybulska and Vossen, 2015a) is a supervised method for event coreference, based on discrete features. They first cluster documents to topics, and then cluster coreferring mentions within each topic cluster. Events are represented using information about participants, time and location, while documents are represented as "bagof-events". We compare to their best reported results, differing from the CV baseline in Kenyon-Dean et al. (2018) which refers to the partial model that uses the same annotations in terms of subcomponents of the event structure.
KCP (Kenyon-Dean et al., 2018) is a neural network-based model for event coreference. They encode an event mention and its context into a vector and use it to cluster mentions. The model does not cluster documents to topics as a pre-processing step, but instead encodes the document as part of the mention representation, aiming to avoid spurious cross-topic coreference links thanks to distant document representations.
CLUSTER+KCP To tease apart the contribution of our document clustering component from that of the rest of the model, we add a variant of the KCP model which relies on our document clustering component as a pre-processing step. During inference, we restrict their model to clustering    mentions only within the same document cluster. Accordingly, we re-trained their model using the gold document clusters for hyper-parameters tuning to fit this cluster-based setting.

DISJOINT.
A variant of our model which uses only the span and context vectors to build mention pair representations, ablating joint features.
We do not compare our work directly to Lee et al. (2012) since it was evaluated on a different corpus and using a different evaluation setup. Instead, we compare to CV and KCP, more recent models which reported their results on the ECB+ dataset.
With respect to entity coreference, to the best of our knowledge, our work is the first to publish entity coreference results on the ECB+ dataset. We therefore only compare our performance to that of the lemma baseline and our disjoint model. Table 2 presents the performance of our method with respect to entity coreference. Our joint model improves upon the strong lemma baseline by 3.8 points in CoNLL F 1 score. Table 3 presents the results on event coreference. Our joint model outperforms all the base-lines with a gap of 10.5 CoNLL F 1 points from the last published results (KCP), while surpassing our strong lemma baseline by 3 points.

Results
The results reconfirm that the lemma baseline, when combined with effective topic clustering, is a strong baseline for CD event coreference resolution on the ECB+ corpus (Upadhyay et al., 2016). In fact, thanks to our near-perfect topic clustering on the ECB+ test set (Homogeneity: 0.985, Completeness: 0.982, V-measure: 0.984, Adjusted Rand-Index: 0.965), the CLUSTER+LEMMA baseline surpasses prior results on ECB+.
The results of CLUSTER+KCP again indicate that pre-clustering of documents to topics is beneficial, improving upon the KCP performance by 4.6 points, though still performing substantially worse than our joint model.
To test the contribution of joint modeling, we compare our joint model to its disjoint variant. We observe that the joint model performs better on both event and entity coreference. The performance gap is modest but significant with bootstrapping and permutation tests (p < 0.001).
We further ablate additional components from the full representation (Table 4). We show that each of our representation components contributes to performance, but the continuous vector components representing semantic dependency to other mentions are stronger than the pairwise binary features originally used by Lee et al. (2012  that were clustered incorrectly, i.e. where their predicted cluster contained at least 70% of mentions that are not in their gold cluster. Figure 3 shows a pie chart for each mention type, manually categorized to error types, suggesting future areas for improvement. For both entities and events, mentions were often clustered incorrectly with other mentions that share the same head lemma. Errors in the extraction of the predicate-argument structures accounted for 12% of the errors in events and 4% for entities, e.g. marking dozens as the Arg0 of devastated in "dozens in a region devastated by the quake".
The joint features caused 10% of the event errors and 2% of the entity errors, where two noncoreferring event mentions were clustered to the same event cluster based on their entity arguments that were incorrectly predicted as coreferring, and vice versa. For example, the event shakes in "earthquake shakes Lake County" and "earthquake shakes Northern California" was affected by the wrong coreference clustering of "Lake County" and "Northern California".
We also found mentions that were wrongly clustered together based on contextual similarity (24% for entities, 4% for events) as well as some annotation errors (12% and 4%). The within-document entity coreference system caused additional 6% of entity errors. Finally, 22% of the event errors were caused by event mentions sharing coreferring arguments. This may happen for instance when similar events occur at different times ("The earthquake struck at about 9:30 a.m. and had a depth of 2.7 miles, according to the USGS." vs. "The earthquake struck at about 7:30 a.m. and had a depth of 1.4 miles, according to the USGS.").

Mention Representation Components
To understand the contribution of each component in the mention representation to the clustering, we visualize them. We focus on events, and sample 7 gold clusters from the test set that have at least 5 mentions each. We then compute t-SNE projections (Maaten and Hinton, 2008) of the full mention representation, only the context vector, and only the semantically-dependent mentions vector (top, middle, and bottom parts of Figure 4). In all the 3 graphs, each point refers to an event mention and its color represents the mention's gold cluster. The full mention representations (top) yield visibly better clusters, but the context vectors (middle) are also quite accurate, emphasizing the importance of modeling context for resolving coreference. The semantically-dependent mentions vectors (bottom) are less accurate on their own, yet, they manage to separate well some clusters even without access to the mention span itself, and based only on the predicate-argument structures.

Conclusion
We presented a neural approach for resolving cross-document event and entity coreference. We represent a mention using its text, context, andinspired by the joint model of Lee et al. (2012)we make an event mention representation aware of coreference clusters of entity mentions to which it is related via predicate-argument structures, and vice versa. Our model achieves state-of-the-art results, outperforming previous models by 10.5 CoNLL F 1 points on events, and providing the first cross-document entity coreference results on ECB+. Future directions include investigating ways to minimize the pipeline errors from the extraction of predicate-argument structures, and incorporating a mention prediction component, rather than relying on gold mentions.