WEC: Deriving a Large-scale Cross-document Event Coreference dataset from Wikipedia

Cross-document event coreference resolution is a foundational task for NLP applications involving multi-text processing. However, existing corpora for this task are scarce and relatively small, while annotating only modest-size clusters of documents belonging to the same topic. To complement these resources and enhance future research, we present Wikipedia Event Coreference (WEC), an efficient methodology for gathering a large-scale dataset for cross-document event coreference from Wikipedia, where coreference links are not restricted within predefined topics. We apply this methodology to the English Wikipedia and extract our large-scale WEC-Eng dataset. Notably, our dataset creation method is generic and can be applied with relatively little effort to other Wikipedia languages. To set baseline results, we develop an algorithm that adapts components of state-of-the-art models for within-document coreference resolution to the cross-document setting. Our model is suitably efficient and outperforms previously published state-of-the-art results for the task.


Introduction
Cross-Document (CD) Event Coreference resolution is the task of identifying clusters of text mentions, across multiple texts, that refer to the same event. Successful identification of such coreferring mentions is beneficial for a broad range of applications at the multi-text level, which are gaining increasing interest and need to match and integrate information across documents, such as multidocument summarization (Falke et al., 2017;Liao et al., 2018), multi-hop question answering (Dhingra et al., 2018;Wang et al., 2019) and Knowledge Base Population (KBP) (Lin et al., 2020).
Unfortunately, rather few datasets of reasonable scale exist for CD event coreference. Notable datasets include ECB+ (Cybulska and Vossen, 2014), MEANTIME (Minard et al., 2016) and the Gun Violence Corpus (GVC) (Vossen et al., 2018) (described in Section 2), where recent work has been evaluated solely on ECB+. When addressed in a direct manner, manual CD coreference annotation is very hard due to its worst-case quadratic complexity, where each mention may need to be compared to all other mentions in all documents. Indeed, ECB+ contains less than 7000 event mentions in total (train, dev, and test sets). Further, effective corpora for CD event coreference are available mostly for English, limiting research opportunities for other languages. Partly as a result of this data scarcity, rather little effort was invested in this field in recent years, compared to dramatic recent progress in modeling within-document coreference.
Furthermore, most existing cross-document coreference datasets are restricted in their scope by two inter-related characteristics. First, these datasets annotate sets of documents, where the documents in each set all describe the same topic, mostly a news event (consider the Malaysia Airlines crash as an example). While such topicfocused document sets guarantee a high density of coreferring event mentions, facilitating annotation, in practical settings the same event might be mentioned across an entire corpus, being referred to in documents of varied topics. Second, we interestingly observed that event mentions may be (softly) classified into two different types. One type, which we term a descriptive mention, pertains to a mention involved in presenting the event or describing new information about it. For example, news about the Malaysian Airline crash will include mostly descriptive mentions of the event and its sub-events, such as shot-down, crashed and investigated. Naturally, news documents about a topic, as in prior event coreference datasets, include mostly descriptive event mentions. The other type, which we term a referential mention, pertains to mentions of the event in sentences that do not focus on presenting new information about the event but rather mention it as a point of reference. For example, mentions referring to the airplane crash, such as the Malaysian plane crash, Flight MH17 or disaster may appear in documents about the war in Donbass or about flight safety. Since referential event mentions are split across an entire corpus, they are less trivial to identify for coreference annotation, and are mostly missing in current newsbased datasets. As we demonstrate later, these two mention types exhibit different lexical distributions and seem to require corresponding training data to be properly modeled.
In this paper, we present the Wikipedia Event Coreference methodology (WEC), an efficient method for automatically gathering a large-scale dataset for the cross-document event coreference task. Our methodology effectively complements current datasets in the above-mentioned respects: data annotation is boosted by leveraging available information in Wikipedia, practically applicable for any Wikipedia language; mentions are gathered across the entire Wikipedia corpus, yielding a dataset that is not partitioned by topics; and finally, our dataset consists mostly of referential event mentions.
In its essence, our methodology leverages the coreference relation that often holds between anchor texts of hyperlinks pointing to the same Wikipedia article (see Figure 1), similar to the basic idea introduced in the Wikilinks dataset (Singh et al., 2012). Focusing on CD event coreference, we identify and target only Wikipedia articles denoting events. Anchor texts pointing to the same event article, along with some surrounding context, become candidate mentions for a corresponding event coreference cluster, undergoing extensive filtering. We apply our method to the English Wikipedia and extract WEC-Eng, our English version of a WEC dataset. The automaticallyextracted data that we collected provides a training set of a very large scale compared to prior work, while our development and test sets underwent relatively fast manual validation.
Due to the large scale of the WEC-Eng training data, current state-of-the-art CD coreference models cannot be easily trained and evaluated on it, for scalability reasons. We therefore developed a new, more scalable, baseline model for the task, while adapting components of recent competitive within-document coreference models (Lee et al., 2017;Kantor and Globerson, 2019;Joshi et al., 2019). In addition to setting baseline results for WEC-Eng, we assess our model's competitiveness by presenting a new state-of-the-art on the commonly used ECB+ dataset. Finally, we propose that our automatic extraction and manual validation methods may be applied to generate additional annotated datasets, particularly for other languages. Overall, we suggest that future cross-document coreference models should be evaluated also on the WEC-Eng dataset, and address its complementary characteristics, while the WEC methodology may be efficiently applied to create additional datasets. To that end, our dataset and code 12 are released for open access.

Related Datasets
This section describes the main characteristics of notable datasets for CD event coreference (ECB+, MEANTIME, GVC). Table 1 presents statistics for all these datasets, as well as ours. We further refer to the Wikilinks dataset, which also leveraged Wikipedia links for CD coreference detection.

CD Event Corpora
ECB+ This dataset (Cybulska and Vossen, 2014), which is an extended version of the EventCoref-Bank (ECB) (Bejan and Harabagiu, 2010), is the most commonly used dataset for training and testing models for CD event coreference (Choubey and Huang, 2017; Kenyon-Dean et al., 2018;Barhom et al., 2019). This corpus consists of documents partitioned into 43 clusters, each corresponding to a certain news topic. In order to introduce some ambiguity and to limit the use of lexical features, each topic is composed of documents describing two different events (called sub-topics) of the same event type (e.g. two different celebrities checking into rehab facilities). Nonetheless, as can be seen in Table 1, the ambiguity level obtained is still rather low. ECB+ is relatively small, where on average only 1.9 sentences per document were selected for annotation, yielding only 722 non-singleton coreference clusters in total. MEANTIME Minard et al. (2016) proposed a dataset that is similar in some respects to ECB+, with documents partitioned into a set of topics. The different topics do not correspond to a specific news event but rather to a broad topic of interest (e.g. Apple, stock market). Consequently, different documents rarely share coreferring event mentions, resulting in only 11 event coreference clusters that include mentions from multiple documents, making this dataset less relevant for training CD coreference models.
Gun Violence Corpus (GVC) This dataset (Vossen et al., 2018) was triggered by the same motivation that drove us, of overcoming the huge complexity of direct manual annotation of CD event coreference from scratch. To create the dataset, the authors leveraged a structured database recording gun violence events, in which the record for an individual event points at documents describing that event. The annotators were then asked to examine the linked documents and mark in them mentions of 5 gun-violence event classes (firing a gun, missing, hitting, injuring, death). Considering the recorded event as a pivot, all mentions found for a particular class were considered as coreferring. Using this process, they report an annotation rate of about 190 mentions per hour. As this corpus assumes a specific event structure scheme related to gun violence, it is more suitable for studying event coreference within a narrow domain rather than for investigating models for broad coverage event coreference.

Wikilinks
Wikilinks (Singh et al., 2012) is an automaticallycollected large-scale cross-document coreference dataset, focused on entity coreference. It was constructed by crawling a large portion of the web and collecting as mentions hyperlinks pointing to Wikipedia articles. Since their method does not include mention distillation or validation, it was mostly used for training models for the Entity Linking task, particularly in noisy texts (Chisholm and Hachey, 2015;Eshel et al., 2017).

The WEC Methodology and Dataset
We now describe our methodology for gathering a CD event coreference dataset from Wikipedia, and the WEC-Eng dataset created by applying it to the English Wikipedia. We also denote how this methodology can be applied, with some languagespecific adjustments, to other Wikipedia languages.

Dataset Structure
Our data is collected by clustering together anchor texts of (internal) Wikipedia links pointing to the same Wikipedia concept. This is generally justified  since all these links refer to the same real world theme described by that article, as illustrated in Figure 1. Accordingly, our dataset consists of a set of mentions, each including the mention span corresponding to the link anchor text, the surrounding context, and the mention cluster ID. Since Wikipedia is not partitioned into predefined topics, mentions can corefer across the entire corpus (unlike most prior datasets). Since mention annotation is not exhaustive, coreference resolution is performed over the gold mentions.Thus, our goal is to support the development of CD event coreference algorithms, rather than of mention extraction algorithms. Our dataset also includes metadata information, such as source and target URLs for the links, but these are not part of the data to be considered by algorithms, as our goal in this work is CD coreference development rather than Event Linking (Nothman et al., 2012).

Data Collection Process
In this paper, we focus on deriving from Wikipedia an event coreference dataset. The choice to focus on event coreference was motivated by two observations: (1) coreference resolution for Wikipedia anchor texts would be more challenging for event mentions than for entity mentions, since the former exhibits much higher degrees of both ambiguity and lexical diversity, and (2) event structures, with their arguments (such as participants, location and time) available in the surrounding context, would facilitate a more natural dataset for the corpus-wide CD coreference task, compared to Wikipedia entity mentions which are comprised mostly of named entities.
Accordingly, we seek to consider only Wikipedia pages denoting events, then collect hyperlinks pointing at these pages. All anchor texts pointing to the same event then become the mentions of a corresponding event coreference cluster, and are extracted along with their surrounding paragraph as context (see Table 2). The following paragraphs describe this process in detail, and how it was applied to generate the WEC-Eng dataset from English Wikipedia.
Event Identification Many Wikipedia articles contain an infobox 3 element. This element can be selected by a Wikipedia author from a pre-defined list of possible infobox types (e.g. "Civilian Attack", "Game", "Scientist", etc.), each capturing typical information fields for that type of articles. For example, the "Scientist" infobox type consists of fields such as "birth date", "awards", "thesis" etc. We leverage the infobox element and its parameters in order to identify articles describing events (e.g. accident, disaster, conflict, ceremony, etc.) rather than entities (e.g. a person, organization, etc.).
To that end, we start by automatically compiling a list of all Wikipedia infobox types that are associated with at least dozens of Wikipedia articles. Of those, we manually identify all infobox types related to events (WEC-Eng examples include Awards, Meetings, Civilian Attack, Earthquake, Contest, Concert and more). We then (manually) exclude infobox types that are frequently linked from related but non-coreferring mentions, such as sub-events or event characteristics, like location and time (see Appendix A.1 for further details). For WEC-Eng, we ended up with 28 English Wikipedia event infobox types (see Appendix A.2 for the full list).
Gathering Initial Dataset Once the infobox event list is determined, we apply a fully automatic pipeline to obtain an initial crude version of the dataset. This pipeline consists of: (1) Collecting all Wikipedia articles (event "pivot" pages) whose Infobox type is in our list.
(2) Collecting all Wikipedia anchor texts ("mentions") pointing to one of the pivot pages, along with their surrounding paragraph.
(3) Filtering mentions that lack context or those belonging to Wikipedia metadata, such as tables, images, lists, etc., as well as mentions whose surrounding context contains obvious Wikipedia boilerplate code (i.e. HTML and JSON tags) (4) Finally, all collected mentions are clustered accord- ing the pivot page at which they point.
Mention-level Filtering An event coreference dataset mined this way may still require some refinement in order to further clean the dataset at the individual mention level. Indeed, we observed that many Wikipedia editors have a tendency to position event hyperlinks on an event argument, such as a Named Entity (NE) related to the event date or location (as in the case of the disqualified mention for cluster 3 in Table 3). To automatically filter out many of the cases where the hyperlink is placed on an event argument instead of on the event mention itself, we use a Named Entity tagger and filter out mentions identified by one of the following labels: PERSON, GPE, LOC, DATE and NORP (for WEC-Eng we used the SpaCy Named Entity tagger (Honnibal and Montani, 2017)).
Controlling Lexical Diversity So far, we addressed the need to avoid having invalid mentions in a cluster, which do not actually refer to the linked pivot event.
Next, we would like to ensure a reasonably balanced lexical distribution of the mentions within each cluster. Ideally, it would be desired to preserve the "natural" data distributions as much as possible. However, we observed that in many Wikipedia hyperlinks, the anchor texts used in an event mention may be lexically unbalanced. Indeed, Wikipedia authors seem to have a strong bias to use the pivot article title when creating hyperlinks pointing at that article, while additional ways by which the event can be referred are less frequently hyperlinked. Consequently, preserving the original distribution of hyperlink terms would create a too low level of lexical diversity. As a result, training a model on such a dataset might overfit to identifying only the most common mention phrases, leaving little room for identifying the less frequent ones.
To avoid this, we applied a simple filter 4 that allows a maximum of 4 mentions having identical strings in a given cluster. This hyperparameter was tuned by making the lexical repetition level in our clusters more similar to that of ECB+, in which lexical diversity was not controlled (resulting with an average of 1.9 same-string mentions per cluster in WEC-Eng train set compared to 1.3 in ECB+).
Using this process we automatically generated a large dataset. We designated the majority of the automatically generated data to serve as the WEC-Eng training set. The remaining data was left for the development and test sets, which underwent a manual validation phase, as described next.

Manual Validation
Inevitably, some noise would still exist in the automatically derived dataset just described. While partially noisy training data can be effective, as we show later, and is legitimate to use, the development set, and particularly the test set, should be of high quality to allow for proper evaluation. To that end, we manually validated the mentions in the development and test sets.
For CD coreference evaluation, we expect to include a mention as part of a coreferring cluster only if it is clear, at least from reading the given surrounding context, that this mention indeed refers to the linked pivot event. Otherwise, we cannot expect a system to properly detect coreference for that mention. Such cases occasionally occur in Wikipedia, where identifying context is missing while relying on the provided hyperlink (see cluster-1 in Table 3, where the tournament year is not mentioned). Such mentions are filtered out by the manual validation. Additionally, misplaced mention boundaries that do not include the correct event trigger (Table 3 cluster-2,3), as well as mentions of subevents of the linked event (Table 3 cluster-4), are filtered.
Summing up, to filter out these cases, we used a strict and easy-to-judge manual validation criterion, where a mention is considered valid only if: (1) the mention boundaries contain the event trigger phrase; (2) the mention's surrounding paragraph suffices to verify that this mention refers to the pivot page and thus belongs to its coreference cluster; and (3) the mention does not represent a subevent of the referenced event. Table 3 shows examples of validated vs. disqualified mentions judged for the WEC-Eng development set.
For the WEC-Eng validation, we randomly selected 588 clusters and validated them, yielding 1,250 and 1,893 mentions for the development and test sets, respectively. Table 1 presents further statistics for WEC-Eng. The validation was performed by a competent native English speaker, to whom we explained the guidelines, after making a practice session over 150 mentions. Finally, all training mentions that appeared in the same (source) article with a validated mention were discarded from the training set.
Our manual validation method is much faster and cheaper compared to a full manual coreference annotation process, where annotators would need to identify and compare all mentions across all documents. In practice, the average annotation rate for WEC-Eng yielded 350 valid mentions per hour, with the entire process taking only 9 hours to complete. In addition, since our validation approach is quite simple and does not require linguistic expertise, the eventual data quality is likely to be high. To assess the validation quality, one of the authors validated 50 coreference clusters (311 mentions), randomly selected from the development and test sets, and then carefully consolidated these annotations with the original validation judgements by the annotator. Relative to this reliable consolidated annotation, the original annotations scored at 0.95 Precision and 0.96 Recall, indicating the high quality of our validated dataset (the Cohen's Kappa (Cohen, 1960) between the original and consolidated annotations was 0.75, considered substantial agreement).
In all, 83% of the candidate mentions were positively validated in the development and test sets, indicating a rough estimation of the noise level in the training set. That being said, we note that a majority of these noisy mentions were not totally wrong mentions but rather were filtered out due to the absence of substantial surrounding context or the misplacement of mention boundaries (see examples in Table 3).

Dataset Content
This section describes the WEC-Eng dataset content and some of its characteristics. The final WEC-Eng dataset statistics are presented in Table 1. Notably, the training set includes 40,529 mentions distributed into 7,042 coreference clusters, facilitating the training of deep learning models.
The relatively high level of lexical ambiguity shown in the table 5 is an inherent characteristic caused by many events (coreference clusters) sharing the same event type, and thus sharing the same terms, as illustrated in Table 2. Identifying that identical or lexically similar mentions refer to different events is one of the major challenges for CD coreference resolution, particularly in the corpuswide setting, where different documents might refer to similar yet different events.
With respect to the distinction between descriptive and referential event mentions, proposed in Section 1, WEC-Eng mentions are predominantly referential. This stems from the fact that its mentions correspond to hyperlinks that point at a different Wikipedia article, describing the event, while the mention's article is describing a different topic. On the other hand, ECB+, being a news dataset, is expected to include predominantly descriptive mentions. Indeed, manually analyzing a sample of 30 mentions from each dataset, in WEC-Eng 26 were referential while in ECB+ 28 were descriptive. This difference also imposes different lexical distributions for mentions in the two datasets, as sampled in Appendix A.3. When describing an event, verbs are more frequently used as event mentions, but nominal mentions are abundant as well. This is apparent for the predominately descriptive ECB+, where 62% of the mentions in our sample are verbal vs. 38% nominal. On the other hand, when a previously known event is only referenced, it is mostly referred by a nominal mention. Indeed, in the predominantly referential WEC-Eng, a vast majority of the mentions are nominal (93% in our sample).

Potential Language Adaptation
While our process was applied to the English Wikipedia, it can be adapted with relatively few adjustments and resources to other languages for which a large-scale Wikipedia exists. Here we summarize the steps needed to apply our dataset creation methodology to other Wikipedia languages. To generate a dataset, the first step consists of manually deciding on the list of suitable infobox types corresponding to (non-noisy) event types. Then, the automatic corpus creation process can be applied for this list, which takes only a few hours to run on a single CPU. After the initial dataset was created, a language specific named-entity tagger should be used to filter mentions of certain types, like time and location (see Mention-level Filtering (3.2)). Next, the criterion for ensuring balanced lexical diversity in a cluster (see Controlling Lexical Diversity (3.2)), which was based on a simple same-string test for English, may need to be adjusted for languages requiring a morphological analyzer. Finally, as we perform manual validation of the development and test sets, this process should be performed for any new dataset (see Section 3.3). Supporting this step, our validation guidelines are brief and simple, and are not language specific. They only require identifying subevents and misplaced mention boundaries, as well as validating the sufficiency of the mention context.

Model
The current state-of-the-art CD event coreference system (Barhom et al., 2019) cannot be effectively trained on WEC-Eng for two main reasons: (1) computational complexity and (2) reliance on verbal SRL features. With respect to computation time, the training phase of this model simulates the clustering operations done at inference time, while recalculating new mention representations and pairwise scores after each cluster merging step. Consequently, training this model on our large scale training data, which is further not segmented to topics, is computationally infeasible. In addition, the model of Barhom et al. (2019) uses an SRL system to encode the context surrounding verbal event mentions, while WEC-Eng is mostly composed of nominal event mentions (Section 3.4).
We therefore develop our own, more scalable, model for CD event coreference resolution, estab-lishing baseline results for WEC-Eng. As common in CD coreference resolution, we train a pairwise scorer s(i, j) indicating the likelihood that two mentions i and j in the dataset are coreferring, and then apply agglomerative clustering over these scores to find the coreference clusters. Following the commonly used average-link method (Choubey and Huang, 2017;Kenyon-Dean et al., 2018;Barhom et al., 2019), the merging score for two clusters is defined as the average mention pair score s(i, j) over all mention pairs (i, j) across the two candidate clusters to be merged.
For the pairwise model, we replicate the architecture of mention representation and pairwise scorer from the end-to-end within document coreference model in (Lee et al., 2017), while including the recent incorporation of transformer-based encoders (Joshi et al., 2019;Kantor and Globerson, 2019). Concretely, we first apply a pre-trained RoBERTa (Liu et al., 2019) language model (without finetuning), separately for each mention. Given a mention span i, we include as context, T (set to 250) tokens to the left of i and T tokens to the right of i. Applying RoBERTa to this window, we represent each mention by a vector g i , which is the concatenation of three vectors: the contextualized representations of the mention span boundaries (first and last) and the weighted sum of the mention token vectors according to the head-finding attention mechanism in (Lee et al., 2017). The two mention representations g i and g j , and the element-wise multiplication of these vectors are then concatenated and fed into a simple MLP, which outputs a score s(i, j), indicating the likelihood that mentions i and j belong to the same cluster. The head-attention layer and the MLP are trained to optimize the standard binary cross-entropy loss over all pairs of mentions, where the label is 1 if they belong to the same coreference cluster and 0 otherwise. 6

Experiments
We first train and evaluate our model on the commonly used dataset ECB+, to assess its relevance as an effective baseline model, and then evaluate it on WEC-Eng, setting baseline results for our dataset. We also present the performance of the challenging same-head-lemma baseline, which clusters mentions sharing the same syntactic-head lemma. For    Table 4 presents the results on ECB+. Our model outperforms state-of-the-art results for both the JOINT model and the DISJOINT event model of Barhom et al. (2019), with a gain of 1.3 CoNLL F 1 points and 2.3 CoNLL F 1 points respectively. The JOINT model jointly clusters event and entity mentions, leveraging information across the two subtasks, while the DISJOINT event model considers only event mentions, taking the same input as our model. These results assess our model as a suitable baseline for WEC-Eng. 7 https://github.com/conll/ reference-coreference-scorers Table 5 presents the results on WEC-Eng. First, we observe that despite the certain level of noise in the automatically gathered training data, our model outperforms the same-head-lemma baseline by 9.2 CoNLL F 1 points. In fact, it achieves similar error reduction rates relative to the lemma baseline as obtained over ECB+, where training is performed on a clean but smaller training data (18.3% error reduction in ECB+ and 19.6% in WEC). Furthermore, the performance of both the same-headlemma baseline and our model are substantially lower on WEC-Eng (Table 5) than on ECB+ (Table 4). This indicates the more challenging nature of WEC-Eng, possibly due to its corpus wide nature and higher degree of ambiguity (Tables 1). Further examining the different nature of the two datasets, we applied cross-domain evaluation, applying the ECB+ trained model on WEC-Eng test data and vice versa. The results suggest that due to their different characteristics, with respect to mention type (descriptive vs. referential) and structure (topic-based vs. corpus wide), a model trained on one dataset is less effective (by 8-12 points) when applied to the other (further details are presented in Appendix B.1).

Qualitative Analysis
To obtain some qualitative insight about the learned models for both ECB+ and WEC-Eng, we manually examined their most certain predictions, looking at the top 5% instances with highest predicted probability and at the bottom 5%, of lowest predictions. Some typical examples are given in Appendix B.2. Generally, both models tend to assign the highest probabilities to mention pairs that share some lemma, and occasionally to pairs with different lemmas with similar meanings, with the WEC-Eng model making such lexical generalizations somewhat more frequently. Oftentimes in these cases, the models fail to distinguish between (gold) positive and negative cases, despite quite clear distinguishing evidence in the context, such as different times or locations. This suggests that the RoBERTa-based modeling of context may not be sufficient, and that more sophisticated models, injecting argument structure more extensively, may be needed.
In both models, the lowest predictions (correctly) correspond mostly to negative mention pairs, and occasionally to positive pairs for which the semantic correspondence is less obvious (e.g. offered vs. candidacy). In addition, we observe that longer spans common in WEC-Eng challenge the span representation model of Lee et al. (2017). This model emphasizes mention boundaries, but these often vary across lexically-similar coreferring mentions with different word order.

Conclusion and Future Work
In this paper, we presented a generic low-cost methodology and supporting tools for extracting cross-document event coreference datasets from Wikipedia. The methodology was applied to create the larger-scale WEC-Eng corpus, and may be easily applied to additional languages with relatively few adjustments. Most importantly, our dataset complements existing resources for the task by addressing a different appealing realistic setup: the targeted data is collected across a full corpus rather than within topical document clusters, and, accordingly, mentions are mostly referential rather than descriptive. Hence, we suggest that future research should be evaluated also on WEC-Eng, while future datasets, particularly for other languages, can be created using the WEC methodology and tool suite, all made publicly available. Our released model provides a suitable baseline for such future work.

A.1 Infoboxs Distillation
Excluding infobox types of broad general events, consisting of many sub-events is necessary as often Wikipedia authors link to a broad event from anchor texts that refer to a subevent (and should be regarded as non-coreferring by standard definitions of the event coreference task (Hovy et al., 2013;Araki et al., 2014)). For example, in English Wikipedia many event articles containing the infobox "Election" tend to be pointed at from anchor texts that describe subevents, such as 2016 primaries linking to 2016 United States presidential election. Additionally, for some infobox types, pointing mentions often correspond to related, but not coreferring, named entities; hence, we discard the infobox types to avoid such noisy mentions. For example, articles of the infobox type "Race" are often linked from mentions denoting the name of the country in which the race took place. Table 6 presents excluded infobox types due to the above reasons.

A.2 WEC-Eng Infobox Types
As mentioned in the paper (Section 3), we manually explored the various Wikipedia infobox types and selected only those denoting an event. Table 7 shows the number of coreference clusters (event pages) and mentions for each selected infobox type.
In the table, infobox types falling under the same general category are grouped together.

A.3 Lexical Distribution of ECB+ and WEC-Eng Mentions
Event mentions can appear in various lexical forms in a document, such as verbs (e.g. exploded), nominalization (e.g. crash), common nouns (e.g. party, accident) and proper nouns (e.g. Cannes Festival 2016). In order to have a rough estimation of the distribution of these different forms, we manually analyze 100 sampled mentions, from each of WEC-Eng and ECB+, and present the statistics in Table 8.

B.1 Cross-Domain Experiment and Results
To further assess the difference between the ECB+ and WEC-Eng datasets, we evaluate the crossdomain ability of the model trained on one dataset, and tested on the other one. We use the same pairwise model as in Tables 4 and 5