A Probabilistic Annotation Model for Crowdsourcing Coreference

The availability of large scale annotated corpora for coreference is essential to the development of the field. However, creating resources at the required scale via expert annotation would be too expensive. Crowdsourcing has been proposed as an alternative; but this approach has not been widely used for coreference. This paper addresses one crucial hurdle on the way to make this possible, by introducing a new model of annotation for aggregating crowdsourced anaphoric annotations. The model is evaluated along three dimensions: the accuracy of the inferred mention pairs, the quality of the post-hoc constructed silver chains, and the viability of using the silver chains as an alternative to the expert-annotated chains in training a state of the art coreference system. The results suggest that our model can extract from crowdsourced annotations coreference chains of comparable quality to those obtained with expert annotation.


Introduction
The task of identifying and resolving anaphoric reference to discourse entities, known in NLP as coreference resolution, has long been considered a core aspect of language interpretation (Poesio et al., 2016b), also because of its role in applications such as summarization (Baldwin and Morton, 1998;Steinberger et al., 2007), information extraction (Humphreys et al.) or question answering (Morton, 1999;Zheng, 2002).
In the 1990s the field made a paradigmatic turn towards corpus based approaches initiated by campaigns such as MUC (Grishman and Sundheim, 1995;Chinchor, 1998) and since then we have seen the development of a range of data-driven approaches, spurred by the development of ever larger and richer datasets. Nowadays, a variety of datasets exist for several languages (Poesio et al., 2016a). These include medium-scale multilingual datasets such as ONTONOTES (Pradhan et al., 2007;Weischedel et al., 2011), which led to the most recent evaluation campaigns, in particular CONLL 2012(Pradhan et al., 2012, and are used in most current research (Björkelund and Kuhn, 2014;Martschat and Strube, 2015;Clark and Manning, 2016;Lee et al., 2017). However, there are still many languages and domains for which no such resources are available, and even for English much larger corpora than ONTONOTES will eventually be required.
However, annotating data on the scale required to train state of the art systems using traditional expert annotation would be unaffordable. One alternative is to employ crowdsourcing, either via platforms like Amazon Mechanical Turk and Crowdflower, or using Games-With-A-Purpose (Poesio et al., 2017). Studies such as (Snow et al., 2008;Raykar et al., 2010) have shown that when a sufficiently large number of workers is employed, expert-level quality can be achieved, at a fraction of the cost required to create such resources using traditional methods. The one effort to create a large-scale coreference corpus entirely through crowdsourcing, the Phrase Detectives project (Poesio et al., 2013;Chamberlain, 2016), employs the Phrase Detectives game with a purpose. The Phrase Detectives corpus consists of 843 documents for a total of 1.2 million tokens and 392,741 markables; at present, 563 documents for a total of 360,000 tokens have been annotated. 1 A second coreference corpus created using crowdsourcing (in the context of a trivia game) also exists, the Quiz Bowl dataset (Guha et al., 2015). 2 However, such existing corpora are not widely used yet. One of the reasons for this is the lack of suitable aggregation methods for anaphora. Crowdsourced annotations require aggregation methods to select among the different interpretations produced by the crowd. Standard practice for crowdsourced data analysis has seen a shift in recent years from simple majority vote to much more effective aggregation methods (Smyth et al., 1994;Quoc Viet Hung et al., 2013;Sheshadri and Lease, 2013;Carpenter, 2008;Hovy et al., 2013;Passonneau and Carpenter, 2014). Probabilistic models of annotation, in particular, make it possible to characterize the accuracy of the annotators and correct for their bias (Dawid and Skene, 1979;Passonneau and Carpenter, 2014), to account for item-level effects (e.g.: difficulty) (Whitehill et al., 2009), and to employ different pooling strategies (Carpenter, 2008). However, existing models of annotation cannot be used for anaphora. Such methods assume that coders choose between a fixed set of general labels, the same labels across all annotated items. In anaphoric annotation, by contrast, coders relate markables to coreference chains which depend on the markables that are annotated in that given document (Passonneau, 2004;Artstein and Poesio, 2008) Contributions In this paper we propose a mention pair-based approach to aggregating crowdsourced anaphoric annotations. Concretely, we introduce a new model of annotation capable of inferring the most likely mention pairs from crowdannotated anaphoric relations. We then use these pairs to build the most likely coreference chains. This approach to building chains is evaluated on both crowdsourced and synthetic (via simulation) coreference datasets. The evaluations include assessing the accuracy of the inferred mention pairs; the quality of the chains; and the viability of using these chains derived from mention pairs as an alternative to gold chains when training a state of the art coreference system. We conclude by also demonstrating the quality of the proposed model 2 Another corpus creation project using crowdsourcing (and also games) for anaphoric annotation is the Groningen Meaning Bank (Bos et al., 2017). However, in the GMB crowdsourcing is not used to generate interpretations: players correct automatically annotated interpretations rather than providing the annotations themselves. Another crucial difference is that interpretations are not aggregated in the sense discussed below; rather, an expert adjudicates between the interpretations produced by players. in a standard annotation task. The implementation is available as supplementary material.

A Mention-Pair Model of Annotation
Traditional models of annotation (Dawid and Skene, 1979;Smyth et al., 1994;Raykar et al., 2010;Hovy et al., 2013) are specified assuming the annotations are chosen among a general set of classes that is consistent across the annotated items. This is the case in a type of annotation closely related to anaphoric annotation, information status annotation (Nissim et al., 2004;Riester et al., 2010). In this type of annotation, an annotator marks a mention as either discourse old (DO)referring to an existing entity (coreference chain) -or as discourse-new (DN) -introducing a new coreference chain, but without specifying which coreference chain the mention belongs to, if any. We will refer below to categories such as DN and DO as (general) classes.
Traditional models of annotation can model this type of annotation, but not the task of anaphoric annotation proper. In standard annotation schemes for anaphora/coreference (Poesio et al., 2016a) the annotator may mark a mention as referring to a discourse new entity as above; but in case the mention is identified as discourse-old, this entity, or coreference chain-the set of coreferring mentions-is also specified. The available coreference chains differ from document to document.
Our proposal for a probabilistic model of this type of annotation is based on one of the most widely used models of coreference resolution: the mention pair model. In the mention pair model, the task of linking the mention to a coreference chain/entity is split in two parts: classifying mention pairs as coreferring or not, and subsequent clustering (Soon et al., 2001;Hoste, 2016). The model we propose addresses the first part.
More formally, the crowdsourced data to be modeled consists of I mentions (indexed by i) annotated by a total of J coders (indexed by j). Each mention i has N i annotations (indexed by n), for a total of M i distinct labels (indexed by m). Each label m of mention i belongs to a class z i,m . The label of a mention could be the ID of the antecedent, in case that mention is annotated as belonging to the discourse old (general) class; or could be discourse new or another general class (e.g.: property, non referring). In these latter cases, the labels coincide with the classes they belong to.
An important difficulty we had to address is label sparsity. The solution we propose is to transform the mention-level annotations into a series of binary decisions with respect to each candidate label. In the extended literature this is often referred to as the binary relevance method (Tsoumakas and Katakis, 2007;Madjarov et al., 2012). We then model these (label-level) decisions as the result of the sensitivity (the true positive rate) and specificity (the true negative rate) of the annotators which we assume are class dependent. This latter assumption allows inferring different levels of annotator ability for each class (e.g.: capturing that DO labels are generally harder compared to DN).
The graphical model of our Mention Pair Annotations model (MPA) is presented in Figure 1, while the generative process is given below: • For every class h ∈ {1, 2, ..., K}: -Draw class specific true label likelihood π h ∼ Beta(a, b) • For every annotator j ∈ {1, 2, ..., J}: -For every class h ∈ {1, 2, ..., K}: • For every mention i ∈ {1, 2, ..., I}: -For every candidate label m ∈ {1, 2, ..., M i }: The model addresses the first part of the mention pair framework: the posterior of the true label indicators is used to link each mention with the most likely label, obtaining the mention pairs. The coreference chains are then built by following the link structure from the inferred pairs.
Note that for a traditional annotation task with no distinction between generic classes and specific labels the MPA model is equivalent to training K binary Bayesian versions of the Dawid 3 Notation: jj[i,m,n] returns the index of the annotator who made the n-th decision on the m-th label of mention i. and Skene (1979) model (one for each general class) on data processed using the binary relevance method. Note also that whereas traditional models of annotation assume one true class per annotated item, an implicit benefit of our approach is allowing for potentially multiple true classes, which can be useful to detect ambiguity (Poesio and Artstein, 2005), but we don't exploit that in this work.

Parameter Estimation
We infer the parameters of the proposed model using Variational Inference (VI). Unlike Markov Chain Monte Carlo (MCMC) approaches (e.g.: Gibbs Sampling, Hamiltonian Monte Carlo), VI is deterministic, fast, and benefits from a clear convergence criterion (Blei et al., 2017).
Specifically we approximate the intractable posterior p(θ|D) with a variational distribution q(θ) such that the Kullback-Leibler (KL) divergence between the two distributions is minimized. It can be shown this minimization is equivalent to maximizing the evidence lower bound (ELBO) below: π, α, β, c, y|a, b, d, e, t, u, z)] We need a variational distribution q that is tractable under expectations. Following common practice (Blei et al., 2003;Hoffman et al., 2013;Blei et al., 2017), we choose q to be in the mean field variational family where each hidden variable is independent and governed by its own parameter. Elegant solutions have been derived for models whose complete conditionals are in the exponential family (Blei and Jordan, 2006;Hoffman et al., 2013). Concretely, we used the fact that the natural parameters of the variational distributions are equal to the expected value of the natural parameters of the corresponding complete conditionals. The derivations are standard in the VI literature (see, for example, Hoffman et al., 2013). (To save space, we only provide here the update formulas of the variational parameters; supplementary details are in the Appendix.) Equations (2) and (3) give the variational update formulas for the class-level true label likelihood. We have q(π h |λ h , η h ) = Beta(λ h , η h ), where: In Equation (4) and (5) we list the variational update formulas for the class-level annotator sensitivity. We have In Equations (6) and (7) we list the variational update formulas for the class-level annotator specificity. We have q(β j,h |θ j,h , j,h ) = Beta(θ j,h , j,h ), where: In Equations (8) and (9) we list the variational update formulas for the true label indicator. We Finally, for the above formulas, we used the fact that E q [I(c i,m = 1)] = φ i,m . The other expectations can be easily calculated noting that for a distribution part of the exponential family, the first derivative of the log normalizer is equal to the expected value of the sufficient statistics (Blei et al., 2003).
is the digamma function. Similar observations apply to the α and β related expectations.

Evaluation
We carried out a series of evaluations of increasing complexity of our MPA model. We first assess the accuracy of the inferred mention pairs. Second, we cluster the pairs into appropriate coreference chains and evaluate the quality of these chains. Third, we assess the viability of using silver chains as an alternative to the gold chains when training a state of the art coreference system. Finally, we conclude the evaluation with a performance check in a standard annotation task.

Datasets
The largest coreference dataset with crowdsourced annotations is the Phrase Detectives corpus. A subset of this corpus is the Phrase Detectives 1.0 dataset , which also includes gold annotations and can therefore be used to evaluate the accuracy of MPA at mention-pair and coreference chain inference, but is too small to train a state-of-the-art coreference system. To carry out this second type of evaluation we used the approach, common in the crowdsourcing literature (Carpenter, 2008;Raykar et al., 2010;Hovy et al., 2013;Felt et al., 2014), of generating simulated datasets by corrupting the gold standard of an existing corpus. For this purpose, we use the CONLL-2012dataset (Pradhan et al., 2012, at present the standard dataset for coreference resolution.

Crowdsourced Data
The Phrase Detectives (PD) 1.0 dataset has been annotated using the Phrase Detectives game with a purpose. 4 The annotation scheme for PD is based on that for the ARRAU corpus (Poesio et al., 2018). Players have to label predefined 5 markables with one of the following categories: nonreferring (e.g., for expletives), discourse-new, discourse-old (in which case an antecedent is also marked, the most recent mention belonging to the antecedent's coreference chain), or property (for appositions and copular structures). The PD 1.0 dataset is the portion of the corpus that contains, in addition to the annotations by the players, a gold label for each markable. The coreference chains are obtained using a simple clustering of the mention pairs. An important limitation of this corpus is its small size (around 6000 markables from 45 documents), making it unfit for the training and evaluation of state of the art supervised systems.

Synthetic Data
The CONLL-2012 dataset specifies gold chains, not mention pairs. So we need first to extract appropriate mention pairs from these chains. To do this, for each mention we select as gold label the closest mention from its gold chain (or discourse new if the mention is the first in its chain).   The simulations are then generated by extracting from each gold label a number of 'crowdsourced labels' produced by (simulated) annotators with varying degrees of ability. We considered a range of simulated scenarios, all sharing the following settings: • 10 distinct annotators per mention and 20 distinct mentions per annotator. The annotators receive random mentions to annotate. 6 • Each annotator is assigned randomly a profile. The profiles indicate the sensitivity of the annotators with respect to discourse old and new. For example, the (DO 0.8, DN 0.9) profile indicates that, given a mention whose true class is DO, the annotator has 0.8 probability of getting it right; and of 0.9 for DN. We considered both profiles reflecting the actual profiles of players in Phrase Detectives (Chamberlain, 2016) and synthetic profiles.
• 5 choices for the annotators to choose from for each mention: the correct label, the DN PD   The range of options considered in the simulation is specified by two aspects: the sensitivity from the annotator profiles and the distribution of the errors they make. We use the following two profile types: The profiles roughly correspond to two experts and three novices whose class sensitivities are relatively close -with extra mass associated with DN because this class is generally easier compared to DO.
• Phrase Detectives inspired profiles: from the PD annotators who annotated more than 10 DO and 10 DN mentions (thresholds set to have a minimum confidence) we extracted a total of 89 profiles. This gave us much more interesting sensitivity pairs compared to the ones from the synthetic profiles, i.e., contrasting class abilities -see Figure 2.
We also considered a range of ways in which annotators may make mistakes: • Distribute the errors uniformly random given the remaining mass (1 -sensitivity) • Distribute the errors in a sparse manner, i.e., assume that some errors will be more likely than others. This can be achieved by drawing randomly from a 4-dimensional (4 = number of errors) uniform Dirichlet for each mention. The annotator probabilities over the 5 choices will then consist of their sensitivity, and the error distribution normalized with respect to the remaining mass.
The settings just discussed lead to 4 simulations summarized in Table 1.

Evaluation 1: Mention Pair Accuracy
We use MPA to link each mention with the most likely label based on the posterior of the true label indicators. We then assess the accuracy of the inferred mention pairs against the gold standard, i.e., the agreement with the gold mention pairs. In this task the proposed model is compared against a majority vote baseline where each mention is paired with the most voted label. 7 The evaluation is conducted on the crowdsourced annotated PD 1.0 dataset and on simulated data generated from the CONLL-2012 test set. The results, summarized in Table 2, indicate the mention pairs inferred by our model (MPA) obtain a much better level of agreement with the gold mention pairs, compared with the output of the majority vote (MV) baseline. MV implicitly assumes equal expertise among the annotators, which has repeatedly been shown to be false in annotation practice (Poesio and Artstein, 2005;Passonneau and Carpenter, 2014;Plank et al., 2014).

Evaluation 2: Silver Chain Quality
After the mention pairs have been inferred using MPA, producing the coreference chains -we will henceforth refer to the coreference chains thus ob-7 Throughout the paper we report the best majority vote result after 10 random rounds of splitting ties.  tained as silver coreference chains 8 -is a straightforward clustering task: we simply follow the link structure from the pairs. In this Section we assess the quality of the silver chains using standard coreference metrics -in particular, the Extended Scorer introduced in (Poesio et al., 2018) which extends the official CONLL scorer to include in the evaluation system-predicted singletons and non referring expressions, both of which are annotated in Phrase Detectives; when singletons and nonreferring expressions are not considered, the Extended Scorer is identical to the official scorer. As in the previous experiment, the evaluation is conducted on the crowdsourced annotated PD 1.0 dataset and on simulated data generated from the CONLL-2012 test set. We compare silver chains produced using our MPA model, using MV, and using the Stanford deterministic coreference system (Stanford) (Lee et al., 2011). To run the latter on PD 1.0, we used the default annotators of the CoreNLP toolkit (Manning et al., 2014) to supply the information required by the coreference sys-8 Our use of the term 'silver standard' should not be confused with the other common use of standard generated out of automatic annotations. tem and switched off the post-processing to output singleton clusters; for the CONLL-2012 data we set the dcoref.replicate.conll = true to run exactly the same method as Lee et al. (2011). On both datasets we evaluated on gold mentions. Table 3 summarizes the results on the crowdsourced annotated PD 1.0 dataset. The silver chains obtained using our MPA model are of a far better quality than those of baseline alternatives such as MV and Stanford. Note also that even the simple MV baseline built from crowdsourced annotations yields much better chains compared to a standard coreference system such as the Stanford system. This underlines the advantage of crowdsourced annotations for coreference over automatically produced annotations. In Table 4 we present the scores of MPA and MV on cases of non referring. In this case, as well, the probabilistic model substantially outperforms the MV baseline.
In Table 5 we present the results obtained on simulated data from the CONLL-2012 test set. The results follow a similar trend to those observed using actual annotations: a much better quality of the chains produced using the mention pairs inferred by our MPA Table 6: Results of a state of the art coreference system trained on silver chains obtained in different ways. Each simulated scenario is randomly generated 10 times (summary is in terms of average result and standard deviation) scenarios. Furthermore, the MV baseline achieves better chains compared to the Stanford system in 3 out of 4 simulation settings, again showcasing the potential of crowdsourced annotations.

Training on Silver Chains
In this Section we assessed the viability of using the (silver) chains extracted from crowdsourcing as an alternative to gold chains when training a state of the art coreference system. Concretely, we train the best-performing current system Lee et al. (2017) on chains produced using our MPA model, the MV baseline and the Stanford deterministic system (Lee et al., 2011) (used mainly for calibration, i.e., an alternative baseline that's not based on crowdsourced annotations). We also include the results obtained using actual gold chains. The results are in Table 6. Across all simulated scenarios, the silver chains produced by our MPA model obtain the closest performance to training on gold chains, and the best result is only 1 percentage point less than the result with gold chains. Again, the MV chains lead to better performance than those obtained using a system (Stanford).
These results, once again, indicate the utility of crowdsourced annotations for coreference tasks.

Traditional Crowdsourcing Tasks
In this Section we show that MPA is state of the art also on traditional crowdsourcing datasets, where annotations fall into general classes that are consistent across the annotated items. This evaluation was done on the datasets (WSD, RTE and TEMP) introduced by Snow et al. (2008) and widely used as benchmarks in the literature on annotation models (Hovy et al., 2013;Carpenter, 2008).
We compare the results against a majority vote baseline and two well-known state of the art models: a Bayesian version of the Dawid and Skene (1979) (DS) model and MACE (Hovy et al., 2013). We implement DS ourselves using variational inference, while for MACE, we simply report the published results. As in Hovy et al. (2013) the assessment is done in terms of accuracy against the gold standard. The results, presented in Table 7, indicate the proposed model achieves performance on par with the state of the art.

Related Work
To our knowledge, this is the first paper proposing a model of crowdsourced annotations for coreference. We did draw inspiration however from existing mention pair models of coreference and traditional models of annotation.
The so-called mention pair model is one of the early machine learning approaches to coreference resolution, made popular by Soon et al. (2001). The model is based on a two step procedure: a classification step which identifies the coreferent mention pairs, followed by a clustering step which builds the coreference chains from the aforementioned pairs. The diversity of mention pair models comes from the distinct approaches taken for each of the two steps (Hoste, 2016). Although we follow a similar two step procedure, there are also important differences. Our way of identifying the mention pairs is completely unsupervised, and relies entirely on the crowdsourced annotations. Furthermore, we pair every mention with only one label, reducing the second step of clustering mention pairs into appropriate coreference chains to a simple grouping task guided by a unique path which arises from the pairs.
All existing probabilistic models of annotation (Dawid and Skene, 1979;Smyth et al., 1994;Raykar et al., 2010;Hovy et al., 2013;Passonneau and Carpenter, 2014) assume the annotations fall into a general set of classes that is consistent across the annotated items. This is clearly not the case in a coreference resolution task, a limitation we had to address. We first transformed the annotations into a series of (per label) binary decisions, approach often referred to, in the multi-class classification literature, as the binary relevance method (Tsoumakas and Katakis, 2007;Madjarov et al., 2012). The transformation avoids modeling the sparse labels directly. We further exploited the fact that the annotations fall into a general set of classes and assumed the inter-label decisions are the result of the class-dependent ability of the annotators.

Conclusions
Crowdsourced annotations are an increasingly popular alternative to expert annotation. Even so, their viability for coreference annotation had not been explored so far. This paper is a first step to filling this gap. We introduced a mention pair-based approach to aggregating crowd-  Table 7: Accuracy on standard crowdsourced data sourced anaphoric annotations and assessed the quality of the inferred pairs, of the post-hoc constructed coreference chains, and the viability of using the inferred chains as an alternative to gold chains when training a state of the art coreference system. Throughout the experiments, the model introduced was superior to baseline alternatives such as majority vote and chains obtained automatically using a coreference system, across both genuinely crowdsourced and simulated coreference datasets. Furthermore, even the annotationbased baseline achieved results consistently better than those obtained by automatic coreference resolvers, strengthening the case for using crowdsourced annotations to create coreference datasets.