Active Learning for Coreference Resolution using Discrete Annotation

We improve upon pairwise annotation for active learning in coreference resolution, by asking annotators to identify mention antecedents if a presented mention pair is deemed not coreferent. This simple modification, when combined with a novel mention clustering algorithm for selecting which examples to label, is much more efficient in terms of the performance obtained per annotation budget. In experiments with existing benchmark coreference datasets, we show that the signal from this additional question leads to significant performance gains per human-annotation hour. Future work can use our annotation protocol to effectively develop coreference models for new domains. Our code is publicly available.


Introduction
Coreference resolution is the task of resolving anaphoric expressions to their antecedents (see Figure 1). It is often required in downstream applications such as question answering (Dasigi et al., 2019) or machine translation (Stanovsky et al., 2019). Exhaustively annotating coreference is an expensive process as it requires tracking coreference chains across long passages of text. In news stories, for example, important entities may be referenced many paragraphs after their introduction.
Active learning is a technique which aims to reduce costs by annotating samples which will be most beneficial for the learning process, rather than fully labeling a large fixed training set. Active learning consists of two components: (1) a taskspecific learning algorithm, and (2) an iterative sample selection algorithm, which examines the performance of the model trained at the previous iteration and selects samples to add to the annotated * *Work done while at the University of Washington. 1 https://github.com/belindal/ discrete-active-learning-coref A volcano in Mexico, known to locals as Po-po , just started spewing molten rock. Are the two mentions coreferent? No What is the first appearance of the entity that the yellowhighlighted text refers to? A volcano in Mexico Figure 1: Discrete annotation. The annotator is shown the document, a span (yellow), and the span's predicted antecedent (blue). In case the answer to the coreference question is negative (i.e., the spans are not coreferring), we present a follow-up question ("what is the first appearance of the entity?"), providing additional cost-effective signal. Our annotation interface can be seen in Figure 5 in the Appendix. training set. This method has proven successful for various tasks in low-resource domains (Garrette and Baldridge, 2013;Kholghi et al., 2015;Syed et al., 2016Syed et al., , 2017. Sachan et al. (2015) showed that active learning can be employed for the coreference resolution task. They used gold data to simulate pairwise human-annotations, where two entity mentions are annotated as either coreferring or not (see first question in Figure 1).
In this paper, we propose two improvements to active learning for coreference resolution. First, we introduce the notion of discrete annotation (Section 3), which augments pairwise annotation by introducing a simple additional question: if the user deems the two mentions non-coreferring, they are asked to mark the first occurrence of one of the mentions (see second question in Figure 1). We show that this simple addition has several positive implications. The feedback is relatively easy for annotators to give, and provides meaningful signal which dramatically reduces the number of annotations needed to fully label a document.
Second, we introduce mention clustering (Section 4). When selecting the next mention to label, we take into account aggregate model predictions for all antecedents which belong to the same cluster. This avoids repeated labeling that would come with separately verifying every mention pair within the same cluster, as done in previous methods.
We conduct experiments across several sample selection algorithms using existing gold data for user labels and show that both of our contributions significantly improve performance on the CoNLL-2012 dataset (Pradhan et al., 2012). Overall, our active learning method presents a superior alternative to pairwise annotation for coreference resolution, achieving better performing models for a given annotation budget.

Background
Our work relies on two main components: a coreference resolution model and a sample selection algorithm.
Coreference resolution model We use the span ranking model introduced by Lee et al. (2017), and later implemented in AllenNLP framework (Gardner et al., 2018). This model computes span embeddings for all possible spans i in a document, and uses them to compute a probability distribution P (y = ant(i)) over the set of all candidate antecedents Y(i) = {K previous mentions in the document} ∪ { }, where is a dummy antecedent signifying that span i has no antecedent. This model does not require additional resources, such as syntactic dependencies or named entity recognition, and is thus well-suited for active learning scenarios for low-resource domains.
Sample selection algorithm Previous approaches for the annotation of coreference resolution have used mostly pairwise selection, where pairs of mentions are shown to a human annotator who marks whether they are co-referring (Gasperin, 2009;Laws et al., 2012;Zhao and Ng, 2014;Sachan et al., 2015). To incorporate these binary annotations into their clustering coreference model, Sachan et al. (2015) introduced the notion of must-link and cannot-link penalties, which we describe and extend in Section 4.

Discrete Annotation
In discrete annotation, as exemplified in Figure 1, we present the annotator with a document where the least certain span i ("Po-po", in the example) and i's model-predicted antecedent, A(i) ("locals"), are highlighted. Similarly to pairwise annotation, annotators are first asked whether i and A(i) are coreferent. If they answer positively, we move on to the next sample. Otherwise, we deviate from pairwise sampling and ask the annotator to mark the antecedent for i ("A volcano in Mexico") as the follow-up question. 2 The annotator can abstain from answering the follow-up question in case i is not a valid mention or if it does not have an antecedent in the document. See Figure 5 in the Appendix for more example annotations.
In Section 5, we show that discrete annotation is superior to the classic pairwise annotation in several aspects. First, it makes better use of human annotation time, as often an annotator needs to resolve the antecedent of the presented mention to answer the first question. For example, identifying that "Po-po" refers to the volcano, and not the locals. Second, we find that discrete annotation is a better fit for mention ranking models (Lee et al., 2017), which assign the most-likely antecedent to each mention, just as an annotator does in discrete annotation.

Mention Clustering
We experiment with three selection techniques by applying popular active learning selectors like entropy or query-by-committee (Settles, 2010) to clusters of spans. Because our model outputs antecedent probabilities and predictions, we would like to aggregate these outputs, such that we have only one probability per mention cluster rather than one per antecedent. We motivate this with an example: suppose span i's top two most likely antecedents are y 1 and y 2 . In scenario 1, y 1 and y 2 are predicted to be clustered together, and in scenario 2, they are predicted to be clustered apart. Span i should have a "higher certainty" in scenario 1 (and thus be less likely to be picked by active learning), because its two most likely antecedents both imply the same clustering, whereas in scenario 2, picking y 1 vs. y 2 results in a different downstream clustering. Thus, rather than simply using the raw probability i refers to a particular antecedents, we use the probability i belongs to a certain cluster. This implies modelling y 1 and y 2 "jointly" in scenario 1, and separately in scenario 2.
Formally, we compute the probability that a span i belongs in a cluster C by summing P (ant(i) = y) for all y that belong in some cluster C, since i having an antecedent in a cluster necessarily also implies i is also in that cluster. This allows us to convert the predicted antecedent probabilities to in-cluster probabilities: Similarly, for query-by-committee, we aggregate predictions such that we have one vote per cluster rather than one vote per antecedent: where V (A(i) = y) ∈ {0, 1, · · · , M} refers to the number of models that voted y to be the antecedent of i. The cluster information (y ∈ C ∩ Y(i)) we use in Equations 1 and 2 is computed from a combination of model-predicted labels and labels queried through active learning. Antecedents which were not predicted to be in clusters are treated as singleton clusters.
Additionally, to respect user annotations during the selection process, we must keep track of all prior annotations. To do this, we use the concept of must-link (ML; if two mentions are judged coreferent) and cannot-link (CL; if two mentions are judged non-coreferent) relations between mentions introduced by Sachan et al. (2015), and adapt it for our purposes. Specifically, in our discrete setting, we build the links as follows: if the user deems the pair coreferent, it is added to ML. Otherwise, it is added to CL, while the user-corrected pair (from the second question) is always added to ML.
In addition, we use these links to guide how we select for the next mention to query. For example, if a CL relation exists between spans m 1 and m 2 , we will be less likely to query for m 1 , since we are slightly more certain on what m 1 's antecedent should be (not m 2 ). Formally, we revise probabilities and votes P (i ∈ C) and V (i ∈ C) in accordance to our link relations, which affects the selector uncertainty scores. 3 Finally, following (Sachan et al., 2015), we impose transitivity constraints, which allow us to model links beyond what has been explicitly 3 See Section A.2 in the appendix for more details. pointed out during annotation: However, recomputing these closures after each active learning iteration can be extremely inefficient. Instead, we build up the closure incrementally by adding only the minimum number of necessary links to maintain the closure every time a new link is added.
We experiment with the following clustered selection techniques: Clustered entropy We compute entropy over cluster probabilities and select the mention with the highest clustered entropy: Clustered query-by-committee We train M models (with different random seeds) and select the mention with the highest cluster vote entropy: Using votes counted over clusters, as defined in Equation 2.
Least coreferent clustered mentions / Most coreferent unclustered mentions (LCC/MCU) We aim to select a subset of spans for which the model was least confident in its prediction. For each span i which was assigned a cluster C i , we compute a score s C (i) = P (i ∈ C i ), and choose n spans with the smallest s C (i). For each singleton j, we give an "unclustered" score s U (i) = max C∈all clusters P (j ∈ C) and choose m spans with the largest s U (i). P (i ∈ C i ) and P (j ∈ C) are computed with Equation 1.

Evaluation
We compare discrete versus pairwise annotation using the English CoNLL-2012 coreference dataset (Pradhan et al., 2012). Following Sachan et al. (2015), we conduct experiments where user judgments are simulated from gold labels.  Annotation time estimation To compare annotation times between pairwise and discrete questions, we collected eight 30-minute sessions from 7 in-house annotators with background in NLP. Annotators were asked to answer as many instances as they could during those 30 minutes. We additionally asked 1 annotator to annotate only discrete questions for 30 minutes. To be as representative as possible, the active learning queries for these experiments were sampled from various stages of active learning (see Table 1). On average, an annotator completed about 67 questions in a single session, half of which were answered negatively, requiring the additional discrete question. Overall, these estimates rely on 826 annotated answers. Our annotation interface is publicly available, 4 see examples in Figure 5 in the Appendix.
Timing results are shown in  the discrete question after the initial pairwise question takes about the same time as answering the first question (about 16s). Furthermore, answering only discrete questions took 28.01s per question, which confirmed that having an initial pairwise question indeed saves annotator time if answered positively.
In the following experiments, we use these measurements to calibrate pairwise and discrete followup questions when computing total annotation times.
Baselines We implement a baseline for pairwise annotation with entropy selector. We also implement two discrete annotation baselines with random selection. The partially-labelled baseline follows the standard active learning training loop, but selects the next mention to label at random. The fully-labelled baseline creates a subset of the training data by taking as input an annotation time t and selecting at random a set of documents that the user can fully label in t hours using ONLY discrete annotation. By comparing the fully-labelled baseline against our active learning results, we can determine whether active learning is effective over labelling documents exhaustively .
Hyperparameters We use the model hyperparameters from the AllenNLP implementation of Lee et al. (2017). We train up to 20 epochs with a patience of 2 before adding labels. After all documents have been added, we retrain from scratch. We use a query-by-committee of M = 3 models, due to memory constraints. For LCC/MCU, given L annotations per document, we split the annotations equally between clusters and singletons.
Results Figure 2 plots the performance of discrete annotation with the various selectors from Section 4, against the performance of pairwise annotation, calibrated according to our timing experiments. In all figures, we report MUC, B3, and CEAFe as an averaged F1 score.
The three non-random active learning frameworks outperform the fully-labelled baseline, show- ing that active learning is more effective for coreference resolution when annotation budget is limited.
Most notably, Figure 2 shows that every nonrandom discrete selection protocol outperforms pairwise annotation. Where the gap in performance is the largest (> 15 minutes per document), we consistently improve by ∼4% absolute F 1 over pairwise selection.

Analysis
A major reason discrete annotation outperforms the pairwise baseline is that the number of pairwise annotations needed to fully label a document is much larger than the number of discrete annotations. In an average development document with 201 candidates per mention, the number of pairwise queries needed to fully label a document is 15, 050, while the maximum number of discrete queries is only 201 (i.e., asking for the antecedent of every mention). Thus, the average document can be fully annotated via discrete annotation in only 2.6% of the time it takes to fully label it with pairwise annotation, suggesting that our framework is also a viable exhaustive annotation scheme.
Further analysis shows that the improvement in discrete selection stems in part from better use of annotation time for mention detection accuracy ( Figure 3) and pronoun resolution (Figure 4), in which we measure performance only on clusters with pronouns, as identified automatically by the spaCy tagger (Honnibal and Montani, 2017) .
Finally, Table 3 shows ablations on our discrete annotation framework, showing the contribution of each component of our paradigm.

Discussion and Conclusion
We presented discrete annotation, an attractive alternative to pairwise annotation in active learning of coreference resolution in low-resource domains. By adding a simple question to the annotation interface, we obtained significantly better models per human-annotation hour. In addition, we introduced a clustering technique which further optimizes sample selection during the annotation process. More broadly, our work suggests that improvements in annotation interfaces can elicit responses which are more efficient in terms of the obtained performance versus the invested annotation time.

A Appendix
A.1 Timing Experiment Details and Computations.
In order to properly calibrate the results from discrete and pairwise querying, we conducted experiments (eight 30-minute sessions) to time how long annotators take to answer discrete and pairwise questions. See Figure 5 for the interface we designed for our experiments. The questions we ask for the experiment are all sampled from real queries from full runs of our active learning simulations. To obtain representative times, we sampled a diverse selection of active learning questions-at various stages of active learning (first iteration before retraining vs. after retraining n times) and various numbers of annotation per document (20 vs. 200). For each document, we randomly selected between 1-5 questions (of the total 20 or 200) to ask the annotator. Full details on how we sampled our queries can be found in Table 1. Note that we divided our samples into two datasets. We ran four 30-minute sessions with Dataset A before Dataset B and four 30-minute sessions with Dataset B before Dataset A-for a total of eight 30-minute sessions across 7 annotators (1 annotator completed a 1-hour session).
Since pairwise annotation is the same as answering only the initial question under the discrete setting, we run a single discrete experiment for each annotation session and use the time taken to answer an initial question as a proxy for pairwise annotation time. Our results show that answering the initial question took an average of 15.96s whereas answering the follow-up question took 15.57s. Thus, we derive the following formulas to compute the time it takes for pairwise and discrete annotation: where p = # of pairwise instances. d c , d nc = # of discrete instances for which the initial pair was "coreferent" (d c ) and "not coreferent" (d nc ), respectively. We also compute the number of pairwise examples p we can query in the same time it takes to query d c + d nc discrete examples: 15.96p = 15.96d c + 15.57d nc Moreover, we additionally conduct a single 30minute experiment to determine how long it takes to answer only discrete questions (without the initial pairwise step). We find that it takes 28.01s per question under the only-discrete setting. This is longer than the time it takes to answer a pairwise question, thus confirming that having an initial pairwise question indeed saves time if the pair is coreferent. Moreover, this also shows that answering the initial pairwise question significantly helps with answering the follow-up discrete question.

A.2 Additional Model Adaptations
Adapting Link Relations for our Model We use must-link and cannot-link relations between mentions to guide our active learning selector. We revise probabilities and model outputs (from which the model computes uncertainty scores for entropy, QBC, and LCC/MCU) in accordance to the following rules: 1. Clustered entropy. For every CL(a, b) relationship, we set P (ant(a) = b) = 0 and re-normalize probabilities of all other candidate antecedents. This decreases the probability that the active learning selector chooses a. Moreover, for every M L(a, b) relationship, we set P (ant(a) = b) = 1 and P (ant(a) = c) = 0 for all c = b. If there are multiple M L relationships involving a, we choose only one of a's antecedent to set to 1 (to maintain the integrity of the probability distribution). This guarantees that the active learning selector will never select a, as any ML link out of a means we have already queried for a.  To determine how much time our incremental closure algorithm saves over recomputing closures from scratch, we simulated annotations on a single document with 1600 mentions, and recorded how long it took to re-compute the closure after each annotation. Our experiments show that recomputing from scratch takes progressively longer as more labels get added: at 1600 labels, our incremental algorithm is 556 times faster than recomputing from scratch (1630ms vs. 2.93ms). Figure 6 plots the runtime of our incremental closure algorithm ("incremental closure") against the run-time of recomputing closures from scratch ("closure") using Equations 3 and 4. In the latter case, we keep track of the set of user-added edges which we update after each annotation, and re-compute the closures from that set.

A.3 Additional Analysis
Computing the time to fully-label a document under discrete and pairwise annotation. First, we compute the maximum number of pairwise questions we can ask. We consider the setup of Lee et al. (2017)'s model. This model considers only spans with highest mention scores (the "top spans"), and only considers at most K antecedents per top span. Thus, for a document with m top spans, we can ask up to pairwise questions. The first factor K(K−1) 2 comes from considering the first K spans in the document. For each of these spans i = 1 · · · K, we can ask about the first i − 1 spans. The second factor (m − K)K comes from considering the spans after the K-th span. For each of these m − K spans in the document, we can only consider up to K antecedents. Using statistics for the average document (m = 201) and the standard hyper-parameter settings (K = 100), we plug into Equation 10 to get 15, 050 overall pairwise questions needed to fully label a document (in worst-case). Meanwhile, the maximum number of discrete questions we can ask is only 201 (i.e., asking for the antecedent of every mention). Using timing Equations 7 and 8, we compute that it takes at most 6337.53s to answer 201 discrete questions in the worst-case scenario, and 240198s to answer 15050 pairwise questions. Thus, in the worst-case scenario for both discrete and pairwise selection, discrete selection will take only 2.64% of the time it takes pairwise selection to fully label a document.
Quantifying "Information Gain" from Discrete and Pairwise Annotation. Let D U be the set of training documents we are annotating for in a given round of active learning. To better quantify how much information discrete and pairwise annotation can supply in same amount of time, we define ∆F 1 as the change in the F 1 score on D U , before and after model predictions are supplemented with user annotation. Figure 7 shows average ∆F 1 as annotation time increases for discrete and pairwise annotation. Across the 10 annotation times we recorded, discrete annotation results in an average ∆F 1 that more than twice that of pairwise, in the same annotation time.

A.4 Hyperparameters
Model. We preserve the hyperparameters from the AllenNLP implementation of Lee et al. (2017)'s model. The AllenNLP implementation mostly maintains the original hyperparameters, except it sets the maximum number of antecedents considered to K = 100, and excludes speaker features and variational dropout, due to machine memory limitations.
Training. We use a 700/2102 fullylabelled/unlabelled initial split of the training data, and actively label 280 documents at a time. We train to convergence each round. Before all documents have been added, we train up to 20 epochs with a patience of 2 before we add more training documents. After all documents have been added, we retrain from scratch and use the original training hyperparameters from Lee et al. (2017).
Selectors. For query-by-committee, we use a committee of M = 3 models. We were not able to experiment with more due to memory constraints.
For LCC/MCU, given L annotations per document, we allocate n annotations to least-coreferent clustered mentions and the remaining m to mostcoreferent unclustered mentions. We use n = min (L/2, number of clustered spans), and m = min(L − n, number of un-clustered spans).

A.5 Active Learning Training Setup Full Details
In our active learning setup, we begin by training our model on a 700-document subset of the full training set. We discard the labels of the remaining 2102 documents. In each round of active learning, we choose 280 unlabelled documents, and query up to Q annotations per document. We then add these documents to the labelled set and continue training our model on this set (now with new documents). After all documents have been labelled, we retrain our model on the full document set from scratch, resetting all model and trainer parameters.
In Algorithm 2, we show our main training loop for active learning using discrete selection. This is the training loop we use for our clustered entropy and LCC/MCU selectors, and our partially-labelled random baseline. In Algorithm 3, we modify that loop for the clustered query-by-committee selector.
In Algorithm 1, we show our incremental closures algorithm, which builds up the transitive closure incrementally by adding only the minimum number of necessary links to maintain the closure each time a new link is added.

Algorithm 1: Incremental Link Closures Algorithm
Let (a, b) = link pair being added, A = a's old cluster before the pair is added, B = b's old cluster before the pair is added, A = set of element a has a CL relationship to before the pair is added, B = set of elements b has a CL relationship to before the pair is added.
1. If pair (a, b) was added to must-link, both must-link and cannot-link needs to be updated.
First, resolve the MLs by adding a ML relationship between every element in A and every element in B: