CorefQA: Coreference Resolution as Query-based Span Prediction

In this paper, we present CorefQA, an accurate and extensible approach for the coreference resolution task. We formulate the problem as a span prediction task, like in question answering: A query is generated for each candidate mention using its surrounding context, and a span prediction module is employed to extract the text spans of the coreferences within the document using the generated query. This formulation comes with the following key advantages: (1) The span prediction strategy provides the flexibility of retrieving mentions left out at the mention proposal stage; (2) In the question answering framework, encoding the mention and its context explicitly in a query makes it possible to have a deep and thorough examination of cues embedded in the context of coreferent mentions; and (3) A plethora of existing question answering datasets can be used for data augmentation to improve the model’s generalization capability. Experiments demonstrate significant performance boost over previous models, with 83.1 (+3.5) F1 score on the CoNLL-2012 benchmark and 87.5 (+2.5) F1 score on the GAP benchmark.


Introduction
Recent coreference resolution systems (Lee et al., 2017(Lee et al., , 2018;;Zhang et al., 2018a;Kantor and Globerson, 2019) consider all text spans in a document as potential mentions and learn to find an antecedent for each possible mention.There are two key issues with this paradigm, in terms of task formalization and the algorithm.
At the task formalization level, mentions left out at the mention proposal stage can never be recovered since the mention-ranking model only operates on the proposed mentions.This would not be a big issue if a perfect or nearly perfect mention proposal model exists.However, mention pro-posal is intrinsically hard because singleton mentions are not explicitly labeled.The coreference dataset can only provide a weak signal for spans that correspond to entity mentions, as verified in Zhang et al. (2018a).Due to the inferiority of the mention proposal model, it would be favorable if a coreference framework had a mechanism to retrieve left-out mentions.
At the algorithm level, existing end-to-end solutions (Lee et al., 2017(Lee et al., , 2018;;Zhang et al., 2018a) score each pair of mentions based on mention representations from the output layer of a contextualization model.This means that (1) the model lacks explicit emphasis on the mentions and their contexts and (2) semantic matching operations between two mentions (and their contexts) are performed only at the output layer and are relatively superficial.Therefore it is hard for their models to capture all the lexical, semantic and syntactic cues in the context.
In view of these issues, we propose a new approach that formulates the coreference resolution problem as a span prediction task, akin to the machine reading comprehension (MRC) setting.A query is generated for each candidate mention using its surrounding context, and a span prediction module is employed to extract the text spans of the coreferences within the document using the generated query.Some concrete examples are shown in Figure 1.
This formulation provides benefits at both the task formulation level and the algorithm level.At the task formulation level, since left-out mentions can still be retrieved at the span prediction stage, the negative effect of undetected mentions is significantly alleviated.At the algorithm level, by generating a query for each candidate mention using its surrounding context, the model explicitly emphasizes the surrounding context of the mentions of interest, the influence of which will later

Original Passage
In addition , many people were poisoned when toxic gas was released.They were poisoned and did not know how to protect themselves against the poison.Converted Questions Q1: Who were poisoned when toxic gas was released?A1: [They, themselves] Q2: What was released when many people were poisoned?A2: [the poison] Q3: Who were poisoned and did not know how to protect themselves against the poison?A3: [many people, themselves] Q4: Whom did they not know how to protect against the poison?A4: [many people, They] Q5: They were poisoned and did not know how to protect themselves against what?A5: [toxic gas] Figure 1: An illustration of the paradigm shift from coreference resolution to query-based span prediction.Spans with the same color represent coreferent mentions.Note that we use a more direct strategy to generate the questions based on the mentions.be propagated to each input word using the selfattention mechanism.Additionally, unlike existing end-to-end solutions (Lee et al., 2017(Lee et al., , 2018;;Zhang et al., 2018a), where the interactions between two mentions are only superficially modeled at the output layer of contextualization, span prediction requires a more thorough and deeper examination of the lexical, semantic and syntactic cues within the context, which will potentially lead to better performance.
Another key advantage of the proposed MRC formulation is that it allows us to take advantage of existing question answering datasets (Rajpurkar et al., 2016a(Rajpurkar et al., , 2018;;Dasigi et al., 2019a). 1 Under the proposed formulation, the coreference resolution has the same format as the existing question answering datasets (Rajpurkar et al., 2016a(Rajpurkar et al., , 2018;;Dasigi et al., 2019a).Those datasets can thus readily be used for data augmentation.We show that pre-training on existing question answering 1 Coreference annotation is expensive, cumbersome and often requires linguistic expertise from annotators.
datasets improves the model's generalization and transferability, leading to additional performance boost.
Experiments show that the proposed framework outperforms previous models by a huge margin.Specifically, we achieve new state-of-the-art scores of 87.5 (+2.5) on the GAP benchmark and 83.1 (+3.5) on the CoNLL-2012 benchmark.

Coreference Resolution
Coreference resolution is a fundamental problem in natural language processing and is considered as a good test of machine intelligence (Morgenstern et al., 2016).Neural network models have shown promising results over the years.Earlier neural-based models (Wiseman et al., 2016;Clark andManning, 2015, 2016) rely on parsers and hand-engineered mention proposal algorithms.Recent work (Lee et al., 2017(Lee et al., , 2018;;Kantor and Globerson, 2019) solve the problem in an end-to-end fashion by jointly detecting mentions and predicting coreferences.Based on how entity-level information is incorporated, they can be further categorized as (1) entity-level models (Björkelund and Kuhn, 2014;Clark andManning, 2015, 2016;Wiseman et al., 2016) that directly model the representation of real-world entities and (2) mention-ranking models (Durrett and Klein, 2013;Wiseman et al., 2015;Lee et al., 2017) that learn to select the antecedent of each anaphoric mention.

Formalizing NLP Tasks as MRC
Machine reading comprehension is a general and extensible task form.Many tasks in natural language processing can be framed as reading comprehension while abstracting away the taskspecific modeling constraints.
McCann et al. ( 2018) introduced the decaNLP challenge, which converts a set of 10 core tasks in NLP to reading comprehension.He et al. (2015) showed that semantic role labeling annotations could be solicited by using question-answer pairs to represent the predicate-argument structure.Levy et al. (2017) reduced relation extraction to answering simple reading comprehension questions, yielding models that generalize better in the zero-shot setting.Li et al. (2019a,b) cast the tasks of named entity extraction and relation extraction as a reading comprehension problem.
<mention> I <\mention> was hired to do some Christmas music I was hired to do some Christmas music, and it was just "Jingle Bells" and I brought my cat with me to the studio, and I was working on the song and the cat jumped up into the record booth and started meowing along, meowing to me.I was hired to do some Christmas music, and it was just "Jingle Bells" and I brought my cat with me to the studio, and I was working on the song and the cat jumped up into the record booth and started meowing along, meowing to me.

And I brought <mention> my cat <\mention> with me to the studio
I was hired to do some Christmas music, and it was just "Jingle Bells" and I brought my cat with me to the studio, and I was working on the song and the cat jumped up into the record booth and started meowing along, meowing to me.

And I was working on <mention> the song <\mention>
I was hired to do some Christmas music, and it was just "Jingle Bells" and I brought my cat with me to the studio, and I was working on the song and the cat jumped up into the record booth and started meowing along, meowing to me.

[I] [I] [my] [me] [I] [me] [my cat] [the cat] ["Jingle Bells"] [the song]
I was hired to do some Christmas music, and it was just "Jingle Bells" and I brought my cat with me to the studio, and I was working on the song and the cat jumped up into the record booth and started meowing along, meowing to me.In parallel to our work, Aralikatte et al. ( 2019) converted coreference and ellipsis resolution in a question answering format, and showed the benefits of training joint models for these tasks.Their models are built under the assumption that gold mentions are provided at inference time, whereas our model does not need that assumption -it jointly trains mention proposal and coreference resolution model in an end-to-end manner.

Data Augmentation
Data augmentation is a strategy that enables practitioners to significantly increase the diversity of data available for training models.Data augmentation techniques have been explored in various fields such as question answering (Talmor and Berant, 2019), text classification (Kobayashi, 2018) and dialogue language understanding (Hou et al., 2018).In coreference resolution, Zhao et al.

Model
In this section, we describe the proposed model (denoted as CorefQA) in detail.The overall architecture is illustrated in Figure 2.

Notations
Given a sequence of input tokens X = {x 1 , x 2 , ..., x n } in a document, where n denotes the length of the document.N = n * (n + 1)/2 denotes the number of all possible text spans in X.Let e i denotes the i-th span representation 1 ≤ i ≤ N , with the start index FIRST(i) and the end index LAST(i).
The task of coreference resolution is to determine the antecedents for all possible spans.If a candidate span e i does not represent an entity mention or is not coreferent with any other mentions, a dummy token is assigned as its antecedent.The linking between all possible spans e defines the final clustering.
To fit long documents into SpanBERT, we use a sliding-window approach that creates a T -sized (set to 512) segment after every T /2 tokens.Segments are then passed to the SpanBERT encoder independently.The final token representations are derived by taking the token representations with maximum context.

Mention Proposal
Similar to Lee et al. (2017), our model considers all spans up to a maximum length L (set to 10) as potential mentions.To maintain computational efficiency, we further prune the candidate spans greedily during both training and evaluation.To do so, the mention score of each candidate span is computed by feeding the first and the last of its constituent token representations into a feedforward layer: where x FIRST(i) and x LAST(i) represent the first and the last token representation of the i-th candidate span.FFNN m () denotes the feed-forward neural network that computes a nonlinear mapping from the input vector to the mention score.We only keep up to λT (where T is the document length, λ is empirically set to 0.2) spans with the highest mention scores.

Mention Linking as MRC Span Prediction
Given a mention e i proposed by the mention proposal network, the role of the mention linking network is to give a score s a (i, j) for any text span e j , indicating whether e i and e j are coreferent.We propose to use the MRC framework as the backbone to compute s a (i, j).It operates on the triplet {context (X), query (q), answers (a)}.The context X is the input document.The query q(e i ) is constructed as follows: given e i , we use the sentence that e i resides in as the query, with the minor modification that we encapsulates e i with special tokens < ref >< /ref > .The answers a are the coreferent mentions of e i .
Following Devlin et al. (2019), we represent the input question and the context as a single packed sequence.Since a mention can have multiple coreferent mentions, we follow Li et al. (2019a,b) and generate a BIO tag for each token.BIO tags respectively mark the beginning(B), inside(I) and outside(O) of a coreferent mention.It is worth noting that there exist unanswerable questions where labels for tokens in X are all O.3 A question is considered unanswerable in the following scenarios: (1) the candidate span e i does not represent an entity mention or (2) the candidate span e i represents an entity mention but is not coreferent with any other mentions in X.
The probability of assigning a tag ∈ B, I, O is computed as follows: FFNN tag () represents the feed-forward neural network that computes a nonlinear mapping from the input vector to the tag logit.
We further extend the token-level score in Eq. 2 to the span level.The anaphora score s a (j|i), the compatibility score of span j being a answer for span i, is calculated by the log probability of its beginning word taking the B tag and the rest taking the I tag: (3) A closer look at Eq.3 reveals that it only models the uni-directional coreference relation from e i to e j , i.e., e j is the answer for query q(e i ).This is suboptimal since if e i is a coreference mention of e j , then e j should also be the coreference mention e i .We thus need to optimize the bi-directional relation between e i and e j . 4The final score s a (i, j) is thus given as follows: s a (i|j) can be computed in the same way as s a (j|i), in which q(e j ) is used as the query.For a pair of text span e i and e j , the premises for them being coreferent mentions are (1) they are mentions and (2) they are coreferent.This makes the overall score s(i, j) for e i and e j the combination of Eq.1 and Eq.4:

Antecedent Pruning
Given a document X with length n and the number of spans O(n 2 ), the computation of Eq.5 for all mention pairs is intractable with the complexity of O(n 4 ).Given an extracted mention e i , the computation of Eq.5 for (e i , e j ) regarding all e j is still extremely intensive since the computation of the backward span prediction score s a (i|j) requires running MRC modeling on all query q(e j ).
A further pruning procedure is thus needed: For each query q(e i ), we collect C (empirically set to 50) span candidates only based on the s a (j|i) scores.

Training
For each mention e i proposed by the mention proposal network, it is associated with C potential spans proposed by the mention linking network based on s(j|i), we aim to optimize the marginal log-likelihood of all correct antecedents implied by the gold clustering.Following Lee et al. (2017),   4 This bidirectional relationship is actually referred to as mutual dependency and has shown to benefit a wide range of NLP tasks such as machine translation (Hassan et al., 2018) or dialogue generation (Li et al., 2015).
we append a dummy token to the C candidates.The model will output it if none of the C span candidates is coreferent with e i .For each mention e i , the model learns a distribution P (•) over all possible antecedent spans e j based on the global score s(i, j) from Eq. 5: P (e j ) = e s(i,j) j ∈C e s(i,j ) The mention proposal module and the mention linking module are jointly trained in an end-to-end fashion using training signals from Eq.6, with the SpanBERT parameters shared.The SpanBERT parameters are updated by the Adam optimizer (Kingma and Ba, 2015) with initial learning rate 1 × 10 −5 and the task parameters are updated by the Range optimizer 5 with initial learning rate 2 × 10 −4 .

Inference
Given an input document, we can obtain an undirected graph using the overall score, each node of which represents a candidate mention from either the mention proposal module or the mention linking module.We prune the graph by keeping the edge whose weight is the largest for each node based on Eq.6.Nodes whose closest neighbor is the dummy token are abandoned.Therefore, the mention clusters can be decoded from the graph.

Data Augmentation using MRC Datasets
We hypothesize that the reasoning (such as synonymy, world knowledge, syntactic variation, and multiple sentence reasoning) required to answer the questions in MRC are also indispensable for coreference resolution.Annotated MRC datasets are usually significantly larger than the coreference datasets due to the high linguistic expertise required for the latter.Under the proposed MRC formulation, coreference resolution has the same format as the existing question answering datasets (Rajpurkar et al., 2016a(Rajpurkar et al., , 2018;;Dasigi et al., 2019a).Therefore they can readily be used for data augmentation.We thus propose to pretrain the mention linking network on the Quoref dataset (Dasigi et al., 2019b), and the SQuAD dataset (Rajpurkar et al., 2016b).

Summary and Discussion
Comparing with existing models (Lee et al., 2017(Lee et al., , 2018;;Joshi et al., 2019b), the proposed MRC formalization has the flexibility of retrieving mentions left out at the mention proposal stage.However, since we still have the mention proposal model, we need to know in which situation missed mentions could be retrieved and in which situation they cannot.We use the example in Fig1 as an illustration, in which {many people, They, them-selves} are coreferent mentions: If partial mentions are missed by the mention proposal model, e.g., many people and They, they can still be retrieved in the MRC mention linking stage when the not-missed mention (i.e., themselves) is used as query.But, if all the mentions within the cluster are missed, none of them can be used for query construction, which means they all will be irreversibly left out.Given the fact that the proposal mention network proposes a significant number of mentions, the chance that mentions within a mention cluster are all missed is relatively low (which exponentially decreases as the number of entities increases).This explains the superiority (though far from perfect) of the proposed model.However, how to completely remove the mention proposal network remains a problem in the field of coreference resolution.

Baselines
We compare the proposed model with previous neural models that are trained end-to-end: • e2e-coref (Lee et al., 2017) is the first endto-end coreference system that learns which spans are entity mentions and how to best cluster them jointly.Their token representations are built upon the GLoVe (Pennington et al., 2014) and Turian (Turian et al., 2010) embeddings.
• EE + BERT-large (Kantor and Globerson, 2019) represents each mention in a cluster via an approximation of the sum of all mentions in the cluster.

Results on GAP
The GAP dataset (Webster et al., 2018) is a gender-balanced dataset that targets the challenges of resolving naturally occurring ambiguous pronouns.It comprises 8,908 coreference-labeled pairs of (ambiguous pronoun, antecedent name) sampled from Wikipedia.We follow the protocols in Webster et al. (2018); Joshi et al. (2019b) and use the off-theshelf resolver trained on the CoNLL-2012 dataset to get the performance of the GAP dataset.Table 2 presents the results.We can see that the proposed model achieves state-of-the-art performance on all metrics on the GAP dataset.
We compare the proposed model with several baseline models in Table 1.Our system achieves a huge performance boost over existing systems: With SpanBERT-base, it achieves an F1 score of 79.9, which already outperforms the previous SOTA model using SpanBERT-large by 0.3.With SpanBERT-large, it achieves an F1 score of 83.1, with a 3.5 performance boost over the previous SOTA system.

Effects of Different Modules in the Proposed Framework
Effect of SpanBERT Replacing SpanBERT with vanilla BERT leads to a 3.8 F1 degradation.This verifies the importance of span-level pre-training for coreference resolution and is consistent with previous findings (Joshi et al., 2019a).
Effect of Pre-training Mention Proposal Network Skipping the pre-training of the mention proposal network using golden mentions results in a 7.5 F1 degradation, which is in line with our expectation.A randomly initialized mention proposal model implies that mentions are randomly selected.Randomly selected mentions will mostly be transformed to unanswerable questions.This makes it hard for the MRC model to learn at the initial training stage, leading to inferior performance.
Effect of MRC pre-training on the augmented datasets One of the most valuable strengths of converting anaphora resolution to question answering is that existing MRC datasets can be readily used for data augmentation purposes.We see a contribution of 0.7 F1 from pre-training on the Quoref dataset (Dasigi et al., 2019a) and a contribution of 0.3 F1 from pre-training on the SQuAD dataset (Rajpurkar et al., 2016a).
Effect of MRC We aim to study the pure performance gain of the paradigm shift from mentionpair scoring to query-based span prediction.For this purpose, we replace the MRC module with the mention-pair scoring module described in Lee et al. (2018), while keeping others unchanged.We observe an 8.4 F1 degradation in performance, demonstrating the significant superiority of the proposed MRC framework over the mention-pair scoring framework.

Analyses on speaker modeling strategies
We compare our speaker modeling strategy (denoted by speaker as input), which directly concatenates the speaker's name with the corresponding utterance, with the strategy in Wiseman et al. larger number of speakers.Compared with the coarse modeling of whether two utterances are from the same speaker, a speaker's name can be thought of as speaker ID in persona dialogue learning (Li et al., 2016;Zhang et al., 2018b;Mazaré et al., 2018).Representations learned for names have the potential to better generalize the global information of the speakers in the multi-party dialogue situation, leading to better context modeling and thus better results.

Analysis on the Overall Mention Recall
Since the proposed framework has the potential to retrieve mentions missed at the mention proposal stage, we expect it to have higher overall mention  recall rate than Lee et al. (2017Lee et al. ( , 2018)); Zhang et al. (2018a); Kantor and Globerson (2019).
We examine the proportion of gold mentions covered in the development set as we increase the hyperparameter λ (the number of spans kept per word) in Figure 5.Our model consistently outperforms the baseline model with various values of λ.Notably, our model is less sensitive to smaller values of λ.This is because missed mentions can still be retrieved at the mention linking stage.

Qualitative Analysis
We provide qualitative analyses to highlight the strengths of our model in Table 4.
Shown in Example 1, by explicitly formulating the anaphora identification of the company as a question, our model uses more information from a local context, and successfully identifies Freddie Mac as the answer from a longer distance.
The model can also efficiently harness the speaker information in a conversational setting.In Example 3, it would be difficult to identify that [Thelma Gutierrez] is the correct antecedent of mention [I] without knowing that Thelma Gutierrez is the speaker of the second utterance.However, our model successfully identifies it by directly feeding the speaker's name at the input level.
We presented a state-of-the-art coreference resolution model that casts anaphora identification as the task of query-based span prediction in MRC.We showed that the proposed formalization can successfully retrieve mentions left out at the mention proposal stage, and makes data augmentation using a plethora of existing question answering datasets possible.Furthermore, a new speaker modeling strategy can also boost the performance in dialogue settings.

Figure 2 :
Figure 2: The overall architecture of our model.
(2018);Emami et al. (2019);Zhao et al. (2019) focused on debiasing the gender bias problem; Aralikatte et al. (2019) explored the effectiveness of joint modeling of ellipsis and coreference resolution.To the best of our knowledge, we are the first to use existing question answering datasets as data augmentation for coreference resolution.

Figure 3 :
Figure 3: An illustration of the token conversion.

Figure 4 :Figure 5 :
Figure 4: Performance on the development set of the CoNLL-2012 dataset with various number of speakers.F1(Speaker as feature): F1 score for the strategy that treats speaker information as a mention-pair feature.F1(Speaker as input): F1 score for our strategy that treats speaker names as token input.Frequency: percentage of documents with specific number of speakers.
2to obtain input representations.Each token x i is associated with a SpanBERT representation x i .Here's Erica Hill from Headline News.Hi Erica.Erica Hill: Hi Paula...I 'm sure I 'll be hearing a little bit more about it all week.

Table 1 :
Evaluation results on the English CoNLL-2012 shared task.The average F1 of MUC, B 3 , and CEAF φ4 is the main evaluation metric.Ensemble models are not included in the table for a fair comparison.

Table 2 :
CorefQA achieves the state-of-the-art performance on all metrics including F1 scores on Masculine and Feminine examples, a Bias factor (F / M) and the Overall F1 score.
1 [Freddie Mac] is giving golden parachutes to two of its ousted executives. . ..Yesterday Federal Prosecutions announced a criminal probe into [the company].2[A traveling reporter] now on leave and joins us to tell [her] story.Thank [you] for coming in to share this with us. 3 Paula Zahn: [Thelma Gutierrez] went inside the forensic laboratory where scientists are trying to solve this mystery.Thelma Gutierrez: In this laboratory alone [I]'m surrounded by the remains of at least twenty different service members who are in the process of being identified so that they too can go home.

Table 4 :
Example mention clusters that were correctly predicted by our model, but wrongly predicted by c2fcoref + SpanBERT-large.Bold spans in brackets represent coreferent mentions.Italic spans represent the speaker's name of the utterance.