Exploring Span Representations in Neural Coreference Resolution

In coreference resolution, span representations play a key role to predict coreference links accurately. We present a thorough examination of the span representation derived by applying BERT on coreference resolution (Joshi et al., 2019) using a probing model. Our results show that the span representation is able to encode a significant amount of coreference information. In addition, we find that the head-finding attention mechanism involved in creating the spans is crucial in encoding coreference knowledge. Last, our analysis shows that the span representation cannot capture non-local coreference as efficiently as local coreference.


Introduction
Coreference resolution, the task of grouping all referring expressions that point to the same entity into a cluster, plays a key role for various higher level NLP tasks that involve natural language understanding such as information extraction, question answering, machine translation, text summarisation, and textual entailment. Referring expressions or mentions can be common nouns, proper nouns, or pronouns, which refer to a real-world entity known as the referent.
With the breakthrough of end-to-end neural systems (Lee et al., 2017), current coreference resolution systems are for the most part neural based. Contrary to previous architectures which identified mentions and then took coreferential decisions in two separate steps, these systems jointly learn the two. A typical system requires different levels of semantic representation of the input sentences, usually done by computing representations at the span level given the word embeddings.
In another area, a wave of recent work has tried to inspect neural NLP models by associating neural † Shared first authorship network components with distinct linguistic phenomena by means of probing tasks (Shi et al., 2016;Liu et al., 2019a;Tenney et al., 2019).
Targeting the coreference task, in this paper, we build a probing model (Tenney et al., 2019;Liu et al., 2019a) to find out what degree of coreference information is encoded in the span representations as first proposed by Lee et al. (2017). Specifically, we generate mention-span representations with BERT embeddings fine-tuned on the OntoNotes dataset (Pradhan et al., 2012) and train a probing model to predict coreference arcs between two mentions from the mention-span representations alone. Moreover, we explore how fine-tuning BERT (Devlin et al., 2019) on coreference resolution affects the linguistic knowledge learned by the span representations. Given the well-documented difficulty in modelling long-distance coreference relations, we also measure the robustness of the span representations at different distance ranges between mentions.
Our probing models consistently achieve > 90% accuracy and F1, suggesting that span representations encode a significant amount of coreference information. Besides, they show that fine-tuning a BERT model greatly helps with encoding coreference relations. By ablating components of the span representation, we also find that the head-finding attention mechanism plays a crucial part in encoding important coreference information. Finally, we show that despite using a fine-tuned BERT, the span representations cannot capture non-local coreference relation efficiently. Our implementation is publicly available 1 .

Span-Ranking Architecture
In this paper we focus on the span representation used in span-ranking models (Lee et al., 2017(Lee et al., , 2018Joshi et al., 2019) and examine their capability to encode the necessary information to make coreference decisions. Lee et al. (2017) proposed an end-to-end coreference resolution model that learns to jointly model mention detection and coreference prediction using span-ranking. However, the model only computes scores between pairs of entity mentions. In an attempt to improve the weakness of this approach, Lee et al. (2018) proposed a model that captures higher-order interactions between mention spans in predicted coreference clusters. The model refines existing span representations iteratively with the antecedent distribution as an attention mechanism. We further refer to this model as c2f-coref. Joshi et al. (2019) proposed to replace the bidirectional LSTM encoder in c2f-coref with BERT transformers and fine-tune it for coreference resolution. Although BERT improves the state-of-the-art results in other NLP tasks significantly (Devlin et al., 2019), coreference resolution still proves to be a challenging task, as the BERT encoder offers a marginal performance increase only. Furthermore, the model still struggles in modelling pronouns and resolving cases where mention paraphrasing is required. We further refer to this model as BERTcoref.

Probing Tasks
The most common method to explore linguistic properties in neural network components is by using the hidden state activations to predict the property of interest, also known as "probing tasks" (Conneau et al., 2018) or "auxiliary prediction tasks" (Adi et al., 2016). Shi et al. (2016) use the internal representations of an LSTM encoder as input to train a logistic regression classifier that predicts various syntactic properties. Conneau et al. (2018) study the linguistic properties of fixed-length sentence encoders with a bidirectional LSTM and gated convolutional networks. Liu et al. (2019a) explore representations produced by pre-trained contextualisers and demonstrate that frozen contextual representations fed into linear models can show similar levels of performance as state-of-the-art task-specific models on many NLP tasks. They also used the corefer-ence arc prediction task, whereby linear models are used to predict whether two mentions corefer. The coreference arc prediction was already used by Soon et al. (2001) as a part of the mention-pair model, where it is used with heuristic procedures to merge coreference chains. Tenney et al. (2019), on their part, introduced the edge probing framework, which focuses on linguistic analysis on sub-sentence level. Their approach relies on a FFNN model with a projection layer and an attention mechanism on top of frozen contextual vectors to predict linguistic properties. Clark et al. (2019) further extended the probingbased approach by proposing attention-based probing classifiers and show that the attention heads in BERT correspond to linguistic notions of syntax and coreference.
Our approach is most similar to Liu et al. (2019a) and Tenney et al. (2019), but we use the span representation learned from Lee et al.'s 2017 coreference resolution model and focus on examining coreference phenomena. Note that we use the coreference arc prediction task as a tool to understand the span representation better, we do not do coreference resolution. Compared to Liu et al. (2019a) who consider single-token mentions only, we use mention-spans to predict coreference arcs. We also compare the span representation against a baseline span representation obtained from pre-trained contextual word embeddings (Tenney et al., 2019).

Span Representations
Span representations are key in span-ranking models since they are used to compute a distribution over candidate antecedent spans. In order to predict coreference relations accurately, a span representation should also capture information about the span's internal structure and its surrounding context. For our experiments, we construct span representations as proposed by Lee et al. (2017), but with BERT embeddings (Devlin et al., 2019) instead of an LSTM-based encoder to encode the lexical information of a span and its context, following Joshi et al. (2019). A span representation is a vector embedding which consists of contextdependent boundary representations with an attentional representation of the head words over the span. The boundary representations are composed of the first and last wordpieces of the span itself.
The head words are automatically learned using additive attention (Bahdanau et al., 2015) over each wordpiece in a span: wherex x x i is a weighted vector representation of wordpieces for span i. This representation is augmented by a R d feature vector which encodes the size of span i with d = 20. The final representation g g g i for span i is formulated as follows: where x x x * start(i) and x x x * end(i) are first and last wordpieces of a span, and φ i is the span width embedding.

Coreference Arc Prediction
We focus on the coreference arc prediction task, which is a part of the probing tasks suite for contextual word embeddings. In this task, a probing model is trained to determine whether two mentions refer to the same entity. We produce negative samples following the approach by Liu et al. (2019a). For every pair of gold mentions (w i , w j ), where they belong to the same gold coreference cluster and w i is an antecedent of w j , we generate a negative example (w random , w j ) where w random is randomly sampled from a different coreference cluster.
This method ensures a balanced ratio between positive and negative examples. The negative examples do not contain any singleton mentions, as in OntoNotes only coreferential mentions are annotated. We also follow the approach of Tenney et al. (2019) by using spans of wordpieces for mentions, as Liu et al.'s approach is limited to single-token mentions and therefore unable to fully exploit available information in a mention-span.

The Probing Model
Our probing model is a simple feed-forward neural network (FFNN), which is designed with a limited capacity to focus on the information that can be extracted from the span representations. As input to the model, we take a span representation for a pair of mention-spans g g g where both g g g 1 and g g g 2 are concatenated and passed to the FFNN. The FFNN consists of a single hidden layer followed by a sigmoid output layer. The model is trained to minimise binary cross-entropy with respect to the gold label Y ∈ {0, 1}. The probing architecture is depicted in Figure 1.
We obtained the mention-span representations from BERT, a language representation model based on the Transformer architecture (Vaswani et al., 2017), trained jointly with a masked language model and next sentence prediction objective. It enables significant improvement in many downstream tasks with relatively minimal task-specific fine-tuning. To study the quality of mention-span representations, we extract mention-span embeddings from BERT-base (12-layer Transformers, 768-hidden) and BERT-large (24-layer Transformers, 1024-hidden) pre-trained models. Furthermore, we compare these original BERT models with finetuned variants, with the purpose to assess any finetuning effect on the quality of the span representations.

Dataset
We use the coreference resolution annotation from the CoNLL-2012 shared task based on the OntoNotes dataset (Pradhan et al., 2012

Implementation Details and Hyperparameters
We extend the original Tensorflow implementation of BERT-coref 2 in order to build our probing model with Keras frontend (Chollet et al., 2015).  Figure 1: The probing architecture for span representations. The feed-forward neural network is trained to extract information from span representations g g g 1 and g g g 2 , while all the parameters inside the dashed line are frozen. The example depicts a mention-pair, where g g g 1 corresponds to span representation of "President Obama", while g g g 2 corresponds to "he". We predictŶ as positive for this example. of the probing model are initialised with Kaiming initialisation (He et al., 2015) and the size of the hidden layer is d = 1024 with rectified linear units (Nair and Hinton, 2010). As mentioned previously, we use both a pre-trained BERT (original) model without fine-tuning the encoder weights and a BERT model that has been fine-tuned on the coreference resolution task (i.e., on OntoNotes annotations). For the fine-tuned BERT model, we take the models that yield the best performance for Joshi et al. (2019), which were trained using 128 wordpieces for BERT-base and 384 wordpieces for BERT-large. The fine-tuned model is trained using split OntoNotes documents where each segment non-overlaps and is fed as a separate instance. This is done as BERT can only accept sequences of at most 512 wordpieces and typically OntoNotes documents require multiple segments to be read entirely. In all of our experiments, we use the cased English BERT models. We will further refer to the base and large variants as BERT-base c2f and BERT-large c2f respectively.

Baseline
As our baseline, we use the span representation introduced in the edge probing framework (Tenney et al., 2019). First of all, we take concatenated contextual embeddings for a pair of mentionspans e (1) = [x 1 , x 2 , x 3 , ..., x n ] and e (2) = [x 1 , x 2 , x 3 , ..., x n ] as inputs. We then project the concatenated contextual embeddings e (1) and e (2) to improve performance following Tenney et al. (2019): where i = (1, 2), A and b are weights of the projection layer. Afterwards, we apply the self-attentional pooling operator in §3.1 over the projected representations to yield fixed-length span representations. This helps to model head words for each mention-span. These mention-span representations are then concatenated and passed to the probing model to predict whether they corefer or not. We use shared weights for both projection and selfattentional layer so that the model can learn the similarity between representations of mention-spans. It is important to note that the self-attention pooling is computed only using tokens within the boundary of the span. As a result, the model can only access information about the context surrounding the mention-span through the contextual embeddings. We take the contextual embeddings from activations of the original pre-trained BERT final layer, while freezing the encoder.
We compare the span representation used in the span-ranking model against the baseline, as it measures the performance that the probing model can achieve with representations that are constructed from lexical priors alone, without any access to the local context within the mention-spans. The resulting baseline span representation have a dimension of d = 768 for BERT-base and d = 1024 for BERT-large.

Long-range Coreference
In order to investigate whether the span representation is able to capture long-range coreference relations, we extend our baseline by introducing a convolutional layer to incorporate surrounding context and improve the baseline span representation, following Tenney et al. (2019).
We replace the projection layer in our probing architecture with a fully-connected 1D CNN layer with a kernel width of 3 and 5, stride of 1 and same padding to properly include contextual embeddings at the beginning and at the end of each mention-span. This is equivalent to seeing ±1 and ±3 tokens around the centre word respectively. We also initialise the weights of the CNN layer with Kaiming initialisation (He et al., 2015). Using this extended probing architecture with a CNN layer as another baseline, which we will refer to as CNNbaseline, enables us to examine the contribution of local and non-local context to the performance of the probing model.
We then test our probing model with various distances between mention-spans. We separate pairs of mention-spans that appear in the OntoNotes test set into several buckets, based on the distance between the last token of the mention-span w i and the first token of the mention-span w j , where w j occurs after w i . Each bucket contains at least 50 examples of pairs of mention-spans.

Control Tasks
To ensure that our probing model is robust, we compare its performance with a control task (Hewitt and Liang, 2019). For every pair of mention-spans (g g g 1 , g g g 2 ), we replace one of the span representations g g g i with another g g g i randomly sampled from the data set. Note that in this control task, some information of the original mention-pairs is still preserved as the other span representation in the pair is not replaced. Table 1 compares the performance of the probing model using span representations fine-tuned on the OntoNotes dataset against baseline span representations and a CNN-baseline that utilises the original pre-trained BERT encoder. The results of the control task are reported in the bottom two lines.

Comparison of Probing Models
The probing model suggests that span representations in BERT-coref encode a significant amount of coreference information, as we are able to train the model to predict whether a pair of mention-spans corefer based on their span representations alone. Both BERT-base c2f and BERT-large c2f consistently score above 90% (accuracy and F1 score) on the OntoNotes test set.
We observe that both BERT-base c2f and BERTlarge c2f perform better in predicting coreference arc between a pair of mention-spans compared to their respective baselines (by 2.37 points for accuracy and 2.18 F1 points on average). We find that, although training the contextual probing model to learn contextual features for coreference arc prediction helps to encode the necessary coreference information into the baseline span representations, it still cannot outperform the probing model that utilises span representations in BERT-coref. This is likely caused by better coreference-related features that are learned by the BERT encoder when it is fine-tuned on OntoNotes.
We also see that fine-tuning the span representations on coreference resolution task helps encode local and long-range context inside the mentionspans efficiently. This can be observed from the performance of CNN-baseline, where the probing model is trained using a 1D CNN layer with kernel width of 3 and 5 to allow the model to see the contribution of local and long-range dependencies, but ultimately still underperforms compared to BERTcoref.
Surprisingly, our baseline span representations which were constructed from only lexical priors perform better compared to the CNN-baseline span representations on both metrics. We attribute this to our decision of using contextual embeddings from the final layer of pre-trained BERT, as most transferable representations from contextual encoders trained with a language modelling objective tend to occur in the intermediate layers, and that the topmost layers might be overly specialised for nextword prediction (Liu et al., 2019a;Peters et al., 2018a,b;Blevins et al., 2018;Devlin et al., 2019). This might cause the CNN layer to learn suboptimal representations of the mention-spans. The probing model that we choose is also highly selective, with selectivity of 28.1 for BERT-base c2f and 26.1 for BERT-large c2f. This also means that to achieve high accuracy, the probes must rely on coreference information encoded in the span representation.

Ablations
To examine the importance of each component in BERT-coref span representation, we conduct an ablation study on each part of the representation and report the accuracy and the F1 score for the probing model on the test data (Table 2). 3 The head-finding attention mechanism is crucial for coreference-arc prediction, as it contributes the highest to the final result with 0.98 and 0.95 points for accuracy and for F1 score on average, respectively. This is consistent with previous findings from Lee et al. (2017), who shows that the attention mechanism is able to learn representations important for coreference.
We also observe that span-width embeddings play an important role in determining a coreference relation, without them the performance degrades on average by 0.4 and 0.37 for accuracy and F1. Contrary to the head-finding attention and spanwidth embeddings, boundary representations did not contribute much to the model's performance. We hypothesise that although boundary representations may encode a large amount of information for coreference resolution, they are not significant for coreference arc prediction, as the model does not have to predict distribution over possible spans.
3 Results for replication experiments after acceptance are reported in Appendix A.

Encoding Long-range Coreference
We compare how our probing model performs on various separation distances between mentionspans. Figure 2 depicts F1 scores as a function of distance between pairs of mention-spans. Although performance with BERT models degrades with larger distances, the span representations in BERT-coref hold up better in general compared to the baseline or CNN-baseline. The BERT-base variant experiences a minor degradation in performance up to 5 points when d = 125 tokens, while for BERT-large the F1 score drops only by 7 points between d = 0 tokens and d = 250 tokens, which suggests that the depth of the Transformer layer helps to encode long-range coreference.
However, we lack sufficient evidence to suggest that the span representations are able to encode long-range coreference relations efficiently, seeing that although the encoder has been fine-tuned on OntoNotes, the model still cannot perform consistently across distant spans, with the lowest F1 score of 67% and 75% for BERT-base and BERT-large respectively, when d = 451 to 475 tokens.

Error Analysis
We provide qualitative error analysis for predicted coreference between mention-pairs. We look at the output of BERT-base c2f (cased, fine-tuned) and BERT-large c2f (cased, fine-tuned). The predictions of both models on the same subset of 1,250 predictions from the test set are analysed. Overall, we found 93 errors in the model with BERT-base embeddings and 84 for the model with BERT-large embeddings. The errors are grouped into: Similar Word Forms, Anaphora, Gender, Mention Para-   phrasing, and Temporal and Spacial Agreement.
Although Gender can be considered as a subcategory of Anaphora, we decided to separate it to check whether gender bias is present in the models. Table 3 portrays an overview of the errors made by both models in each category. We note that mentions separated by a distance of more than 25 tokens have a higher error rate than mentions separated by smaller distances, suggesting that BERT-base c2f and BERT-large c2f perform better when resolving coreference locally.
In the gender category, we only found one problematic example. The proper name Scooter Libby is consistently predicted to corefer with she and her, although the real world referent is male.
Consistent with Joshi et al. (2019), the most difficult case for both models is anaphora, even at very short distances between mentions, as in the follow-  ing example with a distance of only 5 tokens: "we should say it was very prompt with traffic management and emergency repair, ah, because it involved various [. . . ]". Cases of coreference between two pronouns are also difficult for both models. The similar word forms category concerns errors in mentions with morphologically related word forms which are identified as coreferent, for instance " [. . . ] this is the Dick Cheney aide she agreed to refer [. . . ]". I think the agreement was strange [. . . ]". In contrast, together with anaphora, errors involving paraphrasing and temporal and spacial agreement have an extra level of complexity in that they involve real world knowledge. For instance, for humans it is trivial that 1996 and 1997 are years and that they are different ones. The systems, on the other hand, consistently label them as coreferent, as if they were morphologically related forms.

Conclusion and Future Work
In this paper, we quantify the coreference information in the span representation by how well they can do on the coreference arc prediction task. We demonstrate that using mention-span representations as inputs, a simple probing model can be used to predict coreference for pairs of mention spans with accuracy and F1 score over 90%. This suggests that a significant amount of coreference information is encoded in mention-span representations obtained from BERT embeddings, which are fine-tuned on the OntoNotes dataset. Consistently with non-neural architectures, our analysis also shows that non-local coreference is challenging for span representations. Furthermore, we show that the head-finding attention mechanism encodes essential coreference-related features in span representations, even when using original pre-trained BERT embeddings.
The findings we report are solely based on an English corpus. Other pieces of research (Azerkovich, 2020;Hint et al., 2020) suggest that such positive results might be more challenging to achieve for morphologically or syntactically complex languages.
Although we work with the OntoNotes dataset, there are other challenging coreference resolution datasets that focus on ambiguous pronouns (GAP by Webster et al. (2018)) or commonsense reasoning (WinoGrande by Sakaguchi et al. (2019)), which can be used to understand coreference information in span representations better. Moreover, we would like to probe span representations derived from other pre-trained language models such as RoBERTa (Liu et al., 2019b) and SpanBERT (Joshi et al., 2020). Alternative Transformer-based architecture that is better at handling long sequences such as Longformer (Beltagy et al., 2020) also seems promising to explore, as it might improve span representations capability to model long-range coreference. Lastly, instead of building span representations from the final layer of a pre-trained BERT model, one can opt to use the activations from the intermediate layers as well as ELMo-style scalar mixing (Tenney et al., 2019;Peters et al., 2018a). We leave this to future work.   Table 5: Averaged F1 score for ablation on the OntoNotes test set. We take the average F1 score of 10 runs.