Incremental Neural Coreference Resolution in Constant Memory

We investigate modeling coreference resolution under a ﬁxed memory constraint by extending an incremental clustering algorithm to utilize contextualized encoders and neural components. Given a new sentence, our end-to-end algorithm proposes and scores each mention span against explicit entity representations created from the earlier document context (if any). These spans are then used to update the entity’s representations before being forgotten; we only retain a ﬁxed set of salient entities throughout the document. In this work, we successfully convert a high-performing model (Joshi et al., 2020), asymptotically reducing its memory usage to constant space with only a 0.3% relative loss in F1 on OntoNotes 5.0.


Introduction
Coreference resolution is a core task in NLP for both model analysis and information extraction. At the sentence level, ambiguities in pronoun coreference can be used to probe a model for common sense (Levesque et al., 2012;Sakaguchi et al., 2020) or gender biases (Rudinger et al., 2018;Zhao et al., 2018). At the document level, coreference resolution is commonly used in information extraction pipelines, but can be applied to reading comprehension (Dasigi et al., 2019) or literature analysis (Bamman et al., 2014).
Models for this task typically encode the entire text before scoring and subsequently clustering candidate mention spans, either found by a parser (Clark and Manning, 2016b) or learned jointly (Lee et al., 2017). Prior work has primarily focused on improving pairwise span scoring functions (Raghunathan et al., 2010;Clark and Manning, 2016a;Wu et al., 2020) and methods for decoding into globally consistent clusters (Wiseman et al., 2016;Lee et al., 2018;Kantor and Globerson, 2019;Xu and Choi, 2020). Recent models have also benefited from pretrained encoders used to create high-dimensional input text (and span) representations, and improvements in contextualized encoders appear to translate directly to coreference resolution (Lee et al., 2018;Joshi et al., 2019Joshi et al., , 2020. These models typically rely on simultaneous access to all spans -Θ(n) for a document with length n -for scoring and all scores -up to Θ(n 2 ) -for decoding. As the dimensionality of contextualized encoders, and therefore the size of span representations, increases, this becomes computationally intractable for long documents or under limited memory. Given these constraints, expensive scoring functions are increasingly difficult to explore. Further, prior models depart from how humans incrementally read and reason about coreferent mentions; Webster and Curran (2014) argue in favor of a limited memory constraint as a more psycholinguistically plausible approach to reading and model coreference resolution via shift-reduce parsing.
Motivated by scalability and armed with advances in neural architectures, we revisit that intuition. Following prior work, our model begins with a SpanBERT encoding of a text segment to form a list of proposed mention spans (Joshi et al., 2019(Joshi et al., , 2020. Clustering is performed online: each span either attaches to an existing cluster or begins a new one. We substantially minimize memory usage during inference by storing only the embeddings of active entities in the document and a small set of candidate mention spans. Our two contributions of online clustering and storing a constant size set of active entities result in an end-to-end trainable model that uses O(1) space with respect to document length while sacrificing little in performance (see Figure 1). 1

Model
Our algorithm revisits the approach taken by Webster and Curran (2014) for incrementally making coreference resolution decisions (online clustering). The major differences lie in explicit entity representations, neural components, and learning.
Baseline First, we summarize the coreference resolution model described by Joshi et al. (2019), which itself extends from earlier work (Lee et al., 2017(Lee et al., , 2018. For each document, this model enumerates and scores all spans up to a chosen width. The span representations are formed using BERT (Devlin et al., 2019) encodings of input text by concatenating the first, last, and an attention-weighted average of the token representations within the span. These spans are ranked and pruned to the top Θ(n) mentions. Both the maximum span width and fraction of remaining spans are hyperparameters. For each remaining span, the model learns a distribution over its possible antecedents (via a pairwise scorer) and the training objective maximizes the probability of its gold labeled antecedents. The entire model (including finetuning the encoder) is trained end-to-end over OntoNotes 5.0.
This model is further improved by Joshi et al. (2020), who introduces SpanBERT and uses it as the underlying encoder instead. The SpanBERTlarge version of Joshi et al. (2019) is the baseline model used in this paper.
Inference Our method (Algorithm 1) stores a permanent list of entities (clusters), each with its own representation. For a given sentence or segment, the model proposes a candidate set of spans. For each span, a scorer scores the span representation against all the cluster representations. This is used to determine to which (if any) of the pre-existing clusters the current span should be added. Upon inclusion of the span in the cluster, the cluster's representation is subsequently updated via a (learned) function. Periodically, the model evicts less salient entities, writing them to disk. Under this algorithm, each clustering decision is permanent. 2 Concretely, our model uses a contextualized encoder, SpanBERT (Joshi et al., 2020), to encode an entire segment. Given a segment, SPANS returns candidate spans, a result of enumerating all spans up to a fixed width, encoding spans as a combination of the embeddings within the span, and pruning using a learned scorer, following prior work (Lee et al., 2017;Joshi et al., 2019).
PAIRSCORE is a feedforward scorer which takes as input the concatenation of a mention span and entity representation along with additional embeddings for distance and genre. UPDATE updates the entity representation (e top e ) with the newly linked span representation (e m ). In this work, we use a learned weight, α = σ(FF([e top e , e m ])) and update e top e ← αe top e + (1 − α)e m . 3 Here, FF is a feedforward network and σ is the sigmoid function.
To ensure constant space, EVICT moves some entities from E to CPU. These entities are never revisited; the offsets are stored on CPU solely for evaluation purposes. We evict based on cluster size and distance from the end of the segment.
The algorithm is independent of these components, so long as they satisfy the correct interface. Specifically, our algorithm is compatible with the recent model by Wu et al. (2020). They use a querybased pairwise scorer, which could be adopted in place of the feedforward pairwise scorer. Our use of abstract components also allows for comparison of different encoders or update rules.
Training Similar to prior work (Lee et al., 2017), our training objective is to maximize the probability of the correct antecedent (cluster) for each mention span. However, rather than considering all correct antecedents, we are only interested in the cluster for the most recent one. 4 For each mention m, scores is treated as an unnormalized probability distribution P (e | m) for e ∈ E, where E is the entity list that includes an ε target label which represents the action of starting a new cluster. The exact objective is to maximize P (e = e gold | m); Baseline (Joshi et al., 2020)  e gold is the gold cluster of m (i.e., the cluster the most recent antecedent was assigned to). However, the entirely sequential algorithm also introduces sample inefficiency, as most mentions have the same label (ε) and barely accrue loss. We speed up training by accumulating gradients periodically, trading computation time for space. This tradeoff is similar to that of batching by documents, which is impractical for our model from a memory perspective. Like prior work, we update parameters once per document (and not once per mention).
We lean on pretrained components: we reuse not only encoder weights that are already finetuned on this dataset, but also the mention and pairwise scorers from Joshi et al. (2020) as initialization for our encoder, SPANS and PAIRSCORE. 5

Experiments
Since we reuse weights from Joshi et al. (2020) (our baseline), our primary experiment is to compare their model to our constant space adaptation in both task performance and memory usage. Additionally, we analyze document and segment length, conversational genre, and explicit clusters.
Data We use OntoNotes 5.0 (Weischedel et al., 2013;, which consists of 2,802, 343, and 348 documents in the training, development and test splits respectively. These documents span several genres, including those with multiple speakers (broadcast and telephone conversations) and those without (broadcast news, newswire, magazines, weblogs, and the Bible).
Implementation We use the model dimensions and training hyperparameters from the baseline model, a publicly available coreference resolution model by Joshi et al. (2019Joshi et al. ( , 2020. We also reuse their (trained) parameters for the encoder, span 5 The implementation of Joshi et al. (2020Joshi et al. ( , 2019 was the most amenable to extension and experimentation and therefore serves as our illustrative example. scorer, and span pair scorer as initialization. However, our model does not make use of speaker features, since it is not meaningful to assign a speaker to the cluster representation. At the end of each segment, we evict singleton (size 1) clusters more than 600 tokens away from the end of the segment. Additionally, we evict all clusters whose most recent member is more than 1200 tokens away. In this work, we also freeze the encoder-further finetuning the encoder provided little, if any, benefit likely because the encoder has already been finetuned on this dataset and task. Additional details, including our choice of eviction function, are described in Appendix A. All experiments are performed on either a single NVIDIA 1080 TI (11GB) or GTX Titan X (12GB). Table 1 presents the OntoNotes 5.0 test set scores for the metrics: MUC (Vilain et al., 1995), B 3 (Bagga and Baldwin, 1998), and CEAF φ 4 (Luo, 2005) using the official CoNLL-2012 scorer. We reevaluated the baseline, and we report the scores for CorefQA directly from Wu et al. (2020). We observe a small drop in performance compared to the baseline and apparently no drop with eviction.

Document Length
Our goal is a constant-memory model that is comparable to the baseline. We showed above that our model is competitive with and without eviction, the key to constant memory. In Table 2, we report the average F1 broken down based on the length (in subtokens) 6 of the document and number of speakers. Our model is competitive on most document sizes and in the single speaker setting. On longer documents, eviction has a minor effect.   Because our model does not make use of speaker embeddings, we perform worse on documents with multiple speakers. This drop due to speaker features matches previous findings (Lee et al., 2017). One way to include speakers and retain speakerindependent entity embeddings is by treating speakers as part of the input text (Wu et al., 2020).

Inference Memory
We now look towards space. In Table 3, we report the space needed to perform inference over the entire development set. Compared to the baseline and its smaller base version, our model uses substantially less memory. We also find that eviction has little effect on memory and F1 on this dataset. Usage in practice is subject to the memory allocator, and our implementation (PyTorch) differs in framework from the baseline (TensorFlow). To fairly compare the two models, we compute the maximum space used by the allocated tensors for each document during inference. 7 Figure 1 compares this value of peak theoretical memory usage of several models against the dataset. It shows the baseline is dominated by a term that grows linearly with length, while that is not the case for our model, which has constant space usage.
Our model reduces the asymptotic memory usage to O(1). In addition, these plots do not clearly show asymptotic memory usage: the baseline and other derivative models have a quadratic component for scoring span pairs (with a small coefficient). The encoder, SpanBERT, adds a significant constant term (with respect to document length) to all models. While there is some work in sparsifying Transformers (Child et al., 2019;Kitaev et al., 2020), there does not yet exist a sparse SpanBERT.
These plots show that models have relatively modest memory usage during inference. However, their usage grows in training, due to gradients and optimizer parameters. This additional memory usage would render training and finetuning the underlying encoder infeasible for the baseline but possible using our model with 12GB GPUs.

Segment Length
The memory usage at each step (and therefore of the algorithm) is also dependent on the segment length due to the encoder. Table 4 explores the effect of the length of each segment (split at sentence boundaries), which gives us further insight into the tradeoff between performance and memory reduction. We compare models without eviction to ensure fairness. Our observations follow those from Joshi et al. (2019) that larger context windows compatible with the encoder input size improve performance. We also observe that models trained on shorter sequences can be scaled, at infer-  . Each color/shape is a predicted cluster, while light gray circles indicate predicted singletons. For each span, the gold cluster label (-1, if not annotated) and its contribution to the entity embedding is noted in parentheses. ence time, to longer sequences and obtain gains in performance. There is an unsurprising substantial drop using single sentences, owing to coreference being a cross-sentence phenomenon.

Span Representations
Figure 2 visualizes the proposed span representations for a single document in the development set. The colors/shapes represent our predictions, and each point is annotated with the text, the gold cluster label, and the (normalized) α for each span (recall α is used in the UPDATE function to determine a span's contribution to its entity embedding). Given these embeddings, the figure supports the viability of clustering approaches: gold coreference clusters tend to be "close" in embedding space. Regarding α, some spans are weighted equally ("Clinton") while others are not ("North Korea"). This could be a result of online updates biasing more recent spans with higher weights. Alternatively, it may suggest that some spans (like names) are more informative than others (like pronouns).

Conclusion
We present an online algorithm for space efficient coreference resolution that incorporates contributions from recent neural end-to-end models. We show it is possible to transform a model which performs document-level inference into an incremental algorithm. In so doing, we greatly reduce the memory usage of the model during inference at virtually no cost to performance, thereby providing an option for researchers and practitioners interested in modern coreference resolution models for tasks constrained by memory, like the modeling of book-length texts.

A Hyperparameters
In this section, we describe several implementation details and other experiments that we tried. To improve memory usage, we use gradient accumulation. Ultimately, all training was performed on the NVIDIA 1080 TI (11GB), on which we accumulate gradients when the memory usage exceeds 7.5GB. In initial trials, we explored sampling losses for negative examples (spans that do not have an antecedent). While we found sampling at a rate of 0.2 (for example) would speed up training and inference, ultimately it contributed up to a one point deficit in F1.
We also explored teacher forcing, in which spans are added to the gold cluster during training instead of the predicted one. This would "correct" the training objective to match prior work. However, this did not have a noticeable effect on performance. Likewise, we were able to train a competitive model for which only the SpanBERT encoder from Joshi et al. (2019) was retained and the span scorer and pairwise scorer were randomly initialized. However, we opted not to use that for the full experiments because training was more expensive in time. Further, learning span detection is not guaranteed by this objective, leading to high variance across runs (most notably in the number of epochs). Thus, the effect of other hyperparameters would not be immediately apparent.
Additionally, we attempted further finetuning the encoder with a separate learning rate of [1e-5, 5e-6], but were unsuccessful in improving the performance. On our GPUs, training (without finetuning) roughly takes 70 min/epoch with negative sample rate 0.2, 100 min/epoch without sampling loss, and 160 min/epoch when finetuning. All runs are stopped after 5 to 15 epochs due to early stopping (patience = 5).
For eviction, a policy which evicts singletons distance > 600 and all clusters distance > 1200 would have a recall of 99.57% over the training set. This is a result of sweeping over [200,300,400,500,600,900] for singletons and [400, 600, 800, 1000, 1200, 1800] for all clusters. We also try using a single fixed distance, as well as other nonconstant schemes (e.g. size × distance as thresholds). Here, distance is between the current point in the document and the average of the start and end indices of the most recent span added to the cluster. We selected this policy from several other choices due to the recall it achieved.
Our model dimensions otherwise match up exactly with Joshi et al. (2019). Rather than omitting the speaker embedding and segment length embedding entirely (which would affect pairwise scorer dimensionality), we replace those embeddings with the zero vector.
For alpha weighting, we used a two-layer MLP: the first layer has size 300 and ReLu nonlinearity, while the final layer then projected to a scalar with a sigmoid activation. After fixing those values, we explored learning rate ([5e-5, 1e-4, 2e-4, 5e-4]), eviction policy at training ([no eviction, eviction]), and gradient clipping value ([1, 5, 10]). Here, we found that 2e-4, no eviction, and gradient clipping at 10 performed slightly better, although there was little difference between them after these models were allowed to converge.
Given the final set of hyperparameters, we performed five training runs, resulting in average development set F1 of [79.4, 79.5, 79.5, 79.5, 79.7]. We selected the best performing model for the results in the paper. For Table 4, we trained each model only once.
For these experiments, our model contains 377M parameters, of which 340M is SpanBERT-large (Joshi et al., 2020).