Learning to Ignore: Long Document Coreference with Bounded Memory Neural Networks

Long document coreference resolution remains a challenging task due to the large memory and runtime requirements of current models. Recent work doing incremental coreference resolution using just the global representation of entities shows practical benefits but requires keeping all entities in memory, which can be impractical for long documents. We argue that keeping all entities in memory is unnecessary, and we propose a memory-augmented neural network that tracks only a small bounded number of entities at a time, thus guaranteeing a linear runtime in length of document. We show that (a) the model remains competitive with models with high memory and computational requirements on OntoNotes and LitBank, and (b) the model learns an efficient memory management strategy easily outperforming a rule-based strategy.


Introduction
Long document coreference resolution poses runtime and memory challenges. Current best models for coreference resolution have large memory requirements and quadratic runtime in the document length (Joshi et al., 2019;Wu et al., 2020), making them impractical for long documents.
Recent work revisiting the entity-mention paradigm (Luo et al., 2004;Webster and Curran, 2014), which seeks to maintain explicit representations only of entities, rather than all their constituent mentions, has shown practical benefits for memory while being competitive with state-of-the-art models (Xia et al., 2020). In particular, unlike other approaches to coreference resolution which maintain representations of both mentions and their corresponding entity clusters (Rahman and Ng, 2011;Stoyanov and Eisner, 2012;Clark and Manning, 2015;Wiseman et al., 2016;Lee et al., 2017) , the entity-mention paradigm stores representations only of the entity clusters, which are updated incrementally as coreference predictions are made. While such an approach requires less memory than those that additionally store mention representations, the number of entities can still become impractically large when processing long documents, making the storing of all entity representations problematic.
Is it necessary to maintain an unbounded number of mentions or entities? Psycholinguistic evidence suggests it is not, as human language processing is incremental (Tanenhaus et al., 1995;Keller, 2010) and has limited working memory (Baddeley, 1986). In practice, we find that most entities have a small spread (number of tokens from first to last mention of an entity), and thus do not need to be kept persistently in memory. This observation suggests that tracking a limited, small number of entities at any time can resolve the computational issues, albeit at a potential accuracy tradeoff.
Previous work on finite memory models for coreference resolution has shown potential, but has been tested only on short documents (Liu et al., 2019;Toshniwal et al., 2020). Moreover, this previous work makes token-level predictions while standard coreference datasets have span-level annotations. We propose a finite memory model that performs quasi-online coreference resolution, 1 and test it on LitBank (Bamman et al., 2020) and OntoNotes (Pradhan et al., 2012). The model is trained to manage its limited memory by predicting whether to "forget" an entity already being tracked in exchange for a new (currently untracked) entity. Our empirical results show that: (a) the model is competitive with an unbounded memory version, and (b) the model's learned memory management outperforms a strong rule-based baseline. 2 2 Entity Spread and Active Entities Given input document D, let (x n ) N n=1 represent the N mention spans corresponding to M underlying entities (e m ) M m=1 . Let START(x i ) and END(x i ) denote the start and end token indices of the mention span x i in document D. Let ENT(x i ) denote the entity of which x i is a mention. Given this notation we next define the following concepts.
Entity Spread Entity spread denotes the interval of token indices from the first mention to the last mention of an entity. The entity spread ES(e) of entity e is given by: Active Entity Count Active entity count AE(t) at token index t denotes the number of unique entities whose spread covers the token t, i.e., AE(t) = |{e | t ∈ ES(e)}|.
Maximum Active Entity Count Maximum active entity count MAE(D) for a document D denotes the maximum number of active entities at any token index in D, i.e., MAE(D) = max t∈[|D|] AE(t). This measure can be simply extended to a corpus C as: MAE(C) = max D∈C MAE(D). Table 1 shows the MAE and the maximum total entity count in a single document, for LitBank and OntoNotes. For both datasets the maximum active entity count is much smaller than the maximum total entity count. Thus, rather than keeping all the entities in memory at all times, models can in principle simply focus on the far fewer active entities at any given time.

Model
Based on the preceding finding, we will next describe models that require tracking only a small, bounded number of entities at any time.
To make coreference predictions for a document, we first encode the document and propose candidate mentions. The proposed mentions are then processed sequentially and are either: (a) added to an existing entity cluster, (b) added to a new cluster, (c) ignored due to limited memory capacity (for bounded memory models), or (d) ignored as an invalid mention. Document Encoding is done using the SpanBERT LARGE model finetuned for OntoNotes and released as part of the coreference model of Joshi et al. (2020). We don't further finetune the SpanBERT model. To encode long documents, we segment the document using the independent and overlap strategies described in Joshi et al. (2019). 3 In overlap segmentation, for a token present in overlapping BERT windows, the token's representation is taken from the BERT window with the most neighboring tokens of the concerned token. For both datasets we find that overlap slightly outperforms independent.
Mention Proposal Given the encoded document, we next predict the top-scoring mentions which are to be clustered. The goal of this step is to have high recall, and we follow previous work to threshold the number of spans chosen (Lee et al., 2017). Given a document D, we choose 0.3 × |D| top spans for LitBank, and 0.4 × |D| for OntoNotes.
Note that we pretrain the mention proposal model before training the mention proposal and mention clustering pipeline end-to-end, as done by Wu et al. (2020). The reason is that without pretraining, most of the mentions proposed by the mention proposal model would be invalid mentions, i.e., spans that are not mentions, which would not provide any training signal to the mention clustering stage. For both datasets, we sample invalid spans with 0.2 probability during training, so as to roughly equalize the number of invalid spans and actual mentions, as suggested by Xia et al. (2020).
represent the top-K candidate mention spans from the mention proposal step and let s m (x i ) represent the mention score for span x i , which indicates how likely it is that a span constitutes a mention. Assume that the mentions are already ordered based on their position in the document and are processed sequentially in that order. 4 Let E = (e m ) M m=1 represent the M entities currently being tracked by the model (initially M = 0). For ease of discussion, we will overload the terms x i and e j to also correspond to their respective representations.
In the first step, the model decides whether the span x i refers to any of the entities in E as follows: where represents the element-wise product, and f c (·) corresponds to a learned feedforward neural network. The term g(x i , e j ) correponds to a concatenation of feature embeddings that includes embeddings for (a) number of mentions in e j , (b) number of mentions between x i and last mention of e j , (c) last mention decision, and (d) document genre (only for OntoNotes). Now if s top c > 0 then x i is considered to refer to e top , and e top is updated accordingly. 5 Otherwise, x i does not refer to any entity in E and a second step is executed, which will depend on the choice of memory architecture. We test three memory architectures, described below.

Unbounded Memory (U-MEM):
If s m (x i ) > 0 then we create a new entity e M +1 = x i and append it to E. Otherwise the mention is ignored as invalid, i.e., it doesn't correspond to an entity. This differs from Xia et al. (2020) who append all noncoreferent mentions. The reason for the change is that appending all mentions can hurt performance on LitBank where singletons are explicitly marked and used for evaluation.
2. Bounded Memory: Suppose the model has a capacity of tracking C entities at a time. If C > M , i.e., the memory capacity has not been fully utilized, then the model behaves like U-MEM. Otherwise, the bounded memory models must decide between: (a) evicting an entity already being tracked, (b) ignoring x i due to limited capacity, and (c) ignoring the mention as invalid. We test two bounded memory variants that are described below. The proposed LB-MEM architecture tries to predict a score f r (.) corresponding to the anticipated number of remaining mentions for any entity or mention, and compares it against the mention score 5 We use weighted averaging where the weight for e top corresponds to the number of previous mentions seen for e top . s m (x i ) as follows: The Least Recently Used (LRU) principle is a popular choice among memory models (Rae et al., 2016;Santoro et al., 2016). While LB-MEM considers all potential entities for eviction, with RB-MEM this choice is restricted to just the LRU entity, i.e., the entity whose mention was least recently seen. The rest of the steps are similar to the LB-MEM model.
Training All the models are trained using teacher forcing. The ground truth decisions for bounded memory models are chosen to maximize the number of mentions tracked by the model (details in Appendix A.3). Finally, the training loss is calculated via the addition of the cross-entropy losses for the two steps of mention clustering.

Datasets
LitBank is a recent coreference dataset for literary texts (Bamman et al., 2020). The dataset consists of prefixes of 100 novels with an average length of 2100 words. Singletons are marked and used for evaluation. Evaluation is done via 10-fold crossvalidation over 80/10/10 splits. 6  Tables 2 and 3 show results of all the proposed models for LitBank and OntoNotes respectively. As expected, the bounded memory models improve with increase in memory. For both datasets, the LB-MEM model with 20 memory cells is competitive with the U-MEM model. The RB-MEM model with 20 memory cells is competitive on OntoNotes but is significantly worse than the other two on LitBank.

Results
Comparing among the bounded memory models, the LB-MEM model is significantly better than RB-MEM for lower numbers of memory cells. We ana-   lyze the reasons for this in the next section. Between the two datasets, we see that the increase in memory results in larger improvement for LitBank. We also establish a new state-of-theart for LitBank with the U-MEM memory model. For OntoNotes, our models are competitive with comparable models such as Xia et al. (2020). The performance difference between the two U-MEM models might be because we try to predict invalid mentions which, while beneficial for LitBank, can lead to lower mention recall for OntoNotes. We expect gains by further finetuning the SpanBERT model and learning a parameterized global entity representation, but we leave them for future work.

Analysis
In this section we analyze the behavior of the three memory models on LitBank and OntoNotes. Table 4 compares the memory and inference time statistics for the different memory models for the LitBank cross-validation split zero. 7 For training, the bounded memory models are significantly less memory intensive than the U-MEM model. The table also shows that the bounded memory models are faster than the U-MEM memory model during inference (inference 7 Peak memory usage estimated via torch.cuda.max_memory_allocated() . This is because the number of entities tracked by the U-MEM memory model grows well beyond the maximum of 20 memory slots reserved for the bounded models as shown in Table 5. Surprisingly, for inference we see that the bounded models have a slightly larger memory footprint than the U-MEM model. This is because the document encoder, SpanBERT, dominates the memory usage during inference (as also observed by Xia et al., 2020). Thus the peak memory usage during inference is determined by the mention proposal stage rather than the mention clustering stage. And during the mention proposal stage, the additional parameters of bounded memory models, which are loaded as part of the whole model, cause the slight uptick in peak inference memory. Note that using a cheaper encoder or running on a sufficiently long document, such as a book, can change these results. Table 5 compares the maximum number of entities kept in memory by the different memory models for the LitBank cross-validation dev sets and the OntoNotes dev set. As expected, the U-MEM model keeps more entities in memory than the bounded memory models on average for both datasets. For LitBank the difference is especially stark with the U-MEM model tracking about 5/10 times more entities in memory on average/worst case, respectively. Also, while some OntoNotes documents do not use even the full 5 memory cell capacity, all LitBank documents fully utilize even the 20 memory cell capacity. This is because LitBank documents are more than four times as long as OntoNotes documents, and LitBank has singletons marked. These results also justify our initial motivation that with long documents, the memory requirement will increase even if we only keep the entity representations. Table 6 compares the number of mentions ignored by LB-MEM and RB-MEM. The LB-MEM model ignores far fewer mentions than RB-MEM. This is because while the RB-MEM model can only evict the LRU entity, which might not be optimal, the LB-MEM model can choose any entity for eviction. These statistics combined with the fact that the LB-MEM model typically outperforms RB-MEM mean that the LB-MEM model is able to anticipate which entities are important and which are not.

LB-MEM vs. RB-MEM
Error Analysis Table 7 presents the results of automated error analysis done using the Berkeley Coreference Analyzer (Kummerfeld and Klein, 2013) for the OntoNotes dev set. As the memory capacity of models increases, the errors shift from missing mention, missing entity, and divided entity categories, to conflated entities, extra mention, and extra entity categories. For the 5-cell configuration, the LB-MEM model outperforms RB-MEM in terms of tracking more entities.

Conclusion and Future Work
We propose a memory model which tracks a small, bounded number of entities. The proposed model guarantees a linear runtime in document length, and in practice significantly reduces peak memory usage during training. Empirical results on LitBank and OntoNotes show that the model is competitive with an unbounded memory version and outperforms a strong rule-based baseline. In particular, we report state of the art results on LitBank. In future work we plan to apply our model to longer, book length documents, and plan to add more structure to the memory.

A Appendix
A.1 Maximum Active Entities Figure 1 visualizes the histograms of length of Entity Spread (ES), defined in Section 2, as a fraction of document length for documents in LitBank and OntoNotes. For LitBank we only visualize the entity spread of non-singleton clusters because otherwise the histogram is too skewed towards one. Figure 2 visualizes the histograms of Maximum Active Entity Count (MAE), defined in Section 2, for documents in LitBank and OntoNotes.

A.2 Model Details
Other hyperparameters We stick with the hyperparameters for feedforward neural network (FFNN) size and depth, and dropout from Joshi et al. (2020). One hyperparameter that we find to be important is the weight of the non-coreferent term in the cross-entropy loss for the first step of mention clustering. We find that placing a higher weight of 2.0 on that term leads to consistent performance gains. This might be because of that term's signifi- cance, as the value of that term decides whether the next step of mention clustering is triggered or not.
Expected Validation Performance Since Lit-Bank has 10 cross-validation splits, the grid search based tuning process was limited to a few crossvalidation splits. For LitBank, in our initial experiments with gold mention clustering we find that overlap segmentation gave a gain of about 0.5% F1 and we stuck with the choice from then onwards. For non-coreferent entity weight, we see an improvement of 0.5-1% F1 on going from 1.0 to 2.0 but the performance with 5.0 weight drops below of that with 1.0.
For OntoNotes, we find that deviating from overlap to independent results in a drop of about 1% F1 absolute performance for the LB-MEM model with 5 and 10 memory cells, the other two models are almost unaffected. The reason why overlap is crucial to the LB-MEM model is because on average the tokens get more future context which helps the model in "anticipating" which entities are important and need to be kept in the memory.

A.3 Ground Truth Generation
In this section we explain how the ground truth action sequence is generated corresponding to the predicted mention sequence. The ground truth for U-MEM model is fairly straight forward. For the bounded memory models, we keep growing the number of entities till we hit the memory ceiling. For all the entities in memory, we maintain the number of mentions remaining in the ground truth cluster. For example, a cluster with a total of five mentions, two of which have already been processed by the model, has three remaining mentions.
Suppose now a mention corresponding to a currently untracked entity comes in and the memory is already at full capacity. Then for the LB-MEM model, we compare the number of mentions of this new entity (along with the current mention) against the number of mentions remaining for all the entities currently being tracked. If there are entities in memory with number of remaining mentions less than or equal to the number of mentions of this currently untracked entity, then the untracked entity replaces the entity with the least number of remaining mentions. Ties among the entities with least number of remaining mentions are broken by the least recently seen entity. If there's no such entity in the memory, then the mention is ignored. For the RB-MEM model, the comparison is done in a similar way but is limited to the LRU entity.

A.4 Miscellaneous
Computing Infrastructure & Runtime All the models for a single cross validation split of LitBank can be trained within 4 hours. The U-MEM models require 24GB memory GPUs and are trained on TitanRTX. The LB-MEM and RB-MEM models can be trained on 12GB memory GPUs.
As in LitBank, the U-MEM model for OntoNotes require 24GB memory GPUs. The LB-MEM and RB-MEM models can be trained on 12GB memory GPUs. Training on OntoNotes finishes within 12 hours.  Number of model parameters. Table 9 shows the number of trainable parameters for all the model and dataset combinations. LB-MEM and RB-MEM have additional parameters in comparison to U-MEM for predicting a score corresponding to the number of remaining mentions for an entity. Comparing across datasets, the OntoNotes models have a few additional parameters than their LitBank counterparts for modeling the document genre.
We also use the Python implementation by Kenton Lee available at https://github.com/kentonl/ e2e-coref/blob/master/metrics.py. The two scripts can have some rounding differences.
Effect of Document Length and Number of Entities. Table 10 presents the Spearman correlation between document F1 score and both document length and number of entities in the document. The correlations are negative because the problem becomes more challenging with increase in document length and entities. The increase in memory for bounded models results in less negative correlation, suggesting improved performance for challenging documents. The slightly less negative correlation for LB-MEM models than RB-MEM models for 20 memory cells (when their dev performance is similar) implies that LB-MEM models perform better for longer OntoNotes documents.