Attending to Long-Distance Document Context for Sequence Labeling

We present in this work a method for incorporating global context in long documents when making local decisions in sequence labeling problems like NER. Inspired by work in featurized log-linear models (Chieu and Ng, 2002; Sutton and McCallum, 2004), our model learns to attend to multiple mentions of the same word type in generating a representation for each token in context, extending that work to learning representations that can be incorporated into modern neural models. Attending to broader context at test time provides complementary information to pretraining (Gururangan et al., 2020), yields strong gains over equivalently parameterized models lacking such context, and performs best at recognizing entities with high TF-IDF scores (i.e., those that are important within a document).


Introduction
Many of the main datasets used in NLP are comprised of relatively short documents: English OntoNotes (Weischedel et al., 2012), for example, contains an average of 223 tokens per document, the WSJ portion of the Penn Treebank (Marcus et al., 1993) averages 501 tokens, the IMDb dataset (Maas et al., 2011) averages 272 tokens, and SQuAD 2.0 (Rajpurkar et al., 2018) contains an average of 134 tokens per passage. This focus has, in turn, led to the development of models specifically optimized for the characteristics of short documents, including a pervasive focus on the sentence as the atomic unit of analysis for such tasks as NER and parsing, and influencing the maximum context * Work completed while at UC Berkeley. length of contextual language models like BERT (Devlin et al., 2019) to be limited to 512 tokens.
At the same time, however, longer documents are increasingly the objects of empirical study in areas as diverse as computational social science and the digital humanities-including novels (Piper, 2018;Underwood, 2019), scientific articles (Jurgens et al., 2018) and political manifestos (Menini et al., 2017;Denny and Spirling, 2018). These long documents present not only challenges for NLP (such as any task, like coreference resolution, whose computational complexity is superlinear in the size of the document) but opportunities as well, since the longer document context presents greater opportunity for learning better representations.
Recent work in NLP has begun exploring this link between longer documents and representation learning. First, while contextualized models (e.g. Peters et al., 2018;Devlin et al., 2019) generally consider the context of a few sentences, several recent advancements have enabled significantly longer input sequences (e.g. Dai et al., 2019;Beltagy et al., 2020;Kitaev et al., 2020;Rae et al., 2019); most, however, are either incapable of processing book-level documents or prohibitively resource-intensive for standard use.
Second, domain-and task-adaptive pretraining has proven especially effective for adapting the weights of general-purpose language models to the distribution of a particular domain or task (Gururangan et al., 2020;Han and Eisenstein, 2019;Beltagy et al., 2019;. While longer documents are able provide more context for these models to adapt to, pretraining operates at the broad level of a domain, and is unable to exploit new con-text at evaluation time in unseen test documents. To highlight the value of considering document context at test time, consider the following sentence from E.M. Forster's A Room with a View (1908): "Mr. Beebe!" said the maid, and the new rector of Summer Street was shown in; he had at once started on friendly relations, owing to Lucy's praise of him in her letters from Florence.
From the context of this sentence alone, it is unclear if Florence refers to "a city in Italy" or "a person named Florence"; this local contextual ambiguity might lead an NER system to classify Florence as either a PERSON or LOCATION.
However, examining the broader document context clarifies this entity type: other mentions of Florence within the text more clearly indicate that it refers to the city: • "I saw him in Florence," said Lucy... • As her time at Florence drew to its close... • ...two carriages stopped, half into Florence...
We might hypothesize, in fact, that a model that can attend to multiple mentions of a term like Florence in a document will perform better at recognizing important entities-those that are frequently mentioned within it and that may be infrequently seen outside of it. This fundamental idea-that multiple mentions of a term can provide shared information to help disambiguate each oneoriginates in featurized log-linear models that incorporate global information in making local predictions (Chieu and Ng, 2002;Sutton and McCallum, 2004;Liu et al., 2010); we extend that work here to the context of learning representations that can be incorporated into state-of-the-art neural models, explicitly learning to attend over relevant context sequences that are available only at test time, providing a complementary source of information to domain-and task-adaptive pretraining.
This work makes the following contributions: 1. We present Doc-ARC (Document-Attentive Representation of Context), an attentionbased method for incorporating document context in sequence labeling tasks, and demonstrate improvements over equivalently parameterized models without document attention. 2. We evaluate Doc-ARC on three datasets containing long documents from different domains (literature, biomedical texts, and news), and present a new dataset of the full text of biomedical articles paired with labeled annotations of their abstracts in the GENIA/JNLPBA dataset (Collier and Kim, 2004). 3. We demonstrate that Doc-ARC outperforms alternative methods at recognizing important document entities (defined as those with a high TF-IDF score), identifying tangible scenarios where it would be advantageous to use.

Doc-ARC
The core idea behind Doc-ARC is to leverage nearby representations of the same word when generating a representation for a given token. Rather than representing Florence above through a contextual representation scoped only over one sentence, we represent it through a weighted combination of that token itself and other instances of Florence in the document. By attending over multiple instances of the same word, we are able to preserve the importance of the specific local context of a token, while also reasoning about its broader use in the rest of the document. While this model has application to a wide range of NLP tasks, we focus on the sequence labeling problem of NER.

Model Overview
Figure 1 illustrates this model for a sample text from the JNLPBA corpus. Consider a sequence x = {x 1 , . . . , x n } with corresponding labels y = {y 1 , . . . , y n }, drawn from a document D. Other sequences in D may or may not have labels and the labeled set may or may not be contiguous. Let e(x) be an encoding of x under some language model (e.g. BERT). When predicting a label, we consider both e(x), the original encoding of the target sequence, and c(x), an attention-weighted sum over the encodings of each x i ∈ x as they appear in the context of D.
Formally, let us define V(x i ) to be the word type (drawn from vocabulary V) for token to be the K closest sequences to x in D which also contain a token of type V(x i ), 2 where s k is the k-th closest context sequence to x and i k denotes the index of V(x i ) Target ( ) Sentence 0 ( ) Sentence 7 ( ) Sentence 8 ( ) Sentence 5 ( ) [CLS] The proximal IL-4 promoter is only moderately augmented by GATA-3 , but certain genomic regions significantly enhanced GATA-3 promoter transactivation . [SEP] … certain genomic regions significantly enhanced GATA-3 promoter transactivation … … retroviral transduction of GATA-3 into developing T cells induced IL-5 … GATA-3 dependent enhancer activity in IL-4 gene regulation.
… we propose that GATA-3 is permissive, but not sufficient, for full IL-4 enhancement … Attention GATA-3 dependent enhancer activity in IL-4 gene regulation.
Previously, we analyzed the proximal IL-4 promoter in directing Th2specific activity. An 800-base pair proximal promoter conferred some Th2-selective expression in transgenic mice. However, this region directed extremely low reporter mRNA levels relative to endogenous IL-4 mRNA , suggesting that full gene activity requires additional enhancer elements. Here, we analyzed large genomic IL-4 regions for enhancer activity and interaction with transcription factors. The proximal IL-4 promoter is only moderately augmented by GATA-3, but certain genomic regions significantly enhanced GATA-3 promoter transactivation. Some enhancing regions contained consensus , GATA sites that bound Th2-specific complexes. However, retroviral transduction of GATA-3 into developing T cells induced IL-5 to full Th2 levels, but only partially restored IL-4 production. Thus, we propose that GATA-3 is permissive, but not sufficient, for full IL-4 enhancement and may act through GATA elements surrounding the IL-13/IL-4 gene  in s k . For each x i ∈ x and each k ≤ K, our model fetches e c (x i ) (k) , an encoding of V(x i ) as it appears in the context of s k , with d(s k , x) denoting a bucketed embedding of the distance between s k and x. We adapt our distance buckets from Lee et al. (2017). Finally, we compute c(x i ) by attending over each of the e c (x i ) (k) .
If a given word type has K < K occurences in D, we only attend over these K relevant instances. We allow sequences to attend over the target occurrence itself; that is, (x, i) ∈ S K (x i ).
Our model generates a prediction by passing this composite representation through a sequence encoder f s (such as a bidirectional LSTM, GRU, or Transformer layer), and generating a distribution over labels through a softmax function: (4) p(y | x, D) = softmax(z)

Static and Dynamic Doc-ARC
When processing a single target sequence of length N words, our model must process O(N K) context sequences. If the context representation e c (x) is allowed to be trainable, O(N K) model activation copies are stored for each target sentence, which becomes prohibitively expensive for large encoders.
Though optimizations can be made using GPU/TPU parallelism (e.g. Raffel et al., 2019) and/or memory-efficient encoders (e.g. Kitaev et al., 2020;Lan et al., 2019), our work adopts a different focus. Instead, we consider two simple cases which encapsulate the trade-offs inherent to this method, regardless of encoder architecture: Static. Our static variant of Doc-ARC assumes that e(·) is fixed throughout training. This variant is applicable when the encoder is a memory-intensive language model such as BERT. To offset the effects of freezing BERT, we pass the context representations through a trainable 1-layer context encoder f c , which we found crucial to good performance in our experiments.
To compute c(x), we first gather all of the unique sequences that x will attend over, compute the representations of the attended sequences with a  frozen base model, and cache these representations in CPU memory.
Dynamic. Our dynamic variant assumes that e(·) is trainable, which necessitates a memory-efficient encoder (see §4.2). Here, each of the O(N K) context sequences are processed by the encoder in a single batch, including duplicate sentences. Activations for all the sequences are held in GPU memory. We process single target sequence batches with gradient accumulation to achieve larger effective batch sizes. We do not include the context encoder f c .
LitBank. The LitBank dataset (Bamman et al., 2019) is comprised of relatively long documents drawn from 100 English novels, with each document containing annotations for roughly 2,000 words. This dataset contains annotations for nested entities using six of the ACE 2005 (Walker et al., 2006) categories (PER, LOC, FAC, GPE, ORG, VEH). We convert that hierarchy into a flat structure suitable for NER by preserving only the outermost layer for any nested structure (using the same process used by JNLPBA for GENIA, described below); all annotations nested within another are removed. We use the same training, development and test splits reported in Bamman et al. (2019). While the labeled documents in LitBank are already quite long, they represent less than 2% of the novels they are drawn from-the average full text novel in this collection is approximately 133,000 words. We draw on this broader context by treating the remainder of the novel as unlabeled document context that we can exploit.
JNLPBA. To test our performance in the biomedical domain, we use data from the JNLPBA 2004 shared task on entity recognition (Collier and Kim, 2004); this data consists of flat annotations of MEDLINE abstracts extracted from the nested entity annotations in the GENIA corpus (Kim et al., 2003), with five labels (PROTEIN, CELL LINE, CELL TYPE, DNA and RNA).
While the median document length in JNLPBA is only 245 words, these abstracts have a potentially much larger unlabeled context: the full text of the article themselves. One contribution we make in this work is constructing a new dataset by pairing the abstracts in GENIA with their full scientific articles. We do so by converting the MEDLINE identifiers encoded in the JNLPBA dataset to PubMed identifiers using mappings from the National Library of Medicine, 3 querying PubMed to retrieve the article metadata, 4 manually downloading the full-text article pdf, and OCR'ing each pdf using Abbyy FineReader. We are able to pair a total of 882 abstracts in the JNLPBA training set with their full-text articles (44.1%) and 168 abstracts in the test set (41.6%). To enable hyperparameter tuning, we divide the training set into 714 documents for training and 168 documents for development, holding out the 168 original test documents for evaluation. The average length of the unlabeled document context in this dataset is 9,337 words.
OntoNotes 1000 . The OntoNotes 5.0 dataset (Weischedel et al., 2012) provides named entity annotations for a subset of documents, with 18 entity classes, including PERSON, LOCATION, MONEY and WORK OF ART. While the median length of documents in this collection is quite short at 277 words, we simulate a scenario of longer document context by only focusing on documents in  OntoNotes that are over 1,000 words in length. We use the same training, development, and test splits of this data used in Pradhan et al. (2013), using the BIO labels in the OntoNotes-5.0-NER-BIO repository. 5 Subsetting the data to only those documents within these partitions with over 1,000 words yields a total of 434 training documents, 70 development documents, and 34 test documents.
Preprocessing. For Doc-ARC (both static and dynamic), all labeled sequences are kept at their original length; none were longer than BERT's maximum input length (512). All unlabeled (context) sequences longer than 256 tokens are partitioned into chunks of length ≤ 256 tokens, since this limits the complexity of computing c(x) (see §2.2). For baselines, unlabeled sequences are disregarded.

Experiments
We evaluate our static and dynamic Doc-ARC models on LitBank, JNLPBA, and OntoNotes 1000 . To enable a fair comparison of the specific contribution of document-level attention, each Doc-ARC model is compared to a baseline which lacks contextual inputs and has a comparable number of trainable parameters.

Static Doc-ARC
We compute e(x) from a frozen BERT BASE model, using the last four layers of BERT as a token's representation. To offset the effects of freezing BERT's weights, we let f s and f c be trainable bi-LSTMs. We perform hyperparameter tuning on the development set over K for each model.
Task-adaptive pretraining (TAPT). The availability of unlabeled data drawn from the same documents as a labeled dataset is exactly the scenario that task-adaptive pretraining (Gururangan et al., 2020) has demonstrated sizeable effects for. To investigate this in the context of this NER task, we pretrain BERT BASE on the training documents' full text (both labeled and unlabeled) for 100 epochs, yielding a BERT TAPT model for each dataset.
Baselines. We compare each static Doc-ARC model to a baseline with a comparable number of trainable and non-trainable parameters (frozen BERT representations input into two stacked bi-LSTMs), but lacking attention over neighboring sequences; using the notation from §2, the only input to the baseline model is e(x), and not c(x). We train this baseline model on the labeled set only.
Results. Table 2 lists results for Doc-ARC on all three datasets with the encoder fixed to both BERT BASE and BERT TAPT . We find that Doc-ARC performs above the baselines for all trials, a difference that can reasonably be attributed to Doc-ARC's document-level contextual attention mechanism.
We find that task-adaptive pretraining is least beneficial for LitBank and OntoNotes 1000 (perhaps due to the similarity in domain to BERT's training data of BookCorpus and Wikipedia), and most helpful for JNLPBA, which has a linguistic domain most distinct from BERT's training data.

Dynamic Doc-ARC
We compute e(x) from the last layer of a Transformer TINY model (Turc et al., 2019), a compact, two-layer Transformer distilled from BERT BASE , which we will refer to as BERT TINY . We do not process the context representations through f c , but maintain that f s is a trainable bi-LSTM (see Eq. 4). For all datasets, we attend over the K = 10 closest sequences, which was the largest configuration that could be trained on a single GPU for all three datatsets.
Baselines. We compare each dynamic Doc-ARC to trainable BERT TINY , as well as BERT TINY with one bi-LSTM attached. Analogous to the static case, the dynamic baseline has a comparable number of parameters to dynamic Doc-ARC, but lacks attention over neighboring context sequences.  Table 3: Dynamic Doc-ARC results, all evaluated at K = 10. The BERT+LSTM baseline has a comparable number of trainable parameters, but lacks attention over context occurrences. We report mean (SD) test F 1 scores across 5 runs.
Results. We find that dynamic Doc-ARC significantly outperforms the baselines. Relative to BERT TINY +LSTM baselines, we find that dynamic Doc-ARC gains are greater than their static counterparts for LitBank and JNLPBA. Though the dynamic models cannot match the performance of their static analogues, it is worth noting that the static variants have roughly twice as many trainable parameters. Moreover, BERT BASE has roughly 25 times as many parameters as BERT TINY .

Task Fine-Tuning
To contextualize the performance of our dynamic models, we can consider results for a fully task finetuned BERT BASE and BERT TAPT model; as Table  4 illustrates, when given the ability to fine-tune all of its parameters to the task, performance is significantly higher than the small dynamic models, and comparable to the larger (but static) Doc-ARC models.
While a direct comparison is ill-suited given the disparity in trainable parameters in a task-tuned BERT BASE (11 times the number of trainable parameters as a static Doc-ARC and 25 times the number of trainable parameters as a dynamic Doc-ARC), it illustrates one direction of future work: incorporating a task-tuned contextual language model into Doc-ARC. 6 However, even with a static model with an order of magnitude fewer parameters, we find that Doc-ARC can outperform even a trainable BERT baseline for certain classes of important entities, as illustrated in the following section.

Analysis
Doc-ARC was designed to (1) improve the performance of NER systems for rare, but important entities by (2) leveraging rich contextual information in long documents. In this section, we characterize the extent to which these goals were met using both quantitative and qualitative analysis.

Characterizing Important Entities
We hypothesize that Doc-ARC is most beneficial for rare entities that occur primarily within the context of a single document (such as the names of major characters in a novel). Such entities have a unique relevance only within the context of their document and are often the entities of highest importance for downstream analyses. However, these entities are particularly difficult for NER systems to classify correctly due to their rarity, unusual surface forms, and/or ambiguous meaning across documents. Given that these entities occur multiple times throughout a document and in diverse contexts, Doc-ARC should have the capacity to leverage this additional context for greater accuracy among important entities. One means to identify important terms in a document is TF-IDF: words with high TF-IDF scores must appear frequently throughout a given document or appear characteristically within that document by appearing infrequently in other documents; terms with the highest scores satisfy both criteria. As Figure 2 illustrates, TF-IDF scores have a strong relationship with the presence of entity labels; words with high TF-IDF scores are more likely to be named entities across all three datasets. Table 5 lists the three entities with the highest TF-IDF scores for each of the datasets, which appear exclusively as named entities and capture important characters (LitBank), proteins (JNLPBA), and political entities (OntoNotes).
Given that TF-IDF is a reasonable indicator for important entities, we analyze Doc-ARC's performance for high TF-IDF words in comparison to alternative models. First, we compute TF-IDF scores

NER Label Percentage
LitBank JNLPBA OntoNotes Figure 2: Among words in the labeled test set, we compute the proportion of words that appear with NER labels for each TF-IDF quantile. Across all datasets, words with a higher TF-IDF score are more likely appear as named entities.  for all words across all documents for each dataset, using the logarithm of the term-frequency to control for variation in document length. We then restrict our vocabulary to words in the labeled test set that appear with a named entity label at least once, thereby excluding spurious high TF-IDF words (e.g. document-characteristic adjectives and adverbs). We split this vocabulary of high TF-IDF entities at the 90 th , 95 th , and 99 th percentile and compute word-level F 1 scores within each percentile. 7 Results. In Figure 3, we compare word-level F1 scores between our best static Doc-ARC models with a fixed BERT BASE input (Table 2) and a taskfinetuned BERT BASE model (Table 4). We plot the difference in word-level F 1 scores across the entire test set and the top 10%, top 5%, and top 1% of TF-IDF entities. While the static Doc-ARC underperforms a finetuned BERT BASE across all words (mirroring the results from Table 4), we find that the static Doc-ARC outperforms finetuned BERT for high TF-IDF 7 Note that the word-level F1 scores reported in this section differ from the entity span-level F1 scores reported in §4, since span-level F1 measures do not allow for word-level analysis.

All Words
Top 10% Top 5% Top 1% entities. Moreover, these performance gains increase with the TF-IDF threshold, indicating that Doc-ARC's performance is more sensitive to highimportance entities than a standard finetuned BERT model. These results are particularly pronounced for OntoNotes 1000 , where Doc-ARC outperforms a finetuned BERT model by over 17 points in the top 1% of TF-IDF entities.

Characterizing Context Attention
We now turn to analyzing our model's use of attention over context occurrences. We parameterize this analysis via the attention width (K) and the attention weight (α k ).
Attention Width. The attention width (K) determines the number of context occurrences a target word can attend over. In order to better understand the impact of the attention width on our model's performance, we plot mean dev F 1 scores across three runs for several values of K in Figure 5. We find that the optimal value of K is dataset-specific and that performance does not monotonically increase with K, indicating that too much context can be detrimental. The maximum dev F 1 scores were used to determine the final hyperparameters in Table 2 (further hyperparameter details can be found the in Appendix A.2).
Attention Weight. In Figure 4, we plot the distribution attention weight as a function of the distance to the target word. Unsurprisingly, the model assigns the highest weight to the target sentence itself (distance = 0), including the target occurrence itself or multiple mentions of the target word within the target sentence. Though the attention weight distributions for distances greater than zero tend to have small medians, they have very long tails; for certain rare context sequences, Doc-ARC assigns a weight higher than the target token itself.

Related Work
Our work draws from several strands of related research. First, our motivation for this work is rooted in early research exploring the global scope of information across an entire document in making token-level decisions. Chieu and Ng (2002) presents one of the earliest examples of this for NER, employing features scoped over both the local token context and the broader type context in a log-linear classifier. Our use of attention in building a representation of a token that is informed by other instances of the same type is likewise influenced by work on Skip-Chain CRFs (Sutton and McCallum, 2004), which explicitly model the label dependencies between words of the same type, including for the task of NER (Liu et al., 2010).
Second, automatically retrieving relevant context has been shown to improve accuracy across a variety of NLP tasks. Searching for the k most similar context sequences to a given target has been explored for language model pretraining (Gururangan et al., 2020), training (Kaiser et al., 2017;Lample et al., 2019), and inference (Khandelwal et al., 2020); incorporating shared span representations linked through coreference has also been shown to help in multi-task learning (Luan et al., 2018). Recently, Guu et al. (2020) introduced a neural knowledge retriever for open-domain question answering, trained to retrieve the k most relevant documents during all of pretraining, finetuning, and inference. Though named-entity masking had previously shown not to improve standard BERT pretraining (Joshi et al., 2020), Guu et al. (2020) find that it significantly improves retrieval-augmented pretraining. Most prior work has computed similarity in embedding space, using either model internal representations (Khandelwal et al., 2020;Guu et al., 2020) or lightweight sentence encoders . Instead, we adopt word-type identity match as a simpler, yet effective heuristic.
Finally, self-supervised pretraining within relevant domain and/or task data has been widely shown to be beneficial for downstream task accuracy (Gururangan et al., 2020;Han and Eisenstein, 2019;Beltagy et al., 2019;, with applications generally focused on transfer representation learning. Gururangan et al. (2020) additionally investigate human curated task-adaptive pretraining-comparable to our long-document settings-in which labeled annotations are drawn from a larger pool of unlabeled texts.

Conclusion
We present in this work a new method for reasoning over the context of long documents by attending over representations of identical word types when generating a representation for a token in sequence labeling tasks like NER. We show that when comparing equivalently parameterized models, incorporating attention over the entire document context leads to performance gains over models that lack that contextual mechanisms; further, the gains are asymmetric, with a substantial increase in accuracy for important entities within a document (defined as those with high TF-IDF scores). In the context of long documents, our approach presents a novel alternative to established methods such as long sequence modeling and task-adaptive pretraining.
Our work's main contribution is a computationally tractable method for attention in long documents, employing exact word match as a complexity-reducing heuristic. Though our attention mechanism is ostensibly simple, Doc-ARC 's strong performance in comparison to noncontextual baselines demonstrates both the value of the exact-match heuristic and the general utility of our framework.
This work leaves open several natural directions for future research, including incorporating document attention within a fully trainable task-tuned BERT model, and broadening the focus of attention beyond identical word types to words that bear other forms of similarity (such as similarity in subword morphology and meaning). Code and data to support this work can be found at https://github.com/mjoerke/Doc-ARC.

A.1 Computing Infrastructure
Each Doc-ARC model and baseline comparison was trained on a single NVIDIA Tesla c K80 GPU with 12GB GPU memory. Task-adaptive pretraining was performed on a Google Cloud c v2-8 TPU.

A.2 Hyperparameters
Parameter Value ( (Table 3) in Table 8. Hyperparameter tuning over K was limited to K ≤ 10 since this was the largest configuration that could be trained on a single GPU. We perform hyperparameter tuning over K only, choosing the optimal K via mean dev F 1 across 3 trials; all models had the best results for K = 10. Each BERT+LSTM baseline was trained with identical hyperparameters (except for K, which does not apply).  Task Adaptive Pretraining. We perform taskadaptive pretraining on full texts within the training set for 100 epochs. Pretraining was performed using Google's BERT pretraining code 8 . Hyperparameters for pretraining are listed in Table 9.   For each reported result, we list average training times in Table 11. Note that task-adaptive pretraining was only run once for each dataset.

A.4 Development Set Results.
We reproduce each of the tables in the main paper with development set results. Table 12 lists dev  results for Table 2, Table 13 lists dev results for  Table 3, and Table 14 lists dev results for     Table 14: BERT finetuning results on the development set. We report mean (SD) F 1 scores across 5 runs.