ReadTwice: Reading Very Large Documents with Memories

Knowledge-intensive tasks such as question answering often require assimilating information from different sections of large inputs such as books or article collections. We propose ReadTwice, a simple and effective technique that combines several strengths of prior approaches to model long-range dependencies with Transformers. The main idea is to read text in small segments, in parallel, summarizing each segment into a memory table to be used in a second read of the text. We show that the method outperforms models of comparable size on several question answering (QA) datasets and sets a new state of the art on the challenging NarrativeQA task, with questions about entire books. Source code and pre-trained checkpoints for ReadTwice can be found at https://goo.gle/research-readtwice.


Introduction
Transformer-based models such as BERT are very effective in capturing long-range dependencies in text passages through the attention mechanism (Vaswani et al., 2017;. However, the amount of compute in attention depends quadratically on the number of tokens in an input text passage. As such, the standard BERT implementation limits input size to a fixed number (often 512) of tokens.
In reality, dependencies over significantly longer ranges are common and modeling them is crucial. For instance, in a sentence like Inside the Sammath Naur, the Ring-bearer struggled to throw the Ring into the volcano, the narrative interweaves several prior storylines from a book. Comprehending this sentence therefore requires looking up previous * Work is done while at Google † On leave from University of Southern California (feisha@usc.edu) 1 Source code and pre-trained checkpoints for READTWICE can be found at https://goo.gle/ research-readtwice. mentions of Ring-bearer and Sammath Naur, located many tokens away.
Several methods have been proposed to address this challenge; see (Tay et al., 2020) for a survey and §3 for a detailed discussion. One popular strategy is to reduce the number of tokens attended to. Longer inputs can in fact be processed in this way -but only up to a limit of around 5,000 tokens, as used in Zaheer et al., 2020;Beltagy et al., 2020) -far below the context sizes required to model long documents such as books.
Another strategy such as HIBERT  splits inputs into smaller segments which are processed individually, then assembled into a hierarchical representation. As a downside, intersegment context is unavailable during encoding.
We propose READTWICE, a simple approach that combines the strengths of both strategies. As its name suggests, the main idea is to process the input twice: a long text input (such as a document, or even a book) is treated as a collection of shorter text segments which are read independently and in parallel. Then, the encoder reads each segement again, now augmented with compressed information from other segments.
The crucial component in READTWICE, as illustrated in Figure 1, is a memory module that holds compressed information from all segments. That compressed information is used only once: in the second pass. Thus, READTWICE is much more computationally efficient than models like ETC that rely on memory for all segments, in every layer. While READTWICE requires two passes, it differs from hierarchical models such as HIBERT that do not condition segment encoding on other segments.
§3 contrasts these approaches in more detail. We validate the efficacy of READTWICE on extractive question answering (QA) tasks, showing strong performance on HotpotQA (Yang et al., 2018), TriviaQA (Joshi et al., 2017) and Narra- Figure 1: READTWICE model architecture. The input is processed twice, with a memory table for inter-segment information sharing.
tiveQA (Kociský et al., 2018). In particular, READ-TWICE significantly improves the state-of-the-art on QA based on entire books in NarrativeQA, with absolutes gains of 4.5 ROUGE-L points and 3 BLEU-1 points (relative improvements of 23% and 17%, respectively).

Method
We first describe the READTWICE model, followed by its pre-training procedure.

READTWICE
The model reads a large text document split into N segments x 1 , . . . , x N ; each x i is limited to 512 tokens, as in a typical BERT model.
The model architecture is depicted in Figure 1. In the first read, each segment is encoded independently with standard BERT. Then, memories are extracted from each segment-a process we describe in detail later-and gathered into a global memory pool. For the second read, a MemoryAttention layer (with a residual connection and a LayerNorm on top) is first used to merge the information from the former intra-segmental contextual token embeddings and the global memory. The merged result is then read by another small BERT model with only two Transformer layers to produce the final output. The rationale is that the first read already generates rich contextualized embeddings, and the second read only needs to incorporate information from the memory. More formally: Next, we describe the newly introduced layers.
ExtractMemories and Gather Our aim is to compress the information in each segment and disseminate it to other segments to be used in the second read. We consider three types of memories: • READTWICE (CLS). One obvious choice is to use the CLS token representation associated with segment x i as a summary of the segment.
• READTWICE (STS). To obtain more finegrained memories, we extract a memory vector for each consecutive span of 32 tokens. Contextual embeddings of each span's first and the last tokens are concatenated and linearly projected to a single point in the token vector space as the span representation. The projection matrix is learned end to end.
• READTWICE (E). In another variant of spanbased memory, we memorize representations of entity mention spans. To obtain these spans, we first annotate each segment with an external Named Entity Recognition system. Then, each entity mention span is encoded in the same way as in READTWICE (STS). This design is motivated by the intuition that longrange dependencies primarily occur between entities.
Empirically, we find that READTWICE (E) leads to best performance (see the ablation in Section 4.4) and it is the memory type used in our headline results. We collect all memories from all segments into a flat memory table. The table size is given by the number of segments (CLS), the number of 32-token spans (STS), or the number of entity mentions (E).
MemoryAttention In this layer, we let contextual token embeddings from individual segments interact with other segments' memories via dotproduct attention over the memory table.
Let h ij be the contextual embedding of token j in segment i after the first read. And let m be a memory table entry whose source segment is given by m s . We then define its attention weight as: where M 0 is a learnable no-op memory not associated with any specific text. r i,ms is a learned position score which captures the relative distance between segment i and the memory M m , akin to Shaw et al. (2018): where ω is a set of weights indexed by the distance where the cutoff threshold B clips the effect of distance to [−B, B]. We set B to 10 in this work.
Finally, the MemoryAttention layer output for a given token is given by

Pre-training
We pretrain READTWICE similarly to , using the Wikipedia and BooksCorpus datasets. When entity mentions are used in the memory table, the texts are processed with the Entity Linking (EL) and Named Entity Recognition (NER) tools from the Google Cloud NLP API 2 . Moreover, we use existing hyperlinks in Wikipedia as additional entity annotations. The first and the second BERT readers are trained end-to-end. Our pre-training objective is the standard Masked Language Model (MLM) task, with the MLM prediction loss computed based on the output of the second reader.
In order to encourage the model to rely on the memory, we increase the difficulty of the MLM task. Following the entity masking procedure in (Guu et al., 2020;Sun et al., 2019), we mask entity mention tokens more aggressively at a 25% rate and jointly mask all tokens within a mention. By contrast, for non-entity tokens, we mask contiguous sequences of random length at a 15% rate.

Related Work
One way to extend the limit on input size is by reducing the number of tokens attended to. ETC  and LONGFORMER (Beltagy et al., 2020) allow standard attention only between tokens within a fixed distance. To allow information flow over longer distances, they use auxiliary global "memory" tokens which attend to all regular tokens and vice versa. BIGBIRD (Zaheer et al., 2020) additionally has each token attend to a random subset of other tokens. While reducing asymptotic complexity from quadratic to linear (in input size), these global tokens are added at each attention layer, incurring a high computational cost.
Another approach is to split the input into multiple segments and then aggregate information across segments. This is achieved through hierarchical modeling . While reducing the attention size to the number of segments, each individual segment has no information about its siblings during token-level encoding. Alternatively, recurrent models (Dai et al., 2019;Rae et al., 2019) read a large input from left to right, dynamically compressing faraway contexts, thus allowing unidirectional information aggregation (left to right). One disadvantage is that the input needs to be processed sequentially, which becomes time-consuming for producing contextualized representations of a large input.
Our method brings these lines of work together. Processing segments independently and in parallel, then memorizing their compressed representations and sharing memory across segments enables contextual embeddings to be updated based on faraway information. Enabling memory sharing only onceduring the second read-allows it be done cheaply.
Note that the memory module here is internally generated from the input, as opposed to external memory models which are orthogonal to our approach (Peters et al., 2019;Févry et al., 2020).

Pre-training setup
All READTWICE models are initialized with the public ROBERTA (base) checkpoint 3 adapted to Tensorflow by Rothe et al. (2020). Further, models are pre-trained for 1M steps on 64 TPU cores using the LAMB optimizer (You et al., 2020).
Each batch contains 512 segments, with at most 128 segments per document. The segments are consecutive spans of 512 tokens. Therefore, the model can process documents up to 65k (≈ 128 × 512) tokens. Each batch contains the maximum number of documents such that the total number of segments is at most 512. Approximately half of Wikipedia articles fit in one segment (thus not needing memory), with a fat tail of longer documents.
In terms of compute and memory overhead, READTWICE is about 30% slower than the ROBERTA-base model and uses 15M (or 12%) more parameters: 14M owing to the second read BERT 2 and 1M due to ExtractMemories and MemoryAttention layers.
In HQA, questions are based on relatively short text passages (2 evidence paragraphs), with eight additional distractor passages. In TQA, evidence text is medium-sized. NQA asks questions about entire books, requiring a successful QA system to model very long-range dependencies. The NQA dataset has an average of 62,000 words per document with a maximum of 400,000. Only 40% of NQA's answers are span-based -we use a ROUGE-L oracle as training labels for the other questions.
READTWICE is fine-tuned on each task. QAspecific heads are used to generate span-based predictions, consisting of fully-connected layers that take contextual embeddings from the second reader as inputs. These layers output a score for whether the corresponding tokens are the beginning or ending of an answer span. For a similar setup, see multi-segment based QA tasks (Clark and Gardner, 2018;Cheng et al., 2020).
During fine-tuning, batches contain 128 segments for all tasks (also with up to 128 segments per document). Every segment contains 512 tokens, but as neighboring segments have 128 token overlaps, the model can process documents of up to 49K tokens (≈ 128 × (512 − 128)). For TQA and HQA, documents have approximately 10 segments. For NQA, we split the documents into sub-documents with 49k tokens and apply memory only within these sub-documents. We perform hyperparameter search only over learning rate λ ∈ {5e − 6, 1e − 5, 3e − 5} and train for 6 epochs with 10% warm up proportion. Moreover, we use early stopping based on the performance on the development set.

Main Results
Results for HQA and TQA are reported in Table 1. We compare to prior art (using reported results where available or from our own implementations otherwise, denoted as "us"): Longformer (LF) (Beltagy et al., 2020), ETC , BigBird (Zaheer et al., 2020), and ROBERTA (Liu et al., 2019). By default, we compare against the "base" configuration of those models where the number of parameters is comparable to BERT-Base, as is the case for READTWICE. Table 1 shows that for small to medium sized text passages, the proposed READTWICE outperforms all models of comparable size. Table 2 contrasts READTWICE to other methods on extremely large contexts: BiDAF (Kociský et al., 2018), R 3 (Wang et al., 2018), BM25 + BERT Reader / Ranker (Mou et al., 2020) and our own implementation of ROBERTA and ETC 5 . READ-TWICE significantly outperforms all previous work and establishes new state-of-the-art results, demonstrating the effectiveness of performing a second read conditioned on global memory for processing extremely long texts.

Ablation Analysis & Discussion
To isolate individual components' contributions, Table 3 contrasts several variants of READTWICE.
Inter-segment memory matters We introduce a variant READTWICE-E(SS) (where SS stands for "Single Segment") to isolate the gains from the memory layer. READTWICE-E(SS) prevents segments from attending to memories of other segments, thus disabling long-range dependency modeling. We observe that READTWICE-E improves over READTWICE-E(SS) on all tasks, modestly but non-negligibly for TQA, and significantly for HQA and especially NQA. This matches our knowledge of those datasets: TQA questions are based on a relatively short context and can typically be answered using a single passage in the context document. HQA questions have a similarly sized context, but are explicitly constructed to require information from multiple paragraphs to answer, and READTWICE shows accordingly larger gains. Finally, NQA has much larger contexts, and its questions generally require information from different parts of the document, increasing the importance of long-range dependency modeling and accordingly, the performance boost from READTWICE.
Entities matter Entity mentions appears to be the most effective memory type in most experiments, leading to noticeably improved performance on both HQA and NQA. The difference is most pronounced in NQA whose particularly long and challenging contexts make it a perfect testbed.
Source of non-memory gains The non-memory gains over a baseline ROBERTA model originate from the two extra layers and the entity-based MLM objective. In order to disentangle the sources of gains we train the READTWICE-E(SS) model using a 10-layer Transformer for BERT 1 (denoted as E(SS, 10L) in Table 3  pre-training procedure (E(SS, 10L) vs ROBERTA).

Conclusion & Future Work
READTWICE performs well on several QA tasks, particularly NarrativeQA where long-range dependencies among entities appear to be very important. The proposed method is conceptually simple, easy to implement and is capable of reading entire books. For future work, we plan to explore new memory types, hierarchies and aggregation functions. We also aim to apply the model to other tasks, particularly long text summarization, likely to benefit from a memory-forming mechanism.

A Method
Sparse MemoryAttention In the standard setting READTWICE (CLS) and READTWICE (STS) apply the MemoryAttention layer for all tokens h ij . Our preliminary experiments showed that READTWICE (E) benefits from a sparse pattern -when the layer is applied only over tokens that belong to an entity mention. It acts like an identity function for all other tokens. This follows an intuition that long-range dependencies in text mainly occur between entity mentions. Similar sparse patterns for READTWICE (CLS) and READTWICE (STS) (e.g. CLS tokens attending only CLS-based memories) affected their performance negatively.

B Pre-training details
Entity mention specific pre-training We use coreference resolution as an auxiliary pre-training task specifically for READTWICE (E). If there are two entities in the memory table pointing to the same entity (but in different segments), our coreference resolution task will encourage their entity representations to be close. This is achieved through a binary classification task on whether m and m entries in M point to the same entity. The classification probability is modeled as: where b 0 is a bias term. A logistic loss is formed (using the ground-truth as positive examples and all other entities as negative examples) and added to the MLM learning objective, for every entry in the memory table. Memories that correspond to mentions without a corresponding entity ID (meaning Entity Linking has failed to link them to IDs) are ignored in this loss. Ablation results (c.f. Table 5) shows that this auxiliary loss is not meaningfully important for model performance, although it does gives a modest boost to the ROUGE-L score on the Narra-tiveQA task.

MLM Analysis
We consider whether the models learns to use memories in its predictions, evaluating MLM accuracy on a heldout set of Wikipedia articles and books. We compare READTWICE with a restricted version of itself that cannot access memories from different segments, denoted as Single Segment (SS) setting. The single segment version of RIT is essentially a standard ROBERTA model with an additional attention mechanism over   Table 5: Ablation studies on variants of READTWICE. We report F1 (answer only) score for HQA, ROUGE-L and BLEU-1 for NQA (-R and -B correspondingly) and F1 for TQA. entity mentions within the segment, but different segments are processed completely independently.
The results are reported in Table 4. READTWICE achieves a +1.5% accuracy gain over READTWICE (SS), rising to +7.4% for entity tokens, confirming the model learns to utilize mention memory.
C Question Answering

C.1 Extractive QA layers
The model is fine-tuned and evaluated on several extractive QA tasks. We introduce additional QA-specific layers to generate span-based predictions. Let H i be model's output for the segment i. The model generates separate scores (logits) for whether a token j is the beginning of an answer span, Z i,j , and another score for the end of the span, Z where W b , W e are learnable weights and FFN(·) is a shared fully-connected layer.
Model #Params BERT  110M ROBERTA (Liu et al., 2019) 125M LF (Beltagy et al., 2020) 149M ETC  166M BIGBIRD (Zaheer et al., 2020) 166M READTWICE (ENTITY) 145M  For the loss function we largely follow works by (Clark and Gardner, 2018;Cheng et al., 2020), which describe an efficient way to train an extractive QA system in multi-segment setting where there are multiple correct answer spans in the evidence. Let B = {(i, j) | answer span starts at a position j in the segment i}. Then the loss is OR-model with global normalization The loss for the end position of the answer span L (e) span is computed in a similar way. During inference, the model picks the most confident prediction amongst all the segments.

C.2 HotpotQA
Data download link: https://hotpotqa. github.io Unlike other datasets HotpotQA contains questions with "yes" / "no" answers. In order to handle them appropriately we introduce an additional classification layer on top of READTWICE output CLS representation. The layer produces scores for all three possible options -the answer is "yes", the answer is "no" and the answer is a span in the document. During training we normalize these scores globally across all passages. The loss function for the option classifier is a negative log-likelihood of the correct option, applied only to the two supporting paragraphs (not the distractors). During inference we select the option with the highest score across all paragraphs. If the selected option is "yes" or "no", we use it as the model's prediction. If the classifier predicts that answer is a span then we use the standard extractive QA layer to extract a span.
Model selection Model was selected based on the highest F1 (answer only) score on the development set. The best model was trained with a learning rate 3 × 10 −5 for 6 epochs.

C.3 TriviaQA
Data Download Link: https://nlp.cs. washington.edu/triviaqa A complete set of TriviaQA evaluation results is shown in the Table 7.
Model selection Model was selected based on the highest F1 score on the development set. The best model was trained with a learning rate 1×10 −5 for 5.3 epochs.

C.4 NarrativeQA
Data Download Link: https://github. com/deepmind/narrativeqa Here we provide more information on the Nar-rativeQA dataset. In particular, we would like to point that the manner in which NarrativeQA questions were generated encourages questions to require information from distant parts of the documents.
Every question was written given a short Wikipedia summary of a movie/book as context. Accordingly, models can achieve high accuracy on NarrativeQA when given the summary as evidence, rather than the whole book ( (Mou et al., 2020) Cheng et al. (2020 report ROUGE-L scores of 57.19 and 60.5 on the dev set, respectively). However, each sentence in the Wikipedia summary might correspond to a whole chapter of a book, so any question that uses information from multiple summary sentences is likely to require information from different sections of the book. Indeed, retrieve-and-read methods perform poorly in this setting (Mou et al., 2020). On the other hand, READTWICE performs significantly better, demonstrating its ability to capture long-term dependencies.
Preprocessing In contrast to the HotpotQA and TriviaQA, answers to NarrativeQA questions do not necessarily correspond to spans in the document. An exact answer can be found only in ≈ 40% cases -for the rest of the questions we use a ROUGE-L oracle as labels.
Fine-tuning Similarly to Févry et al. (2020) we found it helpful to enforce sparsity in the MemoryAttention layer by computing attention in the Equation 1 and 4 only over the 100 memories m which have the largest dot product with the hidden state h ij .
Evaluation In line with the previous work (Kociský et al., 2018;Mou et al., 2020) we convert both hypothesis and reference to lowercase and remove a trailing period before running an evaluation script. Following Mou et al. (2020) we use an open-source library to perform evaluation 6 , which includes ROUGE-L, BLEU-1, BLEU-4 and ME-TEOR scores.
Model selection Model was selected based on the highest ROUGE-L score on the development set. The best model was trained with a learning rate 5 × 10 −6 for 2.2 epochs.