Beyond Paragraphs: NLP for Long Sequences

In this tutorial, we aim at bringing interested NLP researchers up to speed about the recent and ongoing techniques for document-level representation learning. Additionally, our goal is to reveal new research opportunities to the audience, which will hopefully bring us closer to address existing challenges in this domain.


Introduction
A significant subset of natural language data includes documents that span thousands of tokens. The ability to process such long sequences is critical for many NLP tasks including document classification, summarization, multi-hop, and opendomain question answering, and document-level or multi-document relationship extraction and coreference resolution. These tasks have important practical applications in domains such as scientific document understanding and the digital humanities (Ammar et al., 2018;Cohan et al., 2018;Kociský et al., 2018;Lo et al., 2020;Wang et al., 2020a). Yet, scaling state-of-the-art models to long sequences is challenging as many models are designed and tested for shorter sequences. One notable example is transformer models (Vaswani et al., 2017) that have O(N 2 ) computational cost in the sequence length N , making them prohibitively expensive to run for many long sequence tasks. This is reflected in many widely-used models such as RoBERTa and BERT where the sequence length is limited to only 512 tokens.
In this tutorial, we aim at bringing interested NLP researchers up to speed about the recent and ongoing techniques for document-level representation learning. Additionally, our goal is to reveal new research opportunities to the audience, which will hopefully bring us closer to address existing challenges in this domain.
We will first provide an overview of established long sequence NLP techniques, including hierarchical, graph-based, and retrieval-based methods. We will then focus on the recent long-sequence transformer methods, how they compare to each other, and how they can be applied to NLP tasks (see Tay et al. (2020) for a recent survey). We will also discuss various memory-saving methods that are key to processing long sequences. Throughout the tutorial, we will use classification, question answering, and information extraction as motivating tasks. In the end, we will have a hands-on coding exercise focused on summarization. 1

Description
Tutorial Content This tutorial covers methods for long-sequence processing and their application to NLP tasks. We will start by explaining why processing long sequences is difficult. Many popular models scale poorly with the sequence length, either in computational or memory requirements, making them too expensive or impossible to run on current hardware. Another reason is that we want models that can capture long-distance information while ignoring large amounts of irrelevant text. The introduction also covers the tasks that we will use throughout the tutorial, namely information extraction (relation extraction (Jia et al., 2019) and coreference resolution (Pradhan et al., 2012;Bamman et al., 2020)), question answering (especially the multi-hop setting as in HotpotQA (Yang et al., 2018) and Wikihop (Welbl et al., 2018)), and document classification, and summarization.
The next section will review well-established methods for dealing with long sequences, namely chunking and graph based methods. Chunking refers to splitting the sequence into smaller chunks, processing each one independently, then aggregating them in a task-specific way (Joshi et al., 2019). Hierarchical models are a special case of chunking where the chunks are linguistic constructs (usually sentences) that are aggregated following the document hierarchy (Yang et al., 2016). Finally, retrieval-based methods use a recall-optimized simple model to retrieve short text snippets relevant for the task, then follow up with a stronger, more expensive model. Retrieval methods have been discussed in detail in the Open Domain QA tutorial (Chen and Yih, 2020) so we will cover it here very briefly. Graph-based methods will also be discussed, with a focus on question answering. These methods usually use local context to identify potentially relevant information across the document, heuristically connect the identified information in a graph, then apply a graph neural network (Kipf and Welling, 2017) to propagate information across the document between the snippets. This is particularly effective for the multi-hop reasoning setting (Fang et al., 2019).
Next, we will focus on the recent transformerbased methods for efficient processing of long sequences. The key question these models are addressing is how to perform the expensive O(N 2 ) self-attention computation efficiently. All models make this computation faster by approximating the full self-attention leading to different models with different behaviors and applications. We will survey a few of the key papers summarized in Tay et al. (2020). In particular, we will talk about Transformer-XL (Dai et al., 2019), Longformer (Beltagy et al., 2020), Reformer (Kitaev et al., 2020) and Linformer (Wang et al., 2020b). We will also discuss how they apply to NLP tasks; Transformer-XL is mainly suitable for autoregressive tasks while the other three are equally suitable for autoregressive and bidirectional tasks. We will compare the performance of the other three models on various NLP tasks.
The next section discusses pretraining and finetuning of the transformer models. For pretraining, we will discuss different approaches to warm start the model weights from existing pretrained models for short sequences (Gupta and Berant, 2020;Beltagy et al., 2020). These approaches are versatile and make it possible to adapt most existing pretrained transformer models for short sequences into models that can process long sequences with a tiny pretraining cost. We will also demonstrate how to finetune such models for tasks such as question answering and classification.
The following section is a practical use case on summarization. We will show how to start from the BART (Lewis et al., 2020) checkpoint, convert it into a model that can work with a long input that's tens of thousands of tokens long, then finetune it on a long-input summarization task. It will also discuss practical techniques necessary to run the model on current hardware, including memory optimization techniques such as gradient checkpointing (Chen et al., 2016) and gradient accumulation. These are generic memory saving methods applicable to all neural models, and especially applicable in the long sequence setting.
Finally, the future work section will discuss open questions and future research directions like pretraining objectives that are better suited for long documents, encoder-decoder models with long output sequence, the balance between two-stage retrieval methods and single stage methods with long input, and how we think about long-sequence scaling for large models where the self-attention compute overhead reduces relative to feed-forward layers.
Relevance to ACL The models we cover are generic machine learning tools, but we discuss them from the NLP perspective, and study their application to core NLP tasks like IE, QA, and text generation. These methods have the potential to improve tasks that are currently challenging like multi-document summarization, story generation, and long dialogues. It can also enable new applications that have not yet been considered.

Type of the tutorial
This is a cutting-edge tutorial. The methods we discuss, especially the transformer-based and the graph-based methods, are active areas of research.

Outline
This tutorial will be 3 hours long.

Introduction (15 minutes long): This section
will introduce the theme of the tutorial: why processing long sequence is important and why it is difficult. It will also introduce the NLP end-tasks that we will use throughout the tutorial.
2. Chunking, hierarchical, and graph based methods (35 minutes long): This section discusses graph-based methods and their application to information extraction and question answering, especially in the multi-hop reasoning setting. It also covers chunking and hierarchical methods as applied to coreference resolution, classification, and question answering.
3. Transformer-based methods (45 minutes long): This section reviews recently introduced long-sequence transformer models, compares the pros and cons of their designs, and discuss their applicability to NLP applications.
4. Pretraining and finetuning (25 minutes long): This section discusses how the longsequence transformer methods are pretrained and how they are finetuned for downstream tasks including classification and question answering.
5. Use Case: Summarization (40 minutes long): This section is a practical exercise where we demonstrate in code how to build and train a long-document summarization model. It will also cover the technical details of multiple memory-saving methods that are key for training models on long sequences including gradient accumulation, and gradient checkpointing.
6. Open problems and directions (20 minutes long): In this final section, we will provide an outlook into the future. We will highlight both open problems and point to future research directions.

Breadth
We estimate 75% of the work covered will not be by the tutorial presenters.

Prerequisites
• Machine Learning: Basic knowledge of common recent neural network architectures like RNNs, and Transformers.
• Computational linguistics: Familiarity with standard NLP tasks such as text classification, natural language generation, and question answering.

Reading List
Reading the following papers is nice to have but not required for attendance.

Estimated Attendance
Due to the broad appeal, we expect the tutorial to be well attended with around 150 people. This is especially the case for the long-sequence transformer methods because they open up pretrained models to applications that haven't been considered before. They are also easy to use, something that appeals to researchers and practitioners alike. This tutorial has not been previously offered, but some of the methods have been covered before. In particular, retrieval-based methods have been covered in the Open-Domain QA tutorial at ACL 2020 (Chen and Yih, 2020), so we won't cover this topic and will refer the attendees to the previous tutorial.

Venue
The tutorial will be held at NAACL-HLT 2021.

Open Access
All the slides, video recordings, and software used for the tutorial will be publicly available.