AdaptSum: Towards Low-Resource Domain Adaptation for Abstractive Summarization

State-of-the-art abstractive summarization models generally rely on extensive labeled data, which lowers their generalization ability on domains where such data are not available. In this paper, we present a study of domain adaptation for the abstractive summarization task across six diverse target domains in a low-resource setting. Specifically, we investigate the second phase of pre-training on large-scale generative models under three different settings: 1) source domain pre-training; 2) domain-adaptive pre-training; and 3) task-adaptive pre-training. Experiments show that the effectiveness of pre-training is correlated with the similarity between the pre-training data and the target domain task. Moreover, we find that continuing pre-training could lead to the pre-trained model’s catastrophic forgetting, and a learning method with less forgetting can alleviate this issue. Furthermore, results illustrate that a huge gap still exists between the low-resource and high-resource settings, which highlights the need for more advanced domain adaptation methods for the abstractive summarization task.


Introduction
Abstractive summarization models aim to extract essential information from long documents and to generate short, concise and readable text. Recently, neural abstractive summarization models have achieved remarkable performance (Gehrmann et al., 2018;Paulus et al., 2018), and large-scale generative pre-training (Lewis et al., 2019;Raffel et al., 2019) has shown itself to be surprisingly effective at generation tasks, including abstractive summarization. However, these models generally require large numbers of human-annotated summaries to achieve state-of-the-art performance, which makes them not scalable to low-resource domains where only a few labeled data are available.
Domain adaptation methods have naturally arisen to tackle the low-resource issue and enable models to quickly adapt to target domain tasks. Yet, despite their practicality, very few studies have used domain adaptation methods on the lowresource scenario for the abstractive summarization task. To address this research gap, we present AdaptSum, the first benchmark to simulate the low-resource domain Adaptation setting for abstractive Summarization systems with a combination of existing datasets across six diverse domains (dialog (Gliwa et al., 2019), email (Zhang and Tetreault, 2019), movie review (Wang and Ling, 2016), debate (Wang and Ling, 2016), social media (Kim et al., 2019), and science (Yasunaga et al., 2019)), and for each domain, we reduce the number of training samples to a small quantity so as to create a low-resource scenario.
Recently, conducting a second pre-training step on large-scale language models (e.g., BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019a)) has proven to be effective for domain adaptation tasks (Lee et al., 2020;Gururangan et al., 2020). However, the current methods incorporating such a step are mainly focused on classification or classification-based (e.g., named entity recognition) tasks, leaving a research gap in exploring their use for generation tasks. In this paper, we systematically investigate adding a second phase of pre-training on large-scale generative models under three settings: 1) source domain pre-training (SDPT) based on a labeled source domain summarization dataset; 2) domain-adaptive pre-training (DAPT) based on an unlabeled substantial domain-related corpus; and 3) task-adaptive pre-training (TAPT) based on an unlabeled smallscale task-related corpus. The second phase of pre-training could cause the catastrophic forgetting in the pre-trained model. Thus, we propose to apply RecAdam  into the pre-training process to alleviate this issue and further improve the adaptation performance.
Experimental results show that SDPT and TAPT can generally improve on the performance of the fine-tuning method, while the effectiveness of DAPT is correlated to the similarity between the pre-training data and the target domain task data. Different from previous insights into adaptive pretraining on classification tasks (Gururangan et al., 2020), we find that in the summarization task, DAPT could make the adaptation performance worse, even though the pre-training corpus is collected from domain-related sources. Furthermore, we show that RecAdam can further boost the performance of the second pre-training step by effectively maintaining the pre-trained model's knowledge gained in the first phase of pre-training.
Our contributions are summarized as follows: • We introduce a low-resource domain adaptation scenario for the abstractive summarization task to move towards the fast adaptation of summarization systems.
• To the best of our knowledge, we are the first to systematically study the domain-and taskadaptative pre-training for a low-resource generation task.
• Our work highlights the research questions and challenges in the low-resource abstractive summarization task, which we hope will catalyze research in this area.
2 Related Work

Abstractive Summarization
Abstractive summarization aims to generate short, concise and readable text that captures the core meaning of the input documents. Neural networks have achieved remarkable results for the abstractive summarization due to the emergence of Seq2Seq models (Sutskever et al., 2014) and attention mechanisms (Bahdanau et al., 2014). See et al. (2017), Paulus et al. (2017) and Gehrmann et al. (2018) applied a pointer network to solve the outof-vocabulary issue. Further, See et al. (2017) used a coverage mechanism (Tu et al., 2016) to keep track of the already summarized content, which discourages repetition, while Paulus et al. (2017) and Chen and Bansal (2018)  pre-trained language models (Peters et al., 2018;Radford et al., 2018;Devlin et al., 2019;Dong et al., 2019;Lewis et al., 2019) have achieved impressive gains in a wide variety of natural language tasks. Many studies on the use of pre-trained language models in the abstractive summarization task (Liu and Lapata, 2019;Yu et al., 2020) have been undertaken and have achieved the state-of-the-art performance.

Domain Adaptation
Domain adaption for natural language processing and computer vision tasks is widely studied (Blitzer et al., 2007;Mansour et al., 2008;Daumé III, 2009;Sandu et al., 2010;Foster et al., 2010;Wang and Cardie, 2013;Sun et al., 2016;Liu et al., 2019bLiu et al., , 2020bGururangan et al., 2020;Winata et al., 2020;Jadon, 2020;Yin, 2020;Liu et al., 2020a,d). However, little has been done to investigate domain adaption for the abstractive summarization task. Hua and Wang (2017) first studied the adaptation of neural summarization models and showed that the models were able to select salient information from the source domain data. Wang et al. (2019) investigated the domain shift problem for the extractive summarization task. Recently, Magooda and Litman (2020) studied cross-domain transfer between two entirely different domains and introduced data synthesis methods. To the best of our knowledge, we are the first to systematically study the domain-and task-adaptative pre-training based on the pre-trained generative model in the low-resource abstractive summarization task across multiple diverse domains.

AdaptSum
The goal of AdaptSum is to provide an accessible benchmark for the evaluation of low-resource domain adaptation for abstractive summarization on a diverse set of domains. The vocabulary overlaps between domains are shown in Figure 1. AdaptSum consists of six diverse target domains and the corresponding unlabeled domain-related corpora for DAPT. We provide the data statistics of all domains in Table 1, and the details are as follows.
Email Zhang and Tetreault (2019) introduced an abstractive business and personal email summarization dataset which consists of email and subject pairs. We collect the unlabeled email corpus from the Enron Email Dataset. 3 Movie Review Wang and Ling (2016) introduced a human-annotated abstractive movie review summarization dataset. We collect the unlabeled corpus for this domain from IDMB Movie Review (Maas et al., 2011).
Debate Wang and Ling (2016) introduced an abstractive debate summarization dataset which consists of arguments and the debate topic pairs. The unlabeled corpus is from Ajjour et al. (2019). Kim et al. (2019) introduced an abstractive summarization dataset of Reddit TIFU posts, where the summary for each post come from its title. We collect the unlabeled corpus directly from Reddit TIFU. 4

Social Media
Science Yasunaga et al. (2019) introduced a human-annotated abstractive summarization dataset on computational linguistics. We collect the unlabeled domain corpus from the ACL anthology (Bird et al., 2008).

Methodology
In this section, we will first introduce the three different settings that we investigate for a second pre-training step. Then, we will discuss how we cope with the catastrophic forgetting issue in the second phase of pre-training.

A Second Phase of Pre-Training
We conduct a second pre-training phase based on a pre-trained generative model, BART (Lewis et al., 2019), on three different settings. Then, we finetune it to the summarization task in the target domains. The three settings are described as follows.
Source Domain Pre-Training (SDPT) Inspired by the cross-domain setting (Jia et al., 2019;Liu et al., 2020c,d), we leverage substantial training samples from a source (News) domain (XSum (Narayan et al., 2018)), to aid in the fast adaptation in target domains. We choose the News domain as the source domain because it is a richresource domain in the summarization task, and from Figure 1, the similarity between this domain and target domains is generally low which increases the challenge of the domain adaptation. Our method to conduct SDPT is straightforward. We continue pre-training BART using the source domain summarization data. The objective function for this pre-training is not the sentence reconstruction, as in the original pre-training of BART. Instead, we utilize the supervisions from the source domain summarization data to train BART on the summarization task. The purpose of this pre-training is to inject the task knowledge into the pre-trained language model so that the model can quickly adapt to the same task in target domains.

Domain-Adaptive Pre-Training (DAPT)
We leverage an unlabeled domain-related corpus to continue pre-training BART using its original pretraining objective function (corrupting documents and then optimizing a reconstruction loss-the cross-entropy between the decoder's output and the original document). The intuition behind this method is to introduce the domain knowledge into the pre-trained language model so as to enable its fast adaptation to the target domains.
Task-Adaptive Pre-Training (TAPT) The size of the domain-related corpus for DAPT is usually enormous, which results in two potential drawbacks. First, such a large corpus might not be always available, especially for the low-resource domains. Second, pre-training on such a large corpus is time-consuming and requires excessive computational resources. Therefore, investigating pretraining on a smaller unlabeled corpus is a practical and beneficial research direction. TAPT refers to pre-training on a set of the unlabeled documents in the target domain's summarization task. Compared to DAPT, TAPT uses a much smaller but far more task-relevant pre-training corpus since it directly uses the input documents from summarization task. This setting makes TAPT much less expensive to run and independent of the collection of the large domain-related corpus.

Recall and Learn
Although the second pre-training step allows the pre-trained model to learn the task or domain knowledge, it might lead to the catastrophic forgetting issue and cause the pre-trained model to partly lose the language understanding ability that it gains in the first pre-training step. To alleviate this issue, we expect the pre-trained model to recall the previously learned knowledge during the process of learning new knowledge. A straightforward way to achieve this goal is to borrow the idea of continual learning methods (Kirkpatrick et al., 2017;Lopez-Paz and Ranzato, 2017;. In this paper, we adopt RecAdam from  for the second phase of pretraining to weaken the catastrophic forgetting issue. The reason for choosing RecAdam is twofold: 1) it does not require the first step pre-training data from the pre-trained model, which is usually not available; 2) it is the most recent approach that is being successfully applied to natural language processing tasks. The RecAdam is introduced as follows.
Based on the Adam optimizer (Kingma and Ba, 2015), RecAdam reconstructs the objective function to allow it to gradually shift to the target task: where k and t 0 are the hyper-parameters controlling the annealing rate and time steps, Loss T represents the target task objective function, and Loss S is used to simulate the first pre-training step of the pre-trained model. Loss S can be simplified as: where 1 2 γ is the coefficient of the quadratic penalty, θ is the parameters of the model, and θ * (fixed) is the original parameters of the pre-trained model.
Although RecAdam has shown its effectiveness in fine-tuning BERT-like models (e.g., BERT (Devlin et al., 2019) and ALBERT (Lan et al., 2020)) to the GLUE benchmark (Wang et al., 2018), exploring the effectiveness of RecAdam in the second phase of pre-training for generative pre-trained models is not trivial. First, the second pre-training step of a language model is a completely different task compared to fine-tuning to downstream tasks. Second, a generative model (e.g., BART) is structurally different from BERT-like models. Third, the corpus sizes for SDPT and DAPT are generally much larger than the sizes of GLUE tasks, which could affect the learning process.

Experimental Setup
Training Details We evaluate all of our models on AdaptSum. For the dialog and email domains, we use the standard splits of (Gliwa et al., 2019;Zhang and Tetreault, 2019), while for movie review, debate, social media and science domains, we split the whole dataset into training, validation and test sets by ourselves since the original works do not specify how to split these datasets or the published datasets do not contain the split training, validation and test sets. Since the dataset sizes are limited for science, movie review and dialog domains, the maximum training samples for these domains are   100, 300, and 300, respectively, while for dialog, email, and social media domains, the maximum training samples for them are 14732, 14436, and 60354, respectively, and we select 300 samples for each domain to construct a low-resource setting. We truncate the input documents into 1024 tokens due to the limitation of the maximum input length for BART. For all the experiments, we use the BART-base version to implement our models. We use a mini-batch size of 4 with a gradient accumulation for 10 iterations. We use Adam optimizer with momentum β 1 = 0.9, β 2 = 0.998 and noam decay with warm up steps of 1000. In the decoding stage, we use beam search with a beam size of 4. The decoding process will not stop until an end-ofsequence (EOS) token is emitted or the length of the generated summary reaches to 256 tokens. As for the hyperparameters of RecAdam, we select the best t0 and k in {500, 600, 700, 800, 900, 1, 000} and {1e − 2, 1e − 3, 1e − 4, 1e − 5, 1e − 6}, respectively, for the annealing coefficient λ(t) (Eq. 2).
Baseline As our baseline, we use an off-the-shelf BART model (Lewis et al., 2019) and perform supervised fine-tuning of its parameters for the summarization task in each domain. BART serves as a good baseline since it provides the state-of-the-art performance in the summarization task. And, as a single generative language model, it can be easily adapted to different target domains.

Evaluation Metrics
We use ROUGE (Lin and Hovy, 2003) to measure the quality of the summary produced in our experiments. Following the previous work (Nema et al., 2017), we report ROUGE F1 (ROUGE-1) on the AdaptSum dataset. 5 6 Results & Analysis

Main Results
From Table 2, we can see that SDPT is able to generally improve the summarization performance of the fine-tuning method for all domains. This is because SDPT teaches the model how to do the task using large numbers of annotated examples, which enables the model to adapt to target domains faster than the fine-tuning method, and SDPT is able to outperform both DAPT and TAPT in terms of the averaged ROUGE-1 score. The enormous unlabeled corpus makes DAPT quite effective in certain domains, such as email, debate and social media, with close to or more than 2 ROUGE-1 scores improvements over the fine-tuning baseline. As we can see from Table 3, although TAPT uses a far smaller pre-training corpus than DAPT, the performance of TAPT is on par with that of DAPT, which accords with the results in Gururangan et al.  Table 4: Vocabulary overlaps (%) between the pretraining corpus (for DAPT or TAPT) and the validation set of the summarization task for each domain. The numbers in the brackets denote the vocabulary overlap differences between the two pre-training corpora, and the bold numbers denote the large discrepancies.
the corpus used for DAPT is comparably small for movie review and science domains, and the number of data samples for XSum (204k) is also much smaller than those of DAPT corpora in many domains (e.g., email), which have more than 1M sentences. According to Eq. 1, extensive training data could result in a comparatively large Loss s (the model's parameters tend to be greatly modified) which lead to an unstable loss and a negative effect to the pre-training process. In addition, we find that RecAdam is originally shown to be effective at fine-tuning to the downstream GLUE tasks , the sizes of which are much smaller than the datasets used for SDPT and DAPT.

How Pre-training Data Affects DAPT
According to prior experiments on domain adaptation for classification or classification-based tasks (Beltagy et al., 2019;Lee et al., 2020;Gururangan et al., 2020), DAPT improves the performance for all domains on the fine-tuning baseline. However, as we can see from Table 2, DAPT causes the performance to drop for the movie review and science domains in the summarization task, while TAPT boosts the performance for all the domains. To further investigate the reasons, we aim to analyze the similarity (e.g., vocabulary overlap) between the pre-training corpus for DAPT and the summarization task in the target domain, which we represent with the target domain validation set of the summarization task to represent. We notice that it is difficult to justify how much overlap is large enough for DAPT to be considered as effective. Hence, we add the TAPT corpus, which is directly related to the target domain's summarization task, as an upper bound for the comparison.  Table 4 illustrates the vocabulary overlaps for DAPT and TAPT for each domain. 6 We find large discrepancies between the DAPT corpus and TAPT corpus on the movie review and science domains, which indicates that the domain-related corpora in these two domains are not quite related to the task domains, and pre-training on a domain-unrelated or less related corpus can lead to a performance drop compared to the fine-tuning method. Given that the corpus construction is done by looking for the domain-related sources (as mentioned in Section 3), the experimental results point out that collecting a domain-related corpus for DAPT in the summarization task is not straightforward. Thus, we leave exploring how to construct an effective corpus for DAPT for future work.

Catastrophic Forgetting Issue
We speculate that the second phase of pre-training will result in the catastrophic forgetting for the pre-trained model, which could hurt the adaptation performance. Figure 2 illustrates that the performance of TAPT without RecAdam keeps dropping as the pre-training continues, and it starts to perform worse than the fine-tuning method after three epochs' pre-training, while the performance of TAPT with RecAdam remains stable at around a 25.5 ROUGE-1 score. We conjecture that excessive pre-training makes the pre-trained model overfit to the pre-training data and partially lose its language   understanding and generation ability. However, the model is required to possess both language ability and domain knowledge for better performance in the domain adaptation task. RecAdam helps the pre-trained model preserve its original language ability while continuing pre-training on a new corpus, which boosts the effectiveness of pre-training. However, as we can see from Table 2, RecAdam fails to improve the performance on DAPT using large corpora. We speculate that the catastrophic forgetting issue does not do much harm to the performance of DAPT because pre-training on the large corpus enables the pre-trained model to possess a good language understanding ability in the target domain even though it could lead to partial forgetting in previous domains, and RecAdam makes DAPT stay somewhere in the middle (not forgetting much the previous learned knowledge, but not learning well in the target domain, either). It indicates that more advanced learning methods are needed for coping with the second pre-training phase on a large corpus.

Incorporating SDPT and DAPT
Intuitively, incorporating both the summarization task and target domain knowledge into the pretrained model could further boost the domain adaptation performance in the summarization task. Therefore, we propose to combine SDPT and DAPT in the second pre-training step. Since SDPT and DAPT use different objective functions, jointly learning these two tasks will make BART confused about what to generate (summarization or sentence  reconstruction) given the input sequences. To cope with this issue, we use two BART models (one for SDPT and one for DAPT) and share their encoders in this joint pre-training process to learn the knowledge from both the task and domain. Then, we use the BART model for SDPT to fine-tune to the summarization task in the target domain. As shown in Table 6, the experimental results are contradictory to the intuition. We find that SDPT+DAPT can not further improve upon the performance of SDPT and DAPT. For the dialog and social media domains, the performances of SDPT+DAPT stay between those of SDPT and DAPT, while for the science, movie review and email domains, the performances of SDPT+DAPT are even lower than that of the BART fine-tuning. We conjecture that SDPT and DAPT are two completely different tasks, and jointly pre-training based on them could confuse the model about the knowledge that it learns. However, integrating the task and domain knowledge is still a promising direction for domain adaptation. We leave how to incorporate SDPT and DAPT for future work.

Different Source Domain Data for SDPT
To explore how different source domain data can affect the performance of SDPT, we use another News domain dataset, CNN/Daily Mail (DM) dataset (Hermann et al., 2015;Nallapati et al., 2016), as the labeled summarization data for SDPT. As we can see from  Figure 3: ROUGE-1 results of BART fine-tuning, DAPT and SDPT over different numbers of training data for email (left) and dialog (right) domains. We consider both low-resource settings (50, 100, 200 and 300 (∼2%) samples), medium-resource settings (25% and 50% samples), and high-resource settings (75% and 100% samples). the averaged score, and for all the domains, it generally performs worse or similar compared to SDPT based on XSum. Since both of them are from the News domain but the number of training samples in CNN/DM (287k) is higher than that in XSum (204k), pre-training on CNN/DM should have achieved better performance than pre-training on XSum. To further analyze the reason, we calculate the averaged length of input documents and output summaries for the source and target domains. From Table 5, we find that the averaged length of XSum is much shorter than that of CNN/DM in terms of both document and summary, and surprisingly, SDPT based on XSum can outperform SDPT based on CNN/DM in domains with short length document and summary (e.g., debate and email) as well as the domains with long length document or summary (e.g., movie review and science). Hence, we conjecture that pre-training with relatively short document and summary is more effective for SDPT. Another reason can be attributed to the fact that the summaries of the CNN/DM tend to copy the content in the input documents, while XSum has larger amounts of novel tokens in the summaries. Therefore, we conjecture that XSum enables model learn a more powerful summarization ability, which helps it to better adapt to low-resource target domains. We leave investigating the effectiveness of different source domain datasets in SDPT for future work.

Performance vs. Training Sample Size
We investigate how well models perform in an extremely low-resource scenario (e.g., 50 training samples) and the performance discrepancies among different levels of resources. The performance over different numbers of training samples is illustrated in Figure 3. We find that BART fine-tuning with the 25% data samples significantly outperforms that with ∼2% data samples in the dialog domain, but such improvements are not remarkable in the email domain. We conjecture that the input and output lengths for the email domain are relatively short compared to the dialog domain (according to Table 5), making the domain adaptation easier.
Interestingly, DAPT outperforms other models in the medium-resource and high-resource settings in the email domain but not in the dialog domain. We speculate the reasons are twofold. First, based on the vocabulary overlaps from Table 4, the email corpus is more effective for DAPT than the dialog domain. Second, email corpus is much larger than the dialog corpus from Table 3. However, the performance of DAPT using a high-quality corpus will be still limited by the low-resource scenario, and it needs large enough training samples to achieve remarkable improvements. Moreover, the performance of TAPT is better than BART fine-tuning in the low-resource setting, while it becomes worse in the medium-resource and high-resource settings. We conjecture that training with more data will aggravate the catastrophic forgetting caused by TAPT, which leads to the worse performance.
Surprisingly, the performance of DAPT with medium-resource is close to that with highresource, which can be attributed to the combination of the powerful adaptation ability of the large pre-trained generative model and the effectiveness of the second phase of pre-training. However, there is still a large performance gap for all the models between the low-resource and high-resource settings and all the models perform badly when there is only 50 training samples, which highlights the needs for more advanced domain adaptation models for the summarization task.

Conclusion and Future Work
In this paper, we present AdaptSum, the first benchmark to simulate the low-resource setting for the abstractive summarization task with a combination of existing datasets across six diverse domains. We systematically study three different methods for a second phase of pre-training (i.e., SDPT, DAPT and TAPT), and propose to leverage RecAdam to alleviate the catastrophic forgetting issue caused by the continuing pre-training. Experiments show that SDPT and TAPT can generally improve on the performance of the fine-tuning method, while the effectiveness of DAPT depends on the similarity between the pre-training data and the target domain task data, which is different from the insights into DAPT for classification tasks. Further analysis illustrates that RecAdam successfully alleviates the catastrophic forgetting issue for TAPT and further boost its performance.
Finally, our work highlights several research challenges in low-resource domain adaptation for the abstractive summarization task: (1) How to construct an effective corpus for DAPT; (2) How to better cope with the catastrophic forgetting issue for the second pre-training phase on a large corpus; (3) How to effectively integrate the task and domain knowledge (i.e., incorporate SDPT and DAPT); (4) How to choose better source domain datasets for conducting SDPT; (5) How to build a more powerful domain adaptation models for the extremely low-resource summarization task. We hope that the proposed dataset and the highlighted research directions will accelerate the studies in this area.