Towards Zero Shot Conditional Summarization with Adaptive Multi-task Fine-Tuning

Automatic summarization research has traditionally focused on providing high quality general-purpose summaries of documents. However, there are many applications which require more specific summaries, such as supporting question answering or topic-based literature discovery. In this paper we study the problem of conditional summarization in which content selection and surface realization are explicitly conditioned on an ad-hoc natural language question or topic description. Because of the difficulty in obtaining sufficient reference summaries to support arbitrary conditional summarization, we explore the use of multi-task fine-tuning (MTFT) on twenty-one natural language tasks to enable zero-shot conditional summarization on five tasks. We present four new summarization datasets, two novel “online” or adaptive task-mixing strategies, and report zero-shot performance using T5 and BART, demonstrating that MTFT can improve zero-shot summarization quality.


Introduction
Transfer learning, in which a model is first pretrained on one or more data-rich tasks before being fine-tuned on a downstream task of interest, has been repeatedly shown to obtain remarkable performance on many natural language processing tasks Dong et al., 2019;Liu et al., 2019b). The most successful models resulting from this paradigm rely on self-supervised pretraining with prohibitively-large1 datasets to facilitate adaptation to new tasks (i.e., fine-tuning) with less abundant data (Devlin et al., 2019;Lewis et al., 2020;Keskar et al., 2019;Raffel et al., 2019). Unfortunately, the benefits of pretraining are reduced for tasks in which there is little direct knowledge 1As estimated by Strubell et al. (2019), the cost for training the 11 billion parameter variant of T5 (Raffel et al., 2019) can exceed $1.3 million USD for a single run. Document: Asthma is a condition in which your airways narrow and swell and produce extra mucus. This can make breathing difficult and trigger coughing, wheezing and shortness of breath. [...] Question: What is the consensus of medical doctors as to whether asthma can be cured? Summary: Asthma can't be cured, but its symptoms can be controlled. Because asthma often changes over time, it's important that you work with your doctor to track your signs and symptoms and adjust treatment as needed [...] (a) Health Question (Savery et al., 2020) Document: The United Nations Thursday set aside $1 million to assess environmental damage caused by this week's devastating tsunami, as reports of destroyed coral reefs and uprooted mangrove forests began trickling in [...] Topic: Coral reefs Summary: The waves of the tsunami in southeast Asia wreaked tremendous damage on coral reefs, but much more damage occurred when the waves receded, carrying [...] Topic: Mangrove Forests Summary: The recent 26 December 2004 tsunami in the Indian Ocean with destruction of mangrove forests has highlighted their environmental importance [...] (b) TAC 2010 (Owczarzak and Dang, 2010) Figure 1: Example conditional summaries for two tasks.
transfer, such as language generation for tasks and domains involving previously unseen lexical and semantic properties (as we demonstrate in this paper).
Transfer learning generalization failures are particularly problematic for a family of tasks we refer to as conditional summarization. Unlike traditional summarization, in which the goal is to produce an objective summary of the most salient information in a passage, in conditional summarization, the selection of the most salient points (i.e., content selection), as well as how those points are expressed (i.e., surface realization), are explicitly conditioned on an ad-hoc context, such as a question or topic of interest, as illustrated in Figure 1. In this setting, the same passage may have very different ideal summaries, depending on the summarization context, as shown in Figure 1b. Consequently, obtaining sufficient human-authored reference summaries for conditional summarization can be even more time-or cost-prohibitive than for traditional summarization -particularly when dealing with specialized domains such as healthcare.
In this paper, we explore the use of multi-task fine-tuning to enable zero-shot conditional summarization on previously unseen passages for previously unseen tasks. We report the impact of different tasks on zero-shot summary quality and the impact of different task mixing strategies for fine tuning when applied to T5 (Raffel et al., 2019) and BART (Lewis et al., 2020). The primary contributions of this work are: 1. An analysis of the role of 21 question answering, single-and multi-document summarization, causal reasoning, and argumentation tasks on zero-shot domain specific and general domain conditional summarization tasks; 2. Four new summarization datasets that can be used by the community; and 3. Two novel methods for "online" or adaptive task mixing.

Background
From its inception, automatic summarization aimed to condense documents either in a generic way -conveying the main points of the document to any user, or focusing on points tailored to specific users and applications, such as topic or querydriven summarization (Mani, 2009). Our aims are even more specific than topic-driven summarization: we are interested in summarizing documents in response to ad-hoc natural language healthrelated questions asked by the general public. Summarizing information to generate answers to such questions can only rarely be reduced to topicdriven summarization, e.g., if a person is looking for general information about a given health condition or treatment; in over 90% of cases, health questions are more specific and focus only on particular aspects of the topic . For example, people may be looking for medications for a specific condition or asking how to store a drug. The summary, therefore, has to be tailored not only to the topic of the question and task but must also be restricted only to the aspects of the topic that directly address the question. Consider the following question from a user of a In the open domain, previous community efforts to focus on topic driven summarization include the Document Understanding Conference (DUC) and its successor, the Text Analysis Conference (TAC), both of which organized topic-based summarization tasks. In various iterations of these tasks, human assessors developed topic statements and documents cluster for those topics, and then manually authored summaries based on the topic statements. The tasks' participants were asked to develop automatic summarization approaches for generating single-or multi-document summaries that contained information relevant to the topic statement. Other community efforts involving summarization include the BioASQ2, CL-SciSumm3, and Scholarly Document Processing4 challenges that involve summarization of scientific articles. However, despite the attention that summarization has received in the natural language processing community and the recent development of more sophisticated summarization algorithms, the task of automatically generating human-quality still poses many challenges.
A study of content selection across multiple domains, including medical articles, indicates that new forms of sentence representations and external knowledge sources are needed to identify the most suitable approaches to summarization (Kedzie et al., 2018). Recent work has shown models with transformer-based architectures, coupled with unsupervised pretraining approaches, to achieve state of the art results in many text generation tasks. Building on this, researchers have recently shown that these models can be conditioned on a prompt included in the input text. For example, this prompt can guide the content of the generated text towards either a desired topic (Keskar et al., 2019) or instruct the model to produce output for a specific task (Lewis et al., 2020;Raffel et al.,  2019). Similar work on conditional generation includes , in which the authors condition an extractive transformer on control codes specifying position, importance, and diversity of the sentences in the source text.
There have been relatively few publications focused on zero-shot learning specifically for summarization. Duan et al. (2019) experiment with zero-shot learning for cross-lingual sentence summarization, while Liu et al. (2019a) explored zeroshot abstractive summaries of five-sentence stories.
Prior work indicates that topic and questiondriven summarization can be formulated as a textto-text, conditional generation problem in which content selection and source realization are explicitly conditioned on a user-specified prompt. The formulation of summarization in this way intuitively dovetails with the desired goal described above: question-driven summarization of answers to user's health-related questions. In this study, we extend previous work done with BART and T5 using multi-task fine-tuning using a large body of tasks and exploring multiple mixing strategies to advance topic and question-driven summarization in the open and medical domains.

Models
Several transformer-based models have been shown to generate high quality natural language (Peters et al., 2018;Radford et al., 2018;Wang and Cho, 2019). The majority of these models cast summarization as language modeling wherein the input to the model is the sequence of words in the source document followed by a mask token for each word in the desired summary (Keskar et al., 2019;Radford et al., 2019). This substantially limits the length of summaries that can be generated due to the input sequence limits imposed during pretraining. Fortunately, more recent approaches use separate transformers for encoding and decoding, allowing the generation of potentially arbitrary length sequences. In this work, we explored the two most notable of these approaches: BART and T5.

BART (Bidirectional and Auto-Regressive
Transformers) is pre-trained with sentence ordering and token in-filling tasks (Lewis et al., 2020). BART uses a separate bidirectional encoder and autoregressive decoder similar to BERT except that (1) BART's decoder incorporates cross attention over the final encoder layer and (2) BART's encoder does not use a feed-forward dense layer for word prediction. In our experiments, we used BART-Large, which includes 12 transformer layers in the encoder and decoder.
T5 (Text-to-Text Transfer Transformer) uses several pretraining objectives, including unsupervised fill-in-the-blank as well as supervised translation, summarization, classification, and reading comprehension tasks where each task is represented as a language generation task (Raffel et al., 2019). T5 closely follows the originally-proposed Transformer architecture (Vaswani et al., 2017) except using relative positional embeddings rather than sinusoidal encoding. In this work, we used T5-Base, which includes 12 transformer layers.

Adaptive Multi-task Fine-Tuning
We adapt the text-to-text setting used to pre-train T5 (Raffel et al., 2019) to enable fine-tuning on a large body of tasks with the intent of injecting knowledge from related natural language processing tasks to enable improved zero-shot conditional summarization. In this section, we describe (1) the fine-tuning tasks used in our experiments, (2) how these tasks are encoded as text-generation, and (3) approaches for task mixing.

Fine-Tuning Tasks
We considered a total of 21 tasks related to summarization, question answering, commonsense reasoning, and argumentation; new summarization datasets or new extensions of previous datasets are denoted with an '*'.
BioASQ is a challenge for medical semantic indexing and question answering (QA) (Tsatsaronis et al., 2015). The QA challenges provide participants with questions, PubMed articles, snippets extracted from those articles, and human-generated answers to the questions. For single-document summarization, we used each extracted snippet as a summary of the corresponding article. For multidocument summarization, we used each humangenerated answer as a summary of the corresponding set of articles. The single-document summarization dataset contains 27.1 K examples, and the multi-document summarization dataset contains 3.2 K examples.
CNN/DailyMail includes 287.1 K news articles, as well as highlights of the articles which are used as summaries (See et al., 2017;Hermann et al., 2015).
CoPA The Choice of Plausible Alternatives dataset (Roemmele et al., 2011) presents 400 training sets of questions involving choosing the most plausible cause or effect entailed by a given premise; questions were drawn from (1) personal blog stories (Gordon and Swanson, 2009), and (2) subject terms from the Library of Congress Thesaurus for Graphic Materials.
Cochrane* contains 5.0 K reviews and plain language summaries from the Cochrane Database of Systematic Reviews; we use only the main body of the review as the source document for singledocument summarization.
Cosmos QA includes 287.1 K multiple-choice reading comprehension questions requiring commonsense causal reasoning; it focuses on cause and effect in everyday narratives (Huang et al., 2019).
CQaD-S* is based on a collection of consumer questions about drugs and answers to those questions manually extracted from reliable web pages (Ben Abacha et al., 2019); we adapted the 272 manually selected sections as question-driven summaries of their source documents.
EBM is a collection of Evidence-Based Medicine summaries, including questions, answers, justifications of those answers, and the references for those justifications (Molla and Santiago-Martinez, 2011). We adapted it for two multidocument summarization tasks: EBM Answers*, using the answers as the summary and the abstracts from the reference articles as the set of source documents and EBM Justifications*, using the reference articles and the answer as the source text and the justification for the answer as the summary. This produced 1.2 K and 2.8 K examples, respectively.

IBM Evidence 4.3 K examples of questions with
pairs of evidence, annotated for which evidence in the pair is the most convincing evidence for answering the question; the training set includes 48 topics (Shnarch et al., 2018).
MC-TACO is a set of 13 K question-answer pairs requiring temporal commonsense comprehension; questions pertain to various temporal aspects of events, such as duration, frequency, and temporal order .
MedlinePlus Summaries* contains summaries of health topics obtained from MedlinePlus, a service of the U.S. National Library of Medicine providing human-curated, reliable, and easy-tounderstand articles about over 1 K health topics. Each article contains a summary of the topic and links relevant web pages; we used the summary and the content of linked pages5 to generate a multidocument summarization collection consisting of 969 examples.

PubMed
PubSum* contains publishersubmitted summaries of PubMed articles written in consumer-friendly language; we collected 240 articles with accompanying summaries as single-document summarization. Scientific Papers contains two sets of long documents and their abstracts, including 203.0 K articles from arXiv.org and 119.9 K articles from the Open Access Subset of PubMed Central ® (Cohan et al., 2018).

SQuAD the Stanford Question Answering
Dataset is a reading comprehension dataset consisting of 87.6 K questions over Wikipedia articles where the question is considered unanswerable if the answer cannot be extracted from the corresponding passage (Rajpurkar et al., 2016).

Conditional Generation
As in Raffel et al. (2019), we used a Text-to-Text setting to train BART and T5 such that the model inputs and targets are both encoded as sequences of tokens. For summarization tasks, the input was provided as <task-name> [question: <question>] summarize: <document> and the target was the target summary, where the conditional summarization context (if applicable) is provided as the question portion of the input. For question answering and reading comprehension tasks, the input was provided as <task-name> question: <question> [choice: <choice>...] context: <document> and the target was either (a) True or False for (binary choice questions), or (b) the text of the correct choice for -ary choice questions.

Task Mixing
Neural models are notorious for overfitting dataparticularly in the case of natural language text for which transformer-based models have been shown to memorize spurious cues (Niven and Kao, 2019). A major factor in overfitting is the size of data used for training, and, as documented in Section 4.1, the available training data for each of our fine-tuning tasks vary by orders of magnitude. In order to avoid overfitting (and to avoid overcorrecting and underfitting) small datasets, for each fine-tuning step, we sample a batch of data from a single task assuming a Multinomial distribution over finetuning tasks. We refer to this distribution over tasks as the mixing rate, such that indicates the probability that a batch will be drawn from finetuning task ∈ {1, · · · , }. We explored four approaches to defining the mixing rate: proportional and temperature-scaled task mixing as in Raffel et al. (2019) and two novel "online" approaches, i.e., adaptive and self-adaptive mixing.

Proportional Mixing
The most intuitive way to avoiding overfitting is to define the mixing rate based on the proportion of data in each task compared to the total amount of data over all tasks. Formally, let be the size of the training set for task . In proportional mixing, we define: where is a maximum data size constant used to prevent large datasets from dominating . In our experiments we used = 2 14 .
Temperature-scaled Mixing Another way to handle disparity between the data available for each task is to use temperature-scaling. Formally, for temperature , we take the th -root of the mixing rate for each task , and then renormalize i.e.: When = 1, temperature scaling reduces to proportional mixing, and as is increased, the mixing rates approach a uniform distribution. We considered temperature scaling as a means to reduce the ability of tasks with large datasets to eclipse tasks with significantly fewer examples.
Adaptive Mixing In addition to data size, the task's difficulty can have a strong impact on whether the model underfits or overfits a dataset. Even with temperature-scaling, we observed that the model spent the majority of training steps on data-rich tasks and that the performance of the model on a task was not always proportional to the amount of data available for that task -some tasks were inherently harder for the model to adapt to. Consequently, we wanted to develop a mixing strategy that would decrease the time the model spent training on tasks it had already learned and more time on tasks it was still struggling with. Thus, to capture and account for task difficulty, we implemented an adaptive mixing strategy: after a certain number of warm-up epochs, the mixing rate is updated after each epoch proportionally to the average validation cross-entropy loss for each task and re-normalized. Formally: where is a scaling constant akin to the focus parameter reported in Lin et al. (2020).

Self-adaptive Mixing
While adaptive mixing can account for the difficulty of a task in terms of generalizability, it does not consider the degree to which the model has fit the training dataseti.e., it does not account for bias in the fine-tuning data. Moreover, adaptive mixing requires the availability of validation data for each task used in finetuning, which may not always be available. For these reasons, we explored a second form of adaptive mixing in which the mixing rate is determined based on the training loss for each task. Unlike the validation loss setting above, using training loss is sensitive to epoch size -if the model has not explored a sufficient percent of the training data for a task, the loss for that task may not accurately reflect the model's mastery of the task. Consequently, we needed to balance the exploration ratio, , of task -i.e., the percent of all training data for a task that has been seen by the model during fine-tuning -with the training loss on that task. Formally: where is the macro-average cross entropy training loss over all tasks. In this way, the model begins with a close-to-uniform mixing strategy and begins to favor tasks proportionally to the task's loss and exploration rate. As with adaptive mixing, we wait a certain number of warm-up epochs before computing the exploration rate or updating .

Experiments
In our experiments, we trained on the datasets described in Section 4.1 and evaluated on five tasks originating from three datasets previously unseen by the model. All models were trained with a batch size of 8, maximum sequence length of 512 tokens, 3 warm-up epochs followed by 10 training epochs, and 1,000 batches-per-epoch, using single V100X GPUs (32 GB VRAM) on a shared cluster. Training took between four-to-six hours, depending on cluster load. Additional implementation details are provided in Appendix A. To reduce variance between runs, we report results with greedy decoding (i.e., no beam search). We measured the impact of (1) multi-task finetuning (MTFT), (2) different task mixing strategies, and (3) excluding various tasks from fine-tuning on zero-shot summary quality. We report traditional summarization and generation metrics such as BLEU (Papineni et al., 2002), ROUGE (Lin, 2004) and METEOR (Banerjee and Lavie, 2005). Because the reference summaries for many tasks are highly abstractive, we adopt the embeddingbased metrics proposed in Sharma et al. (2017), i.e., GloVe (Pennington et al., 2014) cosine similarity using Embedding Averages (EACS), Vector Extrema (VECS, Forgues et al., 2014), and greedy matching (GMS, Rus and Lintean, 2012).

MEDIQA-AnS
The MEDIQA-AnS collection contains consumer health questions, articles from reliable websites, passages extracted from those web pages, and single-and multi-document summaries of the passages intended to provide consumer-friendly answers for the questions (Savery et al., 2020). We used the 552 extractive singledocument question-driven summaries.
DUC The Document Understanding Conference (DUC) was hosted by NIST from 2001-2007, to promote summarization research. In 2004, there were 50 questions each associated with very short single-document summaries (limited to 75 bytes), while in 2007, there were 45 questions, each associated with long 10-document summaries (between 230 and 250 words). Documents were from the AQUAINT English news corpus (Graff, 2002).
TAC The Text Analysis Conference (TAC) is the successor to DUC with ongoing public challenges on summarization. In this work, we considered the 2009 and 2010 tracks. Both tracks explored summarizing sets of 10 newswire articles into 100word reference summaries. In 2009, the track had 44 topics, each associated with a natural language topic description and four reference summaries (Dang and Owczarzak, 2009).
In 2010, track explored 46 topics, each associated with a natural language topic description, four reference summaries, and, unlike 2009, one of five pre-defined categories. TAC 2010 summaries were expected to cover all aspects associated with that category (e.g., for Health and Safety, summaries   Table 1: Impact of multi-task fine-tuning (MTFT) on zero-shot summarization quality; "Human" refers to crossevaluation of human-authored references summaries.   should cover (a) what happened, (b) who was affected, (c) how they were affected, (d) why the health or safety issue occurred, and (e) any countermeasures or prevention efforts) (Owczarzak and Dang, 2010).

Results
Table 1 provides summarization results using T5 and BART with and without multi-task fine-tuning (MTFT) for zero-shot summarization. Clearly, MTFT had a strong impact on MEDIQA and TAC summary quality. DUC results, however, were more varied. Interestingly, we can observe that MTFT had a greater impact on BART than T5 summarization quality, despite structuring fine-tuning tasks with the same prompts and configuration as those used to train T5. Table 2 illustrates the zero-shot Rouge-L achieved on each testing task when using various mixing strategies described in Section 4.3. Selfadaptive attention ( = 4) obtains the highest performance, at the cost of implementation complexity; temperature-scaled mixing ( = 2) obtains reasonable performance as well. Table 4 shows the impact of removing each task during fine-tuning on zero-shot summary quality. The most impactful tasks for MEDIQA are BioASQ (single-and multi-document), Med-linePlus, and IBM Evidence; BioASQ (multidocument only), MedlinePlus, ArXiv, and Cosmos QA were the most consistent for DUC; while MEDIQA DUC 2004DUC 2007TAC 2009TAC 2010 Ablation  PubMed, CNN/DailyMail, and Movie Rationales had the highest impact on TAC. Finally, Table 3 reports the standard deviation of T5 and BART for all evaluation tasks; as in Raffel et al. (2019), we assume the standard deviation can be applied to all reported experiments. Table 1 indicates that multi-task fine-tuning (MTFT) provides improved zero-shot summarization quality on domains with clear knowledge transfer (e.g., news documents) as well as new domains with less-direct knowledge transfer such as consumer health (i.e., MEDIQA). We note that for highly abstractive summarization, e.g., DUC and TAC, surface-level metrics such as BLEU and ROUGE are poor summarization quality indicators. Embedding-based measures that are capable of capturing semantic similarity show a strong improvement when MTFT is used. DUC results are more perplexing, likely due to the extreme disparity between MTFT summarization tasks and the DUC evaluation: in 2004, DUC summaries were between 4 and 20 tokens long and highly abstractive (as indicated by human performance), making automatic measures less effective. For DUC 2007, all summaries were between 140 and 250 words long, much longer than most summaries seen dur-ing MTFT.

Discussion
When analyzing the impact of different tasks on down-stream performance as indicated by Table 4, it is clear that each final summarization task benefits from different fine-tuning task combinations. While it may appear that CQaD-S had a strong impact on all tasks, additional experiments suggest that fine-tuning on any single summarization provides similar zero-shot improvements compared to using T5-Base or BART-Large and that CQaD-S and BioASQ had similar impacts on MEDIQA. Our results suggest that picking the optimal combination of fine-tuning tasks is non-trivial, and more work is needed to improve the robustness of training and task-mixing strategies and that in-depth analysis or principled guidelines for task selection would benefit the community. In a zero-shot setting, it is difficult to determine the optimal combination of fine-tuning tasks. However, in future work, we plan to explore feature selection techniques such as additive or recurrent feature elimination to determine an efficient way to select optimal tasks in a few-shot learning environment. Table 2 suggests that for the case of zero-shot learning, self-adaptive training was most effective at exploiting fine-tuning tasks. However, taken with Table 4, it is clear that adaptive mixing can be further improved to be more resilient against suboptimal fine-tuning task combinations. We note that temperature-scaling with = 2 offers a strong competitor to self-adaptive task mixing with the additional advantage of a simpler implementation.
While an in-depth manual assessment of all tasks is beyond the scope of this work, a shallow manual review suggests that conditional summarization would benefit from new metrics that emphasize the role of the conditional context (i.e., question or topic description) in the summary to ensure that summaries are not too generic.

Conclusions
In this paper, we explored the impact of multitask fine-tuning (MTFT) on zero-shot conditional summarization for consumer health questions (MEDIQA, Savery et al., 2020) as well as topicdriven news article summarization (i.e., the TAC and DUC summarization challenges). We introduced four new summarization datasets and proposed two online or adaptive methods for task mixing during fine-tuning. Our experimental results indicate that MTFT enables BART to produce higher quality summaries than T5, and that MTFT improved summary quality on unseen tasks in terms of ROUGE-L by 35.50 % (relative; 11.20 % absolute) for consumer health and 35 %-241 % (relative; 3.80 %-11.46 % absolute) for TAC. DUC results were inconclusive, with MTFT improving T5 results but hindering BART. Ablation analysis indicates that all tasks are not created equal and careful consideration must be taken to ensure each task has transferable characteristics (even subtle semantic properties such as argumentation properties) to the down-stream zero-shot application. Our proposed self-adaptive task mixing strategy was able to lessen the impact of irrelevant tasks on zero-shot performance by 8.25 % (relative; 2.75 % absolute) BLEU-4 and 7.57 % (relative; 3.04 % absolute) ROUGE-L. In future work, we plan to explore automatic approaches for determining the optimal set of fine-tuning tasks, improving the robustness of task mixing strategies to accommodate sub-optimal task combinations, and exploring new evaluation metrics that better reflect the role of the summarization context (i.e., question or topic description).