Exploring Versatile Generative Language Model Via Parameter-Efficient Transfer Learning

Fine-tuning pre-trained generative language models to down-stream language generation tasks has shown promising results. However, this comes with the cost of having a single, large model for each task, which is not ideal in low-memory/power scenarios (e.g., mobile). In this paper, we propose an effective way to fine-tune multiple down-stream generation tasks simultaneously using a single, large pretrained model. The experiments on five diverse language generation tasks show that by just using an additional 2-3% parameters for each task, our model can maintain or even improve the performance of fine-tuning the whole model.


Introduction
Large-scale language models (Radford et al., 2019;Dai et al., 2019) have shown to be effective in learning highly transferable embedding, which can be used in several down-stream tasks. For instance, bidirectional models (Peters et al., 2018;Devlin et al., 2019) are fine-tuned to improve classification tasks (Wang et al., 2019), while, unidirectional language models (Radford et al., 2019) are more effective in language generation tasks. In this work, we focus on the latter, and show that it is possible to dynamically steer the output of a language model (e.g., GPT-2) towards a specific task (e.g., summarization) without modifying the original model parameters.
Feature-based transfer (Howard and Ruder, 2018;Fan et al., 2020a,b) and fine-tuning (Devlin et al., 2019) are the most commonly used methods for transfer learning of a language. The former freezes the pre-trained model and uses it as a feature extractor for training a new classifier, and the * * Equal contributions. 1 Code available in https://github.com/zlinao/ VGLM latter uses the pre-trained weight as a weight initialization for the model to be trained for downstream tasks. The feature-based transfer strategy has not shown promising results (Devlin et al., 2019), while fine-tuning, on the other hand, can achieve state of the art performance in multiple tasks (Dong et al., 2019). However, the downside of the latter is the need for a seperate model for each of the fine-tuned tasks. This is especially relevant for on-device applications, where a limited amount of computation/memory is available.
Therefore, we study how to effectively use a single pre-trained model as the backbone for multiple language generation tasks, such as conversational question answering, summarization, machine translation, multi-turn chit-chat dialogue, and task-oriented natural language generation. This is a particular parameter-sharing schema, where we constrain the shared parameters to be the ones in the pre-trained model, and we learn task-specific parameters for each of the considered datasets.
In this paper, we propose to use residual adapter layers (Houlsby et al., 2019) and task embeddings for modelling the aforementioned task-specific parameters, and we explore different training strategies such as distillation (Hinton et al., 2015;Kim and Rush, 2016). We also analyse the trade-off between freezing or not freezing the language model parameters by leveraging two learning settings, multi-task (MT) (Caruana, 1997) and continual learning (CL) (Thrun and Pratt, 2012). With our experiments, we empirically demonstrate that by adding less than 3% task-specific parameters, our model can maintain or even achieve better performance than fine-tuning the whole model.

Related work
Pre-trained generative language models (Radford et al., 2019(Radford et al., , 2018Dai et al., 2019;Yang et al., 2019;Peters et al., 2018) have shown to be very effective in language generation, whereas, bidirectional pre-trained models (Devlin et al., 2019;Liu et al., 2019;Sanh et al., 2019) significantly improve the performance of several down-stream classification tasks. Fine-tuning large pre-trained models has shown positive results in dialogue tasks (Wolf et al., 2019b;Budzianowski and Vulić, 2019) and other language generation tasks (Dong et al., 2019). However, all of the previous works only consider fine-tuning on each generation task individually, which requires a separate model for each task. In this work, we use only a single model, for multiple generation tasks.
Residual adapters, derived from residual networks (He et al., 2016), were first introduced by Rebuffi et al. (2017) for multiple visual domain learning. Houlsby et al. (2019) proposed low-rank residual adapters to improve the scalability of the adapter module, and effectively transfer BERT (Devlin et al., 2019) to multiple text classification tasks simultaneously, while Bapna and Firat (2019) applied an adapter layer to language/domain adaptation for neural machine translation. On the other hand, Dathathri et al. (2019) proposed a plug and play method to control the language model generation without finetuning the model. Differently, in this paper, we extend the idea of adapters to a large variety of language generation tasks, which has not been considered before, and we compare the idea of a fixed pre-trained back-bone for continual learning with multi-task training (Stickland and Murray, 2019).

Methodology
The Versatile Language Model (VLM) is composed of three components: a pre-trained language model back-bone (e.g., GPT-2), and two kinds of specialized parameters for each generation tasks such as low-rank residual adapters and task embedding. Figure 1 shows the VLM architecture with the specialized parameters in different colours.
Residual Adapters These are trainable modules which steer the pre-trained model to different downstream tasks. We adapt the design of the feedforward Transformer sub-layer following Bapna and Firat (2019). To elaborate, the adapter block consists of 1) a layer normalization (Ba et al., 2016) for an efficient adaptation and 2) a following autoencoder (Hinton and Zemel, 1994), with a residual connection. Formally, given the hidden rep- resentation H i ∈ R t×d from the language model layer i, where d is the hidden dimension and t is the current generation step, the residual adapter computes the following: where W E i and W D i are parameters of dimension d × m and m × d respectively, and LN(·) denotes layer normalization. The bottleneck dimension m is tunable and it allows to adjust the capacity of the adapter according to complexity of the target task.
Task Embedding. To adapt unconditional generative language models to different conditional language generation tasks (e.g., CoQA, Summarization), we construct a set of task-specific segment embeddings. For example, in multi-turn dialogue, we alternate between System and User embeddings to help the model to capture the hierarchical structure of dialogues. Figure 1 shows the task embedding for each task, and more details are available in Appendix A2.
Knowledge Distillation In tasks with a large distributional shift from the original pre-trained language model (e.g., Machine Translation), we expect a larger performance gap between VLM and full fine-tuning. To cope with this issue, we propose to use sentence-level knowledge distillation (Kim and Rush, 2016), to help the task-specific parameters to better adapt to the task. Specifically, we first fully fine-tune a GPT-2 model on the training set of a task (e.g., Machine Translation). Then we replace the gold target (e.g., gold translation) in the training set with the greedy decoded output from the full fine-tuned model. Finally, the new constructed training set is used to fine-tune the student VLM.

Datasets & Evaluation Metrics
We conduct our experiment on five diverse datasets covering multiple generation tasks We use a large variety of evaluation metrics, such as perplexity, F1 score, BLEU (Papineni et al., 2002), ROUGE (Lin, 2004), NIST (Lin and Och, 2004), METEOR (Denkowski and Lavie, 2014) and CiDER (Vedantam et al., 2015). Each task uses the appropriate measure, as reported in Table  1, where in NLG we report the normalized average score of multiple metrics, as in Dušek et al. (2019). More information about task description and the metrics used in each task are reported in Appendix A2.

Implementation and model comparison
We implement VLM based on GPT-2-small (124M) (Wolf et al., 2019a), and experiment with varying adapter bottleneck dimensions in {10, 50, 100, 300} and pick the best one in each task to trade-off the performance with the parameter efficiency. Specifically, we choose bottleneck sizes 100, 300, 100, 300 and 10 for DLG, NMT, SUM, QA, and NLG, respectively, which results in 13% additional parameters in total. We ablate the adapter training with and without knowledge distillation and task embeddings. We also test the performance of a frozen back-bone (VLM) to show the ability to continuously learn tasks, and multitask fine-tune (VLM MT) with a trainable backbone to show possible positive transferring among tasks as in Stickland and Murray (2019). More training details and the dataset pre-processing are reported in Appendix A2.
To show the effectiveness of the proposed methodology of learning a versatile generative model, we compare (i) fine-tuneing the whole GPT-2 model for each task separately (GPT-2 Finetune), (ii) fine-tuning the language model head of GPT-2 for each task (LM-Head), (iii) existing baseline models reported (Reference), and (iv) the state-ofthe-art models for all the tasks (SOTA). Table 1 shows the experimental results of the aforementioned models. Appendix A3 and A4 report detailed results and generated samples for all the datasets. Our findings can be summarized as follow:

Results and Analysis
Fine-tuning GPT-2 vs Baseline & SOTA. Finetuneing the whole GPT-2-small in each task can generally improve on the performance of competitive baselines such as Pointer-Generator (See et al., 2017) in summarization (SUM) and CoQA. In both the Persona-Chat and the NLG tasks GPT-2 fine-tuning slightly outperforms the current SOTA, whereas, we observe a performance gap between GPT-2 and SOTA in NMT and SUM. Notably, the advantage of GPT-2 pre-training is limited in NMT: 1) no or little German text is present in the pretraining corpus; 2) the GPT-2 BPE (Sennrich et al., 2016) tokenizer is optimized for English text, and not for multiple languages. Finally, in SUM and CoQA, the SOTA models use 100× bigger models (Raffel et al., 2019) and bidirectional attention (Dong et al., 2019), where instead, GPT-2 uses unidirectional attention.

Adapter vs Fine-tuning GPT-2 & LM Head.
Fine-tuning only the adapter layers introduces 13% additional parameters to the model with a minimal loss in performance ( 0.4%) compared to finetuning a seperate GPT-2 model. Moreover, the adapter layers are more effective, both in terms of performance and number of additional parameters, compared to fine-tuning LM-Head.
Knowledge Distillation (KD) Using KD in the training procedure is especially useful in tasks such as NMT and SUM, where the gap between finetuning the whole model and adapter is large. This is because KD reduces the complexity of training targets (by replacing the gold training target with a teacher model generated target), which helps with low-capacity adapter (with 4% parameter) by providing an easier translation/summarization task (Zhou et al., 2019). Figure 2 shows the effect of using distillation training when the gap with the full fine-tuning is more substantial. On the other hand, when the adapter performance is very close to that of the fine-tuning baseline, or better (i.e. NLG), distillation has no impact on the final performance.
Task Embedding The specialized segment embedding (a.k.a. task embedding) is very important for achieving competitive performance, independently of the adapter. In Table 1, we can observe a substantial drop in performance when the task  embedding is not deployed. Indeed, without a task embedding the model struggles to learn when the input sequence ends, and how to distinguish the different parts of the input sequence (e.g., attributes in NLG, document and question in CoQA etc.).
Frozen Backbone vs Trainable Backbone As previously mentioned, VLM can be trained either by freezing the weight of the GPT-2 model, i.e., independently and continually learning one task at a time, or by multitasking all the tasks and thus fine-tuning both the GPT-2 model and the adapters. The latter model has the advantage of being able to transfer knowledge among tasks, as we can observe in Table 1 for the CoQA task, where VLM Multi-Task improve the F1 score by 3%. On the other hand, the frozen back-bone model has the big advantage of learning tasks sequentially, since the original GPT-2 weights remain untouched.

Conclusion
In this paper, we have presented a Versatile Language Model which learns five diverse natural language generation tasks in a single model. We found that a residual adapter is more effective than finetuning other parts of the model (e.g., LM-Head), and that distillation helps in reducing the gap in performance in hard to fine-tune tasks, such as summarization and translation. Finally, we show the trade-off between a frozen and trainable back-bone, showing that the former has a competitive performance, with the advantage of being extendable to future tasks without full re-training.

A Appendices
A.1 Model details Figure 3 illustrates a detailed version of VLM. VLM shares a GPT-2 back-bone and for each task, the model looks up a set of task embeddings for modeling different input structures and chooses the corresponding adapter.

A.2 Experiment details
In this section, we will describe the dataset, evaluation metrics, dataset preprocessing and training details for each task.

Conversational Question Answering (CQA)
CoQA (Reddy et al., 2019) is a free-form conversational question answering dataset. The task is to answer the questions in a conversation. Each turn in the conversation contains a question, and we need to answer the questions based on conversation histories and documents. We use document, question, and answer segment embedding to help the model to distinguish the document and alternating questions and answers in the input sequence. We fine-tune the full GPT2-small or VLM (trainable adapter with a fixed GPT2-small) for five epochs with the Adam optimizer. For distillation we only fine-tune VLM for three epochs. We set the batch size to 16 and limit the maximum length of the document to 400 tokens and only retain the last two turns of questions and answers in the dialogue history. Following Reddy et al. (2019) we use the F1 score as evaluation metrics.
Summarization (SUM) CNN/Daily-Mail is a benchmark (Hermann et al., 2015;Nallapati et al., 2016) for text summarization. We use article, summary segment embedding to divide the article and the summary. We fine-tune the full GPT2-small and VLM for 10 epochs with the Adam optimizer. For distillation, we only fine-tune VLM for five epochs. We set the batch size to 32 and limit the maximum length of the article to 400 tokens and that of the summary to 130 tokens. We use the ROUGE-1, ROUGE-2, and ROUGE-L scores (Lin, 2004) as evaluation metrics.
Neural Machine Translation (NMT) We use the spoken German-English translation dataset IWSLT (Cettolo et al., 2016) as our NMT benchmark. We use source, target segment embedding to divide the source language and the target language.
We fine-tune the full GPT2-small, VLM and distillated VLM for 8 epochs with the Adam optimizer. We set the batch size to 32 and limit the maximum length of the source and target sequence to 100 tokens. We use BLEU (Papineni et al., 2002) as the evaluation metric.
Persona Dialogue (DLG) The Persona-Chat dataset (Zhang et al., 2018) is a persona-grounded multi-turn converstion dataset. We use persona, system, user segment embedding to help the model to distinguish the persona, alternating system utterance and user utterance in an input sequence. We fine-tune the full GPT2-small or VLM for three epochs with the Adam optimizer. We set the batch size to 16 and only retain the last five utterances in the dialogue history. We use perplexity, BLEU, , the output should be The wrestlers offers competitive prices, but is not highly rated by customers. We use a set of attribute segment embedding to segment the input attributes. We fine-tune the full GPT2-small and VLM for 10 epochs with the Adam optimizer. We set the batch size to 32 and use BLUE (Papineni et al., 2002) , ROUGE (Lin, 2004), NIST (Lin and Och, 2004), METEOR (Denkowski and Lavie, 2014) and CiDER (Vedantam et al., 2015) as evaluation metrics.
Computational Cost Fine-tuning VLM requires around 80%-90% GPU memory compared to fullfinetune the whole GPT-2 model, as it only updates the small ratio of parameters. And both models have similar training cost, we report the training speed with single GTX 1080 Ti:

A.3 Detailed Results
In this section, we report the detailed results for each task in Tables 2-6. We use a greedy decoding strategy for all the tasks.  Figure 3: A detailed version of VLM. VLM shares a GPT-2 back-bone and for each task, the model looks up a set of task embeddings and chooses the corresponding adapter.  Here we can see that knowledge distillation does not improve the performance of the NLG task because of the small gap between VLM and the full fine-tuned GPT-2. Instead for the dialogue and QA tasks, the gold target is always better than the distillated target.      (Ranzato et al., 2015) 21.83 AC+LL (Bahdanau et al., 2016) 28.53 NPMT (Huang et al., 2017) 28.96 Dual Transfer Learning (Wang et al., 2018) 32.35 LYC Transforemer (He et al., 2018) 35.07

GPT-2 Finetune
If you work with somebody in the '20s, you love them because you lost a loved one in the '20s, I want to see you --great. People in the '20s are really important.

VLM
If you work with somebody in the '20s, because of a love lost in the '20s, I want to see you -OK. Great. People in the '20s are really important.

LM-Head
If you work with someone in the 20ern, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love, you love,

Target
If you work with twentysomethings, you love a twentysomething, you're losing sleep over twentysomethings, I want to see -Okay. Awesome, twentysomethings really matter.

GPT-2 Finetune
Yes, people will be more domestic in the future than they used to be, but that didn't make Alex' 20s for failure.

VLM
Yes, people would come up later than they used to, but that didn't make Alex' 20s a disaster.

LM-Head
Yes, people are later going to come back as former former, but that doesn't make Alex' 20s anymore.

Target
Yes, people settle down later than they used to, but that didn't make Alex's 20s a developmental downtime. Source Leute in den 20ern wie Alex und ich hatten nichts als Zeit. GPT-2 Finetune People in the '20s like Alex and I didn't have time for time.
VLM People in the '20s like Alex and I had nothing but time. LM-Head People like Alex and I had nothing as a time.
Target Twentysomethings like Alex and I had nothing but time.

Summurization CNN Daily Mail
Source los angeles -lrb-cnn -rrb-it's more than just one state's internal problem. the historic california drought hurts the,rest of the union, too. that's because california is a breadbasket to the nation, growing more than a third of its vegetables and nearly two-thirds of its fruits and nuts.
here's why we should heed the ongoing drought in the most populous state, a slowly expanding natural disaster now in its fourth year that this week prompted gov. jerry brown to announce a,mandatory 25 % ....

GPT-2 Finetune
the drought hurts the rest of the union, too, because california is a breadbasket to the nation, the drought is, causing the nation's price to rise, thanks to a strong dollar.
VLM california drought hurts rest of the union, too, says jerry brown. brown: prices rose last year for these items on your kitchen table. he says prices expected to rise this year because of inflation.

LM-Head
new : the drought has hurt the rest of the union, the drought has hurt the rest of the union, too, the drought has hurt the rest of the union, too, the drought has hurt the rest of the union, too.
Target americans paid more for some fruits and vegetables last year because of the drought, tourists will now have to ask for a glass of water at a california restaurant, perhaps the only good thing is another " great " wine grape harvest last year.

Source
-lrb-cnn -rrb-governments around the world are using the threat of terrorism -real or perceived -to advance executions, amnesty international alleges in its annual report on the death penalty. " the dark trend of governments using the death penalty in a futile attempt to tackle real or imaginary threats to state security and public safety was stark last year, " said salil shetty, amnesty's secretary general in a release. " it is shameful that so many states around the world are essentially playing with people's lives -putting people to death for ' terrorism'or to quell internal instability on the ill-conceived premise of deterrence.
GPT-2 Finetune report: governments are using the threat of terrorism to advance executions, china is mentioned as having used, the death penalty as a tool in its " strike hard " campaign., the number of executions worldwide has gone down by almost 22 % on the previous year.
VLM amnesty international releases its annual report on the death penalty, the report cites the example of pakistan, lifting a six-year moratorium on the execution of civilians.

LM-Head
amnesty international : death penalty is a tool in state-sanctioned killing campaigns, china is also mentioned, as having used the death penalty as a tool in its " strike hard " campaign against terrorism.,china is also mentioned, as having used the death penalty as a tool in its " strike hard " campaign against terrorism Target amnesty's annual death penalty report catalogs encouraging signs, but setbacks in numbers of those sentenced to death.,organization claims that governments around the world are using the threat of terrorism to advance executions., the number of executions worldwide has gone down by almost 22 % compared with 2013, but death sentences up by 28 %. When her mommy and sisters found her they started laughing. "What are you doing, Cotton?!" "I only wanted to be,more like you". Cotton's mommy rubbed her face on Cotton's and said "Oh Cotton, but your fur is so pretty and special, like you. We would never want you to be any other way". And with that, Cotton's mommy picked her up and dropped her into,a big bucket of water. When Cotton came out she was herself again. Her sisters licked her face until Cotton's fur was all all dry. "Don't ever do that again, Cotton!" they all cried. "Next time you might mess up that pretty white fur of yours and we wouldn't want that!" Then Cotton thought, "I change my mind. I like being special". Q1: Whose paint was it? A1: the farmer Q2: What did Cotton's mother and siblings do when they saw her painted orange? GPT-2 Finetune laughed VLM laughed LM-Head They put her in a big bucket of water Target laughed VLM wow that's a lot of dyeing. i love to dye. LM-Head i love to make time stop Target do you live in the sky? i do. in the clouds.

Natural Language Generation NLG
Source name = Blue Spice eatType = coffee shop near = Crowne Plaza Hotel customer rating = 5 out of 5

GPT-2 Finetune
Blue Spice is a coffee shop near Crowne Plaza Hotel with a 5 out of 5 customer rating.

VLM
Blue Spice is a coffee shop near Crowne Plaza Hotel. It has a customer rating of 5 out of 5.

LM-Head
Blue Spice is a coffee shop near Crowne Plaza Hotel with a customer rating of 5 out of 5.

Target
-The coffee shop Blue Spice is based near Crowne Plaza Hotel and has a high customer rating of 5 out of 5.
-The Blue Spice coffee shop, near Crowne Plaza Hotel, has a customer rating of 5 out of 5.
-If you want a coffee shop rated 5 out of 5 pick Blue Spice. It is located near Crowne Plaza Hotel.

GPT-2 Finetune
Giraffe is a family friendly pub located near Rainbow Vegetarian Café. It serves English food and is located in t he riverside area VLM Giraffe is a family friendly pub located near Rainbow Vegetarian Café in the riverside area.

LM-Head
Giraffe is a pub near Rainbow Vegetarian Café that serves English food. It is family friendly and located in the,riverside area.

Target
-The Giraffe,,a child-friendly, English pub, is,located in riverside area, near Rainbow Vegetarian Café.
-Giraffe is a pub that is children friendly and,provides English food.It is located on the riverside near Rainbow Vegetarian Café.
-There is an English based pub called Giraffe. It is located in the riverside area near the Rainbow Vegetarian Café and, yes, it is kid friendly.