PALM: Pre-training an Autoencoding&Autoregressive Language Model for Context-conditioned Generation

Self-supervised pre-training has emerged as a powerful technique for natural language understanding and generation, such as BERT, MASS and BART. The existing pre-training techniques employ autoencoding and/or autoregressive objectives to train Transformer-based models by recovering original word tokens from corrupted text with some masked tokens. In this work, we present PALM which pre-trains an autoencoding and autoregressive language model on a large unlabeled corpus especially for downstream generation conditioned on context, such as generative question answering and conversational response generation. PALM minimizes the mismatch introduced by the existing denoising scheme between pre-training and fine-tuning where generation is more than reconstructing original text. With a novel pre-training scheme, PALM achieves new state-of-the-art results on a variety of language generation benchmarks covering generative question answering (Rank 1 on the official MARCO leaderboard), abstractive summarization on Gigaword and conversational response generation on Cornell Movie Dialogues.


Introduction
Self-supervised pre-training has achieved great success in natural language understanding (NLU) and a wide range of NLP tasks (Dai and Le, 2015;Howard and Ruder, 2018;Radford, 2018;Peters et al., 2018;Devlin et al., 2018). A variety of training objectives and auxiliary tasks have been introduced into pre-training on massive unlabeled text data, and the pre-trained models can be further fine-tuned for downstream NLU tasks. Among existing pre-training methods, BERT-like approaches, such as BERT (Devlin et al., 2018), RoBERTa  and ALBERT (Lan et al., 2019), are the most prominent by pre-training the bidirectional Transformer (Vaswani et al., 2017) encoder on a large text corpus through masked language modeling and next sentence prediction. The BERT-like pre-training is designed for language understanding applications aiming to extract knowledge by comprehending given contextual text.
Different from language understanding, language generation aims at generating natural language sentences, including tasks like neural machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), abstractive summarization (Rush et al., 2015;See et al., 2017a;Gehrmann et al., 2018), generative question answering (QA) (Tan et al., 2017;Bi et al., 2019) and conversational response generation (Vinyals and Le, 2015). Many of the language generation tasks require the models to read and to comprehend a given document, based on which output text is generated. In this paper, we present PALM, a novel approach to Pre-training an Autoencoding&autoregressive Language Model for text generation based on reading comprehension of textual context.
Recently, several pre-training methods have been proposed for language generation. GPT (Radford, 2018) and GPT-2 (Radford et al., 2019) use a leftto-right Transformer decoder to generate a text sequence token-by-token, which lacks an encoder to condition generation on context. In contrast, MASS (Song et al., 2019) and BART  both employ a Transformer-based encoderdecoder framework, with a bidirectional encoder over corrupted (masked) text and a left-to-right decoder reconstructing the original text. While such denoising pre-training objectives work well for the downstream generation tasks where generated text comes from input but is manipulated, they are less related to the comprehension-based generation tasks asking for instead generating continuations, responses or answers by comprehending input context. PALM is specifically designed to pre-train a backbone model on a large unlabeled corpus for fine-tuning on the downstream comprehensionbased generation tasks, one example of which is generative QA. In generative question answering, QA models are asked to generate an abstractive answer in natural language to a given question by reading and comprehending a contextual passage. Abstractive answer generation is more than manipulating tokens in the passage. An abstractive answer reflects the understanding of the passage and the question, and can include content out of the passage to be self-contained and well-formed. To address comprehension-based generation like generative QA, PALM uses the pre-training objectives that are closely related to the downstream tasks. Specifically, it differs from existing generative pre-training methods in that PALM goes beyond the solely autoencoding/autoregressive methods and combines the merits of autoencoding and autoregression in a single framework. Moreover, it possesses a mechanism built in pre-training for generating coherent text from given context.
With the new design, PALM can surpass existing language generation methods at much less computational cost than that of prior pre-training approaches -It was trained on 16 NVIDIA V100 GPUs for 3 days in our experiments, and expected to perform even better if trained for longer. PALM gives surprisingly good empirical results on a variety of context-aware generation tasks, including pushing the state-of-the-art Rouge-L on the MARCO Q&A + Natural Language Generation benchmark to 0.498 (Rank 1 on the leaderboard 1 ) and on Gigaword summarization to 0.360, as well as establishing the state-of-the-art perplexity of 21.98 on generating responses to Cornell Movie Dialogues.
We make the following major contributions in this paper: • We propose PALM, a novel approach to pretraining a language model on a large unlabeled text corpus, which is able to comprehend contextual text. The pre-trained model is particularly effective to be fine-tuned for language generation conditioned on context.
• With less training cost than that of existing pre-training methods, PALM significantly ad-1 http://www.msmarco.org/leaders.aspx vances the state-of-the-art results on a variety of language generation applications, including generative QA, abstractive summarization and conversational response generation. It clearly demonstrates PALM's effectiveness and generalizability in language generation.

Language Modeling
PALM is built upon an extension of an encoderdecoder framework. In this section, we introduce the encoder-decoder framework for language modeling, followed by the base architecture used for PALM.

Encoder-Decoder
We denote (x, y) ∈ (X , Y) as a pair of text pieces, where x = (x 1 , x 2 , . . . , x m ) is the source text with m tokens, and y = (y 1 , y 2 , . . . , y n ) is the target text with n tokens. X and Y denote the sets of source text and target text, respectively. An encoder-decoder model learns the parameter set θ to estimate the conditional probability P (y|x; θ) with log-likelihood as the objective function: L(θ; (X , Y)) = (x,y)∈(X ,Y) log P (y|x; θ). The conditional probability P (y|x; θ) can be further factorized according to the chain rule: P (y|x; θ) = n t=1 P (y t |y <t , x; θ), where y <t denotes the token sequence preceding position t.
In the encoder-decoder framework, the encoder reads the source text and generates a set of representations. With the source representations and its preceding token sequence, the decoder estimates the conditional probability of each target token. An attention mechanism (Bahdanau et al., 2015) is further introduced between the encoder and the decoder to identify a subset of source representations to attend for predicting each target token.

Transformer Base
PALM uses the standard Transformer encoderdecoder from (Vaswani et al., 2017) as the base architecture. First, an input sequence of tokens is mapped to a sequence of embeddings, which is then passed into the encoder. The encoder consists of a stack of blocks, each of which comprises two subcomponents: a self-attention layer followed by a small feed-forward network. Layer normalization (Ba et al., 2016) is applied to the input of each subcomponent and a residual skip connection (He et al., 2016) adds each subcomponents input to its output. Dropout (Srivastava et al., 2014) is applied within the feed-forward network, on the skip connection, on the attention weights, and at the input and output of the entire stack.
The decoder is similar in structure to the encoder except that it includes a standard attention mechanism after each self-attention layer that attends to the output of the encoder. The self-attention mechanism in the decoder also uses a form of autoregressive or causal self-attention, which only allows the model to attend to past outputs. The output of the final decoder block is fed into a dense layer with a softmax output, whose weights are shared with the input embedding matrix. All attention mechanisms in the Transformer are split up into independent "heads" whose outputs are concatenated before being further processed.

PALM for Context-conditioned Generation
This section presents the new mechanism and pretraining objectives of PALM for generation conditioned on context. The differences between PALM and prior pre-training approaches are discussed as well.

Joint Modeling of Autoencoding and Autoregression
Existing Transformer-based pre-training methods employ either autoencoding or autoregressive objectives for self-supervision. Autoencoding-based pre-training aims to reconstruct the original text from corrupted input. Notable examples are BERT and its variants RoBERTa and ALBERT, where a certain portion of input tokens are replaced by a special symbol [MASK]. The models are trained to recover the original tokens from the corrupted version by utilizing bidirectional context. However, these autoencoding methods are not applicable to text generation where bidirectional contexts are not available.
On the other hand, an autoregressive model, such as GPT (Radford, 2018;Radford et al., 2019), is only trained to encode unidirectional context (either forward or backward). Specifically, at each output timestep, a token is sampled from the models predicted distribution and the sample is fed back into the model to produce a prediction for the next output timestep, and so on. While applicable to text generation, the autoregressive methods are not effective at modeling deep bidirectional context. On the contrary, downstream generation tasks of-ten ask a model to condition generation on given textual context. This results in a gap between autoregressive modeling and effective pre-training.
To close the gap, PALM is carefully designed to autoregressively generate a text sequence by comprehending the given context in a bidirectional autoencoding manner. In particular, PALM delegates autoencoding-based comprehension to the encoder in Transformer, and autoregressive generation to the Transformer decoder. The encoder and decoder are jointly pre-trained in two stages: 1. The encoder is first trained as a bidirectional autoencoder to reconstruct the original text from corrupted context in which random tokens are sampled and replaced with [MASK] symbols following BERT's practice (Devlin et al., 2018). The training optimizes the crossentropy reconstruction loss between encoder's output and original context, as Masked Language Modeling (MLM) in BERT. By predicting the actual tokens in context that are masked, PALM forces the encoder to comprehend the meaning of the unmasked tokens and the full context.
2. The encoder and decoder are then jointly trained to autoregressively generate text output out of the context representations from the encoder. The training maximizes the loglikelihood of the text in ground truth from the decoder's output: (1) where X represents the set of context and Y represents the set of text to be generated. By conditioning the generation on context representations, PALM forces the decoder to rely deeply on the context instead of preceding generated tokens in next token prediction, which facilitates context-sensitive generation.

Input&Output Representations
In the phase of model pre-training, input and output representations are tailored to minimize the discrepancy between self-supervised pre-training and supervised fine-tuning. In a typical downstream generation task (e.g., abstractive summarization and generative QA), context is given as a rather long passage, and a model is asked to generate a (a) GPT: Tokens are predicted autoregressively, meaning that GPT can be used for generation. However, it lacks an encoder to condition generation on context.
It is based on the encoder-decoder architecture, but the decoder predicts only the tokens that are masked in the text input to the encoder.
(c) BART: Rather than masked tokens, the decoder reconstructs the original full sentence from the corrupted input to the encoder. However, it mismatches with most downstream generation which is more than reconstructing original input.
The encoder predicts masked tokens by encoding context bidirectionally, and the decoder predicts the text segment subsequent to the text input to the encoder, which enables the model to generate continuations in downstream. shorter piece of text based on the comprehension of the context. Given a contiguous text fragment of length L (composed of a few sentences) from an unlabeled corpus, PALM uses the consecutive span of length 80% · L from the beginning of the fragment as context input to the encoder, and uses the remainder of text span of length 20% · L as text output to be generated by the decoder. This representation design mimics the input and output of downstream tasks, with the hypothesis that human-written text is coherent and thus the subsequent text span of length 20% · L captures the comprehension of the preceding context span. In this way, PALM learns to infer the subsequent text content from the preceding content.
The collection of text fragments are constructed from a corpus by following the practice of BERT. In our experiments, we set the maximum length of a fragment to be 500, i.e., L ≤ 500. Therefore, the context input consists of at most 400 tokens, and the text output consists of at most 100 tokens. Figure 1 shows a schematic comparison of in-put&output representations between PALM and the existing pre-training generation methods, GPT, MASS and BART. GPT uses a decoder to predict tokens autoregressively, without an encoder to condition generation on context. MASS and BART are both trained to recover the original tokens that are masked from corrupted text, where the input to the encoder and the decoder come from the same text segment (e.g., the sequence (x 1 , x 2 , x 3 , x 4 , x 5 ) in Figures 1b and 1c). They are also expected to output the tokens from the same text sequence. By contrast, in PALM the encoder and the decoder take two different inputs. The input to the decoder comes from the continuation of the text input to the encoder (e.g., (y 6 , y 7 , y 8 ) is subsequent to (x 1 , x 2 , x 3 , x 4 , x 5 ) in the contiguous text segment (x 1 , x 2 , x 3 , x 4 , x 5 , y 6 , y 7 , y 8 ) in Figure 1d). In addition to the continuation predicted by the encoder, PALM produces an extra output from the encoder, which contains the predicted tokens masked in the input (e.g., x 2 and x 4 in Figure 1d). The output predictions from the encoder and the decoder are used for training in the two stages, respectively.

Copying Tokens from Context
In a human-written document, subsequent text often refers back to entities and tokens present earlier in the preceding text. Therefore, it would increase coherence of text generated in downstream to incorporate the copy mechanism into pre-training on an unlabeled corpus. This allows the model to learn from pre-training when and how to copy tokens in generating text, and the knowledge is transferred to downstream fine-tuning.
PALM incorporates the copy mechanism by plugging in the pointer-generator network (See et al., 2017b;Nishida et al., 2019) on top of the λ Figure 2: The pointer-generator network on top of the decoder in Transformer. For each decoding step t, mixture weights λ for the probability of generating tokens from the vocabulary and copying tokens from context are calculated. The two distributions are weightedly summed to obtain the final distribution. decoder in Transformer. Figure 2 illustrates the pointer-generator network, which allows every token to be either generated from a vocabulary or copied from context in generating text.
Extended vocabulary distribution. Let the extended vocabulary, V , be the union of words in the vocabulary and all tokens present in context. P v (y t ) then denotes the probability distribution of the t-th word token, y t , over the extended vocabulary, defined as: where s t denotes the output representation of t-th token from the decoder. The output embedding W e is tied with the corresponding part of the input embedding (Inan et al., 2017), and W v and b v are learnable parameters.
Copy distribution. PALM uses an additional attention layer for the copy distribution on top of the decoder. In the course of generation, the layer takes s t as the query, and outputs α t as the attention weights and c t as the context vector: where h c l is the representation of l-th token in context from the encoder. w c , b c , W m and W s are learnable parameters. As a result, P c (y t ) is the copy distribution over the extended vocabulary, defined as: P c (y t ) = l:x l =yt α c tl .
Final distribution. The final probability of generating y t is defined as a mixture of the extended vocabulary distribution and the copy distribution: where w z , w s and b m are learnable parameters. The parameters in pointer-generator learned in pre-training are all kept and passed downstream for fine-tuning on labeled data.

Experiments
In this section, we present the experimental setup and results of PALM pre-training on a large unlabeled corpus and fine-tuning on a variety of language generation tasks, including generative QA, abstractive summarization and conversational response generation.

Pre-training Configuration
Experimental Setup. PALM is based on the Transformer which consists of a 12-layer encoder and a 12-layer decoder with 768 embedding/hidden size, 3072 feed-forward filter size and 12 attention heads. The parameters of the encoder are initialized by the pre-trained RoBERTaBase model 2 which was trained with the Masked LM objective, removing Next Sentence Prediction from BERT.
PALM is trained with a dropout rate of 0.1 on all layers and attention weights, and a GELU activation function (Hendrycks and Gimpel, 2016) used as GPT. The learning rate is set to 1e-5, with linear warmup over the first 10k steps and linear decay. The pre-training procedure runs on 16 NVIDIA V100 GPU cards for 800K steps, with each minibatch containing 64 sequences of maximum length 500 tokens.
Pre-training Dataset. We use documents of English Wikipedia and BookCorpus (Zhu et al., 2015) as our pre-training corpus, and perform WordPiece tokenization as BERT (Devlin et al., 2018). The documents are split into sentences. Different from BERT, we use multiple consecutive sentences up to 400 tokens as the source text input to the encoder, Example 1 Input A classic Aston Martin once owned by Spartacus star Peter Ustinov is set to fetch more than £1 million at auction -twice what it fetched four years ago. The actor bought the Aston Martin DB4 Cabriolet in 1962, shortly after winning a Best Supporting Actor Oscar for his role as Batiatus in Spartacus. It was one of the most luxurious cars of its day, costing £4,000 and was delivered to him at a Swiss hotel at a time when the average house price in Britain was just £2,500.

PALM
The Aston Martin DB4 Cabriolet was bought by Peter Ustinov for £4000 and was expected to fetch for £2.5 million for auction. The car was sold for £1.2 million finally. MASS peter ustinov's UNK auctioned for more than $1 million. Example 2 Input Cape Verde's 2-0 win over Portugal was the most eye-catching international result of the week. So, who are Cape Verde and why has this tiny island off the west coast of Africa suddenly become an international football force? Where are the Cape Verde Islands? Cape Verde is a group of islands 400 miles from Senegal off the west coast of Africa. Its population is around 500,000 and boasts both beautiful beaches and striking volcanic landscapes, making it a haven for tourism.

PALM
Cape Verde is a small island off the west coast of Africa with a population of around 500,000 and boasts both beautiful beaches and striking volcanic landscapes, making it a haven for tourism. Cape Verde is home to the Cape Verde Islands National Park with a number of islands. MASS tiny african island nation cape verde has beautiful beaches. and use the subsequent consecutive sentences up to 100 tokens as the target text to the decoder. The pre-training dataset (X , Y) is constructed from the documents by a sliding window with the stride of one sentence, resulting in 50M (x, y) pre-training pairs.

Unsupervised Pre-training
To understand the performance of PALM pretraining, we compare generation quality of the pretrained models of PALM and MASS 3 . Specifically, we feed a few sentences from a news article to both pre-trained models, and the models generate a continuation of the input sentences by beam search with a beam of size 5. The news articles from CNN 4 are used as input text to eliminate the possibility of the text present in the models' pre-training corpus, i.e., Wikipedia and BookCorpus.
The overall perplexity of PALM is 17.22, which is much better than MASS's perplexity of 170.32, indicating PALM's better language modeling. Table 1 illustrates a couple of example continuations 3 https://modelrelease.blob.core. windows.net/mass/mass_summarization_ 1024.pth 4 https://drive.google.com/uc?export= download&id=0BwmD_VLjROrfTHk4NFg2SndKcjQ generated by PALM and MASS. In both examples, PALM generates fluent and grammatical English, while MASS outputs a short sentence that is much less relevant to input text, since the MASS model was trained on individual sentences. In the first example, it is interesting to observe that in addition to summarizing the input content, PALM is able to make a non-trivial inference of the expected auction price and the final selling price of the car (might not be factually accurate though). An inference is also made by PALM in the second example in addition to summarization, although the Cape Verde Islands National Park does not really exist.
These examples demonstrate that PALM pretraining has learned to infer and to reason from the input text. Although in the pre-training phase the generated content may not be factually accurate in the absence of rich context, the capability of inference can be transferred downstream by finetuning on specific generation tasks.

Fine-tuning on Generative QA
We also experiment with fine-tuning PALM on several downstream generation tasks. The MARCO benchmark (Nguyen et al., 2016) released by Microsoft is the best fit for evaluating generative QA models. In the MARCO dataset, the questions are  user queries issued to the Bing search engine and the contextual passages are from real web documents. The data has been split into a training set (153,725 QA pairs), a dev set (12,467 QA pairs) and a test set (101,092 questions with unpublished answers). To evaluate the generative capability, we focus on the Q&A + Natural Language Generation task, the goal of which is to provide the best answer available in natural language that could be used by a smart device / digital assistant. The answers are human-generated and not necessarily sub-spans of the contextual passages, so we use the ROUGE-L (Lin, 2004) metric for our evaluation to measure the quality of generated answers against the ground truth.
We fine-tune the pre-trained PALM on the MARCO training set for 10 epochs. We set the batch size to 64, the learning rate to 1e-5, and the maximum input length to 512. The other hyperparameters are kept the same as pre-training. In fine-tuning PALM, the encoder takes as input x a contextual passage concatenated with a question at the end, and the decoder takes an answer as input y. During decoding, we use beam search with a beam of size 5. Table 2 presents the answer generation results on the test set obtained from the official MARCO leaderboard. PALM achieves the 1st place on the leaderboard, outperforming all competing methods in generation quality. Note that PALM pretrains a single model, while some of the topperforming methods are ensemble models, such as Masque, on the leaderboard. Crucially, the superiority of PALM-single over Masque-ensemble with pre-trained ELMo (Peters et al., 2018) and BERT-based methods clearly demonstrates the ef-   (Dong et al., 2019) fectiveness and generalizability of PALM over the other pre-training approaches in language modeling.

Fine-tuning on Summarization
Text summarization produces a concise and fluent summary conveying the key information in the input (e.g., a news article). We focus on abstractive summarization, a generation task where the summary is not constrained to reusing the phrases or sentences in the input text. Following MASS, we use the Gigaword dataset (Graff and Cieri, 2003) for model fine-tuning and evaluation, which consists of a total of 3.8M article-title pairs in English. We take the articles as the input to the encoder and titles for the decoder. We adopt the same optimization hyperparameters from generative QA finetuning for the summarization task. The F1 scores of Rouge-1, Rouge-2 and Rouge-L are reported on the Gigaword test set for evaluation. As shown in Table 3, PALM achieves better performance than all existing abstractive summarization models. It is worth noting that UniLM, MASS, BERT+LM and DAE are pre-trained on an unlabeled corpus before supervised fine-tuning on the summarization data. By consistently outperforming these pre-training methods, PALM confirms its effectiveness in leveraging unsupervision signals for language generation.

Fine-tuning on Response Generation
Conversational response generation aims to produce a flexible response to a conversation (Vinyals and Le, 2015). Following MASS, we conduct experiments on the Cornell Movie Dialog corpus 5 (Danescu-Niculescu-Mizil and Lee, 2011) that contains 140K conversation pairs, and use the training/test splits provided by the dataset. The same training hyperparameters from generative QA fine-tuning are adopted on the response generation task. We report the results in perplexity following (Vinyals and Le, 2015) (lower is better). We compare PALM with the competing methods including the baseline trained on the data pairs available and the pre-trained BERT+LM and MASS. Following MASS, we train every model on 10K pairs randomly sampled and all 110K training pairs. As shown in Table 4, PALM significantly performs better than all the competitors by a large margin on both the 10K and 110K data, demonstrating its capability in generating responses to context thanks to its new pre-training objectives.

Related Work
ELMo (Peters et al., 2018) is an early prominent pre-training method based on bidirectional LSTMs. It concatenates left-only and right-only representations, but does not pre-train interactions between these features. GPT (Radford, 2018) and GPT-2 (Radford et al., 2019) are proposed to base language modeling on the Transformer architecture, and use only the Transformer decoder for pretraining. Edunov et al. (Edunov et al., 2019) examine different strategies (e.g., ELMo) to add contextualized embeddings to sequence-to-sequence models, and observe the most improvement by adding the learned embeddings to the encoder.
BERT (Devlin et al., 2018) introduces Masked Language Modelling, which allows pre-training to learn interactions between left and right context words. Recent work has shown that very strong performance can be achieved by training for longer , by tying parameters across layers (Lan et al., 2019), and by masking spans instead of words . Predictions are not made autoregressively, reducing the effectiveness of BERT for generation tasks. UniLM (Dong et al., 2019) fine-tunes BERT with an ensemble of masks, some of which use only leftward context, allowing UniLM to be used for generation tasks. A difference between UniLM and PALM is that UniLM predictions are conditionally independent, whereas PALM s are autoregressive. PALM reduces the mismatch between pre-training and context-conditioned generation tasks by forcing the decoder to predict the continuation of text input to the encoder on an unlabeled corpus.
MASS (Song et al., 2019) and BART  are the two pre-training approaches most similar to PALM. In MASS, an input sequence where a contiguous span of tokens is masked is mapped to a sequence consisting of the missing tokens, whereas BART is trained to reconstruct the original text from corrupted input with some masked tokens. The difference in input & output representations between PALM and MASS & BART is detailed in Section 3.2.

Conclusions
In this work, we propose PALM, a novel approach to pre-training an autoencoding and autoregressive language model on a large unlabeled corpus, designed to be fine-tuned on downstream generation conditioned on context. It is built upon an extension of the Transformer encoder-decoder, and jointly pre-trains the encoder and the decoder in an autoencoding denoising stage followed by an autoregressive generation stage.
With less training cost than that of existing pretraining approaches, PALM significantly advances the state-of-the-art results on a variety of contextconditioned generation applications, including generative QA (Rank 1 on the MARCO leaderboard), abstractive summarization and conversational response generation. It has been shown in prior work  that training for more steps over a larger corpus can potentially improve the performance of pre-training. Our future work will explore the potential of training PALM for longer on much more unlabeled text data.