Denoising based Sequence-to-Sequence Pre-training for Text Generation

This paper presents a new sequence-to-sequence (seq2seq) pre-training method PoDA (Pre-training of Denoising Autoencoders), which learns representations suitable for text generation tasks. Unlike encoder-only (e.g., BERT) or decoder-only (e.g., OpenAI GPT) pre-training approaches, PoDA jointly pre-trains both the encoder and decoder by denoising the noise-corrupted text, and it also has the advantage of keeping the network architecture unchanged in the subsequent fine-tuning stage. Meanwhile, we design a hybrid model of Transformer and pointer-generator networks as the backbone architecture for PoDA. We conduct experiments on two text generation tasks: abstractive summarization, and grammatical error correction. Results on four datasets show that PoDA can improve model performance over strong baselines without using any task-specific techniques and significantly speed up convergence.


Introduction
Methods based on unsupervised pre-training and supervised fine-tuning for NLP have achieved phenomenal successes in the last two years. Most of the proposed methods in the literature choose language modeling or its variant as the pre-training task. After the pre-training stage, ELMo (Peters et al., 2018) and CoVe (McCann et al., 2017) directly use the learned representations as additional features for downstream tasks, while BERT (Devlin et al., 2018), ULMFiT (Howard and Ruder, 2018), XLM (Lample and Conneau, 2019), and OpenAI GPT (Radford et al., 2018(Radford et al., , 2019 require fine-tuning both pre-trained parameters and task-specific parameters on labeled data. The state-of-the-art performances have been significantly advanced for classification and sequence labeling tasks, such as natural language inference (Bowman et al., 2015), named-entity recognition, SQuAD question answering (Rajpurkar et al., 2016) etc.
However, little attention has been paid to pretraining for seq2seq text generation (Sutskever et al., 2014). A typical seq2seq network consists of a bidirectional encoder, a unidirectional decoder and attention between the encoder and decoder. Previous work mainly focuses on encoderonly or decoder-only pre-training. For example, BERT pre-trains a bidirectional encoder, and Ope-nAI GPT pre-trains a language model which is essentially a unidirectional decoder. Ramachandran et al. (2016) propose to train two independent language models for the encoder and decoder respectively. All of the aforementioned methods are only able to partially pre-train the seq2seq networks, and therefore are unable to unleash the full potential of transfer learning for text generation.
In this paper, we present PoDA, a denoising based pre-training method that is able to jointly pre-train all components of seq2seq networks. Like denoising autoencoders, PoDA works by denoising the noise-corrupted text sequences. Any noising function that fits in the seq2seq framework can be used. We experiment with three types of noises: randomly shuffle, delete or replace the words in a given sequence. It is noted PoDA is simple, easy-to-implement and applicable to virtually all seq2seq architectures, including ConvS2S (Gehring et al., 2017) and Transformer (Vaswani et al., 2017). Here, we adopt the hybrid architecture of Transformer and pointer-generator networks (See et al., 2017). Transformer is effective at modeling long-distance dependencies, highly parallelizable and demonstrates good performance empirically. Pointer-generator network incorporates copying mechanism (Gu et al., 2016;Gulcehre et al., 2016) which is helpful for most text … …

Transformer Encoder
Transformer Decoder

Pointer-Generator Layer
The fox jumps over the lazy The fox jumps over the <bos> dog dog fox fly over the dog lazy copy attention masked loss . lazy . Figure 1: PoDA model architecture. The masked loss is calculated only for the blue underlined words. "<bos>" is a special begin-of-sequence padding symbol. The example input-output pair is explained in Section 2.2. generation tasks. The text corpora used for pre-training are the Billion Word Benchmark (Chelba et al., 2013) and English Wikipedia, both of which are publicly available and consists of nearly 2.99 billion words in total. We conduct experiments on two abstractive summarization datasets (CNN/Daily Mail (See et al., 2017) and Gigaword (Rush et al., 2015)), and two grammatical error correction datasets (CoNLL-2014(Ng et al., 2014 and JFLEG (Napoles et al., 2017)). With simple maximum likelihood training and no task-specific techniques, PoDA achieves superior or comparable performance against state-of-the-art systems and speeds up convergence for all four datasets.

Model Architecture
First, we design a seq2seq model as the backbone architecture of our proposed pre-training method, which is a combination of Transformer and pointer-generator networks, as shown in Figure 1.
The input representations are the sum of word embeddings and sinusoidal positional encodings. Both the Transformer encoder and the decoder consist of 6 layers of transformer blocks, and each block is a multi-head self-attention layer followed by one layer of positionwise feedforward network.
For the output layer, we use a pointer-generator layer to allow both copying from the input sequence and generation from a fixed vocabulary. The implementation is detailed in Appendix.
As a side note, we want to point out that the seq2seq architecture is not limited to the one we propose and other networks such as ConvS2S, RNN-based seq2seq models are also applicable.
Pointer-generator networks are also not the only solution for handling out-of-vocabulary(OOV) words, and subword-based methods such as sentencepiece (Kudo and Richardson, 2018) can be used at the cost of making the input and output sequences longer.

Noising and Denoising
Similar to denoising autoencoders, PoDA involves two parts: noising and denoising. The noising part corrupts a given word sequence x = {x i } n i=1 and gets a noisy word sequence . The denoising part tries to recover x given x ′ using a seq2seq model.
We use three noising functions: randomly shuffle, delete or replace the words in x. The details are shown in Algorithm 1, where N (0, σ) is a gaussian distribution with mean 0 and variance σ. B(p) is a Bernoulli distribution, and Beta(α, β) is a beta distribution serving as the prior for B(p). Take function DELETE (line 10 to line 15 in Algorithm 1) as an example, it first samples a Bernoulli distribution with expectation p from Beta(α, β), then each word is deleted with probability p. The usage of Beta(α, β) prior can make the model robust to different degrees of noises.
We exemplify the operations above in Figure 1. The original word sequence x ="The fox jumps over the lazy dog .", after three noising operations: delete "The", replace "jumps" with "fly" and swap "lazy" and "dog", we get the noisy word sequence x ′ ="fox fly over the dog lazy .".
The denoising part maximizes the conditional probability p(x|x ′ ), which can be factorized as: When predicting x i , it is conditioned on the noise-corrupted full context x ′ and the clean left Replace w with w ′ sampled from unigram distribution if µ ∼B(p) is 1 20: end for 21: end function context x <i . This shows that our seq2seq formulation is capable of unifying both encoder-only pre-training and decoder-only pre-training methods, since a bidirectional language model used by BERT can be seen as simulating p(x i |x ′ ) , while a traditional unidirectional language model used by OpenAI GPT as resembling p(x i |x <i ).
Like BERT, we add a mask to the target sequence when computing the loss function. To force the model to learn meaningful representations, instead of copying from the input most of the time, the positions where the corresponding words are corrupted in the input are kept. We also keep a small percentage (3%) of positions where the words are not corrupted, so that the model can learn to copy from the input when appropriate. Then, the training loss with mask is as follows (Θ is model parameters): Empirically, we set σ = 0.5 for Gaussian distri-bution. α and β are chosen to have a Beta distribution with mean 0.15 and standard deviation 0.03. For pre-training, we use two text corpora: the full dump of English Wikipedia 2 and the Billion Word Benchmark 3 , as shown in Table 1. For English Wikipedia, we remove paragraphs with less than 3 words or more than 30% OOV words, and each paragraph is split into text segments with no more than 128 words for each segment. The Billion Word Benchmark is a sentence-level corpus. Sentences with more than 500 words are ignored during training.

Pre-training Procedure
The pre-training is performed on 4 GPUs using synchronous data parallelism, gradients are averaged across different GPUs. Each batch on a single GPU consists of at most 3000 tokens. We pretrain the network for 5 million iterations, which is roughly 14 epochs over the entire dataset. The final perplexity on the validation set is about 6.8. Each epoch takes approximately 2 days. Details on the network hyperparameters and optimizers are given in Section 3.1.

Fine-tuning Procedure
With our pre-training method, we do not need to change the network architecture during the finetuning stage, since both the pre-training task and text generation tasks take a source sequence as input and return a target sequence as output. The network is initialized with pre-trained parameter values. For fine-tuning, the preprocessing is dataset-specific, but the learning rate scheduling, dropout, early stopping, and gradient clipping are exactly the same as pre-training.
The objective function for fine-tuning is the word-level negative log-likelihood. Here we do not use reinforcement learning to tune towards the automatic evaluation metrics such as ROUGE 4006 (Lin, 2004) or BLEU (Papineni et al., 2002), because it may overfit evaluation metrics and barely show improvements in human evaluations (Wu et al., 2016).

Setup
The network architecture used by our experiments has 97 million parameters. It consists of 6 layers of encoder blocks, 6 layers of decoder blocks, and 1 pointer-generator layer. The hidden size of each positionwise feedforward layer is 4096. We use 8 heads for all multi-head attention layers. The vocabulary consists of the top 50k most frequent words (case sensitive), and the dimension of word embedding is 512. We tie the parameters of encoder word embeddings, decoder word embeddings, and the output softmax layer. NAG (Nesterov Accelerated Gradient) optimizer is used with initial learning rate 2 × 10 −3 . Dropout of 0.2 is applied for all self-attention layers, positionwise feedforward layers and input embedding layers. The gradient norm is clipped to have a maximum value of 2. We follow the Transformer implementation from fairseq 4 .
For task-specific fine-tuning, unless explicitly specified, we reuse the hyperparameters from the pre-training stage. After each training epoch, we compute the validation loss and halve the learning rate whenever the validation loss stops decreasing. The training procedure terminates if the learning rate drops below 10 −4 . Exponential moving average (EMA) with decay rate 0.9995 is used to make the training stabilized. At inference time, we use standard beam search decoding based on the length-normalized log-likelihood. For ensemble models, we use different random seeds and pre-trained checkpoints for fine-tuning. Ensemble decoding is used by averaging the output probabilities from different models at every decoding step.
When reporting experimental results, "PoDA w/o pre-training" refers to the proposed architecture in Section 2.1 trained only on the supervised data, and "PoDA w/o fine-tuning" only pre-trains on unlabeled data. PoDA first pre-trains a denoising autoencoder and then fine-tunes on the supervised data.

Abstractive Summarization
Datasets We use two summarization datasets: CNN/Daily Mail 5 (See et al., 2017) and Gigaword (Rush et al., 2015) dataset. The official split for training, validation, and test is shown in Table 2  The CNN/Daily Mail dataset contains approximately 300k news articles with an average of 781 words for each article, and each article is paired with summaries with 56 words on average. We use the preprocessing script 6 provided by See et al. (2017). The articles are truncated to 800 words for both training and testing. The summaries are truncated to 130 words for training.
The Gigaword is a headline-generation dataset consisting of nearly 4 million examples. Headline generation can be seen as a sentence summarization task. Each example in Gigaword consists of one sentence with an average length of 31.3 words, which is much shorter than CNN/Daily Mail, and one short headline with an average length of 8.3 words. The Gigaword dataset provided by Rush et al. (2015) is already tokenized and lower-cased. Since our vocabulary is case-sensitive, such inconsistency is expected to hurt our system's performance. Evaluation We report evaluation results in terms of of ROUGE-1, ROUGE-2 and ROUGE-L (Lin, 2004)    scores, as shown in Table 3. "rnn-ext+RL" combines both extractive and abstractive methods and achieves performance improvements (?

Results for Gigaword
The Gigaword dataset is much larger than CNN/Daily Mail, and this enables "PoDA w/o pre-training" to have competitive performance even without pre-training. As in Table 4

Grammatical Error Correction (GEC)
Datasets GEC can also be seen as a text generation task, where the input sequence is a sentence possibly containing some grammatical errors, and the output is a clean and grammatical sentence. We experiment PoDA on two GEC datasets: CoNLL-2014(Ng et al., 2014 and JF-LEG (Napoles et al., 2017). We use three pub-lic datasets for training: Lang-8 NAIST (Mizumoto et al., 2011), NUCLE (Dahlmeier et al., 2013 and CLC FCE (Felice et al., 2014). The test set of CoNLL-2013 shared task is used as validation set for the CoNLL-2014 task. JFLEG has its own validation set. For preprocessing, we use NLTK 8 to tokenize sentences, and remove all sentence pairs without any edits in Lang-8 NAIST.
Simple spelling errors are corrected based on edit distance. The dataset statistics are shown in Table  5.    and GLEU score from 56.52 to 59.02(+2.50) for JFLEG. By ensembling 4 models initialized with different pre-trained checkpoints and trained with different random seeds, the performance can be further boosted on both datasets (+0.94 for CoNLL-2014 and +0.46 for JFLEG), outperforming the other ensemble models such as "Hybrid SMT-NMT". We also report the performance of "PoDA w/o fine-tuning" which does not conduct fine-tuning. The F 0.5 score only reaches 20.86 on CoNLL-2014 dataset and the GLEU score is 36.83 on JF-LEG. These results are even worse than the weakest baselines in Table 6 and Table 7. The denoising based pre-training and the GEC task share some similarities in the sense that both of them attempt to convert noisy texts to clean and grammatical texts. However, the poor results of "PoDA w/o fine-tuning" show that PoDA cannot be seen as a simple data augmentation method for GEC. Instead, PoDA learns generic text representations and requires task-specific fine-tuning.
Techniques from previous work for GEC such as language model based rerank (Chollampatt and Ng, 2018a), data augmentation (Ge et al., 2018a), and domain adaptation  can be easily incorporated. A parallel work (Zhao et al., 2019) observes similar gain by combining simpler pre-training strategy and various GEC-specific techniques.

Analysis
In the following analysis, we only choose one task (summarization or GEC) to analyze each aspect due to space limitation. Similar conclusions also hold for the other task.

Linguistic Quality Analysis
In Table 8, we show some generated summaries by PoDA from Gigaword dataset. In the first example, PoDA successfully deletes the relatively unimportant modifier "ivory coast striker", and keeps the picture complete by including both "bremen" and "saint-etienne" in the summary. In the second example, "PoDA w/o pre-training" misses an important date ("first quarter") of the event "economic crisis". Both examples show our model is able to identify the important text snippets in the source sequence and organize them into a short and fluent summary.
More example outputs by PoDA are listed in Appendix. Source ivory coast striker boubacar sanogo is set to leave werder bremen for french first division side saint-etienne. Target sanogo set to sign for saint-etienne PoDA w/o pre-training ivory coast striker sanogo set to join saint-etienne PoDA sanogo set to leave bremen for saint-etienne Source thailand's battered economy should start to bottom out in the first quarter of #### provided the government's efforts are n't neutralized by outside developments, a news report said monday. Target economic crisis to bottom out early next year minister says PoDA w/o pre-training thai economy expected to start to bottom out in UNK PoDA thai economy to start to bottom out in first quarter In Figure 2, we show the F 0.5 score on CoNLL-2014 dataset when the model is initialized with different pre-trained checkpoints. Though the F 0.5 score has some fluctuations due to the random factors in training neural networks and the limited size of the test set, the overall trend is very clear: the F 0.5 score first improves greatly and then keeps a slow improvement after about 1 million iterations, from 54 at the very beginning to 59 after convergence.

Effects of Pre-trained Encoder and Decoder
To show the effectiveness of the pre-trained encoder and decoder, we train the model by only using the encoder-side pre-trained parameters ("w/o pre-trained decoder") or decoder-side pre-trained parameters ("w/o pre-trained encoder") We do not compare with pre-trained encoder from BERT or pre-trained decoder from OpenAI GPT, mainly because the corresponding model capacity, tokenization and text corpora used for pre-training are very different.   Table 9 shows that the performance degrades by a large margin if the network is only partially pre-trained. The pre-trained encoder (−3.42 drop in F 0.5 ) is more important than the pre-trained decoder (−2.42 drop in F 0.5 ).

Effects of Dataset Size
We also conduct ablations in few-shot learning settings to see how the performance changes when the model only accesses a small percentage of labeled data. We randomly sample 10 3 to 10 5 training examples from the Gigaword dataset and train "PoDA w/o pre-training" and PoDA (with pretraining) using exactly the same hyperparameters.  Figure 3 shows the ROUGE-1 score on Gigaword test set. With only 10 3 training examples, PoDA reaches a reasonably good performance comparable to ABS+ (an attention-based system trained on nearly 4 million examples). With more labeled data available, the performance gap between "PoDA w/o pre-training" and PoDA slowly decreases from 15 to 2 in Figure 3. However, the pre-training still helps even when the models are trained on the full dataset (shown in Table 4).

Convergence Analysis
Pre-training can not only achieve better final performance but also helps the model converge faster.
In Figure 4, we show the validation perplexity after each training epoch for both "PoDA w/o pretraining" and PoDA.
We can clearly see that the validation perplexity of PoDA is consistently lower than that of "PoDA w/o pre-training", especially at the first few epochs. After 5 epochs, PoDA can arrive at a validation perplexity that "PoDA w/o pretraining" usually takes 30 or more epochs for both CNN/Daily Mail and Gigaword datasets. This nice property is particularly helpful when the computational resources are limited. Other pretraining methods such as BERT also demonstrate similar behaviors.

Related Work
Network Pre-training The idea of pre-training neural networks dates back to the early days of deep learning. Bengio et al. (2007) proposed layer-wise pre-training for deep belief networks (DBN) to tackle the difficulty of training deep neural networks based on a reconstruction objective. (Erhan et al., 2010;Dahl et al., 2012) showed the effectiveness of pre-training for tasks such as speech recognition. In the area of computer vision, using ImageNet pre-trained models have become a standard practice. In NLP community, using pre-trained word embeddings is the most popular way to transfer knowledge from the unlabeled corpus. There are also work on semi-supervised sequence learning (Dai and Le, 2015;Peters et al., 2017) attempting to incorporate language modeling as an auxiliary task. Recently, several pre-training methods based on language models are presented, such as ELMo (Peters et al., 2018), OpenAI GPT (Radford et al., 2018), BERT (Devlin et al., 2018), XLM (Lample and Conneau, 2019) etc. The combination of more compute, larger model capacity and large-scale text corpora lead to significant improvements on NLP benchmarks .
Autoencoders have long been used for representation learning of images  and text (Li et al., 2015). However, precisely reconstructing the clean input is probably too easy for high-capacity models. Sparse autoencoders (Deng et al., 2013), contractive autoencoders (Rifai et al., 2011), and denoising autoencoders  are several popular variants. Denoising autoencoders (DA) are shown to be able to learn better representations for downstream tasks (Vincent et al., , 2008Hill et al., 2016). Freitag and Roy (2018) use seq2seq DAs for unsupervised natural language generation in dialogue, and (Kim et al., 2018) propose to improve the quality of machine translation with DAs.
Text Generation covers a wide spectrum of NLP tasks, including machine translation (Wu et al., 2016), summarization (See et al., 2017), response generation (Vinyals and Le, 2015), paraphrase generation, grammatical error correction etc. Early studies on text generation mainly adopt template-based (Reiter and Dale, 2000) or example-based (Watanabe and Takeda, 1998) methods. With the emergence of deep learning for NLP, seq2seq models (Sutskever et al., 2014) become a popular choice for text generation tasks and show better performance in terms of both automatic evaluation metrics and human evaluations (Wu et al., 2016). There are also studies focusing on text generation from structured data such as SQL-to-text . Previous pre-training for text generation is usually done by independently pre-training encoder-side or decoder-side language models (Ramachandran et al., 2016). Concurrent to our work, Edunov et al. augment encoder representation with ELMostyle models, MASS (Song et al., 2019) masks continuous text fragments for pre-training, and UNILM (Dong et al., 2019) proposes to pre-train for both language understanding and generation tasks.

Conclusion
This paper presents a new transfer learning approach for seq2seq text generation named PoDA. It involves two steps: first, pre-train a customized seq2seq denoising autoencoder on large-scale unlabeled text corpora; then, fine-tune on in-domain labeled data. The pre-training step is independent of downstream tasks and jointly learns both encoder and decoder representations. PoDA is simple, intuitive and doesn't require changing network architecture during the fine-tuning stage. Experiments on several abstractive summarization and grammatical error correction datasets demonstrate that PoDA leads to better performance and faster convergence.
For future work, we would like to validate our model on other tasks such as response generation, explore more effective unsupervised sequence-tosequence pre-training methods, and handle crosslingual tasks such as machine translation. Source I do think there is difference in it and I believe many of us will agree. Target I do think there is a difference and I believe many of us will agree. PoDA w/o pre-training I do think there is difference in it and I believe many of us will agree.
PoDA I do think there is a difference in it and I believe many of us will agree. Source Almost all students and young adults possess the Facebook or Twitter account. Target Almost all students and young adults possess a Facebook or Twitter account. PoDA w/o pre-training Almost all students and young adults possess Facebook or Twitter accounts.
PoDA Almost all students and young adults possess a Facebook or Twitter account.

Source
But, on the contrary, he argues that fluoride also some disadvantage. Target But, on the contrary, he argues that fluoride also has some disadvantages. PoDA w/o pre-training But, on the contrary, he argues that there are also some disadvantages.
PoDA But, on the contrary, he argues that fluoride also has some disadvantages.

Source
Such people impressed other people through their strong well and divoution to duty. Target Such people impressed others through their strong will and devotion to duty. PoDA w/o pre-training Such people impressed other people through their strong and divoution to duty.
PoDA Such people impressed other people through their strong will and devotion to duty.

Input
Output samples the best university in the world.
Harvard is considered the 10th best university in the world. I believe that's the best university in the world. Nevada offers the best continuing education in the whole world.
Greek is becoming the best university in the world.
The meaning of life is.
The meaning of daily life is unclear. "The real meaning of daily life is peaceful." The underlying meaning of your life is lost forever. The immediate meaning of our political life is undeniable. Table 12: Some input-output samples by our pre-trained PoDA models. We input some incomplete sentences and output the generated sentences by sampling the output distribution of pre-trained PoDA models at each timestep.
We can see that PoDA successfully transforms the inputs to coherent and grammatical sentences, though the statements entailed by these output sentences are not always correct.