Progressive Generation of Long Text with Pretrained Language Models

Large-scale language models (LMs) pretrained on massive corpora of text, such as GPT-2, are powerful open-domain text generators. However, as our systematic examination reveals, it is still challenging for such models to generate coherent long passages of text (e.g., 1000 tokens), especially when the models are fine-tuned to the target domain on a small corpus. Previous planning-then-generation methods also fall short of producing such long text in various domains. To overcome the limitations, we propose a simple but effective method of generating text in a progressive manner, inspired by generating images from low to high resolution. Our method first produces domain-specific content keywords and then progressively refines them into complete passages in multiple stages. The simple design allows our approach to take advantage of pretrained LMs at each stage and effectively adapt to any target domain given only a small set of examples. We conduct a comprehensive empirical study with a broad set of evaluation metrics, and show that our approach significantly improves upon the fine-tuned large LMs and various planning-then-generation methods in terms of quality and sample efficiency. Human evaluation also validates that our model generations are more coherent.


Introduction
Generating coherent long text (e.g., 1000s of tokens) is useful in myriad applications of creating reports, essays, and other long-form content. Yet the problem is particularly challenging as it demands models to capture global context, plan content, and produce local words in a consistent manner. Prior studies on "long" text generation have typically limited to outputs of 50-200 tokens (Shen et al., 2019;Bosselut et al., 2018;Zhao et al., 2020). 1 Code available at https://github.com/ tanyuqian/progressive-generation Figure 1: Results of large-scale LMs (GPT-2 and BART) fine-tuned on 10K stories. Coherence of text is evaluated by BERT next sentence prediction (NSP) score, where x-axis is the position of the evaluated sentences in the passage. There is a significant gap in coherence between text by human and text by large-scale LMs. Our proposed ProGen instead generates more coherent samples close to human text.
Recent large-scale pretrained language models (LMs), such as GPT-2 (Radford et al., 2019) and BART (Lewis et al., 2020), emerged as an impressive open-ended text generator capable of producing surprisingly fluent text. The massive LMs are typically pretrained on large corpora of generic text once, and then fine-tuned with small domainspecific data. The latest work has mostly focused on the regime of relatively short text with low hundreds of tokens. For example, Holtzman et al. (2020); See et al. (2019); Hua and Wang (2020) studied GPT-2 and BART generations with a maximum length ranging from 150 to 350 tokens. In this work, we study the problem of generating coherent, much longer passages of text (e.g., 1000 tokens). GPT-3 (Brown et al., 2020) was reported to produce long essays, yet the results seem to need extensive human curations (e.g., MarketMuse; Gardian), and the system is not publicly available to adapt to arbitrary desired domains.
In this work, we examine fine-tuning of largescale LMs for domain-specific generation of extra-long text. We find that samples produced by GPT-2 fine-tuned on small domain-specific corpora exhibit various imperfections, including excessive repetitiveness and incoherence between sentences far apart. Figure 1 measures the coherence of text generated by the fine-tuned GPT-2 w.r.t the BERT next sentence prediction (Devlin et al., 2019) score. As the figure shows, GPT-2 models (regardless of the model size) exhibit a significant gap in the score compared with human text, hence falling short in generating coherent text.
We hypothesize that the problem is mainly caused by the sequential generation order of the LMs, which makes global content planning of the passage difficult, especially when the generated text is long and contains thousands of words. One could potentially adopt the recent planning-thengeneration or non-monotonic methods (Sec 2), yet those methods either require specialized neural architectures that need costly retraining for each domain (Gu et al., 2019;Fan et al., 2019), or rely on dedicated intermediate content plans (e.g., summaries, SRL labels) (Fan et al., 2019;Yao et al., 2019) with limited flexibility and producing sub-optimal results as shown in our experiments.
To overcome the limitations, we introduce a new method for Progressive Generation of Text (Pro-Gen). We observe that generation of some words (e.g., stop words) does not require many contexts, while other words are decisive and have long-term impact on the whole content of the passage. Motivated by this observation, our approach first produces a sequence of most informative words, then progressively refines the sequence by adding finergrained details in multiple stages, until completing a full passage. The generation at each stage is conditioning on the output of the preceding stage which provides anchors and steers the current generation ( Figure 2). The intermediate words produced at each stage are defined based on a simple TF-IDF informativeness metric.
The approach enjoys several core advantages: (1) Although the progressive approach implements a conceptually non-monotonic generation process, generation at each stage can still be performed in a left-to-right manner and thus is directly compatible with the powerful pretrained monotonic LMs. The LMs at different stages are easily fine-tuned to accommodate a target domain using only small, independently constructed data. Intuitively, each LM is addressing a sub-task of mapping a sequence to a finer-resolution one, which is much simpler than the overall task of mapping from conditions to full passages of text. In this work, we use BART (Lewis et al., 2020) for generation at each stage, though one can also plug in other off-the-shelf LMs. As seen from Figure 1, ProGen can generate more much coherent text compared with GPT-2 and nearly match human text in terms of the BERT-NSP score; (2) In contrast to the typical 2-stage planning-then-generation in prior work, the simple progressive strategy offers added flexibility for an arbitrary number of intermediate stages, yielding improved results; (3) The training data for each stage is extracted from domain corpus using the simple TF-IDF metric, without need of additional resources (e.g., pretrained summarization models) as in prior work, making the method broadly applicable to various domains and languages.
We conduct extensive empirical studies on the CNN News (Hermann et al., 2015) and Writing-Prompts (Fan et al., 2018) corpora, evaluating various systems by a wide-range of automatic metrics as well as human judgement. Results show that Pro-Gen achieves strongly improved performance by decomposing the generation into more progressive stages. Our method produces diverse text passages of higher quality and coherence than a broad set of models, including fine-tuned GPT-2, BART, and other various planning-then-generation strategies.

Related Work
Content planning in generation. The idea of separate content planning and surface realization has been studied in early text generation systems (Reiter and Dale, 1997). Recent neural approaches have also adopted similar planning-thengeneration strategies for data-to-text (Moryossef et al., 2019;Puduppully et al., 2019), storytelling (Fan et al., 2019;Yao et al., 2019;, machine translation (Ford et al., 2018), and others (Hua and Wang, 2019;Yao et al., 2017). These models often involve customized architectures incompatible with the existing large LMs. Scaling those models for long text generation thus can require expensive training, which restricts systematic studies. On the other hand, it is possible to adopt some of the content planning strategies (e.g., summaries or SRL sequences as the plans (Fan et al., 2019)), and repurpose pretrained LMs for generation in each stage. However, these strategies with dedicated intermediate plans and a pre-fixed number (typically 2) of stages can have limited flexibility, leading to sub-optimal results as shown in our empirical study. Besides, creating training data for planning requires additional resources (e.g., pretrained summarization models or SRL models) which are not always available (e.g., in certain domains or for low-resource languages). In contrast, we propose a simple way for designing the intermediate stages based on word informativeness, which can flexibly increase the number of stages for improved results, and easily create training data for all stages without additional models.
Non-monotonic generation and refinement. Another relevant line of research is non-monotonic generation Gu et al., 2019;, infilling (Zhu et al., 2019;Shen et al., 2020;Qin et al., 2020), or refinement (Lee et al., 2018;Novak et al., 2016;Mansimov et al., 2019;Kasai et al., 2020) that differs from the restricted left-toright generation in conventional LMs. Again, those approaches largely depend on specialized architectures and inference, making them difficult to be integrated with the powerful pretrained LMs. The prior studies have focused on generating short text. Our proposed coarse-to-fine progressive generation conceptually presents a non-monotonic process built upon the pretrained monotonic LMs, which permits fast adaptation to any target domain and generation of much longer text.
Long text generation. Previous work has made attempts to generate text of up to two or three hundred tokens. Those methods often adopt the similar idea of planning-then-generation as above (Shen et al., 2019;Zhao et al., 2020;Bosselut et al., 2018;See et al., 2019;Hua and Wang, 2020;Rashkin et al., 2020). Another line of work instead focuses on extending the transformer architecture (Vaswani et al., 2017) to model longer text sequences (e.g., Dai et al., 2019;Choromanski et al., 2021, etc). For example, Liu et al.
(2018) used a hybrid retrieval-generation architecture for producing long summaries; Dai et al. (2019) showed long text samples qualitatively. Our work systematically examines the pretrained LMs in generating long domain-specific text, and proposes a new approach that empowers pretrained LMs for producing samples of significantly higherquality.

Progressive Generation of Text
One of the main challenges in generating long coherent passages is modeling long-range dependencies across the entire sequences (e.g., 1000 tokens). We propose a progressive generation approach that is conceptually simple yet effective. Intuitively, progressive generation divides the complex problem of generating the full passage into a series of much easier steps of generating coarser-grained intermediate sequences. Contrary to generating everything from left to right from scratch, our progressive generation allows the model to first plan globally and then shift attention to increasingly finer details, which results in more coherent text. Figure 2 illustrates the generation process.

Generation Process
Let y := [y 1 , y 2 , . . . , y T ] be the output text, where each y i is a token of language (a word or a subword). The output sequences are generated either conditionally on any other information x (e.g., generations of a story given a prompt), or unconditionally (in which case we assume x ≡ ∅ while keeping the same notation).
Instead of generating the full passage y directly, we propose to add multiple intermediate stages: x → c 1 → c 2 · · · → c K → y, where for each stage k ∈ {1, . . . , K}, c k is an intermediate sequence containing information of the passage at certain granularity. For instance, at the first stage, c 1 can be seen as a highest-level content plan consisting of the most informative tokens such as key entities. Then, based on the plan, we gradually refine them into subsequent c k , each of which contains finer-grained information than that of the preceding stage. At the final stage, we refine c K into the full passage by adding the least informative words (e.g., stop words). The generation process corresponds to a decomposition of the conditional probability as: As the above intuition, c k at early stages as the high-level content plans should contain informative or important words, to serve as skeletons for subsequent enrichment. We next concretely define the order of generation, namely, which words should each stage generates. Specifically, we propose a simple method shouted my head officer from the jeep . The dog was running circles around our vehicle , barking at the people inside . The officer tapped my shoulder and pointed to the yellow , skinny animal circling our jeep . " But sir.. , " I managed to spit out before he took both his hands and pushed me out of the vehicle . I went tumbling out , and landed on the rough sandy ground . I stood up adjusting the gun hanging from my shoulder and proceeded to walk towards the canine . The dog stopped its barking , and shifted its black eyes to me . " Come here little pup . Hey come here , I ' m not going to hurt ya , " I said trying to coax it nearer to me . Actually , I didn ' t know if I was going to hurt the little mutt or not yet . that constructs a vocabulary V k for each stage k, based on the importance of words in the target domain. Each particular stage k only produces tokens belonging to its vocabulary V k . By the progressive nature of the generation process, we have That is, V 1 contains the smallest core set of words in the domain, and the vocabularies gradually expand at later stages until arriving the full vocabulary V. Note that vocabularies in later stages are supersets of those in earlier stages. This allows the later stages to remedy and polish potential mistakes made in earlier stages when necessary. We discuss the construction of the vocabularies in the below.
Stage-wise vocabularies based on word importance. Given a text corpus D of the target domain with the full vocabulary V, we define the importance scores of words in V based on the TF-IDF metric. We then rank all the words and assign the top V k words to the intermediate vocabulary V k .
Here V k is a hyper-parameter controlling the size of V k . More concretely, for each word w ∈ V, we first compute its standard TF-IDF score (Salton and McGill, 1986) in each document d ∈ D, which essentially measures how important w is to d. The importance of the word w in the domain is then defined as the average TF-IDF score across all documents containing w: where TF_IDF(w, d) is the TF-IDF score of word w in document d; and DF(w, D) is the document Output: Fine-tuned LMs for generation at all stages in a progressive manner frequency, i.e., the number of documents in the corpus that contain the word w.
Pretrained language models as building blocks. Compared to many of the previous planning-thengeneration and non-monotonic generation methods, one of the key advantages of our progressive generation design is the direct compatibility with the powerful pretrained LMs that perform left-to-right generation. Specifically, although our approach implements a non-monotonic generation process that produces importance words first, we can generate intermediate sequences c k at each stage still in a left-to-right manner. Thus, we can plug pretrained LM, such as GPT-2 or BART, into each stage to carry out the generation. As described more in section 3.2, for each stage k, we can conveniently construct stage-specific training data from the domain corpus D using the stage-wise vocabulary V k , and fine-tune the stage-k LM in order to generate intermediate sequences at the stage that are pertaining to the target domain.
One can add masks on the pretrained LM's to-ken distributions to ensure the stage-k LM only produces tokens belonging to V k . In practice, we found it is not necessary, as the pretrained LM can usually quickly learns the pattern through finetuning and generate appropriate tokens during inference. In our experiments we use BART for all stages, since BART is an encoder-decoder model which can conveniently take as inputs the resulting sequence from the preceding stage and generate new. (For the first stage in an unconditional generation task, we simply set x = ∅.) We note that GPT-2, and other relevant pretraiened LMs, can indeed also be used as a conditional generator (Radford et al., 2019;Liu et al., 2018) and thus be plugged into any of stages.

Training
Our approach permits straightforward training/finetuning of the (pretrained) LMs at different stages given the domain corpus D. In particular, we can easily construct independent training data for each stage, and train all LMs in parallel. Note that no additional resources such as pretrained summarization or semantic role labeling models are requested as in previous work, making our approach directly applicable to a potentially broader set of domains and languages. We plan to explore the use of our method in multi-lingual setting in the future. More concretely, for each stage k, we use the stage vocabularies V k−1 and V k to filter all relevant tokens in the documents as training data. That is, given a document, we extract the subsequence c * k−1 of all tokens from the document that are belonging to V k−1 , and similarly extract sub-sequence c * k belonging to V k . The c * k−1 and c * k are then used as the input and the ground-truth output, respectively, for training the LM at stage k with maximum likelihood learning. Therefore, given the stage-wise vocabularies {V k }, we can automatically extract training data from the domain corpus D for different stages, and train the LMs separately.
In the multi-stage generation, the intermediate sequences are not natural language. Yet we found that fine-tuning pretrained LMs (such as BART and GPT-2) to generate the intermediate sequences is indeed very efficient in terms of data and computation. We tried training other models such as small sequence-to-sequence models and n-gram models from scratch, which we found is much harder, requiring more data, or yielding inferior performance. This again highlights the importance of using pretrained LMs, as enabled by our simple method design.
Stage-level exposure bias and data noising. In the above training process, the outputs of each LM are conditioning on the ground-truth input sequences extracted from the real corpus. In contrast, at generation time, the LM takes as inputs the imperfect sequences produced at the previous stage, which can result in new mistakes in the outputs since the LM has never be exposed to noisy inputs during training. Thus, the discrepancy between training and generation can lead to mistakes in generation accumulating through the stages. The phenomenon resembles the exposure bias issue (Ranzato et al., 2016) of sequential generation models at token level, where the model is trained to predict the next token given the previous ground-truth tokens, while at generation time tokens generated by the model itself are instead used to make the next prediction.
To alleviate the issue and increase the robustness of each intermediate LM, we draw on the rich literature of addressing token-level exposure bias (Xie et al., 2017;Tan et al., 2019). Specifically, during training, we inject noise into the ground-truth inputs at each stage by randomly picking an n-gram (n ∈ {1, 2, 3, 4}) and replacing it with another randomly sampled n-gram. The data noising encourages the LMs to learn to recover from the mistakes in inputs, leading to a more robust system during generation.

Setup
Domains. We evaluate on two text generation domains including: (1) CNN News (Hermann et al., 2015) for unconditional generation.
(2) Writing-Prompts (Fan et al., 2018) for conditional story generation. The task is to generate a story given a prompt. The two datasets are chosen since they both contain long documents, with CNN's average and maximum length being 512 and 926, and Writ-ingPrompts's being 437 and 942, respectively. To demonstrate the data efficiency of our approaches adapting to target domains, we sample 1,000 documents in each dataset for training.
Model configs. We use BARTs for all stages of generation. Due to computation limitations, we experiment models with 2, 3, 4-stages generations. In our 2-stage model, our first stage covers about 25% of all content; in the 3-stage model, the first and second stages cover 15% and 25% of all content, respectively; and in the 4-stage model, our first three stages cover 15%, 20%, 25% of all content. For model training, we follow the same protocol as (See et al., 2019) to fine-tune all pretrained models until convergence. To combat exposure bias, we add noise to the training data as described in Sec 3.2, with the probability of replacing 1,2,3,4grams 0.1/0.05/0.025/0.0125. In the generation phase, we use top-p decoding (Holtzman et al., 2020) with p = 0.95 to generate 1024 tokens at maximum. Experiments were conducted with RTX6000 GPUs. It took around 4 hours for model fine-tuning and generation with a single GPU.
Comparison methods. We compare with a wide range of baselines, categorized into two groups: (1) The large pretrained LMs including BART (Lewis et al., 2020) and GPT-2 in both small and large sizes (Radford et al., 2019). The LMs generate text in a standard left-to-right manner; (2) Progressive generation with various strategies adopted in the prior planning-then-generation work. Same as our proposed method, each stage adapts a pretrained BART for generation. Specifically, Summary first generates a short summary text as the content plan and conditioning on the summary produces the full passage of text (Fan et al., 2019). For training, summaries are obtained using the state-of-the-art pretrained CNN news summarization model based on BART; Keyword first generates a series of keywords, based on which the full text is generated in the next stage. Following (Yao et al., 2019), the keywords are extracted with the RAKE algorithm (Rose et al., 2010) for training; SRL follows the recent work (Fan et al., 2019) by first generating a sequence of predicates and arguments and then producing the full text conditionally. The same semantic role labeling tool as in the prior work is used here to create training data. SRL+NER and SRL+Coref further augment the SRL method by an additional stage of generating entity anonymized text conditioning on the predicates sequence prior to the final stage (Fan et al., 2019). SRL+NER uses an NER model to mask all entities, while SRL+Coref applies coreference resolution to mask all clusters of mentions. We use the same NER and coreference tools as in (Fan et al., 2019). Finally, as a reference, we also present the results of Human-written text (i.e., the text in the dev set).

Evaluation Metrics
To evaluate the generation quality for the domainspecific open-ended generation as studied here, we primarily measure the "closeness" between two sets of text, one generated by the model and the other the real text from the target domain. We evaluate with a broad array of automatic metrics, including lexical-based quality metrics and semanticbased quality metrics. We also evaluate the generation diversity.
MS-Jaccard (MSJ) is a lexical-based metric (Montahaei et al., 2019), where MSJ-n measures the similarity of n-grams frequencies between two sets of text with Jaccard index.
TF-IDF Distance (TID) is defined as the distance between the average TF-IDF features of two text sets. We use it as an additional lexical-based quality measure.
Fréchet BERT Distance (FBD) is a semanticbased metric (Montahaei et al., 2019) that measures the Fréchet Distance in the BERT feature space between the generated and real text. By using the BERT features from shallow (S), medium (M), and deep (D) layers, we can compute FBD-S/M/D, respectively.
Backward BLEU (B-BLEU) is a diversity metric (Shi et al., 2018) measuring how well the generated text covers n-grams occurred in the test set.
Harmonic BLEU (HA-BLEU) (Shi et al., 2018) is an aggregated quality and diversity metric that incorporates both the standard BLEU (i.e., precision) and the Backward BLEU (i.e., recall). Figures 3 and 4 show the results of the various systems on the news and story domains, respectively, measured with different metrics against test set. We give more complete results in the appendix. We can see that our progressive generation approach consistently outperforms the standard, single-stage LMs (GPT2-Small, GPT2-Large and BART) by a large margin on almost all metrics in both domains. Further, by increasing the number of progression stages, our method steadily achieves even stronger performance. This highlights the benefits of the flexible progressive generation strategy.

Results
The various models using pretrained LMs with previous planning-then-generation strategies show  mixed results across the different metrics. For example, Summary achieves strong performance in terms of the semantic-based quality metric FBD-D (partially because the summaries are closer to the real text in the BERT feature space), but significantly falls behind other models in terms of diversity (B-BLEU4) and other quality metrics like MSJ and HA-BLEU. Similarly, the SRL-based methods give only mediocre results in terms of the semanticbased FBD-D. In contrast, our approach maintains a relatively consistent performance level. In particular, our 4-stage model, ProGen-4, is steadily among the best across all metrics, further validating the advantage of the proposed simple yet flexible multi-stage generation. These results also indicate the necessity of using a large diverse set of automatic metrics for a comprehensive evaluation, and motivate human studies for further assessment.

Human Evaluation
In our human study, we asked three university students who are proficient English speakers to evaluate the coherence and fluency of the generated text. To better assess the coherence of the long passages of text, we evaluate at both the passage level and the finer-grained sentence level. More concretely, for passage-level coherence, human raters assign a coherence score to each full-length text sample, on a 5-point Likert scale. For a more detailed assessment, we further evaluate sentencelevel coherence, where human raters label each sentence in the text passage with 0 or 1, indicating whether the particular sentence is coherent with the proceeding context in the passage. We then calculate the average percentage of coherent sentences in the generated text by each model. Human raters also evaluate the language quality for a fluency score on a 5-point Likert scale. We compare our method with the systems that show highest generation quality in automatic evaluation, including BART, GPT2-Small, and Summary. We evaluated 50 examples for each comparison model on the CNN domain. The Pearson correlation coefficient of human scores is 0.52, showing moderate inter-rater agreement. Table 1 shows the results. All systems receive close fluency scores. Our approach obtained significantly higher coherence scores at both passage and sentence levels. In particular, over 86% sentences in our model generations are considered as coherent with the context, improving over other models by at least 10 absolute percent.

Ablation Study and Analysis
Sample efficiency. We study how the progressive generation could improve the sample efficiency of large LMs fine-tuned to target domains. The intuition is that by focusing on the subsets of informative words, the early stages can more efficiently capture the domain-specific characteristics and then steer the subsequent refinement stages. Figure 5 shows the results where we report the FBD score averaged over FBD-S/M/D. We can see our approach can make more efficient use of the training data in learning to generate high quality samples. For example, with only 1K training examples, our method achieves comparable results with large LMs trained on 30K examples.
Generation with gold plans. To investigate the importance of dividing the generation process into stages and what the stages learn separately, we add another set of text into our comparison. It is a 2stages model whose first stage is the ground truth (gold plan) while the second stage kept the same (a BART model), shown as GoldPlan in Table 3. Note that with gold plan, our model greatly decreases the gap with human text in terms of lexical (TID) and semantic (FBD-D) quality metrics. The results highlight the importance of plans in text generation. The intermediate plans act as an information bottleneck, and high-quality plans could lead to high-quality text generation.
Effect of data noising. We study the ablation of data noising, to check whether the noising operation really helps reduce stage-wise exposure bias (Sec 3.2) as we expected. Table 2 shows the comparison between models with and without noise in training. The added noise generally brings performance improvement in terms of various metrics.
Example generations. Table 4 shows an example of text generated via three stages. We can see our model first generates the key subject beckham and the team name liverpool in the very first stage, then adds more fine-grained details like acquisition, transfer in the second stage and finally expands the keywords into a full document describing Beckham's joining a new team.

Conclusion
We have proposed a new approach for domainspecific generation of long text passages in a progressive manner. Our method is simple and efficient by fine-tuning large-scale off-the-shelf language models. We conduct extensive experiments using a variety of metrics and human studies. We demonstrate that our method outperforms a wide range of large pretrained LMs with single-stage generation or prior planning-then-generation strategies, in terms of quality and coherence of the produced samples. The multi-stage generation also opens up new opportunities to enhance controllability of text generation, which we would love to explore in the future.