Plan ahead: Self-Supervised Text Planning for Paragraph Completion Task

Despite the recent success of contextualized language models on various NLP tasks, language model itself cannot capture textual coherence of a long, multi-sentence document (e.g., a paragraph). Humans often make structural decisions on what and how to say about before making utterances. Guiding surface realization with such high-level decisions and structuring text in a coherent way is essentially called a planning process. Where can the model learn such high-level coherence? A paragraph itself contains various forms of inductive coherence signals called self-supervision in this work, such as sentence orders, topical keywords, rhetorical structures, and so on. Motivated by that, this work proposes a new paragraph completion task PARCOM; predicting masked sentences in a paragraph. However, the task suffers from predicting and selecting appropriate topical content with respect to the given context. To address that, we propose a self-supervised text planner SSPlanner that predicts what to say first (content prediction), then guides the pretrained language model (surface realization) using the predicted content. SSPlanner outperforms the baseline generation models on the paragraph completion task in both automatic and human evaluation. We also find that a combination of noun and verb types of keywords is the most effective for content selection. As more number of content keywords are provided, overall generation quality also increases.


Introduction
One may think textual coherence can be achieved from a gigantic language model trained on massive data. This might be true in simple cases, such as generating short replies (Kannan et al., 2016), but not in a long, multi-sentence generation. This is * * This work was done while DK was at CMU. mainly because per-word predictions from the autoregressive models can not capture the long-term flow of text, while humans often make structural decisions on what and how to say about before they speak (Byrne, 1979;McKeown, 1985;Hovy, 1990;Swan, 2002;Kang, 2020). Guiding the surfacelevel realization with such high-level decisions and coherently structuring output text is called a planning process.
Where can the model learn such high-level decisions related to long-term coherence? A written paragraph itself can be a pot of golden resources, containing various forms of inductive coherence signals. Different types of coherence signals in a paragraph have been studied and used in many different ways: a sequence of words or sentences (Devlin et al., 2019;Radford et al., 2019), a discourse structure of a text (Appelt, 1982;Hovy, 1991;Kang et al., 2017Kang et al., , 2019, an order of sentences (Chambers and Jurafsky, 2008; Barzilay and Lapata, 2008), topic introduction, co-reference, a sequence of events (Tomkins, 1978;Schank and Abelson, 2013), and more. In this work, we primarily focus on the effect of topical content in text planning.
Despite the recent advances of contextualized language models (Devlin et al., 2019;Radford et al., 2019), the lack of appropriate tasks makes it difficult to evaluate generation models' long-term coherence. Prior tasks fall into classification or ranking problems, such as narrative close task (Chambers and Jurafsky, 2008;Mostafazadeh et al., 2016), sentence ordering (Barzilay and Lapata, 2008), and next sentence prediction (Devlin et al., 2019). Some recent works focused on designing generation tasks: story generation (Fan et al., 2019), text infilling (Huang et al., 2019;Fedus et al., 2018;Hua and Wang, 2019), or paragraph bridging (Kang et al., 2019). However, most of them suffer from predicting appropriate topical content given limited context, due to the limited usage of self-supervision signals from the paragraph.
This work proposes a new open-ended paragraph completion task; PARCOM; predicting the masked sentences in a paragraph. Unlike the prior works, our task uses two effective ways of self-supervision learnt from a written paragraph itself: (1) we augment more training instances via permutation masking and (2) resolve the context sparsity problem by providing a set of ground-truth content keywords and predicting them directly from context at testing time.
For the task, we propose a self-supervised text planner (SSPlanner) that explicitly predicts content keywords (content prediction) from context and guides the pretrained language model (surfacerealization) using the predicted content. The distribution of predicted keywords is then combined with the distribution of words in the language model using copy mechanism (See et al., 2017). The predicted content keywords are an approximation of topical intents by the generator, providing a hint to guide the surface realizer to bridge the coherency gap between the given context and text to generate. Overall, SSPlanner combines two advantages; micro-level language fluency from the pre-trained language model (bottom-up) and macro-level content choice controlled by the macro-level planning (top-down). Our experiment shows that SSPlanner achieves significant improvements over the baselines in both automatic and human evaluation.

Related Work
We first categorize a wide range of long-term coherent generation tasks (Table 1), based on their inclusion relationship (C-T) between the given context (C) and target to predict (T).  is explicitly provided as a data form, the planner mostly orders and structures, not prediction.
C ⊃ T: In abstractive summarization, all context information is entirely given in the source document, as a superset of target summaries to predict. Thus, generation only pays attention to abstracting the context into a shorter form instead of content prediction or ordering.
C ≈ T: Paraphrasing is transforming surface patterns of text while preserving its semantics. Fu et al. (2019) used variational autoencoders for surface realization with a latent bag of words model for differentiable content planning, where content to generate itself is given in context, not requiring any content planning.
C ⊥ ⊥ T: Story generation (Fan et al., 2019), text infilling (Fedus et al., 2018;Huang et al., 2019), paragraph bridging (Kang et al., 2019), and our proposed PARCOM are very challenging tasks where context and target have no overlap (open-ended), but they should be coherently connected.   Hua and Wang (2019) used pre-extracted topics to guide a generator to produce stylized argumentation text. However, they are given the topical content as input (content guidance), while our SSPlanner directly predicts plan words from context (content prediction). Fedus et al. (2018); Huang et al. (2019) developed various methods for text infilling task. Very similar to our task, Kang et al. (2019) developed language models informed by discourse relations on the bridging task; given the first and last sentences, predicting the intermediate sentences (bidirectional flow). However, they did not explicitly predict content words given context nor use them as a self-supervision signal in training. Unlike random masking in Keskar et al. (2019); Huang et al. (2019), we propose a better data augmentation training method via permutation masking.

PARCOM: Paragraph Completion
Task from Self-Supervision Signals Our task is motivated by the recently proposed task; paragraph bridging (Kang et al., 2019), predicting intermediate sentences of a paragraph, given the first and the last sentences. To prevent generation becoming too divergent from the context in story or prompt generation (Fan et al., 2019), the bridging task restricts generation to end with the last sentence given, provided as an ending goal for generation. However, in the bridging task, the objective is to generate text by coherently linking the two extreme sentences, making the task itself too challenging even for human 1 . For instance, the first and last sentences are too sparse to generate multiple (from 2 to 5) target sentences, increasing divergence of generation exponentially. Also, data usage in (Kang et al., 2019) is very inefficient; training a single instance per paragraph.
To address those issues, we propose a new paragraph completion task PARCOM by maximizing self-supervision presented in a paragraph itself ( Figure 1). We describe two types of selfsupervisions: (1) masking a fixed-length of consecutive sentences in any position over a paragraph to maximize usage of a paragraph and (2) extracting partial keywords of the masked text as plan keywords to resolve the content sparsity problem. Mainly, we learn the patterns between the context and the plan keywords in training and at testing time predict the plan keywords, and guide the surface generator ( §4).

Data Augmentation via Permutation Masking
Our work is motivated by word masking, in training contextualized language models (Devlin et al., 2019), but extending it to sentence-level for learning longer coherence. Let t be the number of targets, masked sentences to predict and c be the number of unmasked, context sentences given, where l=t+c is the total length 1 METEOR score from human generation on the task is only about 4.5 (Kang et al., 2019)  Permutation masking by t=1 S1 S2 S3 S4 S5 S1 S2 S3

Plan extraction by nkps=2
Plan keywords (b) Plan extraction on target sentence. The maximum number of keywords per sentence (nkps=2) is given. of a paragraph. For instance, in Figure 1, we have a l=5 length paragraph. We restrict the number of context sentences to be larger than the number of target sentences (c > t), to avoid context become too sparse. Also, we produce a total of 5+4=9 training instances, making use of data more efficient.

Denser Context by Plan Extraction
We provide extra partial information as a set of keywords to guide the surface generator. This is motivated by data-to-text tasks, but our plans are topical content instead of structured data.
We then question what types of plan keywords are the most effective for completing the paragraph. We extract keywords using various keyword extraction systems: • Off-the-shelf systems extract keywords for each sentence using the three off-the-shelf systems: YAKE (Campos et al., 2020) using statistical features (e.g., TF, IDF), RAKE (Rose et al., 2010) using graph-based features (e.g., word degree), and PositionRank (Florescu and Caragea, 2017) using position-based PageRank. Then we choose duplicate keywords by majority voting. • Syntactic features (e.g., part-of-speech tags, named entities (Fan et al., 2019), events (Tomkins, 1978)) are often regarded as the most salient topical content in generation. Using offthe-shelf Part-of-speech (PoS) tagger 2 , we extract three types of syntactic features: nouns, verbs, and nouns+verbs. • Attention weights are used to capture contextaware keywords. We use the pre-trained BERT (Devlin et al., 2019) to encode context and target text, then average the attention weights of context words with respect to each target word. We only use the first head's attentions, then average them over all 12 layers 3 . We finally choose words with the maximum weight except for the special tokens (e.g., [CLS]) and punctuation marks. We set the maximum number of keywords per sentence (nkps) to 5. Some extractors output an empty keyword list, so the number of keywords across the systems is different. Our keywords are always uni-grams. In case they are not uni-grams, we split them by whitespaces and use individual unigrams as unique keywords. If the target text has multiple sentences, we combine all keywords from the sentences and randomly shuffle them. The plan keywords extracted are only provided while training our plan predictor, but not at test time. At testing time, we explicitly predict the keywords given context.

Self-supervised Text Planning (SSPlanner)
SSPlanner has various self-supervision modules that learn coherence signals from a paragraph itself: surface realizer (language model) by learning from a sequence of words, next sentence predictor by learning from a sequence of two consecutive sentences, sentence position embeddings by learning from an order of context sentences, plan predictor by learning from the relationship between the given context and important keywords used in the generation of the target text, and content guidance by learning from whether the predicted plan keywords are used or not in the target (See Figure 2). Our planner is motivated by the two-stage generation framework (Moryossef et al., 2019;Miculicich et al., 2019;Fu et al., 2019;Hua and Wang, 2019). While in prior works, the content is explicitly given from the dataset or task itself, our plan predictor in SSPlanner predicts the plan keywords only from the given context, by learning the topical relationship between context and content in target from training data.
Given l length of a paragraph s 1 ..s l where each sentence s consists of a n number of words s = w 1 ..w n , PARCOM splits it into the context sentences x=s 1 ..s j−1 , s j+t ..s n and t target sentences to predict y=s j ..s j+t−1 . For each target sentence, p number of plan keywords k j,1 ..k j,p for arbitrary target sentence s j are given only at training time. The plan keywords are chosen from the entire vocabulary V W and later combined with word distribution from the language model. We describe each selfsupervision module in SSPlanner as follows: Surface realization with pre-trained language models. We use two different types of transformer-based language models: BERT (Devlin et al., 2019) and GPT2 (Radford et al., 2019). While GPT2 is trained on bidirectionally tied language modeling, BERT is trained on masked language modeling. For BERT, we use the sequential sampling method (Wang and Cho, 2019). Using them, we encode context x and output the hidden representation h j,i = f(h j−1,i , x k<( j,i) ) for j th word in i th sentence, where f ∈ {BERT, GPT2} is the transformer language model. We then output the sentence vector h i by averaging all word vectors in a sentence.
Sentence position embedding. We concatenate the encoded sentence representation with its sentence position embedding. By adding the sentence position embeddings into context encoding, the model is aware of where the context sentence came from (e.g., from the first or last). Compared to the simple concatenation of them (Kang et al., 2019), our sentence position embedding helps better learn the bi-directional coherence. The context vector's final representation is then h c = 1 n ∑ i h i ; pos c i where n is the number of sentences in a text and pos c i is the position embedding of i th sentence in the context paragraph.
Plan prediction. This work assumes that highlevel plan words consist of bag-of-words (Fu et al., 2019), so that the model directly predicts the plan keywords from the vocabulary used in surface realization. We calculate the plan probabilities over the entire vocabularies V given the context vector h c and choose the p number of keywords with maximum probability estimates over vocabulary: They reached..

Ground-truth keywords Predicted keywords
Predicted target Ground-truth target Figure 2: SSPlanner: first predicts high-level plan keywords (Plan Predictor) then guides the surface generation (Transformer LM) using the predicted plan keywords. The ground-truth plan keywords and target sentences (blue arrows) are only given in training time, whereas not in testing time. The predicted and ground-truth target can be seen in Table 7. Best viewed in color.
lary from the training data and W cv is the trainable model parameter. We do not control any explicit cut-off in the p k∈V in order to make the distribution differentiable. The objective is then: where the loss is calculated by cross-entropy,p is the estimated probability distribution over vocabulary and p * is the true one-hot distribution over plan keywords extracted from the extraction algorithms (i.e., [0,1..0,1] over V). Next sentence prediction. Motivated by Devlin et al. (2019), we also add an auxiliary task of predicting whether the target sentence is related to context or not. For negative samples, PARCOM assigns 50% of random target sentences. We optimizê p next = so f tmax(W c h c ) where W c is the trainable parameter for the binary classification. Next sentence prediction's objective is then: where the loss is calculated by binary cross-entropy, p * next is the true label for next sentences andp next is the predicted label.
Content guidance. We combine two distributions between plan predictions and language modeling through copy mechanism following the pointergenerator (See et al., 2017). For j th sentence, we learn the probability of choosing the plan keyword or the word from language modeling based on context vectors, plan keyword distributions, and sentence position embedding of target sentences: where σ is a sigmoid function, W ck is the trainable parameter, and v ∈ [0, 1] is a probability of whether choosing the plan keyword or not.
We then decode each target sentence using the same language model decoder: s j = g(s j−1 ,ŷ j−1 ), where g ∈ {BERT, GPT2} is the language model decoder and s is its output hidden state. We can obtain the attention over plan keywords k: where W k j is the trainable parameter. Lastly, we combine the distribution of plan probabilities P plan and word probabilities in decoding P lm .
The objective of the pointer-generator is then: Final objective. The final objective of our training is to minimize the three objectives; plan prediction, next sentence prediction, and pointer generation, together: L SPP = λ plan L plan + λ next L next + L gen (6) where the weighting terms; λ plan and λ next , are obtained through the cross-validation.

Experiment
We answer three questions in our experiments: Q1. Does SSPlanner help produce a more coherent generation in PARCOM? If so, which of the selfsupervision modules are the most helpful? Q2. What types of plan keywords (e.g., noun, verb, attention) are most effective in terms of generation quality? How many keywords given are the most helpful? Q3. Is PARCOM a valid generation task to measure text coherence?  Paragraph datasets. Table 3 shows the paragraph datasets collected for our experiment. We collect paragraphs from various domains: the two most frequent sub-genres extracted from BookCorpus (Zhu et al., 2015) dataset; Fantasy and SciFi, Wikipedia text from wikiText-103 (Merity et al., 2016), and news articles from CNN/DailyMail (CNNDM) dataset (See et al., 2017). CNNDM and WikiText contain factual knowledge about events or things, whereas Fantasy and Romance are more narrative.
For a fair comparison, we restrict the number of sentences in a paragraph from 4 to 7, the same as the setup in Kang et al. (2019). Since CNNDM has no specific line breakers in the document, each document is regarded as a single paragraph (39.3 lengths on average). Each dataset is randomly split by 0.9/0.05/0.05 for the train, valid, and test set, respectively.
Models. As baselines, we compare nonpretrained sequence-to-sequence models: BiL-STM (Hochreiter and Schmidhuber, 1997) and hierarchical seq2seq HRED (Serban et al., 2017;Sordoni et al., 2015) by encoding the concatenation of context sentences and then decoding the target sentences. We also compare two strong paragraph generation models: FlowNet disc using discourse relations and FlowNet latent using latent delta relations (Kang et al., 2019), following the same setups (e.g., discourse parser, hyper-parameters) of the original paper.
Also, we use the pre-trained language model baselines fine-tuned on our paragraph datasets: the fine-trained bert-base-uncased (BERT f inetune ) and gpt2-base (GPT2 f inetune ) models (Wolf et al., 2019). For BERT, we use the sequential sampling method (Wang and Cho, 2019) with Nucleus sampling strategies for producing more diverse text (Holtzman et al., 2019).
Our proposed method SSPlanner is trained using either bert-base-uncased or gpt2-base. As an upper-bound of our method, we predict masked, target text using the ground-truth plan keywordsp.
Setup. We find the best hyper-parameters on the validation set using a grid search on the learning rate, the number of training epochs, sampling parameters, and so on. We follow the default parameters used in the HuggingFace's transformer models (Wolf et al., 2019). For a pointer-generator, we follow the default parameters in (See et al., 2017). The maximum number of plan keywords per sentence is 3.
Here are the final parameters used for the BiL-STM and HierLSTM baselines: 32 for batch size, 128 for maximum paragraph length, 300 for word embedding size initialized by GloVe (Pennington et al., 2014) for baseline models, 1 LSTM layer (Hochreiter and Schmidhuber, 1997) with 512 size, clipping by 0.25, 0.2 learning rate and 0.5 decay rate with Adagrad (Duchi et al., 2011) optimizer, and 50, 000 for the vocabulary size. For FlowNet variants, we follow the setup used in the original paper (Kang et al., 2019). For BERT and GPT2 models, we use 32 for batch size, 2e-4 leraning rate with 1.0 maximum gradient norm and 0.02 weight decay using Adam (Kingma and Ba, 2014) optimizer.
Due to the computing limit, we restrict the maximum number of target sentences to 3 even though it could be up to half of the paragraph size in the full permutation.
For human evaluation, we measure fluency, coherence with respect to context, and overall quality with 1-5 Likert scale. We randomly select 100 samples from the test set in each Romance, WikiText, and CNNDM (total 300 paragraphs). Each sample is annotated by three crowd-workers then averaged. We also measure how human performs on the task by asking workers to predict the masked text in these 300 paragraphs. Table 4 and 5 show automatic and human evaluation result on PARCOM task. The fine-tuned models ({BERT,GPT2} f inetune ) 4 and FlowNet models show significant improvements over the seq2seq baselines (BiLSTM and HRED) by large margins (∼1.5 METEOR), showing the importance of finetuning on target text and modeling inter-sentential relation, respectively.

Automatic and Human Evaluation
In all datasets, SSPlanner shows significant improvements in both hard and soft metrics. This 4 In our experiment, no fine-tuned models (original pretrained models) show very poor performance on our task.

Romance
WikiText CNNDM  indicates that explicitly predicting content words before surface realization helps generate more coherence text on target-oriented generation in PARCOM. SSPlanner with GPT2 outperforms SSPlanner with BERT, because such autoregressive models like GPT2 are more appropriate for our task, whereas BERT is not. Finally, the performance of SSPlanner with the ground-truth keywords (p) achieves the dramatic gain, which can be seen as an upper bound of our planning framework. Among domains, Fantasy and Romance seem to be better predicted compared to WikiText and CNNDM that require additional factual knowledge as well as narrative coherence. Using the best model; SSPlanner (GPT2), we conduct a human evaluation on various system outputs and human-generated text ( Table 5). The finetuned GPT2 model shows high fluency as itself but very low coherence with context, because PAR-COM requires not only fluent and natural text but also context-aware text. SSPlanner achieves much higher coherence and overall quality than the baselines, but still is far behind the upper-bound model (SSPlanner withp) and human generation.

SSPlanner Human writer
Human eval. F : 4.3, C: 3.9, Q: 3.8 F : 4.8, C: 4.9, Q: 4.8 Predicted plan keywords 2 (vigor, mark, caught), 3 (gate, catholics, police), 4 (altar, mark, bishop) Table 7: Example paragraph with the plan keywords extracted from different algorithms and output predictions by SSPlanner and human writer. F is fluency, C is coherence with context, and Q is overall quality. the test samples (Table 6). SSPlanner achieves very high accuracy in NSP. In PP, SSPlanner correctly predicts almost half of the keywords from the total vocabulary size, indicating that the plan prediction module in SSPlanner can capture a certain level of coherence between the given context and target text to predict, although it is not perfect.    verbs and the keywords extracted from the off-theshelf algorithm outperform the other types. We conjecture that since a sentence consists of both entities (i.e., nouns) and events (i.e., verbs) according to the script theory (Schank and Abelson, 2013), the combination of them provides the largest amount of information to complete the sentence. Attentionbased keywords are not that helpful because the averaged attention weights themselves may not be a good indicator for topical coherence.

Comparison of Keyword Types and Ratios in Testing
In Figure 3, at test time, the predicted keywords from SSPlanner (red) shows dramatic improvements in both METEOR and VE against the ran-dom keywords (blue), but far behind the groundtruth keywords (yellow). As more predicted keywords are used at testing time, the generation quality increases.   Table 7 shows an example paragraph with ground-truth keywords extracted from different algorithms in PARCOM and predicted target sentences with plan keywords by SSPlanner and a human writer. In the prediction by SSPlanner, half of the predicted keywords are used in the generation, making the story more coherent to the first two sentences and the last ending sentence. In the entire test set, we observe that about 43% of predicted keywords are actually used in generation.

Conclusion
A written paragraph itself contains various inductive coherence signals to be learned through selfsupervision. Motivated by this, we propose a paragraph completion task for measuring textual coherence from a long document using different types of self-supervision signals. To solve the task, we propose a text planner SSPlanner that explicitly predicts topical content keywords, and then guides the surface generator using the predicted plan keywords. SSPlanner consists of different kinds of self-supervision modules: sentence positions, a sequence of words or sentences, and the topical relationship between context and target. Our selfsupervised planning, in addition to other types of planning (e.g., discourse, goals, coreference, tenses) can be an important step toward modeling a long-term coherence in text generation.
Our results suggest several promising directions: Although our ablation tests show the effect of each self-supervision module, types of plan keywords, and the amount of keywords with respect to generation quality, there are more spaces to explore in self-supervised text planning. For example, one can study the generation quality with respect to the position of the target sentences (beginning, middle, end), the comparison of plan keywords predicted by human and system, the effect of data augmentation by their positions (e.g., masking the only middle), the generation quality with respect to the ratio between masked and unmasked sentences, and more.
Second, we can extend the set of plan keywords to be more structured like a discourse tree. For instance, one can write a simple structure like "(CAUSALITY (ELABORATE (Buy, Coffee)) (Pay, Tip, 12 dollars))" then the system can generate a long, coherent text reflected by the structure. Predicting such structural plans from context and imposing them into the generator would be a potential direction for future work.
Last, text planning is a cognitive function commonly used in human language generation. To generate more human-like utterances, different planning stages should be simultaneously combined together (Kang, 2020), such as abstractive planning, strategic planning, coherence planning, and diversity planning. Combining the heterogeneous planning systems will be a crucial step towards developing a human-like language generation.