Enabling Language Models to Fill in the Blanks

We present a simple approach for text infilling, the task of predicting missing spans of text at any position in a document. While infilling could enable rich functionality especially for writing assistance tools, more attention has been devoted to language modeling—a special case of infilling where text is predicted at the end of a document. In this paper, we aim to extend the capabilities of language models (LMs) to the more general task of infilling. To this end, we train (or fine tune) off-the-shelf LMs on sequences containing the concatenation of artificially-masked text and the text which was masked. We show that this approach, which we call infilling by language modeling, can enable LMs to infill entire sentences effectively on three different domains: short stories, scientific abstracts, and lyrics. Furthermore, we show that humans have difficulty identifying sentences infilled by our approach as machine-generated in the domain of short stories.


Introduction
Text infilling is the task of predicting missing spans of text which are consistent with the preceding and subsequent text. 1 Systems capable of infilling have the potential to enable rich applications such as assisting humans in editing or revising text (Shih et al., 2019), connecting fragmented ideas (AI21, 2019), and restoring ancient documents (Assael et al., 2019). Rather than targeting a particular application, our goal here is to provide a general, flexible, and simple infilling framework which can convincingly infill in a variety of domains.
A special case of infilling is language modeling: predicting text given preceding but not subsequent text. 2 Language models are (1) Figure 1: We consider the task of infilling, which takes incomplete text as input and outputs completed text. To tackle this task, our framework constructs training examples by masking random spans to generate pairs of inputs (text with blanks) and targets (answers for each blank). We then train unidirectional language models on the concatenation of each pair. Once trained, a model takes text input with blanks, predicts the answers, and then combines them to produce the output.
ing remarkably coherent text (Zellers et al., 2019;See et al., 2019), (2) efficient at generating text, and (3) conceptually simple, but cannot infill effectively as they can only leverage context in a single direction (usually the past). On the other hand, strategies such as BERT (Devlin et al., 2019) and SpanBERT (Joshi et al., 2019) are able to infill using both preceding and subsequent text. However, their use of bidirectional attention limits their infilling capabilities to fixed-length spans. This is problematic as-for many applications-we may not know the length of a missing span a priori. Zhu et al. (2019) propose a method capable of infilling variable-length spans, but it uses a specialized architecture and hence cannot easily leverage large-scale pre-trained models.
In this work, we present infilling by language modeling (ILM), a simple framework which en-ables LMs to infill variable-length spans while preserving their aforementioned benefits: generation quality, efficient sampling, and conceptual simplicity. Our framework involves a straightforward formulation of the infilling task which, as we demonstrate, can be learned effectively by existing LM architectures. As shown in Figure 1, our approach concatenates artificially-masked text with the text which was masked, and adopts a standard LM training (or fine-tuning) procedure on such examples. Once trained, infilling can be performed for a document with blanks by using the LM to generate text and then replacing the blanks with this text.
In addition to its conceptual simplicity, our experiments show that ILM enables off-the-shelf LMs to infill effectively. Furthermore, we find that infilling performance improves when starting from a large-scale pre-trained LM (as opposed to training from scratch), suggesting an additional benefit of using our model-agnostic framework compared to approaches which require specialized architectures.
We provide an interactive web demo of models trained under our framework. This demo can infill multiple variable-length spans with different granularities (e.g. words, n-grams, and sentences) on the domains of short stories, scientific abstracts, and song lyrics: https://chrisdonahue.com/ilm. All code, data, and trained models are available at https://github.com/chrisdonahue/ilm and also on the CodaLab platform at https: //worksheets.codalab.org/worksheets/ 0x9987b5d9cce74cf4b2a5f84b54ee447b.

Problem Statement
The task of infilling is to take incomplete textx, containing one or more missing spans, and return completed text x. Let [blank] be a placeholder for a contiguous sequence (span) of one or more missing tokens. Then, incomplete textx is a sequence of tokens some of which are [blank]. In order to map x to x, an infilling strategy must specify both how many and which tokens to generate for each [blank]. Note that there may be many reasonable x for a givenx. Hence, we are interested in learning a distribution p(x |x).

Infilling by Language Modeling
In this section, we describe our ILM framework. We first outline a simple reparametrization of the infilling task. Then, we define a procedure for automatically generating suitable training examples which can be fed to an off-the-shelf LM. Fedus et al. (2018) explore an infilling framework where LMs are trained on concatenations ofx and x, i.e., they use LMs to directly predict x givenx. While their approach is effective at infilling individual words, it is somewhat redundant as the model must "predict" the unmasked text inx. Additionally, a model is not guaranteed to exactly reproduce the unmasked text.

Formulation
Instead, we make the trivial observation that it suffices to predict only the missing spans y which will replace the [blank] tokens inx. We can then construct x by simply replacing [blank] tokens iñ x with predicted spans y in a deterministic fashion. In order to handle multiple variable-length spans, we pose y as the concatenation of all missing spans separated by special [answer] tokens (one [answer] per [blank]) ( Figure 1). We can thus cast infilling as learning p(y |x) without loss of generality.

Training
Given a corpus consisting of complete text examples, our framework first manufactures infilling examples and then trains an LM on these examples. To produce an infilling example for a given x, we first sample anx from a stochastic function Mask(x) which randomly replaces some number of spans in x with [blank] tokens. Then, we concatenate together the spans which were replacedseparated by [answer] tokens-to form a training target y. Finally, we construct the complete infilling example by concatenatingx, [sep], and y (see Figure 2 for a complete example).
We train (or fine-tune) LMs on these infilling examples using standard LM training methodology, yielding models of the form p θ (y |x). Specifically, we train GPT-2 (Radford et al., 2019) off the shelf, but any LM can potentially be used. This framework has several advantages. First, it incurs almost no computational overhead compared to language modeling. Specifically, if there are k missing spans inx, the concatenation ofx and y contains only 2k + 1 more tokens than x (one [blank] and one [answer] per missing span plus one [sep]). As k is usually small (averaging around 2 per example in our experiments), sequence lengths remain similar to those encountered for the same x during language modeling. In contrast, using LMs to directly predict x fromx as in Fedus et al. (2018) effectively doubles the sequence length of x. This is particularly problematic when considering models like GPT-2 whose memory usage grows quadratically with sequence length. Second, our framework requires minimal change (three additional tokens) to an existing LM's vocabulary. Finally, because the entirety ofx is in the "past" when predicting y, the ILM framework combines the ability to attend to incorporate context on both sides of a blank with the simplicity of decoding from LMs.

Experimental Setup
We design our experiments to determine if training an off-the-shelf LM architecture with our ILM framework can produce effective infilling models for a variety of datasets. Specifically, we train on three datasets of different sizes and semantics (details in Appendix A): short STO-RIES (Mostafazadeh et al., 2016), CS paper AB-STRACTS, and song LYRICS.

Mask Function
A benefit of the ILM framework is that it can be trained to infill spans corrupted by arbitrary mask functions. Here, we explore a mask function which simultaneously trains models to infill different granularities of text; specifically, words, n-grams, sentences, paragraphs, and documents. By using a unique special token per granularity (e.g. [blank word]), this mask function offers users coarse but intuitive control over the length of the spans to be infilled.
We configure our mask function to mask each token in a given document with around 15% probability, echoing the configuration of Devlin et al. (2019). However, instead of masking individual tokens uniformly at random, we perform a preorder traversal of the granularity hierarchy tree, randomly masking entire subtrees with 3% probability. For the datasets we consider, this results in a marginal token mask rate of about 15% (details in Appendix B).
While we train to infill several different granularities, we primarily evaluate and discuss the ability of our models to infill sentences for brevity. Quantitative results of our models on other granularities can be found in Appendix D, and granularity functionality can also be explored in our web demo.

Task and Model Configurations
For all experiments, we train the same architecture (GPT-2 "small") using the same hyperparameters (Appendix C) while varying the infilling strategy and dataset. In addition to our proposed ILM strategy for infilling, we consider three baseline strategies: (1) language modeling (LM; "infilling" based only on past context), (2) reverse language modeling (LM-Rev; "infilling" based only on future context), and (3) language modeling based on all available context (LM-All). LM-All simply concatenates x andx together as in Fedus et al. (2018). LM-All represents arguably the simplest way one could conceive of infilling with LMs, but results in long sequence lengths. Training examples for all strategies are depicted in Figure 2.
For each strategy, we also vary whether training is initialized from the pre-trained GPT-2 model or from scratch. Despite discrepancies between the pre-training and our fine-tuning for most infilling strategies, all of the infilling experiments initialized from the pre-trained checkpoint performed better than their from-scratch counterparts. This indicates that ILM can effectively leverage large-scale language modeling pre-training to improve infilling performance. Henceforth, we will only discuss the models initialized from the pre-trained checkpoint, though we report quantitative performance for all models in Appendix D.
For the models trained on STORIES and AB-STRACTS, we trained models to convergence using early stopping based on the validation set perplexity (PPL) of each model computed only on the masked tokens. These models took about a day to reach their early stopping criteria on a single GPU. For the larger LYRICS dataset, we trained models for 2 epochs (about two days on a single GPU).

Quantitative Evaluation
We evaluate the quantitative performance of our models on the sentence infilling task by measuring PPL on test data. 3 In this setting, a sentence is selected at random and masked out, and we measure the likelihood assigned by a model to the masked sentence in the context of the rest of the document. Regardless of differences in the ordering and number of tokens that each strategy uses to represent a test example, PPL is always computed only for the span of tokens comprising the original sentence (e.g. green tokens in Figure 2). Table 1 shows that across all datasets, ILM outperforms models which see only past or future context (LM and LM-Rev respectively), implying that our proposed framework is able to take advantage of bidirectional context despite using unidirectional models. Additionally, while one might expect LM-All to outperform ILM because its training examples more closely "resemble" those of standard LMs, ILM achieves similar performance to LM-All. This indicates that GPT-2 is able to effectively learn the "syntax" of ILM examples and achieve reasonable infilling performance with shorter sequences (and hence with much less memory usage).
We also observe that models trained via ILM perform similarly on the special case of language mod-3 Overlap-based metrics such as BLEU score (Papineni et al., 2002) are not appropriate for evaluating infilling as there are many realistic infills that have no word-level overlap with the original, e.g., "a sandwich" instead of "leftover pasta." eling compared to the models which were trained only on language modeling (Appendix D.1). This suggests that ILM does not just repurpose LMs to infill, but rather extends their capabilities while maintaining their original functionality.

Human Evaluation
In addition to our quantitative evaluation, we seek to evaluate the qualitative performance of ILM. To this end, we sample a story from the STORIES test set and randomly replace one of its five humanwritten sentences with a model output. Then, we task human annotators on Amazon Mechanical Turk with identifying which of the sentences in a story was machine-generated (details in Appendix E).
We compare our ILM model to three baseline infilling strategies: an LM (context beyond the replaced sentence was discarded), the best model (self-attention; SA) from Zhu et al. (2019), and the pre-trained BERT (base) model (Devlin et al., 2019). All approaches except for BERT were first fine-tuned on the STORIES dataset. To infill using BERT, we replace the tokens representing the original sentence with mask tokens, and then generate text by replacing mask tokens one at a time (conditioning on previously-generated tokens). While vocabulary differences make it is less useful to compare PPL for the SA and BERT baselines to our GPT-2-based strategies, we can still meaningfully compare them in this human evaluation setting.
For each approach we compute a score, which we define as the percentage of examples where the annotator did not correctly identify the machinegenerated sentence. Therefore, a higher score implies a better (more natural, human-like) model. We collect 100 responses for each model and report the scores in Table 2, with qualitative examples in Figure 3 and Appendix E.
Of the four strategies, ILM achieves the highest score, implying that sentences infilled by ILM are harder for humans to recognize as fake than those produced by other strategies. Somewhat surprisingly, we observed that despite only observing past context the LM model performed better than BERT and SA. BERT may have performed poorly due to the intrinsic difficulty of finding convincing infills with a precise length in tokens. SA may have performed poorly because, unlike LM and ILM, it was not initialized from a large-scaled pre-trained LM.  , and our LM and ILM models to replace random sentences in five-sentence stories from the STORIES test set. Then, we task humans with identifying which sentence of the five was generated by a machine. We report the score of each model: the percentage of infilled stories where the human failed to identify the machine-generated sentence. Our ILM model achieves a higher score than all of the other models. Note that the max score is effectively 80%, as a perfect model would cause annotators to randomly choose one of the five sentences.
BERT SA LM ILM Human favoritea ", Mary brightly said. She wasn't sure she had to go to the store. She went to check the tv. Patty knew her friends wanted pizza. She also had the place looking spotless.

Example Story with Masked Sentence
Patty was excited about having her friends over. She had been working hard preparing the food.
[blank] All of her friends arrived and were seated at the table. Patty had a great time with her friends. Figure 3: Example of a short story in our STORIES dataset with its third sentence masked, and sentences infilled by different models. The sentences generated by BERT and SA models are off-topic, the sentence generated by LM model is irrelevant to the future context, while the ones generated by ILM and Human successfully account for both previous and future context.  Shen et al. (2020) infill multiple variable-length sequences, but these approaches require the masked context to be iteratively updated and reprocessed to fill in blanks one a time. In contrast, our approach appends infilled text to the context and does not require reprocessing the entire input sequence for each blank. AI21 (2019) train an LM which can fill in the middle of a paragraph given the first and last sentences-our work generalizes to such capabilities. Task. The cloze task (Taylor, 1953) evaluates language proficiency by asking systems to fill in randomly-deleted words by examining context. Cloze has been extended in the forms of discourse (Deyes, 1984) and narrative cloze (Chambers and Jurafsky, 2008), which remove phrases and narrative events respectively. Recently, cloze has been used not only for evaluation, but also to improve text generation quality (Fedus et al., 2018) and transfer learning (Devlin et al., 2019) (under the name "masked language modeling"). Text infilling can be thought of as generalizing the cloze task from single words to spans of unknown length. Raffel et al. (2019) explore infilling as a pre-training objective to improve downstream performance on inference tasks; our work focuses on generation.
Story generation. Recent work seeks to generate stories given a title and storyline (Yao et al., 2019), entities (Clark et al., 2018), premise (Fan et al., 2018), or surrounding context and rare words (Ippolito et al., 2019). Our work differs in that we aim to build systems capable of making predictions based only on text context, rather than aspects specific to stories (e.g. storyline).

Conclusion
We presented a simple strategy for the task of infilling which leverages language models. Our approach is capable of infilling sentences which humans have difficulty recognizing as machinegenerated. Furthermore, we demonstrated that our infilling framework is effective when starting from large-scale pre-trained LMs, which may be useful in limited data settings. In future work, we plan to incorporate these features into co-creation systems which assist humans in the writing process. We hope that our work encourages more investigation of infilling, which may be a key missing element of current writing assistance tools. We experimented on multiple datasets to demonstrate that our framework was not custom tailored to a single domain. On the STORIES and AB-STRACTS datasets, we include metadata (story title, paper subject matter, etc.), as the first "paragraph" of the document. By providing these paragraphs (Appendix B), our infilling model implicitly learns to summarize (e.g. infill a title given a story), and do conditional generation (e.g. infill a story given a title). On the LYRICS dataset, infilling models may be especially helpful to humans; external aid in the form of rhyming dictionaries is already commonly employed in this domain.
To ensure that all experiments were trained on the same data, we removed infilling examples which would have exceeded our training sequence length of 256 tokens for the model with the longest sequence length (LM-All). This removed no examples from STORIES, a small fraction of examples from LYRICS, and a substantial number of examples from ABSTRACTS.

B Masking function
We design a mask function which takes the entire document and selectively masks several span granularities: words, n-grams, sentences, paragraphs, and entire documents. Accordingly, models trained via ILM on this masking function offer users the ability to specify the granularity of text to infill at a particular location. This allows users to have coarse but intuitive control over infilling length, so that multiple paragraphs are not generated when the user was expecting a single word.
Our masking function first constructs a tree of the training example (using the natural hierarchy of documents, paragraphs, sentences, and words). Then, using a pre-order tree traversal, each subtree is masked with 3% probability (or ignored if any of its ancestors are already masked). If the entire document (root node of the tree) is masked, then the infilling model's job is equivalent to that of a language model. If a word (leaf) is selected to be masked, 50% of the time we mask that individual word, otherwise we mask an n-gram of random length between 1 and min(8, # words left in the sentence) words (inclusive). Note that a word may comprise multiple tokens, as GPT-2 uses sub-word tokenization (Sennrich et al., 2015). We chose the value of 3% as, for the datasets we considered, it resulted in a marginal token mask rate of around 15%, echoing the configuration of Devlin et al. (2019).
We add special tokens for each granularity to our model's vocabulary (e.g. [blank word]), so that the user may specify which granularity they would like the infilling model to produce. This functionality can be explored in our demo: https: While we focus on this specific mask function in this paper, we structured the ILM codebase to allow users to train infilling models for completely different use cases. Users need only define a new mask function which takes complete documents and outputs lists of character-level spans representing the desired spans to be masked.

C Hyperparameters
We use early stopping based on the PPL of the model on infilling the masked token for the validation set. We train all models using the default fine-tuning parameters specified in the transformers library (Wolf et al., 2019), except that we use a batch size of 24 and a sequence length of 256.
Note that the most straightforward way of training an LM on ILM examples (Section 3.2) is to maximize the likelihood of the entire concatenated example:x, [sep], and y. This trains the model to predict tokens inx even though such behavior is not necessary at inference time asx will always be fully-specified. Nevertheless, we found that this additional supervision improved performance when evaluating model PPL of y. Conveniently, this is also the default behavior when adapting existing LM training code for use with ILM.

D Evaluation on language modeling and infilling other granularities
Our quantitative evaluation (Section 5) examined the sentence infilling performance of GPT-2 initialized from the large-scale pre-trained checkpoint   after fine-tuning on different datasets and infilling strategies. Here, we report PPL for GPT-2 both initialized from scratch and from the pre-trained checkpoint for several other configurations: language modeling, a mixture of granularities, specific granularities, and language modeling.

D.1 Language modeling
In

D.2 Mixture of granularities
In Table 4, we report results for a mixture of granularities. Specifically, we run the same mask function we use for training (Appendix B) on our test data and evaluate PPL on the masked spans. This reflects general infilling ability across a wide variety of granularities (and hence lengths). Unlike our other quantitative evaluations, there may be multiple variable-length spans missing from each example in this evaluation. Results are similar to that of sentence infilling. Namely, that ILM outperforms LM and LM-Rev and is similar to LM-All despite using much less memory.

D.3 Individual granularities
In Tables 5 to 8 we report PPL values for infilling performance on paragraphs, sentences, n-grams, and words, respectively, across the three datasets.
For each granularity, we create one infilling example per document from the test set with exactly one masked span (randomly chosen from all spans of that granularity for that document). Then, we compute PPL only on the tokens which comprise the masked span, i.e., PPL is computed for all models on exactly the same set of tokens. Across all granularities, we observe that ILM outperforms   LM and LM-Rev and either outperforms or is comparable with LM-All while using less memory.

E Details on human evaluation
For human evaluation, we sampled 100 stories from the test set of the STORIES dataset. From each story, we masked out one sentence at a time, thereby resulting in 500 stories with masked sentences. Then we used these stories as context and tasked each model with infilling the masked sentence. We compared 8 models in total. In addition to the four models reported in Section 6 (BERT, SA, LM, and ILM), we included the models which are initialized from scratch (as opposed to initialized from the large-scale pre-trained checkpoint) for exhaustive comparison. Furthermore, to filter out spam, we used a control model which always generates "This sentence was generated by a computer." Lastly, we included the original sentence from the dataset as a reference model (Human) to sanity check the max score is around 80%.
Each annotator was shown 8 stories, one from each model, and was asked to identify one of the five sentences generated by machine (see Figure 4 for an example). Among the 100 collected responses, we filtered out 5 responses whose annota-tion for the control model was wrong. The quantitative and qualitative results can be found in Table 9 and Figure 5, respectively. All model outputs and responses of human evaluation can be found at  Table 9: Human evaluation results.
Identify one of the five sentences generated by machine.
○ Patty was excited about having her friends over. ○ She had been working hard preparing the food. ○ Patty knew her friends wanted pizza. ○ All of her friends arrived and were seated at the table. ○ Patty had a great time with her friends.

Example Story with Masked Sentence
Lily always loved to read. She wondered sometimes, what it would be like to write a book?
[blank] Lily did well in the course, and during it, wrote a short book.

SA LM ILM
Human I held her hand and helped her sit. Of her, but she didn't know her. She practiced reading a lot every week. Finally, in middle school, her teacher introduced her to writing that. She decided to take a course on fiction writing.

SA LM ILM Human
Or rather, what the next job would be now. I was going out I was going to the beach. I put on about thirty sugar cubes. The issues are getting so many people crazy. I could never catch up and each week got worse.

Example Story with Masked Sentence
Yesterday was Kelly's first concert.
She was nervous to get on stage.
[blank] Kelly was then happy. She couldn't wait to do it again.

ILM Human
Today was the first concert that she had to see every where. She was going to go to the play. When she went on stage she smoothly walked right past the audience. When she got on stage the band was amazing. As soon as she got on the audience applauded.

Example Story with Masked Sentence
Yesterday was Kelly's first concert. She was nervous to get on stage.
[blank] Kelly was then happy. She couldn't wait to do it again.