INSET: Sentence Infilling with INter-SEntential Transformer

Missing sentence generation (or sentence in-filling) fosters a wide range of applications in natural language generation, such as document auto-completion and meeting note expansion. This task asks the model to generate intermediate missing sentences that can syntactically and semantically bridge the surrounding context. Solving the sentence infilling task requires techniques in natural language processing ranging from understanding to discourse-level planning to generation. In this paper, we propose a framework to decouple the challenge and address these three aspects respectively, leveraging the power of existing large-scale pre-trained models such as BERT and GPT-2. We empirically demonstrate the effectiveness of our model in learning a sentence representation for generation and further generating a missing sentence that fits the context.


Introduction
Generating a span of missing tokens in a text chunk, known as "text infilling," has attracted many attentions recently (Zhu et al., 2019;Song et al., 2019;Ippolito et al., 2019;Joshi et al., 2020). Here we study the related but somewhat different task of "sentence infilling." Specifically, as illustrated in Figure 1, intermediate sentences (or chunks of text) are removed from long-form text (e.g., paragraphs, documents), and the task is to generate the missing pieces that can smoothly blend into and fit the context both syntactically and semantically. The generation can be either based only on context, or based on both context and side information such as constraint keywords. Compared with text infilling, sentence infilling requires the model to handle inter-sentential correlation and to reason about missing semantic information. Developing models for sentence infilling can potentially * These authors contributed equally to this work. She was extremely happy with our hotel and we had a complimentary buffet.
... The food was just phenomenal! I can't recall what everything was called, but we rolled out of there stuffed and happy. My husband had the rib eye dumpling as an appetizer and he said it was the best dumpling he has ever had.
Beautiful beachside boutique hotel with great views and modern decoration. My favorite part about this hotel is definitely the restaurant, UVA. I recently visited UVA to attend a friend's birthday party. ... Figure 1: Sentence infilling: generating an intermediate sentence that provides a smooth semantic transition from the preceding to the following context. This example is generated by our model on the TripAdvisor dataset. The colors mark the correspondence between the generated sentence and the context. facilitate many text generation applications. Possible scenarios include, but are not limited to: document auto-completion by detecting and suggesting missing bridging sentences in the surrounding context; collaborative document writing by modifying and unifying different writing styles from multiple authors; meeting note expansion by extending a set of keywords (lexical constraints) to a full sentence, leveraging the surrounding context.
There are many challenges associated with this long-form sentence infilling task, which is typically a one-to-many problem in that the possible outputs can be diverse. As the generated sentence should connect separate text pieces in a syntactically and semantically smooth and coherent manner, the task requires a wide range of understanding, planning, and generation techniques. Largescale pre-trained language models such as BERT (Devlin et al., 2019) and GPT-2 (Radford et al., 2019) have dramatically enhanced the understanding and generation modules. However, how to in-tegrate them in a holistic manner, and to analyze and establish the long-range dependence structure by high-level semantic planning is still challenging and yet to explore, as semantic appropriateness is usually subtler than syntactic appropriateness, which can be well characterized by autoregressive language models.
Several works have been done in this direction. MASS (Song et al., 2019) obtains sentence representations by predicting a span of missing tokens. It can be used to generate missing text, but the missing span length needs to be pre-specified. Other related works Joshi et al., 2020) also require knowledge of the span length as an input to their models, and thus are different from our work. Text infilling (Zhu et al., 2019) sequentially generates tokens for the missing part of a sentence until an end-of-blank token is generated. Its generation can be of arbitrary length. By design, all these previous approaches operate at the token level, and thus arguably focus more on lexical appropriateness than the global semantics.
In this paper, we propose INter-SEntential Transformer (INSET), a novel approach to sentence infilling. Our model first produces sentence-level semantic features that capsulate the missing highlevel information. Then, grounded on the predicted semantic features, the model generates the syntactic and lexical features to embody the predicted sentence. Specifically, understanding, planning, and generation are handled by three modules in a synergistic manner: • a BERT-based encoder to map each sentence to the latent semantic space. • a sentence-level semantic planner to infer the missing information that can bridge the semantics of preceding and following context. • a GPT-based generator (decoder) to map semantic features back to the text domain. The main contributions and advantages of this work are summarized as follows: • We study the task of sentence infilling, which requires the model to handle inter-sentential correlation and to predict missing semantic information. This goes beyond text infilling (Zhu et al., 2019), which asks the model to fill in the missing part of a single sentence. • Our approach decouples understanding, planning, generation, and leverages existing largescale pre-trained understanding and generation models (BERT, GPT-2). The components of our model can be separately examined and improved with additional data. • Our model predicts a feature vector in the latent semantic space for the missing sentence and maps the vector to text. Thus, it takes care of semantic smoothness and appropriateness. • Our model allows the generation to be of arbitrary length, as in (Zhu et al., 2019). • Compared with directly processing text, our approach significantly reduces computation time and memory usage during training, as (after pre-computing sentence features) the sequence length is the number of sentences rather than that of tokens.

Related Work
Pre-Trained Language Model. Language models pre-trained on a large corpus improve natural language understanding and generation through transferable contextualized word representations (Devlin et al., 2019;Lample et al., 2019) and models (Howard and Ruder, 2018). Large transformer models (Vaswani et al., 2017) like GPT-2 (Radford et al., 2019), Megatron (https://github. com/NVIDIA/Megatron-LM), and T5 (Raffel et al., 2019) can achieve state-of-the-art results without training on any particular language modeling benchmark. (Keskar et al., 2019) proposes a conditional generation model, trained to condition on control codes that govern style, content, and other task-specific properties. Different from them, our model builds sentence representations via autoencoding with a pair of BERT and GPT-2.
Context-Aware Text Generation. There are some related works on context-aware text generation (Mikolov and Zweig, 2012;Tang et al., 2016;Mangrulkar et al., 2018). Most previous works on language modeling with contextual information (Wang and Cho, 2016;Wang et al., 2018;Sordoni et al., 2015b;Wen et al., 2015;Vinyals and Le, 2015) treat the preceding sentences as context. Compared with these sequential generation tasks, our task is constrained by bidirectional context, and is more challenging. Text infilling (Zhu et al., 2019) aims at filling in the missing part, given the rest of a sentence.  proposes an iterative inference algorithm based on gradient search for text infilling. For story infilling, (Ippolito et al., 2019) first predicts rare words in the missing span, and then generates text conditioned on these words. SpanBERT (Joshi et al., 2020) masks random contiguous spans and (pre-)trains a language model to predict tokens in the span. XL-Editor (Shih et al., 2019) adapts XL-Net  to text infilling and other editing tasks. (Kang and Hovy, 2019) models logic connections between sentences and generates intermediate sentences grounded on inter-sentential "flow." (Bhagavatula et al., 2020) formulates abductive commonsense reasoning as a natural language inference task to decide the appropriate reason that could explain the observation in one sentence given the background described by another sentence.  proposes a text style transfer task to translate a sentence in the context of a paragraph into the desired style. These three works study generation tasks that address inter-sentential relationship, and thus may be conceptually related to our motivation.
Compared with (Zhu et al., 2019;Ippolito et al., 2019;Joshi et al., 2020;Shih et al., 2019;Kang and Hovy, 2019;Bhagavatula et al., 2020;, our approach is clearly different. We fully exploit existing large-scale pretrained models BERT and GPT-2 to learn smooth sentence embeddings in the latent semantic space, and then process sentence-level information in this space. Hierarchical Text Generation. Hierarchical text generation with high-level semantic planning has been studied in many previous works. (Sordoni et al., 2015a) presents a hierarchical recurrent encoder-decoder architecture for context-aware query suggestion.  proposes a framework to infer semantic features for response generation using self-supervised learning. Previous works have used multi-level LSTM encoders Hu et al., 2020) and hierarchical autoencoders (Li et al., 2015) to learn hierarchical representations for long text. (Shen et al., 2019) uses a variational autoencoder to encode an entire paragraph into a single latent variable, from which the paragraph can be generated hierarchically. In comparison, our task is to generate intermediate sentences in the surrounding context.

Task Definition
The task of sentence infilling is formally defined as follows. Consider a dataset of N m k is missing. The task is to generate a sentenceŝ (k) m k in the missing position such that it fits the context. For simplicity and without any confusion, we drop the index k from now on (note that M and m may depend on k).
The criteria for successful generation are: • The sentenceŝ m is fluent and meaningful.
• Inserting the generated sentence into the context, we obtain a semantically coherent paragraph (s 1 , s 2 , . . . , s m−1 ,ŝ m , s m+1 , . . . , s M ). •ŝ m is written in the same style as contextual sentences {s j } j =m . Since there could be multiple semantically different sentences that fit the same context well, it is not necessary forŝ m to be close to the ground truth s m . Rather,ŝ m is considered successful as long as it satisfies the criteria above.

INSET: Inter-Sentential Transformer
Model Overview. At a high level, our model consists of two components: a (denoising) autoencoder and a sentence-level transformer. The former maps each sentence to a fixed-length feature vector in the latent semantic space, and reconstructs the sentence from the representation. The latter predicts the semantic features of the missing sentence from those of contextual sentences. We call our model INter-SEntential Transformer (INSET).
Formally, let (E, D) be an autoencoder, where E (D) is the encoder (decoder) such that E • D and D • E are supposed to be identity maps. Let T be a sentence-level transformer with positional encoding P. The transformer T takes the contextual information as input and outputs a hypothetical representation of the missing sentence. Specifically, where f j = Es j is the encoding of the sentence s j , 0 is the zero vector representing the missing sentence, and T (· · · )[m] is output of the transformer T in the missing position m.
The autoencoder and the sentence-level transformer can be trained separately. We first train the Transformer encoder E from BERT Sentence-level transformer T f 4 Figure 2: Model overview. Left panel: Denoising autoencoder. The encoder E takes a corrupted sentence (with each token w i for i = 1, 2, . . . , l masked randomly) as input and outputs a representation of the sentence. The decoder D should reconstruct the original uncorrupted sentence. The training parameters of E and D are initialized with those of BERT and GPT-2 , respectively. Right panel: Sentence-level transformer. Using the encoder E, we obtain the representation of every contextual sentence. These sentence representations are fed into a sentence-level transformer T , which outputs a representation of the missing sentence.
former on individual sentences. Then, we precompute and save the feature vectors of all sentences. While training the latter, it is not necessary to load the former. This makes training more efficient.
Sentence Representation Learning via Denoising Autoencoding. Large-scale pre-training approaches (e.g., BERT) lead to superior performance in many language understanding tasks related to sentence representation learning (Reimers and Gurevych, 2019). However, the features learned by BERT (or fine-tuned on downstream tasks) cannot be directly used for generation tasks, as the masked language model objective of BERT does not enforce the reconstruction of the original sentence from the extracted features. Instead of directly using BERT features, we learn sentence representations via autoencoding. This naturally integrates BERT and GPT-2, and combines sentence representation learning and generation. As shown in the left panel of Figure 2, we pad the [CLS] token at the beginning of each sentence s j . We initialize the encoder E with BERT, and extract the output f j corresponding to the [CLS] token as the embedding of s j . We initialize the decoder D with GPT-2, and feed f j as the embedding of the zeroth token. Then, we have D generate a sequence of tokens in the hope that the sequence matches s j (padded with special tokens [SOS] at the beginning and [EOS] at the end). To train the autoencoder, we use teacher forcing and minimize the negative log-likelihood loss by (fine-)tuning the parameters of E and D jointly.
An autoencoder embeds sentences into vectors in the latent space. We hope that the embedding is smooth in the sense that semantically similar sentences are mapped to vectors that are close to each other. In particular, interpolation between two points in the latent space should correspond to a smooth semantic transition in the text domain. To this end, we use the following two tricks.
First, we employ a denoising autoencoder, which is known to yield a smoother embedding (Vincent et al., 2008). To add noise, we randomly mask each token in s j with probability 15% by replacing the masked tokens with a special token [MASK]. During training, we use the "noisy" s j with masks as input to the encoder, and use the "clean" s j without masks to compute the loss function. Of course, one could try more sophisticated noise-adding strategies (Lewis et al., 2019).
Second, we use early stopping. In our experiments, we observe that as training proceeds, the validation loss of the autoencoder keeps decreasing. In the absence of masks, presumably it would eventually decay to zero so that the autoencoder perfectly reconstructs every sentence. However, this does not necessarily imply that the embedding is smooth. On the contrary, an overtrained autoencoder often tries to remember every individual token and thus fails to achieve smoothness in the latent semantic space. Moreover, it can catastrophically forget some of the information in the initial pre-trained model (GPT-2) and partially lose the power of generating fluent sentences. In practice, we select a checkpoint by monitoring its validation performance on sentence interpolation. Some examples of sentence interpolation are shown in Table 1.
Sentence Feature Prediction. After encoding sentences into feature vectors, we use a sentence-level transformer T to predict the feature vector of the missing sentence from those of contextual sentences. This is analogous to the task of predicting masked tokens for (pre-)training BERT (Devlin et al., 2019), but now it is at the sentence level. Indeed, sentence feature vectors in T correspond to token embeddings in BERT, and sentence position ID in T corresponds to position ID in BERT.
We train the transformer T with the objective where cos(· · · ) is the cosine similarity between the ground truth sentence feature vector f m and the prediction T (· · · )[m] in Eq. (1). Note that cos(· · · ) is a good similarity measure only when its arguments are unit vectors. This is guaranteed by the technical trick of fixing the parameters of the last LayerNorm of the transformers E and T , i.e., do not compute the gradients of these parameters in backpropagation.
Generating Sentences from Features. At test time, we use the decoder D to generate the missing sentence by mapping the predicted feature vector to the text domain. Note that standard generation schemes such as top-k sampling, beam search, and nucleus sampling  can be used without additional modeling effort.
Computational Efficiency. Compared with vanilla GPT-2, our model can process and analyze a document containing many sentences at the discourse level with dramatically lower time and space complexity. To estimate quantitatively, suppose that a document contains N s sentences, each of which has N t tokens. Then, the time complexity . Moreover, sentence features can be precomputed once and then reused for every epoch or even in other tasks on the same dataset. If sentence features have been precomputed and are already directly available, the time complexity is further reduced to O(N 2 s ).

Sentence Infilling with Lexical Constraints
We further introduce a related task called sentence infilling with lexical constraints, which is the same as sentence infilling except that now we are given some keywords of the missing sentence as an additional input to hint the generation. The keywords are treated as soft constraints (aka priming): The generated sentence is not directly enforced to contain the exact keywords. It may contain a synonym or share some semantics with the keywords. We expect that the presence of keyword constraints makes the task more difficult rather than easier, although incorporating keywords can significantly improve the BLEU score of the generation with respect to the ground truth. Intuitively, keywords force the model to speculate the semantics of the ground truth sentence, and significantly reduce the number of possible solutions. In the absence of keywords, the model has the freedom of completing the task according to its own way of thinking.
To handle keyword constraints, we introduce a new component called the constraint feature encoder to our architecture. It is a transformer encoder K that maps a set S of keywords to a feature vector that lives in the same latent space of sentence embeddings. We train K with knowledge distillation (Kim and Rush, 2016). The teacher model is the sentence encoder E, which maps a sentence containing the keywords in S to a feature vector. We use the cosine similarity loss between these two feature vectors to teach the student model K.
For implementation details, suppose we have two keywords w 1 and w 2 . Then, the input to K is three tokens ([CLS], w 1 , w 2 ). We replace the zero vector in Eq. (1), which represents the missing sentence, with the output of K above the [CLS] token. We do not use positional encoding in K because keywords do not have order.

Experimental Setup
We evaluate our model on two datasets (TripAdvisor and Recipe). We have released the source code to facilitate future research (https://github. com/dreasysnail/INSET).

Dataset and Preprocessing.
We conduct experiments on the TripAdvisor and Recipe datasets. For the TripAdvisor dataset of hotel reviews (Wang et al., 2010), we partially follow the preprocessing in (Cho et al., 2019). Our preprocessing includes, but is not limited to: (i) discarding reviews containing non-English tokens; (ii) removing duplicate reviews so that only one copy is retained. We set the maximum number of tokens in a sentence to be 32 and the minimum number of sentences in a review to be 7 (so that the context is not too short). Any review with longer sentences or having a smaller number of sentences is discarded.
We use the following strategy to mask sentences. For a paragraph consisting of M ≥ 7 consecutive sentences, we split it into M −6 data points, each of which has exactly 7 sentences. The j'th data point spans from the j'th to the (j + 6)'th sentence (inclusive) of the paragraph, for j = 1, 2, . . . , M − 6. We mask the middle (i.e., 4th) sentence for each data point so that the masking rate is 1/7 ≈ 14.3%, which is close to that (15%) of BERT. After preprocessing, the size of the dataset (training, validation, test) is (1108134, 62543, 533) data points.
Our strategy of always masking the middle sentence out of 7 sentences is not only the simplest but also without loss of generality. Our model is directly applicable to the situation where we randomly mask, e.g., 3 out of 20 sentences. However, the quality of human evaluation may be affected because the patience and attention of human evaluators may decrease as the context length increases. For the effectiveness of human evaluation, we use the simplest strategy to mask sentences.
The Recipe dataset is obtained from (https: //commoncrawl.org), where the metadata is formatted according to Schema.org (https:// schema.org/Recipe). We use the same preprocessing as that of the TripAdvisor dataset except that instructions with less than 5 sentences are discarded. After preprocessing, the size of the dataset (training, validation, test) is (1073886, 56055, 500) data points. Recipe instructions usually describe a time-ordered procedure, and thus are ideal for testing the reasoning capability of the model. Evaluation Metrics. Following , we perform automatic evaluation using standard machine translation metrics, including BLEU (Papineni et al., 2002), NIST (Doddington, 2002), and METEOR (Lavie and Agarwal, 2007). As a variant of BLEU, NIST weights n-gram matches by their information gain, and thus penalizes uninformative n-grams. We also use Entropy (Zhang et al., 2018) and Dist-n (Li et al., 2016) to evaluate lexical diversity. See  for more details.
BLEU, NIST, and METEOR measure the similarity between the generated sentence and the ground truth. They are not ideal scores for our task because a sentence that is semantically very different from the ground truth could possibly fit the context perfectly. However, it may still be helpful to compute these commonly used scores. It is an important and challenging open problem to design an automatic score that faithfully measures the overall quality of the generation in our task.
Baseline. Our baseline is the self-attention model for text infilling (Zhu et al., 2019). It is a transformer language model with novel positional encoding. The traditional approach of encoding the absolute position of each token is not directly applicable to our task because we do not know in advance the absolute positions of contextual tokens after the missing sentence. To resolve this issue, (Zhu et al., 2019) divides the text into segments. In the case of only one masked sentence, the first (third) segment consists of contextual tokens before (after) the mask, and the second corresponds to the mask. Then, each token is indexed by its segment ID and its position ID within the segment. The missing tokens are sequentially generated from these IDs and the current surrounding context. Training the baseline model on our dataset, we use the same set of hyperparameters as in the original reference except that the batch size is set to 250 (it is 400 in (Zhu et al., 2019)). This avoids out-ofmemory errors. Note that we are handling much longer sequences (usually > 100 tokens) than (Zhu et al., 2019), in which the maximum number of tokens in a sequence is only 16.
The baseline model is trained for a sufficient number (30) of epochs until the validation (negative log-likelihood) loss and perplexity clearly saturate. We report the results of the checkpoint with the smallest validation loss and perplexity. Note that we observe that other checkpoints in the saturation regime behave very similarly on the test set.
Keyword Extraction. In the task of sentence infilling with lexical constraints, we need to extract keywords from the masked sentence. Keyword extraction is a classical problem in information retrieval. Standard methods include, but are not limited to, tf-idf (term frequency-inverse document frequency) (Ramos, 2003). We have tried tf-idf, but it does not work well for the TripAdvisor dataset of hotel reviews. One reason is that this dataset has quite a few typos, and unfortunately tf-idf favors them because typos occur less frequently than normal words. This issue can be resolved by manually filtering out all typos. After the fix, however, we observe that the quality of extracted keywords remains unsatisfactory.
We use the following strategy to extract key-words. We first define a list of stop words. To this end, we use the stop word list from NLTK (Bird et al., 2009) and manually add a number of words (e.g., "hotel") that are not very informative for the particular dataset of hotel reviews. For each sentence, we select non-stop words that appear most frequently in the entire dataset. We usually select two keywords per sentence, but occasionally select one or even zero if few words remain after filtering out stop words and typos. We observe that the keywords extracted with this strategy can pivot the gist of most sentences well.
Model Size and Hyperparameters. Our architecture has several components. The encoder E and the sentence-level transformer T have the same size as BERT BASE . The decoder D has the same size as GPT-2 (117M). In the presence of lexical constraints, the constraint feature encoder K has the same size as BERT BASE . During decoding, we use beam search with beam size 5.

Experimental Results
Sentence Representation Learning. We first qualitatively evaluate the smoothness of the latentspace sentence embeddings learned via denoising autoencoding. Table 1 shows two examples of sentence interpolation on the TripAdvisor dataset. In each example, the first and last sentences are inputs by hand, and the 3 intermediate ones are interpolations generated by our (denoising) autoencoder. We observe that the interpolations not only combine words from input sentences, but are readable, meaningful, and show a smooth semantic transition from the first to the last sentence. We speculate that the power of generating fluent and semantically coherent sentence interpolations is derived from BERT and GPT-2. Inherited from these largescale pre-trained models, the latent-space sentence embedding is reasonably smooth as our sentence interpolation results show.
Automatic Evaluation. The service was attentive and we had the best food in town. - The service was attentive and we had a great room with plenty of food. - The room was spacious with good service and we had a queen bed. - The room was very spacious with queen beds. B The room was very spacious with 2 queen beds. to the ground truth and are more diverse than the baseline. In terms of the average generation length, our results are much closer to the ground truth than the baseline is. Table 2 also presents two ablation studies. The first shows the performance decrease with less context. Recall that each data point in the TripAdvisor dataset has 6 contextual sentences (full context). We train INSET on the same set of data points but truncate the context to 4 sentences (less context). The second ablation study shows the effect of context in the presence of keywords. We compare two models. The first (INSET w/ context) is the model described in Subsection 3.3. Its generation is based on both keywords and context. The second model (INSET w/o context) is D • K, which directly decodes the output of the constraint feature encoder K using the decoder D. Its generation is only based on keywords but not context. We observe that the scores of the first model are higher than those of the second. Both ablation studies show that our model can make full use of context to improve the generation.
Human Evaluation. We performed human evaluation of our method on the TripAdvisor dataset. We used a crowd evaluation platform to compare two systems and assess their fluency, informativeness, and relevance to the surrounding context (coherence) on 500 random samples from the test set. Following recommended best practices, each sample was evaluated by five judges. We performed simple spam detection by excluding judges that were too fast or performed too low on a gold set. To avoid bias, we randomized the position of each system while asking judges to compare our systems (with and without keywords) with the ground truth  Table 2: Automatic evaluation. "w/ context" indicates that the generation is based on both keywords and context. "w/o context" indicates that the generation is only based on keywords but not context. "Ent." and "Len." stand for Entropy and the average generation length, respectively.  Table 3: Human evaluation. "w/(w/o) keywords" and "w/(w/o) context" indicate whether the generation is based on keywords and context, respectively. All numbers are percentages. and the text infilling baseline (Zhu et al., 2019). Table 3 shows the human evaluation results. The judges strongly prefer our results (without keywords) to the baseline in all aspects: coherence, fluency, and informativeness. They also strongly prefer the ground truth to our results. Moreover, our results with keywords and context are compared with three other systems: (i) the ground truth; (ii) our results with keywords but not context; (iii) our results with context but not keywords. The second comparison shows that in the presence of keywords, our model can use context to improve all aspects of the generation. The third comparison shows that the presence of keywords reduces the performance of our model, probably because keywords are constraints that the model must take care of. Generated Examples. To qualitatively demonstrate the effectiveness of our model, Table 4 shows some examples from the TripAdvisor and Recipe datasets. We observe that the baseline (Zhu et al., 2019) tends to generate generic sentences, while our results (either with or without keywords) are more informative and can fit the surrounding context reasonably well. Table 5 shows examples generated by our model in the same context but with different keywords. Our model can extend keywords to a full sentence, adapting to the context. More examples generated by our model on both datasets are given in Appendix A.

Conclusions and Outlook
We study the task of sentence infilling, which is analogous to the masked language modeling task for (pre-)training BERT, but now it is at the sen-example from the TripAdvisor dataset example from the TripAdvisor dataset example from the Recipe dataset preceding context It was such a pleasure to see somthing new every night. It was not very crowded so we were able to get great seats at either the pool or the beach. The VIP sevice was great for dinner reservations and pillow service.
The walls are very thin. Since this is a family vacation type of hotel, people are up at the pool/bbq area/hallways during all hours of the night. Do not stay here if you need a quite night of sleep.
After another 15 minutes or so the mixture should thicken up. The mixture will continue to thicken as it cools.
following context Enjoyed the shrimp coctail and seafood salad delivered to us while enjoying the pool. All of us would not want to stay at another resort and are planning to go back again. Enjoy and Hola!Karen and FriendsMilford, CT You have to take multiple elevators to go all the way to the 5th floor. My other complaint is that the hotel staff seemed a bit unprofessional. Not what I'm used to when I stay at Marriot properties.
Sterilize your jars and lids and while still hot fill with the jam leaving about a 1/2 inch headspace. Place lids onto the jars and boil in a water bath with jars covered by 3 inches of water for 10 minutes.
ground truth We did bring a lot of $1 for tipping and of course the service stepped up a notch more.
Also, the elevator situation is weird. Remove from the heat and stir in your amaretto. baseline The staff was friendly and helpful. The rooms are very clean and well kept. Add the flour mixture to the dry ingredients and mix well.

INSET
The buffet dinner was amazing and we had the best food in the resort.
There is only one elevator block in the hotel.
Carefully remove the jars from hot water and keep going until a thick sauce is formed. + keywords $, service elevator, situation - Service fee for the buffet dinner was $5.00 and we paid $5.00 extra for food service.
The elevator situation is extremely frustrating.
- Table 4: Examples generated by our model and the baseline.
preceding context My room was a very good size. Tiled floors and woodchip painted walls. The tv did not work -so what.
following context Great places to eat close by and very reasonable. No air con -so summer could be sticky. My concern is the left luggage room not supervised.
human oracle The location is terrific beside Sevilla metro stn so only 2 to get by metro all the way to airport. + (walk, shopping) Walking distance to shopping mall and Circular Quay.
Internet cost $20.00 per day. Table 5: Examples generated by our model in the same context but with different keywords. "+ (· · · )" is keywords.
tence level. Sentence infilling requires the model to handle long-range inter-sentential correlation and to process high-level semantic information. It is complementary to (token-level) masked language modeling, which focuses more on syntactic appropriateness and short-range correlation. We propose a framework called INSET to decouple three aspects of the task (understanding, planning, and generation) and address them in a unified manner. We demonstrate the effectiveness of our approach using automatic and human evaluation. Our approach can be modified or extended in some ways. (i) We use a denoising autoencoder to obtain sentence embeddings. One can try to use a variational autoencoder (Kingma and Welling, 2014) instead. A large-scale pre-trained variational autoencoder  could possibly improve the smoothness of sentence embeddings. (ii) Our model predicts a feature vector for the missing sentence. This vector can be fed into and serve as a guide to token-level models including the baseline (Zhu et al., 2019).
Since sentence infilling is analogous to masked language modeling, we expect that it can also be used as a pre-training task. For example, in ma-chine translation of long texts, it is often the case that sentences are translated independently from each other. This sometimes leads to incoherence or even inconsistency between the translated sentences. A post-editor to fix the issue (Voita et al., 2019) should be able to understand inter-sentential relationship and to generate fluent sentences in the surrounding context, both of which can be learned from sentence infilling.
Note. After this paper was posted on arXiv, some related works appeared. (Shen et al., 2020) proposes Blank Language Model for text infilling and other tasks. (Donahue et al., 2020) trains (finetunes) a language model (GPT-2) for text and sentence infilling.  pre-trains a largescale variational autoencoder with a pair of BERT and GPT-2. (Ippolito et al., 2020) uses a sentencelevel language model, which operates on sentence embeddings obtained from BERT, to predict story endings. Unfortunetly the view is going to be partly blocked with yet another "Glass tower" going in. The room was spacious and clean. No tub in our room.
ground truth It's very colorful and unique. We had a terrific view from the 16th floor.

INSET
There is also a beach resort with lots of loungers. We had a room on the upper floor which overlooks the lobby.
example 3 example 4 preceding context My family stayed here for 5 nights in August 2011. The resort is beautiful and the grounds are immaculately manicured. The kitchen is great for the family.
We stayed in 2 interconnecting rooms as we are a family of 5. We started off with a bad start, as the check in was not aware that we were with 3 kids. I booked directly with them and got a confirmation via email for 2 rooms for 2 adults.
following context We would just pack a cooler and head out in our rental car and explore the island. The pools at the resort were fabulous and the staff was attentive. We used the grills(kept very clean) several nights.
Obviously this was not reflected in the paper work check-in had. We could only add an extra bed for an extra charge, but I refused to pay for this as I had phoned them before. The check-in lady would not bend, and we had to go for 2 rooms with 2 seperate beds.
ground truth We were able to keep essentials in the room which made those early morning excursions more enjoyable.
Before we arrived I called reservations to change this into 2 adults and 3 children.

INSET
We have plenty of kitchen utensils and the beach was a nice place to stay.
When we checked in we were told that we had to request another room on the 2nd floor due to the extra charges.
example 5 example 6 preceding context It was such a pleasure to see somthing new every night. It was not very crowded so we were able to get great seats at either the pool or the beach. The VIP sevice was great for dinner reservations and pillow service.
My intentions were to expect the worst which made my stay there that much better than everyone elses. If everyone thought they were staying at the Hyatt, no wonder they thought so negatively about the place. I am in my late twenties and wanted a place where I could walk to local bars, restaurants, etc.
following context Enjoyed the shrimp coctail and seafood salad delivered to us while enjoying the pool. All of us would not want to stay at another resort and are planning to go back again. Enjoy and Hola!Karen and FriendsMilford, CT This was the perfect place for me. As far as the accomodations, the beds were small (but so was everywhere else in Europe) and the showers were unusual. Otherwise it was worth the money for a prime time location in the heart of the night life area.
ground truth We did bring a lot of $1 for tipping and of course the service stepped up a notch more.
without struggling to find my way home at night.

INSET
The buffet dinner was amazing and we had the best food in the resort. So I had no reason to stay in the HOTEL itself. Roll up rectangles width-wise and pinch ends to seal. Bake for 12 minutes or until the tops begin to brown.
Drizzle each potato cup with 1 teaspoon browned butter. Cover muffin tin tightly with aluminium foil and place in oven.
following context Best when served warm. For added flavor, serve with strawberry jelly.
Remove from oven and turn broiler on high. Sprinkle potato rounds evenly with remaining parmesan cheese.
ground truth Let cool on baking sheet. Bake for 25 minutes.

INSET
Cool on wire rack and remove. Bake for 20 minutes or until potatoes are tender.
example 3 example 4 preceding context Preheat oven to 425 degrees Fahrenheit. Line a baking sheet with a SILPAT mat.
Heat the oil in a pan at medium. Add the mushrooms and saute until tender, about 7-10 minutes.
following context With a pastry cutter, cut in the coconut oil and the butter. Make a well and add in the milk 1/2 cup at a time, stirring gently with a wooden spoon.
Add the reserved water and simmer at medium-high until reduced by half, about 10 minutes. Meanwhile cook the pasta as directed on the package.
ground truth In a bowl, mix the flour, baking powder, baking soda and sea salt. Add shallots, garlic, thyme, salt and pepper and saute for 2 minutes.

INSET
In a medium bowl, mix together the flour, baking powder, sugar, salt and cinnamon.
Add the garlic and sautee until fragrant, about 2 minutes.
example 5 example 6 preceding context After another 15 minutes or so the mixture should thicken up. The mixture will continue to thicken as it cools.
Bake the graham cracker crust for 10 minutes. Remove from oven and allow to cool to room temperature.
following context Sterilize your jars and lids and while still hot fill with the jam leaving about a 1/2 inch headspace. Place lids onto the jars and boil in a water bath with jars covered by 3 inches of water for 10 minutes.
Stir in the lime zest and lime juice. Stir until mixture is smooth and begins to slightly thicken.
ground truth Remove from the heat and stir in your amaretto. Meanwhile, combine the egg yolks and condensed milk in a medium bowl.

INSET
Carefully remove the jars from hot water and keep going until a thick sauce is formed.
In a medium bowl, combine the cream cheese and powdered sugar, stirring until smooth. Table 7: Generated examples by our model on the Recipe dataset preceding context Also has a safe. The hotel is in a good location, beside the City Centre and there are a nice selection of shops within the Monte Carlo. Service was very good but avoid the concierge in the morning when people are booking tours, the queues are long.
following context No wi-fi in the room which is a bit annoying but they have it in the foodcourt by Starbucks and McDs. Also we were disappointed to see the $15/night resort fee was charged to our credit card after our stay. I don't recall them mentioning this at check-in.
human oracle CVs is next door and it's 24/7 so you can buy snacks and anything else you fancy.

+ (breakfast, cereal)
Breakfast is included with cereal, muffins and breads. + (food, expensive) Prices are expensive but food in the hotel is very cheap. Table 8: Examples generated by our model in the same context but with different keywords. "+ (· · · )" is keywords.