POINTER: Constrained Progressive Text Generation via Insertion-based Generative Pre-training

Large-scale pre-trained language models, such as BERT and GPT-2, have achieved excellent performance in language representation learning and free-form text generation. However, these models cannot be directly employed to generate text under speciﬁed lexical constraints. To address this challenge, we present P OINTER 1 , a simple yet novel insertion-based approach for hard-constrained text generation. The proposed method operates by progressively inserting new tokens between existing tokens in a parallel manner. This procedure is recursively applied until a sequence is completed. The resulting coarse-to-ﬁne hierarchy makes the generation process intuitive and interpretable. We pre-train our model with the proposed progressive insertion-based ob-jective on a 12GB Wikipedia dataset, and ﬁne-tune it on downstream hard-constrained generation tasks. Non-autoregressive decoding yields an empirically logarithmic time complexity during inference time. Experimental results on both News and Yelp datasets demonstrate that P OINTER achieves state-of-the-art performance on constrained text generation. We released the pre-trained models and the source code to facilitate future research 2 .


Introduction
Real-world editorial assistant applications must often generate text under specified lexical constraints, for example, convert a meeting note with key phrases into a concrete meeting summary, recast a user-input search query as a fluent sentence, generate a conversational response using grounding facts (Mou et al., 2016), or create a story using a pre-specified set of keywords (Fan et al., 2018;Yao et al., 2019;Donahue et al., 2020).
Generating text under specific lexical constraints is challenging.Constrained text generation broadly falls into two categories, depending on whether inclusion of specified keywords in the output is mandatory.In soft-constrained generation (Qin et al., 2019;Tang et al., 2019), keyword-text pairs are typically first constructed (sometimes along with other conditioning information), and a conditional text generation model is trained to capture their co-occurrence, so that the model learns to incorporate the constrained keywords into the generated text.While soft-constrained models are easy to design, even remedied by soft enforcing algorithms such as attention and copy mechanisms (Bahdanau et al., 2015;Gu et al., 2016;Chen et al., 2019a), keywords are still apt to be lost during generation, especially when multiple weakly correlated keywords must be included.
Hard-constrained generation (Hokamp and Liu, 2017;Post and Vilar, 2018;Hu et al., 2019;Miao et al., 2019;Welleck et al., 2019), on the other hand, requires that all the lexical constraints be present in the output sentence.This approach typically involves sophisticated design of network architectures.Hokamp and Liu (2017) construct a lexical-constrained grid beam search decoding algorithm to incorporate constraints.However, Hu et al. (2019) observe that a naive implementation of this algorithm has a high running time complexity.Miao et al. (2019) introduces a sampling-based conditional generation method, where the constraints are first placed in a template, then words in a random position are either inserted, deleted or updated under a Metropolis-Hastings-like scheme.However, individually sampling each token results in slow convergence, as the joint distribution of all the tokens in a sentence is highly correlated.Welleck et al. (2019) propose a tree-based text generation scheme, where a token is first generated in an arbitrary position, and then the model recursively

Stage
Generated text sequence 0 (X 0 ) sources sees structure perfectly 1 (X 1 ) sources company sees change structure perfectly legal 2 (X 2 ) sources suggested company sees reason change tax structure which perfectly legal .3 (X 3 ) my sources have suggested the company sees no reason to change its tax structure , which are perfectly legal .4 (X 4 ) my sources have suggested the company sees no reason to change its tax structure , which are perfectly legal .
Table 1: Example of the progressive generation process with multiple stages from the POINTER model.Words in blue indicate newly generated words at the current stage.X i denotes the generated partial sentence at Stage i. X 4 and X 3 are the same indicates the end of the generation process.Interestingly, our method allows informative words (e.g., company, change) generated before the non-informative words (e.g., the, to) generated at the end.
generates words to its left and right, yielding a binary tree.However, the constructed tree may not reflect the progressive hierarchy/granularity from high-level concepts to low-level details.Further, the time complexity of generating a sentence is O(n), like standard auto-regressive methods.
Motivated by the above, we propose a novel nonautoregressive model for hard-constrained text generation, called POINTER (PrOgressive INsertionbased TransformER).As illustrated in Table 1, generation of words in POINTER is progressive, and iterative.Given lexical constraints, POINTER first generates high-level words (e.g., nouns, verbs and adjectives) that bridge the keyword constraints, then these words are used as pivoting points at which to insert details of finer granularity.This process iterates until a sentence is finally completed by adding the least informative words (typically pronouns and prepositions).
Due to the resemblance to the masked language modeling (MLM) objective, BERT (Devlin et al., 2019) can be naturally utilized for initialization.Further, we perform large-scale pre-training on a large Wikipedia corpus to obtain a pre-trained POINTER model that which can be readily finetuned on specific downstream tasks.
The main contributions of this paper are summarized as follows.(i) We present POINTER, a novel insertion-based Transformer model for hardconstrained text generation.Compared with previous work, POINTER allows long-term control over generation due to the top-down progressive structure, and enjoys a significant reduction over emperical time complexity from O(n) to O(log n) at best.(ii) Large-scale pre-training and novel beam search algorithms are proposed to further boost performance.(iii) We develop a novel beam search algorithm customized to our approach, further improving the generation quality.(iv) Experiments on several datasets across different domains (including News and Yelp) demonstrates the superiority of POINTER over strong baselines.Our approach is simple to understand and implement, yet powerful, and can be leveraged as a building block for future research.

Related Work
Language Model Pre-training Large-scale pretrained language models, such as BERT (Devlin et al., 2019), RoBERTa (Liu et al., 2019), XLNet (Yang et al., 2019), Text-to-text Transformer (Raffel et al., 2019) and ELECTRA (Clark et al., 2020), have achieved great success on natural language understanding benchmarks.GPT-2 (Radford et al., 2018) first demonstrates great potential for leveraging Transformer models in generating realistic text.MASS (Song et al., 2019) and BART (Lewis et al., 2019) propose methods for sequence-to-sequence pre-training.UniLM (Dong et al., 2019) unifies the generation and understanding tasks within a single pre-training scheme.DialoGPT (Zhang et al., 2020) and MEENA (Adiwardana et al., 2020) focus on opendomain conversations.CTRL (Keskar et al., 2019) and Grover (Zellers et al., 2019) guide text generation with pre-defined control codes.In addition, recent work has also investigated how to leverage BERT for conditional text generation (Chen et al., 2019b;Mansimov et al., 2019;Li et al., 2020).To the best of our knowledge, ours is the first largescale pre-training work for hard-constrained text generation.
Non-autoregressive Generation Many attempts have been made to use non-autoregressive models for text generation tasks.For neural machine translation, the promise of such methods mostly lies in their decoding efficiency.For example, Gu et al. (2018) employs a non-autoregressive decoder that generates all the tokens simultaneously.Generation can be further refined with a post-processing step to remedy the conditional independence of the parallel decoding process (Lee et al., 2018;Ghazvininejad et al., 2019;Ma et al., 2019;Sun et al., 2019;Kasai et al., 2020).Deconvolutional decoders (Zhang et al., 2017;Wu et al., 2019) have also been studied for title generation and machine translation.The Insertion Transformer (Stern et al., 2019;Gu et al., 2019;Chan et al., 2019) is a partially autoregressive model that predicts both insertion positions and tokens, and is trained to maximize the entropy over all valid insertions, providing fast inference while maintaining good performance.Our POINTER model hybridizes the BERT and Insertion Transformer models, inheriting the advantages of both, and generates text in a progressive coarse-to-fine manner.

Model Overview
Let X = {x 0 , x 1 , • • • , x T } denote a sequence of discrete tokens, where each token x t ∈ V , and V is a finite vocabulary set.For the hard-constrained text generation task, the goal is to generate a complete text sequence X, given a set of key words X as constraints, where the key words have to be exactly included in the final generated sequence with the same order.
Let us denote the lexical constraints as X 0 = X.The generation procedure of our method can be formulated as a (progressive) sequence of K stages: The following stage can be perceived as a finer-resolution text sequence compared to the preceding stage.X K is the final generation, under the condition that the iterative procedure is converged (i.e., X K−1 = X K ).Table 1 shows an example of our progressive text generation process.Starting from the lexical constraints (X 0 ), at each stage, the algorithm inserts tokens progressively to formulate the target sequence.At each step, at most one new token can be generated between two existing tokens.Formally, we propose to factorize the distribution according to the importance (defined later) of each token: The more important tokens that form the skeleton of the sentence, such as nouns and verbs, appear in earlier stages, and the auxiliary tokens, such as articles and prepositions, are generated at the later stages.In contrast, the autoregressive model factorizes the joint distribution of X in a standard left-toright manner, i.e., p(X) = p(x 0 ) T t=1 p(x t |x <t ), ignoring the word importance.Though the Insertion Transformer (Stern et al., 2019) attempts to implement the progressive generation agenda in (1), it does not directly address how to train the model to generate important tokens first.

Data Preparation
Designing a loss function so that (i) generating an important token first and (ii) generating more tokens at each stage that would yield a lower loss would be complicated.Instead, we prepare data in a form that eases model training.
The construction of data-instance pairs reverses the generation process.We construct pairs of text sequences at adjacent stages, i.e., (X k−1 , X k ), as the model input.Therefore, each training instance X is broken into a consecutive series of pairs: where K is the number of such pairs.At each iteration, the algorithm masks out a proportion of existing tokens X k to yield a sub-sequence X k−1 , creating a training instance pair (X k−1 , X k ).This procedure is iterated until only less than c (c is small) tokens are left.
Two properties are desired when constructing data instances: (i) important tokens should appear in an earlier stage , so that the generation follows a progressive manner; (ii) the number of stages K is small, thus the generation is fast at inference time.
Token Importance Scoring We consider three different schemes to assess the importance score of a token: term frequency-inverse document frequency (TF-IDF), part-of-speech (POS) tagging, and Yet-Another-Keyword-Extractor (YAKE) (Campos et al., 2018(Campos et al., , 2020)).The TF-IDF score provides the uniqueness and local enrichment evaluation of a token at a corpus level.POS tagging indicates the role of a token at a sequence level.We explicitly assign noun or verb tokens a higher POS tagging score than tokens from other categories.YAKE is a commonly used unsupervised automatic keyword extraction method that relies on statistical features extracted from single documents to select the most important keywords (Campos et al., 2020).YAKE is good at extracting common key words, but relatively weak at extracting special nouns (e.g., names), and does not provide any importance level for non-keyword tokens.Therefore, we combine the above three metrics for token importance scor-ing.Specifically, the overall score α t of a token x t is defined as α t = α Additionally, stop words are manually assigned a low importance score.If a token appears several times in a sequence, the latter occurrences are assigned a decayed importance score to prevent the model from generating the same token multiple times in one step at inference time.We note that our choice of components of the importance score is heuristic.It would be better to obtain an unbiased/oracle assessment of importance, which we leave for future work.
DP-based Data Pair Construction Since we leverage the Insertion-based Transformer, which allows at most one new token to be generated between each two existing tokens, sentence length at most doubles at each iteration.Consequently, the optimal number of iterations K is log(T ), where T is the length of the sequence.Therefore, generation efficiency can be optimized by encouraging more tokens to be discarded during each masking step when preparing the data.However, masking positional interleaving tokens ignores token importance, and thus loses the property of progressive planning from high-level concepts to low-level details at inference time.In practice, sequences generated by such an approach can be less semantically consistent as less important tokens occasionally steer generation towards random content.
We design an approach to mask the sequence by considering both token importance and efficiency using dynamic programming (DP).To accommodate the nature of insertion-based generation, the masking procedure is under the constraint that no consecutive tokens can be masked at the same stage.Under such a condition, we score each token and select a subset of tokens that add up to the highest score (all scores are positive).This allows the algorithm to adaptively choose as many high scored tokens as possible to mask.
Formally, as an integer linear programming problem (Richards and How, 2002), the objective is to find an optimal masking pattern Φ = {φ 1 , • • • , φ T }, where φ t ∈ {0, 1}, and φ t = 1 represents discarding the corresponding token x t , and φ t = 0 indicates x t remains.For a sequence X , Algorithm 1 DP-based Data Pair Construction.φt ← 1,t ← pt 16: end while the objective can be formulated as: where α max = max t {α t }.Though solving Eq. ( 2) is computationally expensive, one can resort to an analogous problem for a solution, the socalled House Robbery Problem, a variant of Maximum Subarray Problem (Bentley, 1984), where a professional burglar plans to rob houses along a street and tries to maximize the outcome, but cannot break into two adjacent houses without triggering an alarm.This can be solved using dynamic programming (Bellman, 1954) (also known as Kadane's algorithm (Gries, 1982)) as shown in Algorithm 1.

Model Training
Stage-wise Insertion Prediction With all the data-instance pairs (X k−1 , X k ) created as described above as the model input, we optimize the following objective: where X + X k − X k−1 , and Φ k−1 denotes an indicator vector in the k-th stage, representing whether an insertion operation is applied in a slot.
As illustrated in Figure 1, while the MLM objective in BERT only predicts the token of a masked   placeholder, our objective comprises both (i) likelihood of an insertion indicator for each slot (between two existing tokens), and (ii) the likelihood of each new token conditioning on the activated slot.To handle this case, we expand the vocabulary with a special no-insertion token [NOI].During inference time, the model can predict either a token from the vocabulary to insert, or an [NOI] token indicating no new token will be inserted at a certain slot at the current stage.By utilizing this special token, the two objectives are merged.Note that the same insertion transformer module is re-used at different stages.We empirically observed that the model can learn to insert different words at different stages; it presumably learns from the completion level (how discontinuous the context is) of the current context sequence to roughly estimate the progress up to that point.
During inference time, once in a stage (X k ), all the slots predict [NOI] for the next stage, the generation procedure is converged and X k is the final output sequence.Note that to account for this final stage X k , during data preparation we incorporate an (X, N ) pair for each sentence in the training data, where N denotes a sequence of [NOI] with the same length of X.To enable the model to insert at the beginning and end of the sequence, an [SOS] token and an [EOS] token are added in the beginning and at the end of each sentence, respectively.
In light of the similarity with the MLM objective, we use BERT model to initialize the Insertion Transformer module.
Large-scale Pre-training In order to provide a general large-scale pretrained model that can benefit various downstream tasks with fine-tuning, we train a model on the massive publicly available English Wiki dataset, which covers a wide range of topics.The Wiki dataset is first preprocessed according to Sec. 3.2.We then initialize the model with BERT, and perform model training on the processed data using our training objective (3).After pre-training, the model can be used to generate an appropriate sentence with open-domain keyword constraints, in a tone that represents the Wiki style.In order to adapt the pre-trained model to a new domain (e.g., News and Yelp reviews), the pre-trained model is further fine-tuned on new datasets, which empirically demonstrates better performance than training the model on the target domain alone.

Inference
During inference time, starting from the given lexical constraint X 0 , the proposed model generates text stage-by-stage using greedy search or top-K sampling (Fan et al., 2018), by applying the Insertion Transformer module repeatedly until no additional token is generated.If a [NOI] token is generated, it is deleted at the next round.
Inner-Layer Beam Search According to (3), all new tokens are simultaneously generated based on the existing tokens at the previous stage.Despite of being fully parallel, like BERT (Yang et al., 2019) and NAT (Ghazvininejad et al., 2019;Kasai et al., 2020) this approach suffers from a conditional independence problem in which the predicted tokens are conditional-independently generated and are agnostic of each other.This can result in generating repeating or inconsistent new tokens at each generation round. 3o address this weak-dependency issue, we perform a modified beam search algorithm for decoding.Specifically, at stage k, suppose the existing tokens from last stage are For predicting next stage X k , there will be T k−1 available slots.A naive approach to perform beam search would be to maintain a priority queue of top B candidate token series predictions when moving from the leftmost slot to the rightmost slot.At the t-th move, the priority queue contains top B sequences for existing predicted tokens: (s The model then evaluates the likelihood of each item (including [NOI]) in the vocabulary for the slot s t , by computing the likelihood of (s . This is followed by a ranking step to select the top B most likely series among the V B series to grow.However, such a naive approach is expensive, as the runtime complexity takes O(T BV ) evaluations.
Instead, we approximate the search by constraining it in a narrow band.We design a customized beam search algorithm for our model, called innerlayer beam search (ILBS).This method applies an approximate local beam search at each iteration to find the optimal stage-wise decoding.At the t-th slot, ILBS first generates top B token candidates by applying one evaluation step based on existing generation.Prediction is limited to these top B token candidates, and thus the beam search procedure as described above is applied on the narrow band of B instead of the full vocabulary V .This reduces the computation to O(T B 2 ).

Experiments
We evaluate the POINTER model on constrained text generation over News and Yelp datasets.Details of the datasets and experimental results are provided in the following sub-sections.The pretrained models and the source code are available at Github4 .

Experimental Setup
Datasets and Pre-processing We evaluate our model on two datasets.The EMNLP2017 WMT News dataset5 contains 268,586 sentences, and we randomly pick 10k sentences as the validation set, and 1k sentences as the test set.The Yelp English review dataset is from Cho et al. ( 2018 The English Wikipedia dataset we used for pretraining is first pre-processed into a set of natural sentences, with maximum sequence length of 64 tokens, which results in 1.99 million sentences for model training in total (12.6 GB raw text).On average, each sentence contains 27.4 tokens.
For inference, we extract the testing lexical constraints for all the compared methods using the 3rd party extracting tool YAKE6 .The maximum length of the lexical constraints we used for News and Yelp is set to 4 and 7, respectively, to account the average length for News (27.9 ≈ 4 × 2 3 ) and Yelp (50.3 ≈ 7 × 2 3 ), as we would hope the generation can be done within 4 stages.
Baselines We compare our model with two stateof-the-art methods for hard-constrained text generation: (i) Non-Monotonic Sequential Text Generation (NMSTG) (Welleck et al., 2019), and (ii) Constrained Sentence Generation by Metropolis-Hastings Sampling (CGMH) (Miao et al., 2019).We also compared with an autoregressive softconstraint baseline (Gao et al., 2020).Note that the Insertion Transformer (Stern et al., 2019) focuses on machine translation rather than hard-constrained generation task, and therefore is not considered for comparison.Other methods based on grid beam search typically have long inference time, and they only operate on the inference stage; these are also excluded from comparison.For all compared system, we use the default settings suggested by the authors, the models are trained until the evaluation loss does not decrease.More details are provided in the Appendix.

Experiment Setups
We employ the tokenizer and model architecture from BERT-base and BERTlarge models for all the tasks.BERT models are used as our model initialization.Each model is trained until the validation loss is no longer decreasing.We use a learning rate of 3e-5 without any warming-up schedule for all the training procedures.The optimization algorithm is Adam (Kingma and Ba, 2015).We pre-train our model on the Wiki dataset for 2-4 epochs, and fine-tune on the News and Yelp datasets for around 10 epochs.
Evaluation Metrics Following Zhang et al. (2020), we perform automatic evaluation using commonly adopted text generation metrics, including BLEU (Papineni et al., 2002), METEOR (Lavie and Agarwal, 2007), and NIST (Doddington, 2002).Following (Kann et al., 2018), to assess the coherence of generated sentences, we also report the perplexity over the test set using pre-trained GPT-2 but it would also be required for estate agents , who must pay a larger amount of cash but stay with the same policy for all other assets .
Table 3: Generated examples from the News dataset.aging the ILBS or using a larger model further improves most automatic metrics we evaluated8 .For diversity scores, as CGMH is a sampling-based method in nature, it achieves the highest Dist-n scores (even surpasses human score).We observed that the length of generated sentences, the diversity scores and the GPT-2 perplexity from POINTER are close to human oracle.

Yelp Generation We further evaluate our method
Semantics: A and B, which is more semantically meaningful and consistent?Generally, the generation from our model effectively considers all the lexical constraints, and is semantically more coherent and grammatically more fluent, compared with the baseline methods.The automatic evaluation results is generally consistent with the observations from News dataset, with an exception that Dist-n scores is much lower than the human Dist-n scores.Compared with greedy approach, at a cost of efficiency, ILBS is typically more concise and contains less repeated information, a defect the greedy approach occasionally suffers (e.g., Table 4, "delicious and delicious").
For both datasets, most of the generations converges with in 4 stages.We perform additional experiments on zero-shot generation from the pretrained model on both datasets, to test the versatility of pre-training.The generated sentences, albeit Wiki-like, are relatively fluent and coherent (see examples in Appendix B and C), and yield relatively high relevance scores (see Appendix E for details).Interestingly, less informative constraints are able to be expanded to coherent sentences.Given the constraint is to from, our model generates "it is oriented to its east, but from the west".
The autoregressive soft-constraint baseline (Gao et al., 2020) has no guarantee that it will cover all keywords in the given order, thus we omit it in the Table 2.For this baseline, the percentage of keywords that appear in the outputs are 57% and 43% for News and Yelp datasets, respectively.With the similar model size (117M), this baselines performance is worse than ours approach in automatic metrics for News dataset (BLEU4: 2.99 → 1.74; NIST4: 3.22 → 1.10; METEOR: 16% → 9%; DIST2: 61% → 58%; PPL: 66 → 84).The performance gap in Yelp dataset is even larger due to more lexical constraints.
Human Evaluation Using a public crowdsourcing platform (UHRS), we conducted a human evaluation of 400 randomly sampled outputs (out of 1k test set) of CGMH, NMSTG and our base and large models with greedy decoding.Systems were paired and each pair of system outputs was randomly presented (in random order) to 5 crowdsourced judges , who ranked the outputs pairwise for coherence, informativeness and fluency using We note there is no theoretical guarantee of O(log N ) time complexity for our method.However, our approach encourages filling as many slots as possible at each stage, which permits enables the model to achieve an empirical O(log N ) speed.In our experiment 98% of generations end within 4 stages.
Note that our method in Table 6 uses greedy decoding.ILBS is around 20 times slower than greedy.The large model is around 3 times slower than the base model.

Conclusion
We have presented POINTER, a simple yet powerful approach to generating text from a given set of lexical constraints in a non-autoregressive manner.The proposed method leverages a large-scale pre-trained model (such as BERT initialization and our insertion-based pre-training on Wikipedia) to generate text in a progressive manner using an insertion-based Transformer.Both automatic and human evaluation demonstrate the effectiveness of POINTER.In future work, we hope to leverage sentence structure, such as the use of constituency parsing, to further enhance the design of the progressive hierarchy.Our model can be also extended to allow inflected/variant forms and arbitrary ordering of given lexical constraints.but it would also be required for estate agents , who must pay a larger amount of cash but stay with the same policy for all other assets .
Wiki zeroshot however , his real estate agent agreed to pay him for the stay under the same policy .
Keywords managers cut costs million ORACLE he was the third of four managers sent in to cut costs and deal with the city ' s $ 13 million deficit .
CGMH the managers , who tried to cut off their costs , added 20 million euros NMSTG business managers cut demand for more expensive costs in 2017 -by October -is around 5 million 8 per cent , and has fallen by 0 .3 per cent in January and 2017 .

POINTER (Greedy, base)
under one of its general managers , the firm had already cut its annual operating costs from $ 13 .5 million to six million euros .

POINTER (ILBS, base)
and last month , the managers announced that it had cut its operating costs by $ 30 million .

POINTER (Greedy, Large)
the biggest expense is for the managers , where it plans to cut their annual management costs from $ 18 .5 million to $ 12 million .
Wiki zeroshot but then he and all of his managers agreed to cut off all of the operating costs by about 1 million .
Keywords looked report realized wife ORACLE i looked at the report and saw her name , and that's when I realized it was my ex-wife .
CGMH he looked at the report and said he realized that if his wife Jane NMSTG i looked at my report about before I realized I return to travel holidays but -it doesn ' t haven ' t made anything like my wife .

POINTER (Greedy, base)
when i turned and looked at a file report from the airport and realized it was not my wife and daughter .

POINTER (ILBS, base)
when i turned around and looked down at the pictures from the report , i realized that it was my wife .POINTER (Greedy, Large) however , when they looked at the details of the report about this murder , they quickly realized that the suspect was not with his wife or his partner .
Wiki zeroshot but when he looked up at the report , he realized that it was not his wife .this is not for the first time that the scottish government was able to claim tax cuts of thousands of pounds a year to pay .
Wiki zeroshot but at the time , the claim was that the same sales tax that was from the previous fiscal year .
Keywords model years big drama ORACLE the former model said : " I haven ' t seen him in so many years , I can ' t make a big drama out of it ." CGMH the " model " continues , like many years of sexual and big drama going NMSTG after model two years and did it like , could we already get bigger than others in a big drama ?POINTER (Greedy, base) but i am a good role model , who has been around for 10 years now , and that is a big example of what i can do in drama on screen .

POINTER (ILBS, base)
but the young actress and model , for 15 years , made a very big impact on the drama .

POINTER (Greedy, Large)
i have seen the different model she recommends of over years , but it ' s no big change in the drama after all .
Wiki zeroshot she was a model actress for many years and was a big star in the drama .
Keywords made year resolution managed ORACLE i once made this my new year ' s resolution , and it is the only one that I ' ve actually ever managed to keep .
CGMH indeed , as he made up the previous year , the GOP resolution was managed NMSTG while additional sanctions had been issued last week made a year from the latest resolution , Russia ' s Russian ministers have but have managed .

POINTER (Greedy, base)
no progress has been made in syria since the security council started a year ago , when a resolution expressed confidence that moscow managed to save aleppo .

POINTER (ILBS, base)
and the enormous progress we have made over the last year is to bring about a resolution that has not been managed .

POINTER (Greedy, Large)
the obama administration , which made a similar call earlier this year and has also voted against a resolution to crack down on the funding , managed to recover it .
Wiki zeroshot but despite all the same changes made both in both the previous fiscal year , and by the un resolution itself , only the federal government managed ... there has been a lot of great work here in the past few years within more than a decade , done for the city , he says .
Wiki zeroshot there was a great success in the past during the last decade for the city .

C Additional Generated Examples for Yelp Dataset
We provide two examples on Yelp dataset for how the model progressively generates the sentences in Table 8.All the generations are from the POINTER large model using greedy decoding.
We also provide some additional examples from the Yelp test set.The results includes keywords, human oracle, CGMH, NMSTG and our models.For our models, we include POINTER base and large models with greedy decoding and base model with ILBS.The large model with ILBS is time consuming so we omit them from the comparison.
Stage Generated text sequence 0 (X 0 ) delicious love mole rice back 1 (X 1 ) restaurant delicious authentic love dish mole beans rice definitely back ! 2 (X 2 ) new restaurant so delicious fresh authentic .love mexican dish called mole with beans and rice we definitely coming back more !3 (X 3 ) this new restaurant is so delicious , fresh and authentic tasting .i love the mexican style dish , called the mole , with black beans , and white rice .we will definitely be coming back for more !
Stage Generated text sequence 0 (X 0 ) joint great food great drinks greater staff 1 (X

D Additional Human Evaluation information and Results
There were 145 judges in all: 5 judges evaluated each pair of outputs to be reasonably robust against spamming.P-values are all p¡0.00001 (line 721), computed using 10000 bootstrap replications.Judges were lightly screened by our organization for multiple screening tasks.We present the additional human evaluation results on POINTER large model vs base model in table 11.In general, for the news dataset the results are mixed.For the yelp dataset, the large model wins with a large margin.All results are still far away from the human oracle in all three aspects.

E Additional Automatic Evaluation Results
We provide the full evaluation result data including Wikipedia zero-shot learning results in Table 9 and Table 10.Note that zero-shot generations from Wikipedia pre-trained model yield the lowest perplexity, presumably because the Wikipedia dataset is large enough so that the model trained on it can learn language variability, thus delivering fluent generated results.

F Inference Details
During inference time, we use a decaying schedule to discourage the model from generating noninteresting tokens, including [NOI] and some other special tokens, punctuation and stop words.To do this, we use a decay multiplier η on the logits of these tokens before computing the softmax.The η is set to be η = min(0.5+λ* s), where s is the current stage and λ is an annealing hyper-parameter.
In most of the experiments, λ is set at 0.5

G Human Evaluation Template
See Figure 2 for human evaluation template t e x i t s h a 1 _ b a s e 6 4 = " I v G T a P b g s t e x i t s h a 1 _ b a s e 6 4 = " I v G T a P b g s

Figure 1 :
Figure 1: Illustration of the generation process (X 0 → X) of the proposed POINTER model.At each stage, the Insertion Transformer module generates either a regular token or a special [NOI] token for each gap between two existing tokens .The generation stops when all the gaps predict [NOI].The data preparation process reverses the above generative process.
), which contains 160k training examples, 10k validation examples and 1k test examples.These two datasets vary in sentence length and domain, enabling the assessment of our model in different scenarios.

Table 11 :
Human Evaluation on two datasets for semantic consistency, fluency and informativeness, showing preferences (%) for our POINTER(large) model vis-a-vis POINTER(base) model and real human responses.Numbers in bold indicate the most preferred systems.Significant differences (p ≤ 0.001) are indicated as ***.

Table 2 :
Automatic evaluation results on the News (upper) and Yelp (lower) dataset.ILBS denotes beam search."+Wiki" denotes fine-tuning on the Wiki-pretrained model."base/large" represents the greedy generation from a based(110M)/large(340M) model."Human" represents the held-out human reference.
(Li et al., 2016)18) 7 .We use Entropy(Zhang et al., 2018)and Dist-n(Li et al., 2016)to evaluate lexical diversity.Keywords estate pay stay policyCGMHan economic estate developer that could pay for it is that a stay policy .NMSTG as estate owners , they cannot pay for households for hundreds of middle -income property , buyers stay in retail policy .

Table 4 :
Generated examples from the Yelp dataset.
POINTER is able to take full advantage of BERT initialization and Wiki pre-training to improve relevance scores (NIST, BLEU and METEOR).Lever-7 https://github.com/openai/gpt-2Keywords joint great food great drinks greater staff CGMH very cool joint with great food , great drinks and even greater staff .! .NMSTG awesome joint .great service.great food great drinks.good to greater and great staff!POINTER (Greedy, base) my favorite local joint around old town.great atmosphere, amazing food, delicious and delicious coffee, great wine selection and delicious cold drinks, oh and maybe even a greater patio space and energetic front desk staff.

Table 5 :
Human Evaluation on two datasets for semantic consistency, fluency and informativeness, showing preferences (%) for our POINTER(base) model vis-à-vis baselines and real human responses.Numbers in bold indicate the most preferred systems.Differences in mean preferences are statistically significant at p ≤ 0.00001.
on the Yelp dataset, where the goal is to generate a long-form text from more constraints.Generating a longer piece of text with more lexical constraints is generally more challenging, since the model needs to capture the long-term dependency structure from the text, and effectively conjure up with a plan to realize the generation.Results of automatic evaluation are provided in Table 2 (lower).Generated examples are shown in Table 4 and Appendix C.
Yelp dataset.Despite the noise, the judgments show a strong across-the-board preference for POINTER(base) over the two baseline systems on all categories.A clear preference for the human ground truth over our method is also observed.The base and large models show comparable human judge preferences on the News dataset, while human judges clearly prefer the large model on Yelp data (see Appendix D for more details).

Table 8 :
1 ) new joint around great location food variety great craft drinks unless greater friendly staff ! 2 (X 2 ) is new breakfast joint be around area great , location excellent food nice variety selections great of craft , drinks quite unless ask greater .friendly and staff love ! 3 (X 3 ) this is the new modern breakfast joint to be found around the area .great atmosphere , central location and excellent food .nice variety of selections .great selection of local craft beers , good drinks .quite cheap unless you ask for greater price .very friendly patio and fun staff .love it !Example of the progressive generation process with multiple stages from the POINTER model.New additions at each stage are marked as blue.CGMH good potential but bad service.not maintained .it replaced a dirty box .disgusting .NMSTG do a good price .not like the and potential bad maintained has disgusting .replaced been , dirty and disgusting .very good .it really has more potential maybe , but it smells really bad .its not very well maintained either .trash cans were replaced only when they were dirty .the floors were utterly disgusting .not so good .overall it has a lot of potential for being better but it is too bad that it is not clean and un maintained and towels are in desperate need to be replaced regularly .the floors are very dirty and the higher floors have become filthy disgusting when i visited here . is good it has no potential , and the bad taste can be maintained until they are replaced by a dirty , and disgusting one .Keywords love animal style long line expected quick ORACLE who doesn t love in and out .animal style is a must .long line but expected , it goes quick anyways so don t let that discourage you .CGMH love this place .animal style food .long line than expected for quick .NMSTG love animal chicken .it was style long a bit so good .the line is it was even on on a time and we expected to go but quick .love having the double with animal style fries and protein style etc .have a super long wait line , but its just as expected and it always moves pretty quick too .just gotta love about this place is the double animal style and protein style .it was a long line , but i expected it to be quick .good price .i love that they have non chain locations .i like the animal style fries too .have to wait long as there is always traffic but the line can be much shorter than i had expected and they are always send out pretty quick .very impressed !Wiki zeroshot he also has love with the animal and his style , and was long as the finish line , and was expected to be quick .Keywords great great service happy found close home ORACLE great sushi and great service .i m really happy to have found a good sushi place so close to home !CGMH great price and great customer service .very happy that i found this place close to my home .NMSTG great food and great service .a happy and found a year in close for them .keep them home here .great quality food .great prices and friendly service staff .so happy and surprised to have finally found such a wonderful nail salon so close to my work and home .great food .great food and wonderful service .very happy to have finally found a chinese restaurant close to my home .have been here twice .great times here .food always has been great and the customer service was wonderful .i am very happy that we finally found our regular pad thai restaurant that is close to where we work now and our home .pleasantly surprised !great teacher and a great love of the service he was very happy , and he found himself in the close to his home .
i just did not even hesitate to admit , i should give credit cards to my customers here .the beijing chicken and fried rice were spot on , a decent side on my favorite list .