TIGS: An Inference Algorithm for Text Infilling with Gradient Search

Text infilling aims at filling in the missing part of a sentence or paragraph, which has been applied to a variety of real-world natural language generation scenarios. Given a well-trained sequential generative model, it is challenging for its unidirectional decoder to generate missing symbols conditioned on the past and future information around the missing part. In this paper, we propose an iterative inference algorithm based on gradient search, which could be the first inference algorithm that can be broadly applied to any neural sequence generative models for text infilling tasks. Extensive experimental comparisons show the effectiveness and efficiency of the proposed method on three different text infilling tasks with various mask ratios and different mask strategies, comparing with five state-of-the-art methods.


Introduction
Text infilling aims at filling in the missing part of a sentence or paragraph by making use of the past and future information around the missing part, which can be used in many real-world natural language generation scenarios, for example, fill-in-the-blank image captioning (Sun et al., 2017), lexically constrained sentence generation (Liu et al., 2018b), missing value reconstruction (e.g. for damaged or historical documents) (Berglund et al., 2015), acrostic poetry generation (Liu et al., 2018a), and text representation learning (Devlin et al., 2018).
Text infilling is an under-explored challenging task in the field of text generation. Recently, sequence generative models like sequenceto-sequence (seq2seq) models (Sutskever et al., * Correspondence to Jiancheng Lv. 1 Our code and data are available at https://github.com/dayihengliu/ Text-Infilling-Gradient-Search Input: Hey, how about going for a few beers after dinner ?
Ground Truth You know that is tempting but is really not good for our fitness .

Seq2seq + Left-to-Right Beam Search
You know that I like it very much , let's for our fitness .

Seq2seq (backward) + Right-to-Left Beam Search
You know that not going , it is really bad for our fitness .

Figure 1:
Our key observation on text infilling for Dialogue task. The inability of unidirectional BS to consider both the future and past contexts leads models to fill the blank with words that clash abruptly with the context around the blanks (see red circles). 2014; Bahdanau et al., 2014;Gehring et al., 2017;Vaswani et al., 2017) are widely used in text generation tasks, including neural machine translation (Wu et al., 2016;Vaswani et al., 2017), image captioning (Anderson et al., 2017), abstractive summarization (See et al., 2017), and dialogue generation (Mei et al., 2017). Unfortunately, given a well-trained 2 neural seq2seq model or unconditional neural language model (Mikolov et al., 2010), it is a daunting task to directly apply it to text infilling task. As shown in Figure 1, we observe that the infilled words should be conditioned on past and future information around the missing part, which is contrary to the popular learning paradigm, namely, each output symbol is conditioned on all previous outputs during inference by using unidirectional Beam Search (BS) (Och and Ney, 2004).
To solve the issues above, one family of methods for text infilling is "trained to fill in blanks" (Berglund et al., 2015;Fedus et al., 2018;Zhu et al., 2019), which requires large amounts of data in fill-in-the-blank format to train a new model that takes the output template as a conditional in-put. Such methods are only used for unconditional text infilling tasks, whereas many text infilling tasks are conditional, e.g., conversation reply with templates. Another kind of promising approach (Berglund et al., 2015;Wang et al., 2016;Sun et al., 2017) is an inference algorithm that can be directly applied to other generative models. These inference algorithms are applied to Bidirectional RNNs (BiRNNs) (Schuster and Paliwal, 1997;Baldi et al., 1999) which can model both forward and backward dependencies. The latest work is Bidirectional Beam Search (BiBS) (Sun et al., 2017) which proposes an approximate inference algorithm in BiRNN for image caption infilling. However, this method is based on some unrealistic assumptions, such as that given a token, its future sequence of words is independent of its past sequence of words. We experimentally find that these assumptions often generate non-smooth or unreal complete sentences. Moreover, these inference algorithms can be only used to decoders with bidirectional structures, whereas almost all sequence generative models use a unidirectional decoder. As a result, it is highly expected to develop an inference algorithm that could be applied to the unidirectional decoder.
In this paper, we study the general inference algorithm for text infilling to answer the question: • Given a well-trained neural sequence generative model, is there any inference algorithm that can effectively fill in the blanks in the output sequence?
To investigate such a possibility, we propose a dramatically different inference approach called Text Infilling with Gradient Search (TIGS), in which we search for infilled words based on gradient information to fill in the blanks. To the best of our knowledge, this could be the first inference algorithm that does not require any modification or training of the model and can be broadly used in any sequence generative model to solve the fillin-the-blank tasks as verified in our experiments.
To be specific, we treat the blanks to be filled as parameterized vectors during inference. More concretely, we first randomly or heuristically project each blank to a valid token and initialize its parameterized vector with the word embedding of the valid token. The goal is seeking the words to be infilled by minimizing the negative log-likelihood (NLL) of the complete sequence.
Then the algorithm alternately performs optimization step (O-step) and projection step (P-step) until convergence. In O-step, we fix all other parameters of the model and only optimize the blank parameterized vectors by gradients. In P-step, heuristics like local search and projected gradient are used to project the blank parameterized vectors to valid tokens (i.e., discretization).
The contribution and novelty of this work could be summarized as below: • We propose an iterative inference algorithm based on gradient search, which could be the first inference algorithm that can be broadly applied to any neural sequence generative models for text infilling tasks. • Extensive experimental comparisons show the effectiveness and efficiency of the proposed method on three different text infilling tasks, compared with five state-of-the-art methods.

Related Works
There are some effective solutions to the text infilling task: a) training a model specifically for text infilling tasks (Berglund et al., 2015;Fedus et al., 2018;Zhu et al., 2019); b) using standard sequence generative model with modified inference algorithm (Berglund et al., 2015;Wang et al., 2016;Sun et al., 2017). As one typical work of the first category, NADE (Berglund et al., 2015) is proposed to train a specific BiRNNs for filling in blanks, which concatenates an auxiliary vector to input vectors for indicating a missing input during training and inference. Fedus et al. (2018) propose MaskGAN which 1) uses some specific "missing" tokens to indicate the blanks and takes the whole sequence with blanks (called template) as the input of encoder, and 2) uses an RNN as a decoder to generate the whole sentence after filling in the blanks. Similarly, Zhu et al. (2019) use self-attention model (Vaswani et al., 2017), which takes the template as the input for unconditional text infilling task. One major limitation of these works is that they require large amounts of data in fill-in-the-blank format and need to train a specific model. Besides, they are only used for unconditional text infilling tasks. Different from them, our new inference algorithm does not require any modification or training of the model, which can be broadly applied to any neural seq2seq models for both conditional and unconditional text infilling tasks.
As with the second category, some inference algorithms based on BiRNNs have been proposed for fill-in-the-blank tasks thanks to their ability to model both forward and backward dependencies. For example, Berglund et al. (2015) propose Generative Stochastic Networks (GSN) to reconstruct the blanks of sequential data. The idea is to first randomly initialize the symbols in the blanks and then resample an output y t from P BiRN N (y t | {y d } d =t , x) one at a time until convergence. More recently, Sun et al. (2017) propose the Bidirectional Beam Search (BiBS) inference algorithm of BiRNNs for fill-in-the-blank image captioning task. However, this method is based on some strong assumptions, which may be violated in practice. As shown in our experiments, we provide empirical analysis on cases where this approach fails. Moreover, GSN and BiBS can be only applied to decoders with bidirectional structures, while almost all sequence generative models use a unidirectional decoder. In contrast, our proposed inference method decouples from these assumptions and can be applied to the unidirectional decoder.

Preliminary
Since our method utilizes gradient information, it could smoothly cooperate with other architectures, such as models proposed in (Vaswani et al., 2017;Gehring et al., 2017). Considering the popularity of RNNs, and the related work is based on RNN model, we use RNN-based models as a showcase in this paper.

RNN-based Seq2Seq Model
We firstly introduce the notations and briefly describe the standard RNN-based seq2seq model. Let x = {x 1 , x 2 , ..., x n } denotes one-hot vector representations of the conditional input sequence, y = {y 1 , y 2 , ..., y m } denotes scalar indices of the corresponding target sequence, and V denotes the vocabulary. n and m represent the length of the input sequence and the output sequence, respectively.
The seq2seq model is composed of an encoder and a decoder. For the encoder part, each x t will be firstly mapped into its corresponding word embedding x emb t . Then {x emb t } are input to a bidirectional or unidirectional long short-term mem-ory (LSTM) (Hochreiter and Schmidhuber, 1997) RNN to get a sequence of hidden states {h enc t }. For the decoder, at time t, similarly y t is first mapped to y emb t . Then a context vector c t is calculated with attention mechanism (Bahdanau et al., 2014;Luong et al., 2015) c t = n i=1 a ti h enc i , which contains useful latent information of the input sequence. Here, a t is an attention distribution vector to decide which part to focus on. The context vector c t and the embedding y emb t are fed as input to a unidirectional RNN language model (LM), which will output a probability distribution of the next word P (y t+1 |y 1:t , x), where y 1:t refers to {y 1 , ..., y t }.
During training, the negative log-likelihood (NLL) of the target sequence is minimized using standard maximum-likelihood (MLE) training with stochastic gradient descent (SGD), where the NLL is calculated as follows: During inference, the decoder needs to find the most likely output sequence y * by giving the input x: Since the number of possible sequences grows as |V| m (|V| is the size of vocabulary), exact inference is NP-hard and approximate inference algorithms like left-to-right greedy decoding or beam search (BS) (Och and Ney, 2004) are commonly used.

Problem Definition
In this paper, instead of setting some restrictions such as limiting the number of blanks or restricting the position of the blanks in previous work (Sun et al., 2017), we consider a more general case of text infilling task where the number and location of blanks are arbitrary.
Let B be a placeholder for a blank, B be the set that records all the blanks' position index, and y B be a target sequence where portions of text body are missing as indicated by B. For instance, if a target sequence has two blanks at the position i and j, then B = {i, j} and y B = {y 1 , .., y i−1 , B, y i+1 , ..., y j−1 , B, y j+1 , ..., y m }.
Given an input sequence x and a target sequence y B containing blanks indicated by B, we aim at filling in the blanks of y B . This procedure needs to consider the global structure of sentences

Methodology
In this section, we present our inference method in detail. The overall framework is shown in Figure 2. Given a well-trained seq2seq model and a pair of text infilling data (x, y B ), the method aims at finding an infilled word setŷ = {ŷ 1 , ...,ŷ |B| } to minimize the NLL of the complete sentence y * via:ŷ = arg min where y * denotes the complete sentence after filling the blanks of y B withŷ, |B| denotes the number of blanks. Since the number of possible infilled word set is |V| |B| , naïvely searching in this space is NP-hard.
Our key idea is to utilize the gradient information to narrow the range of search during inference. This idea is similar to the "white-box" adversarial attacks (Goodfellow et al., 2014b;Szegedy et al., 2013;He and Glass, 2018). However, the adversarial attack aims to slightly modify the inputs in order to mislead the model to make wrong predictions, while our goal is to search for the reasonable words that should be filled into the blanks.
Unlike the continuous input space (e.g., images) in other tasks, applying gradient search directly to the input would make it invalid (i.e., no longer a one-hot vector) for text infilling tasks. More specifically, we firstly treat the blanks to be filled as parameterized word embedding vectorsŷ emb = {ŷ emb 1 , ...,ŷ emb |B| }. Then, we fix the parameters of the well-trained model and only optimize these parameterized vectors in the continuous space, where the gradient information can be used to minimize the NLL loss L N LL (x, y * ). Finally, theŷ emb is discretized into valid wordsŷ by measuring the distance between theŷ emb and the word embeddings in W emb . Here W emb denotes the word embedding matrix in the decoder of the well-trained seq2seq model, and each column of W emb represents the word embedding of one word in the vocabulary.
As every word in the setŷ is dependent on each other, the simultaneous discretization of all parameterized word embeddings inŷ emb into valid words at the same time usually make the complete sentence y * non-smooth. As a concrete example, when infilling the two blanks in "Amy likes eating , so she goes to snack bars very often.", theŷ emb may be close to the word embeddings of {"ice", "cream"} and {"fried", "chips"}. However, if one discretizes the two blanks simultaneously, one might get answers like {"ice", "chips"} or {"fried", "cream"}. Therefore, we adopt an iterative algorithm which is similar to Gibbs sampling. At each inference step, we focus on one single infilled wordŷ j for j-th blank and update it while keeping other words in the infilled word set y fixed. For the unknown blank length tasks (each blank may contain an arbitrary unknown number of tokens), we can apply the TIGS as a black box inference algorithm over a range of blank lengths and then rank these solutions.
At the beginning, we initialize the infilled word setŷ with some valid words randomly or heuris-tically (from a left-to-right beam search). Then we perform optimization step (O-step) and projection step (P-step) alternately to update each infilled word in the infilled word setŷ until convergence or reach the maximum number of rounds T .
In O-step, we aim to optimize theŷ emb j in continuous space using gradient information with respect to L N LL (x, y * ). Firstly, we get the complete sentence y * by fillingŷ in the blanks of y B and obtain the L N LL (x, y * ) of y * after putting x into the encoder and y * into the decoder of the well-trained seq2seq model. Then we treat the vectorŷ emb j as parameterized vector, and fix all other parameters of the seq2seq model and only optimize the parameterized vectorŷ emb j with gradient information to minimize L N LL (x, y * ).
However, directly optimizing L N LL (x, y * ) may lead to the finalŷ emb j not like a feasible word embedding in W emb , and its nearest neighbor word embedding in W emb could be far away from it. So we add an L2 penalty to make theŷ emb j get close to W emb : where λ is a hyperparameter. We also tried to add an additional regularization term that directly narrow the distance betweenŷ emb j and its nearest word embedding in W emb , which is used in Cheng et al. (2018) for seq2seq adversarial attacks, but no obvious improvement was found.
Given the loss L(x, y * ),ŷ emb j is updated with ∇ŷemb j L(x, y * ) by one-step gradient descent: Instead of updatingŷ emb j by naïvely SGD algorithm, we experimentally find that Nesterov (Sutskever et al., 2013) optimizer performed better than other optimizers to updateŷ emb j . As discussed in Dong et al. (2017b), this momentum based optimizer can stabilize update directions and escape from poor local maxima during the iterations for adversarial attack.
In P-step, we aim to project theŷ emb j into a valid infilled wordŷ j . A naïve way is to find the word whose word embedding in W emb is nearest toŷ emb j based on the distance metric function dist(·) 3 . However, due to its high dimensionality, the obtaining word embedding may be far from satisfactory. Instead, similar to the idea of beam search, we first obtain a set S containing K candidate words whose word embedding is K-nearest toŷ emb j : and then we select one word with lowest NLL from these K words in S asŷ j . Our experiments suggest that just setting the size of K to 1% of the vocabulary size works well. The whole algorithm is further summarized in 1. Since our method is designed for the unidirectional decoder, the time complexity is expected to be slightly higher than that of the inference algorithm designed for the bidirectional decoder. In brief, our approach requires mKT |B| RNN steps, while the GSN (Berglund et al., 2015) requires mT |B| BiRNN steps, and the BiBS (Sun et al., 2017) requires 2mKT RNN steps. Fortunately, our inference algorithm can be easily optimized with GPUs.

Algorithm 1 TIGS algorithm
Input: a trained seq2seq model, a pair of text infilling data (x, y B ), output length m. Output: a complete output sentence y * . Initialize the infilled word setŷ and initialize y * by infilling y B withŷ. Initializeŷ emb by looking up the word embedding matrix W emb .

Datasets
In the experiments, we evaluate the proposed method on three text infilling tasks with three widely used publicly available corpora.
The first task is conversation reply with a template (denoted as Dialog) which is conducted on the DailyDialog (Li et al., 2017) dataset. We use Input: What is the weather like today ?

Mask strategy: Middle
Mask ratio: 25% It stops snowing , __ __ a bit wind . Figure 3: Some testing samples of conversation reply with templates task with different mask strategies and ratios.
its single-turn data, which contains 82,792 conversation pairs. The query sentence is taken as encoder input x, and the reply sentence is taken as y.
The second task is Chinese acrostic poetry generation (denoted as Poetry). Here we use a publicly available Chinese poetry dataset 4 which contains 232,670 Chinese four-line poems. For each poem, the first two lines are used as encoder input x, and the last two lines are y.
The third task is infilling product reviews (denoted as APRC). The Amazon Product Reviews Corpus (APRC) (Dong et al., 2017a), which is built upon Amazon product data (McAuley et al., 2015) and contains 347,061 reviews, is used in this task. Unlike the first two tasks, this task is an unconditional text infilling task (without conditional input x). We use each product review in Dong et al. (2017a) as y.
For each task, we take 5,000 samples in the test set to construct the data with blanks (y B ) for testing, we create a variety of test samples by masking out text y with varying missing ratios and two mask strategies. More specifically, the first mask strategy is called middle which is followed as the setting in Sun et al. (2017), namely, removing r = 25%, 50%, or 75% of the words from the middle of y for each data. The second mask strategy is called random, namely, randomly removing r = 25%, 50%, or 75% of the words in y for each data. To sum up, we have three test tasks, and each task has six types of test sets (two mask strategies and three mask ratios). Each test set contains 5,000 test samples. We show some data examples in Figure 3.

Baselines
We compare our approach TIGS with several strong baseline approaches: Seq2Seq-f: it runs beam search (BS) with beam width K on a well-trained seq2seq model (forward) to fill the blanks from left to right.
Seq2Seq-b: it runs BS with beam width K on a well-trained seq2seq model (backward) to fill the blanks from right to left.
Seq2Seq-f+b: it fills the blanks by both Seq2Seq-f and Seq2Seq-b, and then selects the output with a maximum of the probabilities assigned by the seq2seq models. This method is used in Wang et al. (2016). (Berglund et al., 2015) on a well-trained seq2seq model with BiRNN as the decoder to fill the blanks.

BiRNN-GSN: it runs GSN
BiRNN-BiBS: it runs bidirectional beam search (BiBS) (Sun et al., 2017) on a well-trained seq2seq model with BiRNN as the decoder to fill the blanks. The method has achieve the state-of-theart results on fill-in-the-blank image captioning task in Sun et al. (2017).
Except for BiRNN-GSN and BiRNN-BiBS, all the above baselines and our method perform inference on the same well-trained seq2seq model. BiRNN-GSN and BiRNN-BiBS perform inference on a well-trained seq2seq model in which the decoder is BiRNN. These models are trained on the complete sentence dataset with standard maximum-likelihood. Moreover, the sentences with blanks are only used in the inference stage. For fari comparison, BiRNN-BiBS, BiRNN-GSN, and the proposed method use the same initialization strategy (left-to-right greedy). The maximum number of iterations T is set to 50 to ensure that all the algorithms can achieve their best performance.
In addition to the above inference based approaches, we also compare two model-based approaches: Mask-Seq2Seq and Mask-Self-attn (Fedus et al., 2018;Zhu et al., 2019). These baselines take the output template as an additional input and are trained on the data in fill-in-the-blank format. We use LSTM RNNs for Mask-Seq2Seq, and use the self-attention model (Vaswani et al., 2017) for Mask-Self-attn (Zhu et al., 2019) which is shown to have better performance than GANbased models (Goodfellow et al., 2014a)

Metrics
Following Sun et al. (2017), we compare methods on standard sentence-level metric BLEU scores (4-gram) (Papineni et al., 2002) which considers the correspondence between the ground truth and the complete sentences. However, such a metric also has some deficiencies in text infilling tasks. For example, given two complete sentences with only one word different, the sentence level statistics of them may be quite similar, whereas a human can clearly tell which one is most natural. Moreover, given a template, there may be several reasonable ways to fill in the blanks. For example, given a template, "i this book, highly recom-mend it", it is reasonable to fill the word "love" or "like" in the blank. However, since there is only one ground truth, the BLEU scores of these two complete sentences are quite different. We find that this issue is more severe for the unconditional text filling task which has fewer restrictions, leading to more ways of filling in the blanks.
Therefore, for the unconditional text filling task (APRC), instead of calculating the BLEU score with only the ground truth as the reference, we also follow Yu et al. (2017) and use 10,000 sentences which are randomly sampled from the test set as references to calculate BLEU scores to evaluate the fluency of the complete sentences.
Besides BLEU scores, we conduct a modelbased evaluation. We train a conditional LM for each task (unconditional LM for APRC task) and use its NLL to evaluate the quality of the complete sentence y * given the input x.

Results
The BLEU (the higher the better) and NLL (the lower the better) results are shown in Table 1. Generally, we find that bidirectional methods (BiRNN-BiBS, BiRNN-GSN, and Seq2Seq-f+b) outperform unidirectional ones (Seq2Seq-f and Seq2Seq-b) in most cases. The model-based methods (Mask-Seq2Seq and Mask-Self-attn) perform well on unconditional text infilling task (APRC), but slightly poorly on conditional text infilling tasks (Dialog and Poetry). In line with the evaluation results in Zhu et al. (2019), the Mask-Self-attn performs consistently better than Mask-Seq2Seq. It has also achieved the highest BLEU score in some cases of unconditional text infilling tasks. However, in most cases of conditional text infilling tasks, the proposed method performs better than Mask-Self-attn.
Since the goal of the proposed method TIGS is to find the complete sentence with minimal NLL by utilizing gradient information. As expected, it achieves the lowest NLL in all cases of all tasks. Also, the BLEU scores of TIGS is highest in most cases of conditional text infilling tasks, while BiRNN-GSN and BiRNN-BiBS provide comparable performance. Although TIGS is used in RNNbased seq2seq model, it still achieves very competitive BLEU results on unconditional text infilling task compare with Mask-Self-attn.

Human Evaluation
We also conduct the human evaluation to further compare TIGS, BiRNN-BiBS, BiRNN-GSN, and Mask-Self-attn. Following the setting in Zhu et al. (2019), we collect generations of each of the four methods on 50 randomly-selected test instances. Then we launch a crowd-sourcing online study, asking 10 evaluators to rank the generations. The Template really __ this __ __ __ believable plot .
Ground Truth really enjoyed this futuristic book . believable plot .
Seq2seq-f really enjoyed this book . the believable plot .
Seq2seq-b really with this development and a believable plot .
Mask-Seq2Seq really enjoyed this fast paced and believable plot Mask-Self-attn really enjoyed this book . very believable plot .
BiRNN-BiBS really enjoyed this story and a believable plot .
BiRNN-GSN really enjoyed this book and a believable plot .
TIGS really enjoyed this book . very believable plot .

Input (Query) can you study with the radio on ?
Template __ , __ listen __ __ music .
Ground Truth no , I listen to background music .
Seq2seq-f i'd , I'm listen to the music .
Seq2seq-b music , can listen to the music .
Mask-Seq2Seq yes , they listen to the music .
Mask-Self-attn yes , it's a lot of music .
BiRNN-BiBS I , to listen to the music .
BiRNN-GSN yes , I'll listen to the music .
TIGS yes , I listen to classical music .
Ground Truth so much better than the last one . really good . and now i need <num> more words before i can submit .
Seq2seq-f so far better than the first one . ca n't . wait now i have <num> more books to read . submit .
Seq2seq-b so getting better for the next one . it down . . now i write <num> more words so i can submit .
Mask-Seq2Seq so much better than the first one . loved it . and now i have <num> more words to submit this submit .
Mask-Self-attn so much better than the first one . loved it . now now i need <num> more words to describe and submit .
BiRNN-BiBS so much better than the first one . loved it . and now i have <num> more to read . i submit .
BiRNN-GSN so much better than the first one . i cried . so now i have <num> more words to go to submit .
TIGS so much better than the first one . highly recommend . but now i need <num> more words to go to submit .  method with the best generation receives a score of 4, and the other three methods receive scores of 3, 2, and 1 according to the rank, respectively. The results are shown in Table 2. We can see that TIGS consistently outperforms all baselines. Template __ __ then __ and keep __ touch .

Samples and Analysis
Ground Truth good bye then , and keep in touch .
Seq2seq-f nice to then . and keep your touch .
Seq2seq-b minutes , then go and keep in touch .
Mask-Seq2Seq ok , then go then keep in touch .
Mask-Self-attn then , then keep and keep in touch .
BiRNN-GSN ok , then go and keep in touch .
TIGS alright , then . and keep in touch . Mask-Seq2Seq really enjoyed this fast paced and believable plot Mask-Self-attn really enjoyed this book . very believable plot .

Input
BiRNN-BiBS really enjoyed this story and a believable plot .
BiRNN-GSN really enjoyed this book and a believable plot .
TIGS really enjoyed this book . very believable plot .
Input (Query) can you study with the radio on ?
Ground Truth no , I listen to background music .
Seq2seq-f i'd , I'm listen to the music .
Seq2seq-b music , can listen to the music .
Mask-Seq2Seq yes , they listen to the music .
Mask-Self-attn yes , it's a lot of music .
BiRNN-BiBS I , to listen to the music .
BiRNN-GSN yes , I'll listen to the music .
This assumption may cause some sentences generated by BiRNN-BiBS are non-smooth or unreal. For example, in the top instance, the BiRNN-BiBS generates a non-smooth sentence "i, to listen to the music". At the third time-step, because both P−−→ URNN (y 3 ="to"|y 4:m ="listen to the music", x) and P←−− URNN (y 3 ="to"|y 1:2 ="i,", x) are relatively large, resulting in this blank being filled with an inappropriate word "to" by BiRNN-BiBS. However, P (y 3 ="to"|y 1:2 ="i,", y 4:m ="listen to the music", x) should be lower. In addition, we find that BiRNN-BiBS tends to use the unknown token "<unk>" to fill the blanks compared to other methods. The reason we analyze may be that sometimes both P−−→ URNN (y t ="<unk>"|y 1:t−1 , x) and P←−− URNN (y t ="<unk>"|y t+1:m , x) would be relatively large.
As for Mask-Seq2Seq and Mask-Self-attn, al-though they directly take the template y B as an additional input and are trained with data in fillin-the-blank format. We experimentally found that the generalization ability of these models is still limited, especially for conditional text infilling tasks. In the Dialog task, 21% and 16% of the samples generated by Mask-Self-attn and Mask-Seq2Seq with beam search could not even reconstruct the template (see Figure 5). 5 Because the BiRNN-GSN fills the blank from the probability P BiRNN (y t |y 1:t−1 , y t+1:m , x), and the proposed method filling the blank directly with the gradient ∇ŷemb t L(x, y * ). Both of them have the ability to reason about the past and future simultaneously without any unrealistic assumptions. We can see that the complete sentences generated by them are better than all other algorithms. However, BiRNN-GSN uses the bidirectional structure as the decoder, which makes it challenging to apply to most sequence generative models, but the proposed method is gradient-based, which can be broadly used in any sequence generative models.

Conclusions
In this paper, we propose a general inference algorithm for text infilling. To the best of our knowledge, the method is the first inference algorithm that does not require any modification or training of the model and can be broadly used in any sequence generative model to solve the fill-in-theblank tasks. We compare the proposed method and several strong baselines on three text infilling tasks with various mask ratios and different mask strategies. The results show that the proposed method is an effective and efficient approach for fill-in-theblank tasks, consistently outperforming all baselines.