Syntax-driven Iterative Expansion Language Models for Controllable Text Generation

The dominant language modeling paradigm handles text as a sequence of discrete tokens. While that approach can capture the latent structure of the text, it is inherently constrained to sequential dynamics for text generation. We propose a new paradigm for introducing a syntactic inductive bias into neural text generation, where the dependency parse tree is used to drive the Transformer model to generate sentences iteratively. Our experiments show that this paradigm is effective at text generation, with quality between LSTMs and Transformers, and comparable diversity, requiring less than half their decoding steps, and its generation process allows direct control over the syntactic constructions of the generated text, enabling the induction of stylistic variations.


Introduction
The currently dominant text generation paradigm is based on generating a sequence of discrete tokens in a left-to-right autoregressive way. Most neural language models (LMs) fall into this autoregressive generation category. Some neural architectures are sequential in nature, such as those based on recurrent neural networks (RNNs), lending themselves naturally to the autoregressive approach when used together with teacher forcing (Williams and Zipser, 1989). Other architectures, such as Transformer (Vaswani et al., 2017), while not intrinsically sequential, have also been targeted for sequential generation. On the other hand, some recent lines of research have focused on nonsequential generation. In this work, we propose a new paradigm for text generation and language modeling called Iterative Expansion Language Model, which generates the final sequence following a token ordering defined by the sentence dependency parse by iteratively expanding each level of the tree.

Dependency LMs
The use of dependency parse trees to drive a language model was first proposed by Chelba et al. (1997), with a similar structure to an n-gram LM, but where the context of a word is its preceding bigram plus a list of preceding words whose parent does not precede it. Shen et al. (2008) make use of the dependency tree in a probabilistic LM, computing the probability of each word conditioned on its parent and the sibling words between both. Mirowski and Vlachos (2015) propose a dependency LM based on RNNs, where the dependency tree is decomposed into a collection of unrolls, that is, paths from the root to one of the leaves, and where the probability of a word can be predicted from these unrolls. Buys and Blunsom (2018) propose a shift-reduce transition-based LSTM (Hochreiter and Schmidhuber, 1997) dependency LM that can be used for language modeling and generation by means of dynamic programming.

Syntax-driven Generation
Recurrent neural network grammars (Dyer et al., 2016) are recursive models that operate with a stack of symbols that can be populated with terminals or nonterminals, or "reduced" to generate a syntactic constituent, obtaining as a result a sentence and its associated constituency parse tree. Shen et al. (2018) use skip-connections to integrate constituent relations with RNNs, learning the underlying dependency structures by leveraging a syntactic distance together with structured attention. Akoury et al. (2019) use a simplified constituency tree as latent variables, modeling it autoregressively to later use it as input for a nonautoregressive transformer that generates the output sentence.
Ordered neurons (Shen et al., 2019) are modified LSTMs where the latent sentence tree structure is used to control the dependencies between recurrent units with a special "master" input and forget gates.  propose a conditional generative model that iteratively generates tokens plus the position at which they should be inserted within the sequence. Emelianenko et al. (2019) further propose to optimize the generation order by sampling from the ordering permutations. Instead,  optimize a lower bound of the marginalized probability over every possible ordering. Gu et al. (2019a) handle the generation order as a latent variable that is captured as the relative position through self-attention, optimizing the ELBO to train the model.

Insertion-based Generation
Levenshtein Transformer (Gu et al., 2019b) is a non-autoregressive approach trained with reinforcement learning (RL) to generate token insertion and deletion actions. While it benefits from the same generation speed-ups over autoregressive models as our model, it has the added difficulty of learning an insertion/deletion policy using RL without any linguistically or empirically motivated priors, which can be slow or difficult to obtain convergence in practice. By comparison, our approachmakes uses a linguistically motivated prior for word insertion in a fully supervised way, avoiding the optimization difficulties of RL. Welleck et al. (2019) use cost minimization imitation learning to learn a policy to generate a binary tree that is used to drive the token generation. Lee et al. (2018) propose a latent variable nonautoregressive machine translation model where first the target length is predicted by the model, and then, the decoder is iteratively applied to its own output to refine it.

Iterative Refinement
Mask-predict (Ghazvininejad et al., 2019) also predicts the target sentence length and then nonautoregressively predicts the sentence itself, iteratively refining it a fixed number of times, masking out and regenerating the tokens it is least confident about. Lawrence et al. (2019) follow a similar approach and start with a sequence of placeholder tokens (all the same) of a specified length, and they iteratively replace them with normal tokens via masked LM-style inference. As the masking strategy for the training data, the authors propose different stochastic processes to randomly select which placeholders are to be uncovered.

Iterative Expansion LMs
Our proposal is to train a new kind of language model where the token generation order is driven by the dependency parse tree of the sentence and where the generation process is iterative. The input vocabulary contains terminal tokens as well as non-terminal special tokens called dependency placeholders, each of which is associated with one of the possible dependency relations to the heads. For the dependency tree in Figure 1 The input of the first iteration is the sequence with the [ROOT] element. At each iteration, the model receives as input a sequence I tok with tokens from the input vocabulary and non-autoregressively generates two new sequences, each with the same length as the input.
The first output sequence, O tok , contains tokens from a vocabulary with all possible textual tokens (terminal tokens). The second output, O exp , is a sequence of tokens called expansion placeholders, which are taken from a separate vocabulary. Each expansion placeholder is associated with a pattern describing the left and right dependencies of the token at that position in the O tok sequence. An example of dependency expansion could be [nsubj-advmod-HEAD-xcomp] for the word "likes" in the dependency parse tree from Figure 1.
After each iteration, the output of the model is expanded. 1 This consists of creating a new sequence by combining the tokens from I tok , O tok and O exp . This process is illustrated in Figure 2, making use of the dependency tree from Figure 1.
When there is a padding token [pad] in the output (either O tok or O exp ), this means that the output at that position is ignored when computing the loss function. This occurs when the terminal token has already been computed in previous iterations and has therefore been received as part of I tok , and the model does not need to compute it again.
Note also that an empty dependencies token [HEAD] marks the end of a branch and that there is no need for an end of sequence token <eos>. As shown in the example from Figure 1, the generation of different branches occurs in parallel, needing only 3 iterations to generate a 6-token sentence.  The strategy for composing tree expansion tokens (e.g., [nsubj-advmod-HEAD-xcomp]) may not scale well when single words have many direct dependencies. To alleviate this, we introduce a preprocessing step to modify the dependency tree so that every word has at most one dependency to the left and one to the right. For each word with more than one dependency on any of its sides, we rearrange the tree to force left-to-right dependencies. Although this tree binarization reduces the degree of parallelism, it reduces data sparsity and allows handling constructions with a number of dependencies may otherwise be too large for the model to properly capture, such as enumerations (e.g., "I bought a pair of shoes, an umbrella, a beautiful jacket and a bracelet").
Iterative expansion LMs can be naturally extended to subword vocabularies, like byte-pair encoding (BPE; Sennrich et al., 2016): for each word, we decompose its node in the tree into as many nodes as subwords in the word, rearranging the tree so that the head of the old word is now the head of the first subword, and each subsequent subword depends on the previous one, while every dependency of the old word node now depends on the last subword.

Neural Architecture
The neural architecture proposed is based on a Transformer decoder (Vaswani et al., 2017). To generate the dual output (terminal tokens and expansion placeholders) we condition the generation of terminals on the expansions: the probability distribution over the expansion token space is generated first by projecting from one of the intermediate layers' hidden states. We sample from it and use the resulting expansion IDs as an index to a trainable expansion embedding layer; the embedded vectors are added to the hidden state used to generate them for use as input to subsequent layers.
As described in Section 3, the input and output token vocabularies are different: the latter only contains terminal tokens (plus some special tokens such as [PAD]); the former also contains dependency placeholders. However, for practical purposes, at the model level, we define both vocabularies to be the same, both with terminal tokens and dependency placeholders, and we mask the entries of dependency placeholders in the final softmax.
To inject the syntactic dependency information as input into the model, we add a layer of learned positional embeddings containing the position of the head of each token, and we refer to this embedding layer as head position embedding.
The self-attention mask used in Transformer to force causality is not used in our proposal. The input is therefore not masked at all, and the token predictions have access to the full input sequence.

Training
For training iterative expansion LMs, the main input of the model is the tokens at one of the levels of the dependency parse tree (I tok ), while the output is the following level tokens (O tok ) and expansion placeholders (O exp ). A secondary input to the model are the dependency indexes, which are used in the head position embedding.
The model is trained with the categorical crossentropy for both tokens and expansion placeholders, then adding both sublosses into the final loss (with equal weights). Tokens generated in previous iterations appear as [PAD] tokens in the expected output and are ignored when computing the loss.
Training takes place in batches; as the trainable unit is a level transition, a training batch is composed of level transitions from different sentences.

Inference and Text Generation
In iterative expansion LMs, inference takes place iteratively. The initial state is a batch of [ROOT] tokens, together with the head positions initialized to the special value representing the root node and, in constrained attention variants, a mask with the self-dependency of the single node in each sentence in the batch. At each iteration, the model generates the probability distributions for terminal tokens and expansion tokens. We use nucleus sampling (Holtzman et al., 2020) to sample from them. The terminal token sequences are expanded according to the expansion tokens (see §3), and these are the inputs for the following iteration if there are still unfinished branches. Before sampling from the token and expansion probability distributions, we mask the <unk> token and the dependency placeholders to avoid generating them.
Although iterative expansion LMs could be subject to beam search across iterations, we have not covered such a possibility as part of this work.

Unconditional Text Generation
We conducted experiments on unconditional text generation following the methodology used by Caccia et al. (2020). The goal is to assess both the quality and diversity of the text generated by the model and the baselines. For the quality evaluation, we use the BLEU score (Papineni et al., 2002) over the test set, where each generated sentence is evaluated against the whole test set as a reference. For diversity, we used the self-BLEU score (Zhu et al., 2018), computed using as references the rest of the generated sentences. For each model, the temperature of the final softmax τ is tuned to generate text in the closest quality/diversity regime to the training data.
Iterative expansion LMs are compared against a standard LM baselines, namely, AWD-LSTM 2 (Merity et al., 2018) and a Transformer LM (Vaswani et al., 2017), both with word (w) and BPE subword (sw) vocabularies. The models were trained on the EMNLP2017 News dataset, which contains news in English, enriched with dependency annotations by corenlp, an automatic annotation tool that provides pre-trained models. Syntax-driven generation baseline models were not included because the only model with an available implementation that is able to do unsupervised text generation are RNNGs, but they proved not to scale even to medium-sized datasets like EMNLP2017 News. When sampling from models, we use nucleus sampling (Holtzman et al., 2020), a form of ancestral sampling that constrains the candidate pool by discarding the distribution tail. Samples from the training and validation data are included for reference. Full hyperparameters and data processing details are described in Appendices D and B.

Style Variation
Iterative expansion LMs drive the generation of text with the dependency parse tree. It is possible to influence the generated trees by altering artificially the probability of the different expansion tokens. To demonstrate this, we modified the decoding process of iterative expansion LMs to force the probability of generating adjectival constructions to be higher than normal, aiming at generating a more descriptive style: during decoding, we multiply the probabilities of the expansion placeholders that express adjectival dependencies (i.e. those containing adjectival modifier "amod" relations), and renormalize the probabilities by dividing by the sum.
We conducted this experiment with the wordlevel models trained on EMNLP2017 News data. We compute the ratio of adjectives per sentence to verify the increased presence of adjectives, while controlling quality and diversity measures over the generated text for potential degradation.

Results and Analysis
We assess the ability of iterative expansion LMs to unconditionally generate text in terms quality (BLEU-5) vs. diversity (self BLEU-5), comparing against sequential baselines, each with a softmax temperature τ tuned separately.
(proxy for diversity). Each model was used to generate 20 samples of 400 sentences, and self-BLEU5 and validation-BLEU5 were computed over each of them, taking the average and the standard deviation. Figure 3 and Table 1  by other language models: an AWD-LSTM wordlevel LM and a Transformer word-level LM, both trained on EMNLP2017 News, plus OpenAI GPT-2 (1.5 B parameters) (Radford et al., 2019). The results are shown in Table 2.
These results show how the generated text improves over AWD-LSTM in terms of quality by all measures, with a comparable level of diversity. In comparison to the Transformer, while the quality measured with BLEU-5 is better for ITEXP, the rest of the quality measures indicate that the text generated by the Transformer is of better quality.  Table 3: ITEXP (w, τ = 1.0) with increased adjectives.
The results of the styled text generation experiments, shown in Table 3, confirm that the style of the resulting text can be successfully modulated to the desired degree and that the quality and diversity are only slightly degraded at moderate increases of the probability of adjectival clause generation.

Human Evaluation
In order to better assess the quality of the generated text, we also include a human evaluation. For this, we took a sample of 60 sentences of each model under study, including also a sample of the same size from the validation data, to serve as reference. The sentences were evaluated by a pool of annotators, who were requested to rate the sentence in an integer scale from 1 to 5, taking into account its fluency and correctness.
The pack of sentences rated by each annotator contained 10 sentences from each of the models under evaluation. Each sentence under evaluation was part of the packs of 3 evaluators; this redundancy was used to measure the discrepancies in the rating of each sentence among annotators, which was quantified by means of the average per-sentence standard deviation.   Table 4 shows the statistics of the obtained ratings, were we can see the average rating of the sentences generated by each model, together with the average per-sentence standard deviation, to understand how different the ratings for each sentence were among the different evaluator ratings. We can see that the highest human ratings were obtained by the Transformer, both with word and subwordlevel vocabularies, followed by ITEXP and then AWD-LSTM. Table 5 shows the human evaluation for the models from the style variation experiments presented in Table 3. As we can see, there is a small degradation in quality as we force high levels of adjectival presence.  Table 5: Human evaluation for ITEXP (w) models with increased adjectival construction probability.

Further Comparison with Real Text
Given that the generation process in iterative expansion LMs is not sequential, we studied the distribution of the sentence lengths it generates. This is shown in Figure 4 for the text generated by a word-level iterative expansion LM trained on EMNLP2017 News, along with the lengths of a sample from the training data. Iterative expansion LMs generate the dependency parse tree as they generate text. We studied the depths of the dependency trees of generated text in relation to those parsed from the training data, as shown in Figure 5. We also measured the degree to which the generated trees adhere to the trees obtained by parsing their lexicalized representation. Specifically, we computed the labeled and unlabeled attachment scores between both for the text generated at different softmax temperatures τ . Attachment scores are the standard performance measure in dependency parsing and are computed as the percentage of words that have been assigned the same head as the reference tree, over a test set. The attachment score is "labeled" if the dependency label is taken into account or "unlabeled" otherwise. As shown in Table 6, the obtained labeled attachment scores (LAS) and unlabeled attachment scores (UAS) are very high across the different values of the generation temperature τ .  Table 6: Attachment scores of the generated trees.

Quantification of the Generation Speedup
Text generation with autoregressive models like LSTM or Transformer models offers a linear computational complexity with respect to the length of the generated sequence. In comparison, the dependency tree-driven decoding used by iterative expansion LMs generates text in parallel for each branch in the tree. If the tree was a perfectly balanced binary tree, then the computational complexity would be logarithmic. However, dependency trees in general are not balanced and, given the tree binarization postprocessing that we introduce, the parallelization is slightly reduced. 0.1 0.2 0.3 0.4 0.5 0.6 0.7 Ratio of tree-based decoding steps with respect to sequential decoding Binarized tree Non-binarized tree Ideal binary tree Figure 6: Histogram of the ratio of the decoding steps needed to generate a sentence with tree-based decoding with respect to sequential generation. Figure 6 shows the speedup of the needed decoding steps of tree-based decoding with respect of auto-regressive decoding, taking a sample of the training data and computing the needed steps to decode them should the sentences have an idealized binary dependency parse tree, a normal parse tree, and a binarized parse tree. On average, the binarized parse tree, which is the decoding used by iterative expansion LMS, needs only 45% of the decoding steps needed by autoregressive decoding.
American students were 62 percent more likely to die in a heart attack during the first week of 2004, according to the study. For 150 days, Hillary Clinton will do more to improve access to affordable quality care, support and education funding for millions of Americans, she says. For those on this list, it's likely that I would rather be able to train them up, she said. He made it clear the SNP repeated on Friday as a response, saying they discussed a contract getting the extra cost here. He'll pay $25, 000 for rent and more buses and bring his collection to The Academy on Channel 31. Six years later, at least eight people died as a result of the shooting. The health prime minister told CNN Thursday that he was willing to back up against the US and remove all of the relevant items at the end of the transition. Then, another man told police that was a friend's friend, and as a child, he made the decision to call his mother. They are 40 -60 among the top 50, 000 women in the last year in that group since 2014 -15. They've worked hard on Twitter and they think they've tried to focus on our sport, she said. We like to think that if you try to get this game done, we can get a lower success rate out of 15. I feel that they're going to Syria because we had this explanation, that they have an indication of their advance. The girl's mother told the group of three she needed treatment and the family said her daughter would still be alive with another child. But she added: "The data is important to the EU that the UK can attract more businesses. Though he also spoke to Mr Wilson on Saturday morning at the Netherlands Police trial, Johnson referred it to the No. 1 commission. It's a collective belief and it's a statement to us, he said. It's just the first thing we're feeling now and I don't like it. So if you want to be sitting in a garden, you have to wait for something to make sure that this does not end. So, for example, we need to argue about what the president did, but I'm just interested in having any talk. The British defence ministry confirmed action had been taken at the hospital but could not confirm the details until now. We'll ask for a fair share of Russia to stop border security, particularly for people of color, he added.  Table 7 shows a selection of text samples generated by iterative expansion LMs with a word-level vocabulary, while Table 8 shows samples generated with a subword-level vocabulary. We can see that, despite being generated non-sequentially and each branch of the dependency parse tree being generated in parallel, the resulting sentences maintain coherence and syntactic agreement, confirming that conditioning on the token dependencies in the parse tree provides enough information to generate it while speeding up the decoding process.

Conclusion
In this work, we presented iterative expansion LMs, which are iterative non-autoregressive text generation models that rely on syntactic dependency trees to generate sentence tokens in parallel. As opposed to other syntax-driven generation mechanisms, the training of iterative expansion LMs can be naturally computed in batches and they are amenable to subword-level vocabularies.
We showed that our proposed method generates text with quality between LSTMs and Transformers, with comparable diversity, both regarding automatic measurements and human judgement, while generating text in half of the decoding steps needed by sequential LMs, and also allowing direct control over the generation process at the syntactic level, enabling the induction of stylistic variations in the generated text.
Our code is available as open source at https:// github.com/noe/iterative_expansion_lms .