FUDGE: Controlled Text Generation With Future Discriminators

We propose Future Discriminators for Generation (FUDGE), a flexible and modular method for controlled text generation. Given a pre-existing model G for generating text from a distribution of interest, FUDGE enables conditioning on a desired attribute a (for example, formality) while requiring access only to G’s output logits. FUDGE learns an attribute predictor operating on a partial sequence, and uses this predictor’s outputs to adjust G’s original probabilities. We show that FUDGE models terms corresponding to a Bayesian decomposition of the conditional distribution of G given attribute a. Moreover, FUDGE can easily compose predictors for multiple desired attributes. We evaluate FUDGE on three tasks — couplet completion in poetry, topic control in language generation, and formality change in machine translation — and observe gains in all three tasks.


Introduction
Recent advances in large pretrained language models allow us to generate increasingly realistic text by modeling a distribution P (X) over natural language sequences X. The distribution P (X) may be truly unconditional, as is common in language modeling, or it may model P (X|I) conditioned on some input I, as in machine translation or summarization.
We are frequently interested in controlled text generation, the task of generating text conditioned on an additional desirable attribute a which is not already built into P (X). That is, we would like to model P (X|a) (or possibly P (X|I, a); henceforth we will drop I from the notation for simplicity). For example, P (X) may be a pretrained translation model for Spanish inputs I to English outputs X, but we may wish to additionally constrain the outputs to possess a new attribute a, e.g., formality, which we did not optimize for during training.
Unfortunately, once we have already obtained an unconditioned P (X) defined as the output dis-tribution of some large generative model G, it is nontrivial to add conditioning on a new attribute a without either training a new model from scratch or fine-tuning with additional data. Although in principle we can trivially sample from P (X|a) via rejection sampling from P (X), rejection sampling may be highly inefficient in practice. On the other hand, while generating according to attribute a, P (X) should be left otherwise intact: in the previous translation formality example, it is pointless to generate formal English outputs if they do not preserve the original Spanish meaning.
In light of these concerns, we propose Future Discriminators for Generation (FUDGE), a flexible and modular method for modeling P (X|a) which accesses only the output probabilities of the generative model G which defines P (X). FUDGE learns a binary predictor for whether attribute a will become true in the complete future, based on an incomplete sequence prefix (Sec. 3). Multiplying the output probabilities of this predictor with G's original probabilities and then renormalizing yields a model for the desired P (X|a) via Bayes' Rule.
We run experiments on three controlled text generation tasks -couplet completion in poetry, topic control in language generation, and formality change in machine translation -showing our method's broad applicability. Additionally, we demonstrate the modularity of FUDGE by composing multiple attribute constraints in both the couplet and topic control tasks. In our experiments, we find that FUDGE is highly effective at attribute control, outperforming both a baseline which directly fine-tunes G and also a strong gradientbased method (PPLM (Dathathri et al., 2019)). Our code is available at https://github.com/yangkevin2/ naacl-2021-fudge-controlled-generation.

Related Work
Ideally, a controlled text generation method should efficiently control for a while preserving P (X) as much as possible. Recent work on controlled text generation has greatly advanced our ability to control for a required attribute a flexibly and cheaply, with varying degrees of modification to the original model G which defines P (X). The key distinguishing feature of FUDGE is that it models whether attribute a will be true in the future, rather than in the present. We find that FUDGE substantially outperforms previous WD approaches in our experiments (Sec. 4.2).

Future Discriminators for Generation
We now explain the details of our proposed method, Future Discriminators for Generation (FUDGE), and show that it corresponds to modeling the desired conditional distribution P (X|a).
For a given language generation task, assume we have an autoregressive model G (e.g., a large pretrained language model) which models P (x i |x 1:i−1 ) for tokens x 1 . . . x i . Letting X = x 1:n denote a completed sequence, G can sample from P (X) = P (x 1:n ) one token at a time by factoring P (X): To condition on attribute a, we instead model P (X|a). This requires a model for P (x i |x 1:i−1 , a), modifying the previous factorization: P (x i |x 1:i−1 , a) If we model P (x i |x 1:i−1 , a) directly, we obtain a class-conditional language model (CCLM). We can learn the CCLM by e.g., fine-tuning G depending on the available data, possibly with some structural modification to G to accommodate conditioning.
However, FUDGE instead relies on the following Bayesian factorization, exchanging x i and a conditioned on x 1:i−1 : The second term is exactly the quantity modeled by the base G. It then suffices to model the first term, P (a|x 1:i ), with a binary classifier B for the attribute a given a prefix x 1:i . Intuitively, one can view B as rescoring or reranking G's original hypotheses.
We emphasize that although B takes a prefix x 1:i as input, it predicts whether attribute a will in the future be satisfied for the completed generation x 1:n . For instance, suppose we are given a dataset of examples {(x 1:n , a )} with a being the values of binary indicators for the desired a (i.e., if a is formality, then a is 0 or 1 when x 1:n is informal Figure 1: Illustration of one decoding step in FUDGE, for an example where the desired attribute a is formality. A large pretrained model G (dark blue) outputs unconditioned probabilities. Our binary predictor (red) predicts whether the eventual completed sequence will be formal for each possible continuation (computed for each candidate x 3 , e.g., "want"; holding a fixed). The probabilities for each x 3 are multiplied (purple) and then renormalized to obtain P (x 3 |x 1:2 , a), from which we sample the next token x 3 ="prefer." or formal respectively). For each training example (x 1:n , a ), we train our classifier B using all pairs (x 1:i , a ); that is, we construct a separate example from each prefix x 1:i of x 1:n . Our approach contrasts with previous methods such as Dathathri et al. (2019), which greedily optimize for a on the immediate extension x 1:i+1 . One particular benefit is that FUDGE naturally plans for the future: in the example for generating text on the "space" topic in Table 6, FUDGE writes about a "mysterious ship" despite "ship" itself not being in the given "space"-topic bag of words, because "mysterious ship" easily leads into a mention of one of the targeted "space" words ("Earth"). Similarly, in the first couplet completion example in Table 3, FUDGE needs to rhyme with "fear" after exactly ten syllables. After seven syllables, it could reasonably generate the word "clear," but it first generates the adverb "pretty" in order to set up the generation of "clear" as the tenth syllable. FUDGE's implementation is shown schematically in Figure 1, and is quite simple in practice. FUDGE just needs to learn a B (red in Figure 1) sharing tokenization with G (dark blue). It then converts B's output into probabilities (red table in Figure 1), and multiplies with the original output probabilities from G (dark blue table), to obtain unnormalized probabilities P (x i , a|x 1:i−1 ) (purple table). Finally, renormalizing over the output vocabulary yields the desired distribution P (x i |x 1:i−1 , a). In practice, we operate in the log-probability space for numerical stability.
To improve computational efficiency, we typically choose B to be lightweight relative to G. We also consider only the top 200 possibilities for x i according to G at each step, as a cheap approxi-mation to the full distribution, and find that this works well in practice. 1 In each task in Sec. 4, running FUDGE on the test set takes no more than 15 minutes on a single Quadro RTX 6000 GPU.
Finally, as with other controlled generation approaches such as Dathathri et al. (2019), it is likely that augmenting FUDGE with reranking approaches such as rejection sampling could improve output quality at the cost of compute time, although we do not comprehensively evaluate such extensions in this work.

Advantages and Limitations
We highlight several additional potential advantages of FUDGE compared to directly modeling P (x i |x 1:i−1 , a) via e.g., a fine-tuned CCLM: 1. FUDGE requires access only to P (X) (i.e., G's output logits) rather than G itself.
2. G can be freely swapped out for any other model that shares the same tokenization when larger models become available.
3. Given multiple conditionally independent attributes with predictors for each, FUDGE can easily condition on the combination of these attributes in a modular fashion by summing their output log-probabilities (Sec. 4.1, 4.2).
Unfortunately, like previous methods, FUDGE cannot fully guarantee that all outputs possess the desired attribute a. In FUDGE's case, this is due to the approximation inherent in modeling P (a|x 1:i ), as well as only considering the top 200 possible x i for computational efficiency.

Experiments
We run experiments on a range of controlled text generation tasks to evaluate the effectiveness of our proposed method: poetry couplet completion (Sec. 4.1), topic-controlled language generation (Sec. 4.2), and machine translation formality change (Sec. 4.3). For each task we discuss the evaluation setup, the specific details of our method and baselines, and finally experimental results.

Poetry Couplet Completion
So long as men can breathe or eyes can see, So long lives this and this gives life to thee. We begin with English poetry generation, a task that emphasizes well-formedness, and which has been studied in different forms by many previous works (Zhang and Lapata, 2014; Wang et al., 2016; Ghazvininejad et al., 2016, 2017). Our task here is couplet completion. Given the first line of an iambic pentameter couplet (e.g., Table 1), the model must generate a second line which (1) satisfies iambic pentameter, (2) rhymes with the first line, and (3) ends a sentence. The desired attribute a is defined as possessing all three properties, as evaluated by a rule-based checker F (Appendix A). Our test set is a collection of prefix lines of couplets, collected from the ending couplet of each of Shakespeare's 154 sonnets.
Metrics. We consider four metrics. At test time, we decode until the model generates ten syllables followed by an end-of-sentence punctuation mark, or after the eleventh syllable (an automatic failure, since iambic pentameter requires exactly ten syllables).
Overall, because we define a using a rule-based F which is accessible during training, our formulation of couplet completion is a relatively clean task for evaluating the effectiveness of FUDGE.
4.1.1 Method and Baselines FUDGE Instantiation. The obvious approach is to learn a predictor for F directly. However, the three components of a -meter, rhyme, and sentenceending -should be roughly independent. Thus we assume conditional independence, and demonstrate the modularity of FUDGE by constructing three separate predictors to be combined at test time: 1. B 1 (x 1:i ) takes a text prefix x 1:i , and predicts whether the completion x 1:n of prefix x 1:i will be in iambic meter. The model is an LSTM followed by a linear output layer.
2. B 2 (x 1:i , t, r) takes prefix x 1:i , the number of syllables t between x i and x n for n ≥ i, and a rhyme sound r. 3 It predicts whether the completion x 1:n has the rhyme sound r at the end of token x n . The model is an LSTM with attention dependent on t and r, followed by a shallow feedforward network, and is trained via noise-contrastive estimation (Gutmann and Hyvärinen, 2010). 4 3. B 3 (x 1:i , t) takes prefix x 1:i and the number of syllables t between x i and x n for n ≥ i, and predicts whether x n ends a sentence. The model is an LSTM followed by a shallow feedforward network.
The predictors vary in architecture because B 2 and B 3 require inputs other than x 1:i -in truth, they are families of related predictors. We find that performance is not overly sensitive to the particulars of the predictor architectures (Appendix D).  Table 2: Couplet completion results. Success (main metric), grammaticality, perplexity, and distinctness of different methods, tested on 154 prefix lines from Shakespeare sonnets. FUDGE substantially outperforms automated baselines on success and maintains high diversity, although quality unsurprisingly suffers compared to the base G due to the difficult constraint F. Note Shakespeare's work is often "incorrect" due to the narrowness of our metric F; 6 he also scores poorly on text quality because our evaluation models are intended for more modern English.
To train the discriminators, we sample a dataset of 10 million generations of varied length from GPT2-Medium. From these generations, we sample random subsequences x 1:n of roughly 10 to 30 syllables and truncate t ≤ 10 ending syllables. These truncations become inputs x 1:i to the predictors. For simplicity, we did not balance the class labels for e.g., the iambic predictor during training, although it is likely that doing so would improve performance.
At test time, we extract r from the given first line of the couplet, and initialize t = 10, updating at each step. We then modify the output logits of G by simply adding the log-probabilities from B 1 , B 2 , and B 3 , demonstrating the ease of composing constraints in FUDGE.
2. FINETUNE, a CCLM which finetunes G on similar inputs to those used for B 2 in FUDGE.
Since it is not obvious how to compose multiple CCLM's for different attributes, we train a single CCLM for all desired properties together. We condition by prefixing the input with (1) whether the last 10 syllables of the original untruncated x 1:n are iambic, (2) the 5 A system like Hafez (Ghazvininejad et al., 2016, 2017), which enforces meter and rhyme at each decoding step using a hard constraint, could achieve perfect success rate. However, this approach relies on the meter and rhyme attributes being "prefix-checkable" at the word level: one can guarantee success by simply never selecting a word which immediately violates the constraint. This is often the case for simple rulebased constraints, but not for many other interesting attributes, such as the topic and formality attributes in our subsequent experiments. To preserve generality, FUDGE does not rely on this "prefix-checkable" property, and neither do our baselines. rhyme sound at the end of x n , and (3) whether a sentence ends with x n . A special token is inserted 10 syllables from the end of x 1:n .
3. PPLM (Dathathri et al., 2019), which uses shallow predictors learned from G's top-level hidden layer to modify G's states toward increasing probability of the desired attribute via gradient ascent. We decompose the predictors into the same iambic, rhyme sound, and endof-sentence predictors as for FUDGE, inserting an additional hidden layer in the shallow predictor when needed to incorporate additional input (the desired rhyme sound and/or number of syllables until end-of-sentence).
All non-Shakespeare methods use top-k sampling with k = 10.

Results
Even though our GPT2-Medium-generated training dataset is completely different from the test domain, and contains essentially zero examples of correct couplets, FUDGE is able to learn the desired attribute. As shown in Table 2, FUDGE greatly outperforms all automated baselines in success rate.
Surprisingly, the PPLM baseline achieves zero success. We find that its iambic and rhyme predictors are very poor, so we hypothesize that the relevant information is not easily extractable from the last hidden layer of G. In contrast, FUDGE's predictors operate directly on the raw text.
Funnily enough, FUDGE even matches Shakespeare according to F, although this is largely due to the narrowness of F and should not be taken se-riously. 6 Similarly, the grammaticality and perplexity metrics are designed for our automated baselines, and thus assign poor scores to Shakespeare's antiquated and flowery style.
FUDGE also maintains relatively fluent generation despite lower grammaticality and perplexity compared to G. See Table 3 for two successful examples. Interestingly, FUDGE also increases diversity compared to G, perhaps due to the difficult constraint F forcing FUDGE to use lower-probability regions of the base distribution P (X).
And even thence thou wilt be stol'n, I fear, for this shall be the end. That's pretty clear.
Or, if they sleep, thy picture in my sight I will be glad to look upon the night. Finally, it is possible (and trivial) to adjust the conditioning strength in FUDGE by multiplying the binary predictors' output logits by a constant. However, this deviates from our Bayesian factorization of P (X|a), and we do not do so.

Topic-Controlled Language Generation
Next, we explore topic control in English language generation. The desired attribute a is to be on-topic for a given topic, such as science or politics. To facilitate comparison with prior work, we largely follow the setup of PPLM (Dathathri et al., 2019): the model is provided an approximation to the topic at test time, in the form of a bag of on-topic words W. The goal is to sample text according to the topic approximated by W, starting from a generic prefix. There are 7 topics (space, politics, military, legal, science, religion, and computers) and 20 prefixes, and the model generates 3 80-token 7 samples from each topic-prefix pair, for a total of 420 generations.
Metrics. Unfortunately, we cannot easily construct a rule-based F for being "on-topic." Addi- 6 We define F using somewhat narrow criteria (Appendix A), which capture only a subset of what Shakespeare considered to be well-written couplets. The purpose of this task is to evaluate FUDGE's ability to satisfy a difficult well-formedness constraint compared to automated baselines, rather than to perfectly capture the human notion of an iambic pentameter couplet. Thus Shakespeare is marked wrong when he (1) uses archaic pronunciations, (2) uses loose rhymes, (3) elides syllables to fit meter, or (4) uses words missing from the CMU Pronouncing Dictionary. See Appendix A.1 for details. Of course, Shakespeare is only included as a whimsical point of reference; our generations obviously do not hold a candle to Shakespeare's originals. 7 All models and baselines use GPT2 tokenization.
tionally, use rate of words in W is a poor metric, because a model can score highly by e.g., simply returning the words in W, without generalizing to the full topic that W approximates. Instead, we adopt a notion of success which requires the model to generalize the bag W to the full topic. The remaining metrics are measures of quality and diversity. 2. Grammaticality, identical to the couplet task.
4. Distinctness, defined as in the couplet task. However, it is calculated separately within the 60 generations for each topic, and then averaged over the 7 topics.
Additionally, following the evaluation procedure of prior work such as (Dathathri et al., 2019), we run human evaluations via Amazon Mechanical Turk for FUDGE against each baseline, comparing topic control and fluency. For each pairwise comparison, we ask 3 workers to evaluate each of 420 paired outputs. Workers were asked to mark which generation is more on topic (first, second, both, or neither), and to rate each generation's fluency on a Likert scale from 1 to 5. We report the average fraction of outputs marked as on-topic as well as the average fluency rating for each method.

Method and Baselines
FUDGE Instantiation. Since we model topics as bags of words, FUDGE uses a binary predictor B(x 1:i , w) which takes a prefix x 1:i and word w, and classifies whether w appears in the future x i:n for n ≥ i. (Since it is desirable to stay on topic even after successfully getting on topic, we use x i:n rather than x 1:n .) Training examples (x 1:i , w) are sampled from the same dataset of 10 million GPT2-Medium generations used for the couplet task, and B is trained using noise-contrastive estimation. B

On-Topic
Text Quality Diversity  Table 4: Topic control results. Success (main metric), grammaticality, perplexity, and distinctness for different methods. FINETUNE and WDEC often degenerate into repeating the given bag of words W; this is ill-captured by perplexity, but results in poor grammaticality and distinctness. FUDGE substantially outperforms all baselines on success, including the strong gradient-based PPLM baseline, while preserving high quality and diversity.
is a lightweight LSTM-based classifier similar to B 2 from the couplet task. At test time, we can compose individual-word constraints if we assume conditional independence between words (although this may be imperfect). Given a bag of N words {w 1 . . . w N } and prefix x 1:i , we could condition on all words in the bag appearing in the future by adding all logprobabilities log P (w 1 |x 1:i ) . . . log P (w N |x 1:i ) to G's logits. However, topic control does not require every word to appear; perhaps some number λ of on-topic words is enough to be "on-topic." Therefore, we model the topic constraint as selecting a random subset of λ words from the original bag, and requiring that only those λ words all appear. Since each of the N words is selected with probability λ N , the quantity we add to the base G logits is λ N N j=1 log P (w j |x 1:i ) in expectation. In our experiments we use λ = 4, based on a fantasy-topic bag of words used for validation (Appendix C).
Baselines. We compare to four baselines.
2. FINETUNE, which finetunes G on the same inputs used for FUDGE. The future word is given as a prefix for conditioning. At test time, we compute logits for each prefix in the given W and use the average as the true logits, as an ad hoc way to condition on the full W.
3. WDEC, a simple weighted decoding implementation which greedily considers only the immediate next token when optimizing for a. Instead of using B, WDEC just adds a fixed λ WDEC to the logit for each word in W. Note WDEC requires a to be well-defined at the token level, so it is not easily transferable to certain tasks (e.g., couplet completion).

Method
Topic Fluency  Table 5: Topic control human evaluations, pairwise comparisons. FUDGE achieves a substantially higher fraction of on-topic outputs compared to each baseline, in addition to higher average fluency (rated 1 to 5). FUDGE achieves the highest success by a substantial margin (Table 4), and outperforms all baselines on human evaluations in both topic relevance and fluency (Table 5). FUDGE simultaneously preserves high quality and diversity according to automated metrics. Table 6 shows two examples.
Unsurprisingly, G performs poorly on success. WDEC and FINETUNE also perform poorly, in success and especially in distinctness. WDEC frequently degenerates into repeating the given words in the bag W, despite tuning λ WDEC (Appendix C).
Space: The issue focused on the original plot, which was about a mysterious ship that would land on Earth, and would lead to humanity's first interstellar expedition. The original plan called for humanity to use the spacecraft to colonize outer space and build the first city on Mars. But this idea fell by the wayside in the final drafts.\n\n"It was just not a very popular idea and it wasn' Politics: The issue focused on whether the two institutions were operating within the bounds set by the constitution and the law.\n\nThe Constitutional Court said that both governments "have a duty to ensure the integrity of the electoral process and its effective administration, especially in light of the current political climate that is threatening the functioning of elections" Table 6: The first output from FUDGE when using the prefix "The issue focused on" for two topics. We use red to highlight words in the given bag of words W along with obvious forms (e.g., plurals), and cyan for other on-topic words, including related words not in the heldout bag W . More examples in Appendix J.
FINETUNE also suffers from repetition, which appears to be the result of distribution shift from finetuning. Our fine-tuning dataset was built by sampling directly from the original P (X) modeled by G to mitigate distribution shift, but it is well-known that language model generations are more repetitive than natural language (Holtzman et al., 2018, 2019). We hypothesize that FINETUNE, being finetuned on language model generations rather than natural language, amplifies this repetitiveness. This repetition is reflected in the poor grammaticality for both FINETUNE and especially WDEC. In contrast, FUDGE does not touch the original P (X), largely avoiding FINETUNE's distribution shift problem on this task.
Finally, FUDGE outperforms the strong gradientbased PPLM method, despite requiring access only to G's output logits. Non-reliance on gradients means FUDGE is also many times faster than PPLM, which takes a few hours compared to FUDGE's 15 minutes for the full set of 420 generations on our hardware. Sometimes we do not even have gradients: for example, gradients are unavailable in the API for GPT3 at time of writing.

Machine Translation Formality Change
Finally, we turn to a somewhat more challenging task, changing formality in machine translation -specifically, from informal to formal. Given a source sentence written in an informal and conversational style, the goal is to output a translation which is also more formal. We test on the Fisher and CALLHOME Spanish-English Speech  Table 7 for an example. Our task is to translate the original informal Spanish to into more formal English. However, we assume that Salesky et al. (2019)'s fluent references are unavailable during training. entonces de verdad sí sí pero entonces tu estudiando para es es digo es más porque es exactamente Then, if it's business, but then you are a student for a PHD, the Master's is that exactly.
If it's business, then you are a student for a PhD. The masters is exactly that. Metrics. The desired attribute a is formality, but we cannot sacrifice the source sentence's meaning. The latter requirement makes generation more constrained than in the couplet and topic tasks, so perplexity and distinctness are less relevant. Instead, we use the following:  As shown in Table 8, FUDGE increases the formality of outputs compared to G, even though the test-time formality predictor is trained on a different domain (Family/Relationships, rather than Entertainment/Music). Note that formality unsurprisingly decreases after fine-tuning G, simply due to the informality of the fine-tuning dataset. As in the couplet task, one could adjust the strength of the formality control in FUDGE, although this is unprincipled from the view of modeling P (X|a).
Moreover, while FUDGE and G achieve similar BLEU after fine-tuning G, FUDGE achieves higher BLEU compared to G when G is not fine-tuned on the Fisher training set. In the latter case, controlling for formality somewhat remedies the struggles of G when not fine-tuned on such disfluent text.
In contrast, the G + ST baseline achieves nearperfect formality but less than half the BLEU of G, due to the style transfer model overfitting to the GYAFC Entertainment/Music dataset. This is similar to the distribution shift issue that we observed in topic control for FINETUNE, an issue which FUDGE largely avoids. Nevertheless, there remains substantial room for improvement on this difficult task.

Discussion
FUDGE is a principled approach to controlled text generation which models P (X|a) by closely following a Bayesian factorization, thus preserving the base P (X) as much as possible. FUDGE achieves strong performance on a wide range of different tasks: poetry couplet completion, topic control, and informal-to-formal machine translation. Additionally, FUDGE can easily compose different attributes in a modular fashion: the meter, rhyme, and end-of-sentence constraints for couplet completion, and the individual words within each topic bag for topic control. In principle, FUDGE is applicable to any controlled generation task where we can train discriminators for the desired attribute or attributes.

Ethics of Controlled Text Generation
We recognize that strong controlled generation methods have the potential to produce harmful outputs and/or misinformation when used adversarially (Wallace et al., 2019,2020

A Details of F for Couplet Completion
We provide the full details of the function F we use to check iambic pentameter, rhyme, and sentenceending in our couplet completion task. Note that iambic pentameter consists of two components: iambic meter as well as containing exactly ten syllables.
1. Iambic meter: Given a phrase, we obtain the sequence of stresses (0 for unstressed, 1 for stressed, 2 for secondary stress) for each word, according to the CMU Pronouncing Dictionary (Weide, 1998). If any word does not exist in the dictionary (almost never for non-Shakespeare methods) we return False. We treat 2 as ambiguous stress, and additionally change 1 to 2 for any monosyllabic words, i.e. we allow monosyllabic stressed words to be unstressed but not vice versa. Finally, we check that all syllables at even indices (0indexed) are unstressed or ambiguous, and all syllables at odd indices are stressed or ambiguous.

Number of syllables:
We count the number of syllables in each word based on the number of stresses according to the CMU Pronouncing Dictionary. If a word does not exist in the dictionary, we estimate the number of syllables by rounding the number of letters divided by 3 to the nearest integer.
3. Rhyme: Two words rhyme if and only if they both exist in the CMU Pronouncing Dictionary and are a perfect rhyme according to the dictionary.

Sentence-ending:
We check if the output ends with a period, question mark, or exclamation mark.
Of course, both FUDGE and FINETUNE will fit to whatever output is given by F. The purpose of the couplet task is to check FUDGE's ability to fit a difficult well-formedness constraint. We simply design an F that corresponds to true iambic pentameter rhymes in most cases.

A.1 Shakespeare Evaluation
Shakespeare himself performs somewhat poorly according to F, which is designed with the automated baselines in mind, not for Shakespeare. (The same is true for our grammaticality and perplexity metrics.) One source of error is words which are out-ofvocabulary for the CMU Pronouncing Dictionary. Such words are almost never generated by either FUDGE or our automated baselines, but appear in a fifth of Shakespeare's lines, resulting in failures on the iambic meter and syllable checks.
Nevertheless, most of Shakespeare's "errors" are the result of real -though slight -deviations from our very strict definitions of meter and rhyme. In particular, he frequently (1) elides syllables to fit meter, and (2) uses loose rhymes; both "error" types are likely exacerbated by differences between archaic and modern pronunciations. The example in Table 10 illustrates both types of "errors." Although such deviations are often acceptable to a human, they are difficult to capture in an automatic metric, and we do not allow such deviations in F. Again, Shakespeare is only included as a whimsical point of reference, and not as a serious baseline to be compared to.
But here's the joy; my friend and I are one; Sweet flattery! then she loves but me alone. Table 10: An example couplet by William Shakespeare, illustrating two common deviations from the narrow definition of correctness we use in F. For this example to follow iambic meter, one must read "flattery" in only two syllables. Moreover, "one/alone" is a loose (non-perfect) rhyme, at least in modern English.

B PPLM Baseline in Machine Translation
As discussed in the main text, it is difficult to apply PPLM in our machine translation setup, in which P (a|X) is learned from an English formality dataset without parallel Spanish. Since P (X) is a Spanish-English translation model, we must obtain hidden states for training PPLM's P (a|X) by first "backtranslating" English into Spanish, accessing a second pretrained translator. For this purpose we use a second pretrained Marian transformer from HuggingFace (https://huggingface.co/ Helsinki-NLP/opus-mt-en-es). Additionally, we needed to tune their suggested hyperparameters.
During evaluation, we observe that PPLM makes some reasonable modifications for formality compared to the base P (X), like changing "hard" to "difficult," but such improvements are also accompanied by occasional disfluencies and/or repetitions (although such problems plague all methods to some degree). Overall, while PPLM achieves similar BLEU to FUDGE, it is substantially less formal (Table 11).

C Hyperparameter Choices
FUDGE has essentially one hyperparameter in our topic control task, λ, which controls the strength of conditioning and corresponds to the number of words in the bag which should appear in the future.
To choose λ in topic control, we used a separate validation bag of words (on the topic of fantasy; Appendix K.4) to select a reasonable λ for our main paper experiments (λ = 4). Unlike in the main paper where we use heldout bags W to measure success, during validation we simply use the original bag. We use a set of 60 generations, considering values ranging from 1 to 6 (Table 12), although the result may be somewhat noisy. Of course, different choices of λ result in different tradeoffs (Appendix G).
We also optimized the conditioning strength λ WDEC for the WDEC baseline on the same fantasy bag of words, considering values ranging from 1 to 32. We selected the only value (4) which achieved reasonable success without a total collapse in diversity (Table 13), but diversity still collapsed when tested on our seven main test bags of words.
We do not optimize any model hyperparameters in the couplet completion and informal-to-formal translation tasks. LSTM's and feedforward networks are 3 layers (including the output layer of dimension 1) and 300-dimensional unless otherwise specified. They are bidirectional (150-dimensional in each direction) for the couplet rhyme predictor and the topic control future words predictor, and otherwise unidirectional. Attention mechanisms use key-query-value attention. For the rhyme and future words predictors the output hidden state is multiplied element-wise by the embedding of the rhyme sound or future word, then concatenated to the embeddings, before the final feedforward network. Since a selling point of our method is the lightweight process of constructing and training predictors, noise-contrastive estimation is a natural choice for the rhyme and future word predictors: we avoid softmaxing over the output dimension during training. (This is primarily relevant for the future word predictor, as the number of distinct rhyme sounds is not too large, but we use noisecontrastive estimation for both for consistency's sake.) For the PPLM baseline, we used step size 0.01 for both couplet completion and MT after tuning, and kept their other hyperparameters fixed. For topic control we simply evaluated their provided generations instead of rerunning their model.

D Ablations on Predictor Architectures
Some variation in predictor architectures is necessary due to the diversity of our tasks (as evidenced by the difficulties in adapting PPLM). Specifically, while our core predictor architecture is word embeddings followed by LSTM and output layer, taskspecific architectures vary because some "predictors" are actually families of related predictors. We model such families as a single predictor taking additional input (e.g., rhyme sound in poetry); this is needed in our poetry and topic tasks.
On these two tasks, we provide ablations with more homogenized predictors: additional inputs are simply embedded and concatenated to each input word embedding. The difference is relatively small in both cases (Tables 14 and 15). FUDGE-MOD indicates the ablated version of FUDGE.

E Alternative Perplexity Measurements
On the couplet completion task, we additionally measure perplexity using Transformer-XL (Dai et al., 2019b) and using a GPT model fine-tuned on Shakespearean language as generated by (Lau et al., 2018). We measure using Transformer-XL on the topic control task as well. Relative perplexities between most models remain largely similar when switching between GPT and Transformer-XL, with a few exceptions. Compared to the base GPT, Shakespeare's perplexity naturally decreases while other models' perplexities increase when measured with Shakespeare-finetuned GPT. The highly repetitive and disfluent WDEC baseline is rightly punished for this behavior when measured by Transformer-XL. PPLM also obtains slightly lower perplexity than FUDGE on topic control when

F Statistical Significance
In couplet completion, FUDGE outperforms the strongest automated baseline (FINETUNE) on success rate with p < 0.0001 on a McNemar test, pairing the generations for each Shakespeare prefix.
In topic control, FUDGE outperforms the strongest automated baseline PPLM with p = 0.04 using a Wilcoxon matched pairs test, pairing the generations for topic-prefix combinations.
In translation formality, FUDGE's generations are more formal than those of the base G with p < 0.0001 according to a paired t-test.
Space: The issue focused on the new, higher level of control that NASA had in the space shuttle program.\n\n"The question of how far the U.S. government can extend its jurisdiction in space was raised," Mr. Smith said.\n\nNASA's role has become increasingly important in the 21st century in part because of the growth in space activities. The space shuttle program began in 1977 with Politics: The issue focused on how much power each company was willing to use in response to the request.\n\nAccording to the complaint, Comcast has not been forthcoming with any data, such as how often it uses the technology, and what it has paid for it, in order to meet the FCC's mandate to make its own data more accessible.\n\nAnd, according to the suit, the company also Military: The issue focused on the use of force by the armed forces and police, as well as the use of lethal force by civilians.\n\nThe bill would require that a shooting occur "with reasonable care," meaning a shot was "justified" under the circumstances of the case and not in retaliation for an act of violence, and that a shooting was "necessary for the safety of the officer or the Table 18: The first generation by FUDGE using λ = 2 on the space, politics, and military topics given the prefix "The issue focused on." Words in the given bag are highlighted in red, and other related words in cyan.

G Effect of Varying Topic Control Strength
Although we use λ = 4 for FUDGE in our main paper experiments for topic control, we experiment here with varying the conditioning strength. Specifically, we experiment with λ = 2 and λ = 8.
The conditioning is unsurprisingly stronger as λ increases, as shown quantitatively in Table 19, although the perplexity increases as well. We also provide some example generations for λ = 2 and λ = 8 in Tables 18 and 21, for the same prompts and topics as in Table 6 for λ = 4 in the main text. The λ = 8 generations remain mostly fluent and interesting, despite their worse grammaticality and perplexity.

H Effect of Varying Candidate Pruning
For computational efficiency, we only feed the top 200 candidates returned by G into FUDGE's predictor when predicting each next token. Here, we ablate on this number in our topic control setting, testing 100 and 400 (Table 20).

I Additional Couplet Completion Examples
We provide some additional examples of FUDGE and baselines on our couplet completion task in Table 22. We also show some unsuccessful examples for FUDGE in 23. Overall, we find that most errors are due to the rhyme and ten-syllable end of sentence constraints, or due to Shakespeare's prefix ending in a word not in the CMU Pronouncing Dictionary (e.g., "prognosticate" in the table). FUDGE also sometimes overgenerates punctuation at the end of a sentence.

J Additional Topic Control Examples
In Tables 24, 25, and 26 we show additional example generations by our method using the same   Space: The issue focused on the size of NASA's satellite telescope that is being used to observe the universe.\n\nThe telescope will be the world's largest when it is completed in 2022.\n\nThe US space agency wants to put the telescope into a new orbit around the planet.\n\nThe Hubble Space Telescope orbits in an elliptical orbit, which puts the telescope into a "cross-path" Politics: The issue focused on the power, independence and independence of the federal judiciary. In its ruling, the three-judge "progressive" panel of the 10th Circuit of the U.S. Court of Appeals for the 10th Circuit found that the "Supreme Court lacks the requisite power to make decisions on the constitutionality of any particular federal regulation, including the power to make the final determination Military: The issue focused on the military wing of the U.S. Navy that manages ships to the surface of a seabed, the USS Ponce, which is carrying a guided-missile destroyer. A small accident with the weapon's guidance system forced the carrier's "supergun" system to be deactivated, resulting in a loss of a "kill ship"role from the fleet. The   Table 21: The first generation by FUDGE using λ = 8 on the space, politics, and military topics given the prefix "The issue focused on." Words in the given bag are highlighted in red, and other related words in cyan.
Prefix: Pity the world, or else this glutton be, G: (1) to see that you are not to be ashamed. CCLM: or else it is a human being. He. PPLM: ..." He continued, "I have seen, and heard of the FUDGE: and so I do; and so I do, as be. Shakespeare: To eat the world's due, by the grave and thee.
Prefix: But flowers distill'd though they with winter meet, G: and with their sweet, sweet love; The snow-white rose CCLM: the flowers are not quite so sweet or sweet. PPLM: 'Twas a great feast, and merry company all FUDGE: (I will be here to see her) let my sweet.
Shakespeare: Leese but their show; their substance still lives sweet.
Prefix: And nothing 'gainst Time's scythe can make defence G: A better foe to be slain, than Death, and all CCLM: a better weapon than it is. That defence PPLM: I am so glad that I have found a new home FUDGE: worth more. That's how much time is on the fence. Shakespeare: Save breed, to brave him when he takes thee hence.  Space: More importantly, the E4E-R-E-S-T report finds, the greatest threat to Earth's existence comes from the human race's inability to adapt: "It is clear from E4E's analysis that a lack of knowledge about how to manage an expanding world and to adapt to changing climatic conditions poses a serious challenge to our ability to sustain life on planet Earth.
Politics: More importantly, in an effort to preserve the historical integrity of the state's judicial system, the state also needs to ensure its integrity within the larger American political system through fair, transparent, and competitive elections. In other words, a system based upon meritocracy and equality for all candidates, voters, candidates and parties.\n\nThe American people have a right to know whether the current system for electing state Military: More importantly, the military has a great deal of leverage with its own soldiers and their superiors, and its willingness to use that leverage to force an immediate end to this practice of indefinite detention and indefinite imprisonment was demonstrated by the recent court order issued by the US District Court for the District of Columbia, which requires the release of an Iraqi-American held in an indefinite military detentions facility and a detainee Legal: More importantly, in an effort to make the case that the law was needed because of its effects on the state's economy, the law's supporters claimed that the "death penalty was necessary to protect public safety." (The argument that the punishment was needed because it was needed to prevent certain crimes was rejected by the Court.) The state's argument was that the death penalty violated constitutional guarantees of due process, Science: More importantly, it is the role of the C.S.A. to develop the technology to use such signals for its national defense, which the agency is doing through the fusion centers. It has been a longstanding goal of the C.S.A. to develop a fusion center that would be capable of processing such signals as well as to develop technologies to use them in other contexts. In recent Religion: More importantly, the nature of the act of creation is seen differently in different traditions. In Islam, a Muslim woman's choice to cover up her face in front of other Muslims is seen as idolatry. This is because Islam prohibits the practice of covering up the faces of other women. In Christianity, the practice of covering up a woman's face is seen as idolatry. This is because Computers: More importantly, it shows the complete inability of an entire system to provide a fair shot and fair share of the pie for a large and diverse pool of users who are not only using the platform in a diverse range of contexts: it is a system that refuses to consider the many different ways in which a user may use the platform, including the many ways a user might engage with the site.\n\n Table 24: Generations starting with "More importantly," by FUDGE. The first generation is selected for each prefix. The space example is somewhat tangential, while the other six are on topic. Words in the given bag are highlighted in red, and other related words in cyan. Space: It has been shown in a pilot study in the United States and in an earlier pilot study in Europe that a combination of an advanced technology, including a laser and high-frequency pulsed light, was able to induce spontaneous cell death, which could be detected using an electroencephalogram (EEG).\n\n"Our findings indicate the potential use of a small-scale laser to generate a Politics: It has been shown that the "no" movement in France is growing, as evidenced by the increase in the vote in the national assembly on May 7th, 2012. As of now, it is a minority, and its support is shrinking with each passing day. The "no" movement has the potential to take over the government of the French Republic.\n\nIn the past years, France's Military: It has been shown in several other laboratories that, while anaerobic digester systems, such as those deployed in the United States by Cummins, use a different and potentially safer process to extract and recycle the waste, their operation is also far more dangerous.\n\n"We had a blast at Cummins and they are a very good company. They were very, very quick to come up with Legal: It has been shown to be the case that a person with a criminal record is more likely to be a victim of domestic abuse and to experience more violence than the general population.\n\n"Domestic abuse, whether a family member, a current or former partner or a stranger, can have devastating effects, not only on the person, but on their partner and others in their home.\n\n" Science: It has been shown to increase the efficiency of the central nerve fibers by as much as 50% in a single operation [11]. The results of the present experiments show that it is possible to activate the central nervous system by using nanomaterials in a novel fashion and to produce a therapeutic effect on various neurological diseases by the action of a single compound.\n\nIn this study, the novel chemical Religion: It has been shown that, once you become a devout Muslim, there will be an increase in your own religiosity. It can be seen from the following quote: "Islam was the religion that brought the first Muslims to Europe, and it has been the religion that will bring the first Muslims to the Americas."\n\nI have heard a number of people tell me that their religion is based on a Computers: It has been shown using a simple and reliable approach that when the right and left sides of the network are connected by a simple method, the network will become stronger.\n\nIn the network, a network of nodes is connected with each of them receiving the information from a node that is a neighbor of the node.\n\nThe neighbor node of the node receiving the information from the neighbor node is an Table 25: Generations starting with "It has been shown" by FUDGE. The first generation is selected for each prefix. The space example seems unrelated and the military example is somewhat tangential, while the other five are on topic. Words in the given bag are highlighted in red, and other related words in cyan.
Space: To review, the plot is that a new Earth was discovered, and a group of scientists, led by the late Dr. Robert Zubrin, begin work. Their plan involves the creation of a giant space station called Orion, to be built in orbit to study the new Earth. The plan, however, has the unexpected side-effect of creating an artificial gravity well, which is then used to create Politics: To review, the central issue in the case of the "Babylonian" text is the legitimacy of the text's existence, since it is based on an earlier, more primitive, text that was already in existence at the time of the Babylonians. It would therefore be wrong to conclude that the "Babylonian" text is an "authentic" document, since it shares certain Military: To review, an army officer is an officer who has a direct, practical and active role in the development, execution and execution of war plans, and, in particular, in carrying out operations of combat importance."\n\n"The military has a right to the exercise of its authority to carry out a range of operations, including the use of lethal force, against a hostile civilian population. The right of Legal: To review, no one in their right mind should have accused them of lying about this.\n\n"No one has a legal right to lie, but it is possible for people to lie if the facts do not support the allegation."\n\nBut a spokesman for the Attorney General, Dominic Grieve, said the case was "extremely difficult" and that a judge must consider "the full range of Science: To review, the following are relevant:\n\nA) It was reported in Science that the study "is the first to show that an early age at conception can alter the brain structure of the brain-damaged."\n\nB) The study "found brain abnormalities in the hippocampus-which is a key part of memory and learning-when an individual was exposed to a high-risk pregnancy or Religion: To review, "God is a God of peace" is a simple concept to understand without understanding the meaning of "God is love." The phrase was popularized by a popular television show called, "God is Love." However, "God is a God of Peace" is not a "God of Peace" phrase. "God is a God of Peace" refers to Jesus Christ as the "Son Computers: To review, if you are an active user, you can now download and install the latest version without using Google Play Store, you can now install a custom app on your phone without needing to open the Play Store, you can now run any of these custom apps without being restricted by Android's app permissions.\n\n\nFor the most part, we believe that the new features are an awesome addition to Android Table 26: Generations starting with "To review," by FUDGE. The first generation is selected for each prefix. The politics example is somewhat tangential but the other six are on topic. Words in the given bag are highlighted in red, and other related words in cyan.
Space: In summary, the following are the major findings of the study: Women who use condoms at least one time per year are more likely to use contraception than those who do not. Women who use condoms at least one time per year are more likely to use contraception than those who do not. Condoms may decrease a woman's risk of getting sexually transmitted infection (STI). This is Politics: In summary, I believe it's the same with the current state of the industry and it's not a question of whether there are enough people, but how much of what we are producing is actually worth our own time. We must not forget that the majority of our readership consists of people who are not interested in learning more about the world of the internet or in the internet in general. They want

Military:
In summary, this study demonstrated that the use of marijuana was associated with significantly lower risk of depression among college students. In addition, we identified that the association was independent of the use of other drugs, and thus may be due to the absence of confounding by other drug use and its impact on depression. We also identified significant associations between marijuana use and depression that are independent of other drug use and its impact Legal: In summary, the new data suggests that women are increasingly being judged by their appearance and that, overall, they are less attractive than men.<|endoftext|>In order to keep our readers from feeling overwhelmed or discouraged by the latest news, the Daily Dot takes some time to explain the basics of what you can expect to see in the world of eSports. 1. The eSports scene is growing The Science: In summary, the most significant change was in the way we think about what is "true." The word "truth" has lost all meaning. In the past, people were told they were "telling the truth." Today, the only thing we think "truth" means are lies, and we have no idea whether we are telling a truth or not. The term "truth" has lost its meaning Religion: In summary, the authors concluded that the findings of the current study indicate that, "The effects of the dietary supplement have been underestimated because they do not include all dietary components, including dietary fiber." The authors conclude: "Based on these results, we propose that, in order to fully understand the impact of dietary fiber on the metabolic syndrome, it is necessary to examine all dietary components, including Computers: In summary, there is a lot of information available about the relationship between brain function during sleep and cognition, but it is still unclear whether these results are due more to differences in the brain's structure versus the way it works under the influence of sleep, or whether there may be some other underlying factor that is involved. We hope this review will contribute to this debate and to future research to shed more light Table 27: Generations starting with "In summary," by G. Note G does not actually use the conditioning information. Words in the given bag are highlighted in red, and other related words in cyan.
Space: In summary, the two are similar and they are both capable of performing similar operations. This means that the main advantage of both of them is their speed: they both use the same power. In addition to their speed, the two are equipped with various special abilities, such as the Power of the Sun's Light. In addition, both of them use various different abilities to their advantage. In order to Politics: In summary, the U.K.'s economic policy is largely about economic growth, rather than a political one. Indeed it is much more important for governments to have access to wealth that can be easily earned and managed. The British Empire and Britain and other countries have been able to do that by creating a free market economy for workers and businesses. The fact that the British Empire and Britain have been able

Military:
In summary, the current trend for the United States is a clear example of an economic crisis that has created many major economic and social problems. This is particularly true in countries like China and South Korea that have experienced a period of extreme unemployment and low incomes and are facing an uncertain and volatile climate. We will also note that the recent slump of prices of natural gas has been accompanied by a sharp increase Legal: In summary, the government has proposed to the courts that the government will not be able to make decisions on this matter until the Court decides to grant or reject it. This is what the government is going to do. It is going to take an approach that is very different from the government's and that is not the law. They will try to take a different approach from the government's. If Science: In summary, the results suggest that a high prevalence of breast cancer was found in the general population. The results do not indicate the extent to which breast cancer prevalence can differ between individuals.The authors also note that breast cancer prevalence may be greater among women who have been diagnosed with breast cancer than among those who have never been diagnosed with breast cancer. However, the evidence on the effects of breast Religion: In summary, if the first person you see is an older person, or is the youngest person who is, then you will see the first person you will hear from. If the second person you would like to see is someone who is about to enter into a marriage with someone, or is the youngest person you would like to see, then you will hear the second person you would like to hear from! Computers: In summary, you need to do some basic math before you even get a "good" answer.The first thing we have to consider is whether the "good" answer is really that simple. A good answer is the one you want to get right. The "bad" answer is the one you don't like. So for now the good answer is: If you have a question, you Table 28: Generations starting with "In summary," by FINETUNE. The text is often repetitive, while often being off topic. Words in the given bag are highlighted in red, and other related words in cyan.
Space: In summary, the study found that women who were more likely to be obese and/or obese-ish in childhood were also more likely to have been overweight in adulthood. That's because they were more likely to experience negative childhood experiences that could cause them to gain body fat, which would later lead to later obesity. It's important to note that these researchers didn't actually examine the effects of Politics: In summary, the government must state clearly that the tax authority has authority to tax imports of imports from imports of imports of imports. It must state in writing that imports of imports of imports of imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports imports Military: In summary, I've been using the company for a long time and have never been dissatisfied with my purchase experience. I have a company account with mine and have not had any major issues, which is good since mine was a little expensive. I've also had the service company staff service my order and leave me peace of mind. I'm very pleased with the service I received and will be buying another Legal: In summary, there will be law law enforcement law enforcement law enforcement law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law law Science: In summary: -the data for data_id is not available in data_list data_id data_list data_list data_id data_list data_list data_id data_list data_list data_list data_id I've used data_id data. It's a variable name, so it doesn't matter how big data_id actually is Religion: In summary, yin yang yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin yin y Computers: In summary, the key development this process required was to identify key data sources that could be utilized to document key data security data security data security data data security data security data security data security data security data security data The software platform platform platform platform platform platform platform security platform security platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform platform Table 29: Generations starting with "In summary," by WDEC. The text frequently degenerates into repeating words in the given bag, despite previously tuning on a validation bag of words on the fantasy topic. Words in the given bag are highlighted in red, and other related words in cyan.

K Topic Control Bags of Words and Prefixes
We use the exact same bags of words and prefixes as Dathathri et al. (2019) for their topic control task, with non-proper nouns lower-cased (in practice, this only changes the religion wordlist). Note our success metric in the paper matches without casing. We additionally provide the heldout bags of words computed from the original bags (before lower-casing), which we use for the success metric. Although a few words deviate somewhat ("actress" as a synonym for "star" in the space category), overall the heldout bags do represent the desired topic.
Finally, we provide the fantasy bag of words used for selecting the λ and λ WDEC conditioning strengths for FUDGE and WDEC respectively. It is also taken directly from Dathathri et al. (2019).