Posterior Control of Blackbox Generation

Text generation often requires high-precision output that obeys task-specific rules. This fine-grained control is difficult to enforce with off-the-shelf deep learning models. In this work, we consider augmenting neural generation models with discrete control states learned through a structured latent-variable approach. Under this formulation, task-specific knowledge can be encoded through a range of rich, posterior constraints that are effectively trained into the model. This approach allows users to ground internal model decisions based on prior knowledge, without sacrificing the representational power of neural generative models. Experiments consider applications of this approach for text generation. We find that this method improves over standard benchmarks, while also providing fine-grained control.


Introduction
A core challenge in using deep learning for NLP is developing methods that allow for controlled output while maintaining the broad coverage of data-driven methods.While this issue is less problematic in classification tasks, it has hampered the deployment of systems for conditional natural language generation (NLG), where users often need to control output through task-specific knowledge or plans.While there have been significant improvements in generation quality from automatic systems (Mei et al., 2016;Dusek and Jurcicek, 2016;Lebret et al., 2016b), these methods are still far from being able to produce controlled output (Wiseman et al., 2017).Recent state-of-the-art system have even begun to utilize manual control through rulebased planning modules (Moryossef et al., 2019;Puduppully et al., 2019).
Consider the case of encoder-decoder models for generation, built with RNNs or transformers.
These models generate fluent output and provide flexible representations of their conditioning.Unfortunately, auto-regressive decoders are also globally dependent, which makes it challenging to incorporate domain constraints.
Research into controllable deep models aims to circumvent the all-or-nothing dependency tradeoff of encoder-decoder systems and expose explicit higher-level decisions.One line of research has looked at global control states that represent sentence-level properties for the full decoder.For example, Hu et al. (2017) uses generative adversarial networks where the attributes of the text (e.g., sentiment, tense) are exposed.Another line of research exposes fine-level properties, such as phrase type, but requires factoring the decoder to expose local decisions, e.g.Wiseman et al. (2018).
This work proposes a method for augmenting any neural decoder architecture to incorporate finegrained control states.The approach first modifies training to incorporate structured latent control variables.Then, training constraints are added to anchor the state values to problem-specific knowledge.At test time, the control states can be ignored or utilized as grounding for test-time constraints.Technically, the approach builds on recent advances in structured amortized variational inference to enforce additional constraints on the learned distribution.These constraints are enforced through efficient structured posterior calculations and do not hamper modeling power.
We demonstrate that the method can improve accuracy and control, while utilizing a range of different posterior constraints.In particular on two large-scale data-to-text generation datasets, E2E (Novikova et al., 2017) and WikiBio (Lebret et al., 2016a), our method increases the performance of benchmark systems while also producing outputs that respect the grounded control states.Our code is available at https://github.com/XiangLi1999/Several recent works have shown methods for effectively fitting neural models with structured variational inference (Johnson et al., 2016;Krishnan et al., 2017;Kim et al., 2019).We therefore use these techniques as a backbone for enforcing problem-specific control.See §4 for a full description of the variational family used.

Posterior Regularization of Control States
Posterior regularization (PR) is an approach for enforcing soft constraints on the posterior distribution of generative models (Ganchev et al., 2010).
Our goal is to utilize these soft constraints to enforce problem specific weak supervision.Traditionally PR uses linear constraints which in the special case of expectation maximization for exponential families leads to convenient closed-form training updates.As this method does not apply to neural generative models, we resort to gradient-based methods.In this section, we develop a form of posterior regularization that accommodates the neural variational setting.
Starting with the log-likelihood objective, L(θ), PR aims to add distributional constraints on the posterior.These soft constraints are expressed as a distributional penalty, R p (x, y) ≥ 0. For example, if we have partial information that a specific control state takes on label c we can add a constraint R p (x, y) = 1 − p(z t = c | x, y).We might also consider other distributional properties, for instance penalizing the entropy of a specific posterior marginal, R p (x, y) = H z (z t = z | x, y).See §5 for more constraint examples.
PR uses these soft constraints to regularize the model.Ideally we would penalize the posterior directly, but as noted above, computing this term in a blackbox model is intractable.We therefore follow Ganchev et al. (2010) and use a relaxed version with a surrogate posterior q φ (z | x, y), We can write this in terms of a variational lowerbound on the relaxed PR objective.
This allows us to relate the q in the PRLBO to the variational posterior in the ELBO simply by q (z | x, y) < l a t e x i t s h a 1 _ b a s e 6 4 = " S g F y a i a I S e 0 A t 6 N R 6 N Z + P N e J + 3 l o x i Z h / 9 g v H x D U k g l H I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y l N s I a a i a I S e 0 A t 6 N R 6 N Z + P N e J + 3 l o x i Z h / 9 g v H x D U k g l H I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y l N s I a a i a I S e 0 A t 6 N R 6 N Z + P N e J + 3 l o x i Z h / 9 g v H x D U k g l H I = < / l a t e x i t > < l a t e x i t s h a 1 _ b a s e 6 4 = " Y l N s I a Clowns is a restaurant British R q (x,y) < l a t e x i t s h a 1 _ b a s e 6 4 = " s X r 8 G 3 g h 4 m B N Q 0 n p a W o 5 r c / N 4 P c = " > A An inference network φ is used to parameterize a structured segmental conditional random field q φ (z | x, y) over control states z. (Right) Sample from q φ (colored circles) is used to provide control state labels for a blackbox generation model p θ (y, z | x) .(Left) To ground the control states to represent problem-specific meaning, posterior regularization is used to enforce distributional constraints through penalties R q (x, y).The whole system is optimized end-to-end to learn latent properties of the final output tokens.
expanding the KL and rearranging terms, To train, we jointly maximize over both terms in the PRLBO: the model parameters θ and the variational parameters φ (which tightens the bounds).Following standard practice, we use an amortized inference network, i.e. a variational autoencoder (Kingma and Welling, 2014; Mnih and Gregor, 2014;Rezende et al., 2014), to define φ.

Structured Variational Family for Segmental Generation
We now discuss how to efficiently compute the PRLBO under a structured variational family.
− λR q φ (x, y) We need a q φ (z | x, y) for which we can efficiently (1) take samples, (2) compute entropy, and (3) compute the distributional penalties.This motivates the use of a factored conditional random field (CRF), defined by a potential function φ(x, y, z).
At training time, x, y are observed and z is the latent variable that denotes the control states.We then specify a variational posterior distribution: q φ (z | x, y) = φ(x,y,z) z φ(x,y,z ) .In this work, we focus on the semi-Markov CRF (Gales and Young, 1993;Sarawagi and Cohen, 2005), a common CRF family used in generation (Wiseman et al., 2018).It divides tokens into segmental spans, which are useful for generating entity mentions and commonly used phrases.This model divides the potential function into three parts: the emission potential for a span of tokens given Algorithm 1: Generic Semi-Markov Algorithm.
Given φ and generic semiring (⊕, ⊗, 0, 1) a state, denoted as φ (e) ; the transition potential between states, φ (t) ; and the length potential of span length given a state, φ (l) .Suppose our control states define a span from i (inclusive) to j (exclusive) labeled by c, we denote it as z i:j = c.The potential function of a labeled sequence is defined: For computational efficiency, we restrict all segment length to be ≤ L. 2With this model, we can use the forwardbackward algorithm for all required inferences: exact sampling, computing partition function, entropy, and posterior marginals q φ (z i:j = c | x, y), useful for term (3).In Algorithm 1, we give a One-to-One One-to-Many Name Penalty Name Penalty Inclusion For (i, j, f ) ∈ A(x, y), Table 1: Posterior penalties utilized in the One-to-One and One-to-Many setting.These constraints softly enforce an alignment between control states and text spans by penalizing posterior violations.The objective sums over the three R q in both cases.
generic semi-Markov algorithm (Sarawagi and Cohen, 2005).We store two tables β and β , both of size T × |C|.β t (c) denotes the event that there is a transition at time t from state c. β t (c) denotes the event that there is a emission starting from time t at state c.Then we have the recursion for β t (c) by "summing" over different span length, and we have the recursion for β t (c) that sums over all different state transitions.
The algorithm is generic in the sense that different (⊗, ⊕) operators allow us to compute different needed terms.For example, computing the partition function Z = z φ(x, y, z ) requires the (+,×) semiring (Goodman, 1999;Li and Eisner, 2009), other distributional terms can be computed by using the same algorithm with alternative semirings and backpropagation 3 .

Posterior Constraints from Data Alignment
To make the PR model concrete, we consider the problem of incorporating weak supervision from heuristic alignment in a data-to-text generation task.Assume that we are tasked with describing a table x consisting of global field names F each with a text value v, e.g.x f = v.Not all global fields may be used in a given x, we use f ∈ x to indicate an 3 We need four terms: (a) log-partition term log z φ(x, y, z ) requires the log semiring (logsumexp, +).The posterior marginals q(z | x, y) requires backpropagating from the log-partition term; (b) max score maxz φ(x, y, z): (max, +) max semiring and argmax arg max z φ(x, y, z) by (subgradient) backpropagation, (c) entropy through an expectation semiring p1, r1 ⊗ p2, r2 = p1p2, p1r2 + p2r1 , and p1, r1 ⊕ p2, r2 = p1 + p2, r1 + r2 , with 1 = 1, 0 .To initialize, all the emission, transition and length scores takes the form φ, − log φ .The algorithm returns Z, R , and the true entropy is R Z + log Z.(d) exact sampling through one backward pass and one forward filtering backward sampling, where forward uses the log-partition semiring and backpropagation is by categorical sampling.
x  active field.
We would like control states to indicate when each field is used in generation.Our alignment heuristic is that often these fields will be expressed using the identical text as in the table.While this heuristic obviously does not account for all cases, it is very common in natural language generation tasks as evidence by the wide use of copy attention based approaches (Gu et al., 2016;Gulcehre et al., 2016).To utilize these alignments, we use the notation (i, j, f ) ∈ A(x, y) to indicate that a span i : j in the training text y overlaps directly with a field f ∈ x.Table 2 gives an example of the notation.

One-to-One Constraints
We first consider oneto-one constraints where we assume that we have a static, mapping from fields to states σ : F → C. Given this mapping, we need to add penalties to encourage the semi-Markov model to overlap with the given weak supervision.
To enforce soft alignments, we define three posterior constraint types and their computation as shown in Table 1 (Left).The three constraints are i) Inclusion: if a span in y aligns with a field value f , then label that span σ(f ) the state allocated to that field; ii) Exclusion: A span should only have a state σ(f ), if it aligns with the field value of type f ; iii) Coverage.The usage count of state σ(f ) should be 1 if f in x.
One-to-Many Constraints We also consider the case when it is infeasible to specify a hard mapping σ between the fields and the states.For example, F could be unbounded or large, whereas we hope to keep the cardinality of states small for computational efficiency.
We propose a method of inducing a dynamic soft mapping σ(c | f ) as we train the model, and impose constraints on the mapping from table field to the state names.First, we would like the distribution of state given table field to be consistent, so one table field is mapped to roughly 1 state.Second, we want to make use of the state space as much as possible by requiring a diverse usage of states.
In order to enforce these properties we introduce the dynamic mapping as a second amortized variational distribution σ(c | f ; M ) = softmax(M f ) which gives the probability that a table field f takes on state c.As shown in Table 1 (Right), we define three constraints that regularize the local q with respect to the global σ: i) Sparsity: Each vocabulary entry in σ should have low entropy; ii) Fit: The global σ should represent the class name distribution posterior of each table field by minimizing the cross entropy between types σ(c | f ) and tokens q(z i:j | x, y) for all (i, j, f ) ∈ A(x, y); iii) Diversity: the aggregate class label distribution over all the token in a sentence should have high entropy.

Related Work
In addition to previously mentioned work, other researchers have noted the lack of control of deep neural networks and proposed methods at sentencelevel, word-level, and phrase-level.For example Peng et al. (2018) and Luo et al. (2019) control the sentiment in longer-form story generation.Others aim for sentence-level properties such as sentiment, style, tense, and specificity in generative neural models (Hu et al., 2017;Oraby et al., 2018;Zhang et al., 2018;Shen et al., 2017).Closest to this work is that of Wiseman et al. (2018) who control phrase-level content by using a neuralized hidden semi-Markov model for generation itself.Our work differs in that it makes no independence assumption on the decoder model, uses a faster training algorithm, and proposes a specific method for adding constraints.Finally, there is a line of work that manipulates the syntactic structure of generated texts, by using some labeled syntactic attribute (e.g., parses) or an exemplar (Deriu and Cieliebak, 2018;Colin and Gardent, 2018;Iyyer et al., 2018;Chen et al., 2019).While our work uses control states, there is no inherent assumption of compositional syntax or grammar.
Posterior regularization (PR) is mostly used in standard EM settings to impose constraints on the posterior distribution that would otherwise be intractable (or computationally hard) in the prior.Ganchev et al. (2010) applies posterior regularization to word alignment, dependency parsing, and part-of-speech tagging.Combining powerful deep neural networks with structured knowledge has been a popular area of study: Xu et al. (2019) applies PR to multi-object generation to limit object overlap; Bilen et al. (2014) focuses on object detection, and uses PR features to exploit mutual exclusion.In natural language processing; Hu et al. (2016a,b) propose an iterative distillation procedure that transfers logic rules into the weights of neural networks, as a regularization to improve accuracy and interpretability.
Finally, the core of this work is the use of amortized inference/variation autoencoder to approximate variational posterior (Kingma and Welling, 2014; Mnih and Gregor, 2014;Rezende et al., 2014).We rely heavily on a structure distribution, either linear chain or semi-Markov, which was introduced as a structured VAEs (Johnson et al., 2016;Krishnan et al., 2017;Ammar et al., 2014).Our setting and optimization are based on Kim et al. (2019), who introduce a latent tree variable in a variational autoencoding model with a CRF as the inference network, and on Yin et al. (2018) who use an encoder-decoder model as the inference network.

Experimental Setup
Data and Metrics We consider two standard neural generation benchmarks: E2E (Novikova et al., 2017) and WikiBio (Lebret et al., 2016a) datasets, with examples shown in Figure 1.The E2E dataset contains approximately 50K examples with 8 distinct fields and 945 distinct word types; it contains multiple test references for one source table.We evaluate in terms of BLEU (Papineni et al., Table (x)  2002), NIST (Belz and Reiter, 2006), ROUGE-L (Lin, 2004), CIDEr (Vedantam et al., 2015) and METEOR (Lavie and Agarwal, 2007), using the official scoring scripts4 .The WikiBio dataset contains approximately 700K examples, 6K distinct table field types, and 400K word types approximately; it contains one reference for one source table.We follow the metrics from (Lebret et al., 2016a) and evaluate the BLEU, NIST, and ROUGE-4 scores.
Architecture and Hyperparameters For all tasks, we use an encoder-decoder LSTM for the generative model.We follow recent state-of-the-art works in parametrizing our encoder, and we use copy attention and dual attention (Gu et al., 2016;Gulcehre et al., 2016;Liu et al., 2018): full model architectures are given in the supplement.
The inference network scores are computed using a BiLSTM.We compute the emission scores φ (e) using span embeddings (Wang and Chang, 2016;Kitaev and Klein, 2018;Stern et al., 2017); transition scores φ (t) by dot product between embedding vectors for the class labels; lengths φ (l) is kept uniform, as in Wiseman et al. (2018).Additional details are in the supplement.
At training time, we use a rate for alleviating posterior collapse in the ELBO: warm-up the ELBO objective by linearly annealing the coefficient on the term T t=1 log p θ (z t | z <t , y <t ) and H[q φ (z | x, y)] from 0 to 1, as implemented in Kim et al. (2019).We use the REINFORCE algorithm to do Monte Carlo estimation of the stochastic gradient.We choose the control variate to be the mean of the samples (Mnih and Rezende, 2016).
At decoding time, we only use the generative model.We use beam search with length normalization to jointly generate both the control states and the sentences.To obtain controlled generation, we observe the control states, and apply constrained beam search to p(y | x, z).
Baselines For generation on E2E, we compare externally against 4 systems: E2E-BENCHMARK (Dušek and Jurčíček, 2016) is an encoder-decoder network followed by a reranker used as the shared task benchmark; NTEMP, a controllable neuralized hidden semi-Markov model; NTEMP+AR, the product of experts of both a NTemp model and an autoregressive LSTM network (Wiseman et al., 2018); SHEN19 (Shen et al., 2019) is an pragmatically informed model, which is the current state-ofthe-art system on E2E dataset.We also compare internally with ablations of our system: ENCDEC is a conditional model p(y | x) trained without control states.PC 0 is posterior control model with no constraints.It uses structured encoder with the PR coefficient set to 0. PC ∞ is our model with hard constraints, which assumes fully-observed control states.These control states are obtained by mapping tokens with lexical overlap to their designated state; otherwise we map to a generic state.We train a seq2seq model p(y, z | x) with full supervision of both control states and target text.Our main model is PC λ , which applies PR with coefficient given by hyperparameter λ.
For WikiBio, we compare externally against 5 systems: NTEMP and NTEMP+AR as above; LE-BRET16 (Lebret et al., 2016a), which uses copy attention and an NNLM; LIU18 (ENCDEC), which is our base encoder-decoder LSTM model, and LIU18 (Field Gating) which uses a field gating table encoder and a decoder with dual attention (Liu et al., 2018).For internal comparison on WikiBio, Table 3 shows the main results for the E2E and Wik-iBio, comparing to both standard neural models and controllable systems.On E2E (left), our posterior control model outperforms the neural benchmark system on all validation metrics and most of the test metrics.It also achieves results comparable or better than a specialized encoder-decoder system.It has significantly better performance than the controllable NTemp and NTemp+AR in all metrics on both validation and test.This demonstrates that the PC model provides interpretable and controllable states without sacrificing any representation power or generation performance.
For internal comparison, having soft constraints on the posterior outperforms the system PC ∞ (forced hard constraints) and PC 0 (no constraints).Anecdotally, we find that if two fields have the same value, then the hard coding system is often forced into the wrong decision.Similarly removing posterior regularization altogether leads to a slightly weaker performance than our controlled model.
On the larger WikiBio dataset (right) our model also significantly outperforms both the controllable NTemp and NTemp+AR baselines in all three metrics.It gives improvements over Liu et al. (2018)'s strong encoder-decoder style model.The promising result from WikiBio dataset suggests that the method scales to larger datasets and the PR style works well in handling large field spaces.In addition, we find that dynamic constraints are feasible compared with static constraints (we believe this is because the modeling burden on PC λ one-to-many is heavier since it also needs to figure out the clustering).Overall, the dynamic framework opens up the possibility of generalizing to work well with a wider set of constraints.

Analysis
Qualitative Analysis Table 4 shows how control states (shown by different colors) are used in generated sentences.We use examples generated by the PC λ system on the WikiBio dataset.We obtain outputs by beam search over control states and words.The first block contains examples with relatively complete coverage by the semantically grounded control states, including name, birth date, death date, occupation and nationality.We note that when a control state is selected, the textual span covered by the control state tend to respect truthfulness by copying from the table.The second block shows a longer example that uses less of the source, but still remain truthful with respect to the table.Limitations Given the promise of PR as a technique for inducing control states, it is worth noting some of the current limitations to our specific application of the method.Currently, we use simple rules which do not generalize well to paraphrase.Our weak supervision relies on direct overlap to align states and fails on aligning phrases like less then 10 dollars that are expressed as cheap.Additionally, while at test time, our method is comparable to a standard decoder model, it does require slightly longer to train due to both the dynamic program and the requirement to compute multiple samples.

Conclusion
This work introduces a method for controlling the output of a blackbox neural decoder model to follow weak supervision.The methodology utilizes posterior regularization within a structured variational framework.We show that this approach can induce a fully autoregressive neural model that is as expressive as standard neural decoders but also utilizes meaningful discrete control states.We show (1) Clowns is a 5 star coffee shop located near Clare Hall .
(2) Clowns is a coffee shop that serves English food and is near Clare Hall .It is in riverside and has a 5 out of 5 customer rating .
(3) Near Clare Hall in Riverside is coffee shop , Clowns .It serves English food , and has received a customer rating of 5 out of 5 .
(4) Near the riverside , Clare Hall is a coffee shop called Clowns that serves English food and has a customer rating of 5 -stars .
( this decoder is effective for text generation while inducing meaningful discrete representations.Induction of grounded control states opens up many possible future directions for this work.These states can be used to provide integration with external rule-based systems such as hard constraints at inference time.They also can be used to provide tools for human-assisted generation.Another direction is to improve the sources of weak supervision and such as interactive new constraints provided by users.One could also explore alternative posterior constraints based on pre-trained models for summarization or paraphrase tasks to induce semantically grounded latent variables.Finally, it would be interesting to explore alternative training methods for these models, such as reducing reliance on hard sampling through better relaxations of structured models.

Figure 1 :
Figure 1: Model training.Assumes we are given conditioning x (not shown) and output sentence y. (Middle)An inference network φ is used to parameterize a structured segmental conditional random field q φ (z | x, y) over control states z. (Right) Sample from q φ (colored circles) is used to provide control state labels for a blackbox generation model p θ (y, z | x) .(Left) To ground the control states to represent problem-specific meaning, posterior regularization is used to enforce distributional constraints through penalties R q (x, y).The whole system is optimized end-to-end to learn latent properties of the final output tokens.
: name[Clowns] eatType[coffee shop] food[Chinese] customer-rating[1 out of 5] area[riverside] near[Clare Hall] Ref.1: Clowns is a coffee shop in the riverside area near Clare Hall that has a rating 1 out of 5 .They serve Chinese food .Ref.2: The Chinese coffee shop by the riverside near Clare Hall that only has a customer rating of 1 out of 5 is called Clowns .Ref.3:There is a Chinese coffee shop near Clare Hall in the riverside area called Clowns its not got a good rating though .

Figure 2 :
Figure 2: Generation benchmarks.Model is given a table x consisting of semantic fields and is tasked with generating a description y 1:T of this data.Two example datasets are shown.Left: E2E, Right: WikiBio.

Table 2 :
Example of data alignment notation.Here x is a table of data, and f are its fields.For a given output y we enforce a soft alignment A.

Table 4 :
Qualitative examples on WikiBio dataset.(Top) Generated sentences control states highlighted.(Bottom) Full example of content selection with data table and reference.(Best viewed in color.)c} be the field states used by z.Define the field word overlap between x and y as, means that the control states are a strong signal to copy from the table, and C = 1 means that control states learn to cover all table fields.On WikiBio, the model has a precision of 0.83 on the, meaning that on average, when we generate a good control state, 83% of the generated tokens will match the table content.Since only a fraction of the source table in WikiBio is used, recall and coverage are less applicable.Distributional Metrics Table 5 (right) shows distributional metrics related to the optimization of the generative model and the inference network.The reconstruction perplexity, Rec. is much lower than the full perplexity, PPL and the KL divergence between the variational posterior and the conditional prior is highly non-zero.These observations indicate that latent variables are being used in a non-trivial way by the generative model.It also suggests the variational model is not experiencing posterior collapse.

Table 5 :
5) Near Clare Hall , Clowns coffee shop has a five star rating and English food .(6) Clare Hall is a 5 star coffee shop near to Clowns that serves British food .(7) Clowns coffee shop is near Clare Hall in Riverside .It serves English food and has an excellent customer rating .(8) 5 star rated restaurant , Clowns coffee shop is located near Clare Hall .(Left) Example of controlled generation p θ (y | x, z) on the source entity "Clowns" from E2E dataset.The color represents the class label of the token z. (Right) Metrics related to the generative model/inference network measured on both E2E and WikiBio.Rec. is reconstruction perplexity based on E q(z|x,y) [log p θ (y |, x, z)].PPL is the perplexity per token estimated by importance sampling.