Learning Neural Templates for Text Generation

While neural, encoder-decoder models have had significant empirical success in text generation, there remain several unaddressed problems with this style of generation. Encoder-decoder models are largely (a) uninterpretable, and (b) difficult to control in terms of their phrasing or content. This work proposes a neural generation system using a hidden semi-markov model (HSMM) decoder, which learns latent, discrete templates jointly with learning to generate. We show that this model learns useful templates, and that these templates make generation both more interpretable and controllable. Furthermore, we show that this approach scales to real data sets and achieves strong performance nearing that of encoder-decoder text generation models.


Introduction
With the continued success of encoder-decoder models for machine translation and related tasks, there has been great interest in extending these methods to build general-purpose, data-driven natural language generation (NLG) systems (Mei et al., 2016;Dušek and Jurcıcek, 2016;Lebret et al., 2016;Chisholm et al., 2017;Wiseman et al., 2017).These encoder-decoder models (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015) use a neural encoder model to represent a source knowledge base, and a decoder model to emit a textual description word-by-word, conditioned on the source encoding.This style of generation contrasts with the more traditional division of labor in NLG, which famously emphasizes addressing the two questions of "what to say" and "how to say it" separately, and which leads to systems with explicit content selection, macro-and micro-planning, and surface realization components (Reiter and Dale, 1997;Jurafsky and Martin, 2014).

| | . |
Its customer rating is Their customer rating is Customers have rated it ...

| | .
Figure 1: An example template-like generation from the E2E Generation dataset (Novikova et al., 2017).Knowledge base x (top) contains 6 records, and ŷ (middle) is a system generation; records are shown as type [value].An induced neural template (bottom) is learned by the system and employed in generating ŷ.Each cell represents a segment in the learned segmentation, and "blanks" show where slots are filled through copy attention during generation.
Encoder-decoder generation systems appear to have increased the fluency of NLG outputs, while reducing the manual effort required.However, due to the black-box nature of generic encoderdecoder models, these systems have also largely sacrificed two important desiderata that are often found in more traditional systems, namely (a) interpretable outputs that (b) can be easily controlled in terms of form and content.
This work considers building interpretable and controllable neural generation systems, and proposes a specific first step: a new data-driven generation model for learning discrete, template-like structures for conditional text generation.The core system uses a novel, neural hidden semimarkov model (HSMM) decoder, which provides a principled approach to template-like text generation.We further describe efficient methods for training this model in an entirely data-driven way by backpropagation through inference.Generating with the template-like structures induced by the neural HSMM allows for the explicit representation of what the system intends to say (in the form of a learned template) and how it is attempting to say it (in the form of an instantiated template).
We show that we can achieve performance competitive with other neural NLG approaches, while making progress satisfying the above two desiderata.Concretely, our experiments indicate that we can induce explicit templates (as shown in Figure 1) while achieving competitive automatic scores, and that we can control and interpret our generations by manipulating these templates.Finally, while our experiments focus on the data-to-text regime, we believe the proposed methodology represents a compelling approach to learning discrete, latent-variable representations of conditional text.

Related Work
A core task of NLG is to generate textual descriptions of knowledge base records.A common approach is to use hand-engineered templates (Kukich, 1983;McKeown, 1992;McRoy et al., 2000), but there has also been interest in creating templates in an automated manner.For instance, many authors induce templates by clustering sentences and then abstracting templated fields with hand-engineered rules (Angeli et al., 2010;Kondadadi et al., 2013;Howald et al., 2013), or with a pipeline of other automatic approaches (Wang and Cardie, 2013).
There has also been work in incorporating probabilistic notions of templates into generation models (Liang et al., 2009;Konstas and Lapata, 2013), which is similar to our approach.However, these approaches have always been conjoined with discriminative classifiers or rerankers in order to actually accomplish the generation (Angeli et al., 2010;Konstas and Lapata, 2013).In addition, these models explicitly model knowledge base field selection, whereas the model we present is fundamentally an end-to-end model over generation segments.
Recently, a new paradigm has emerged around neural text generation systems based on machine translation (Sutskever et al., 2014;Cho et al., 2014;Bahdanau et al., 2015).Most of this work has used unconstrained black-box encoderdecoder approaches.There has been some work on discrete variables in this context, including extracting representations (Shen et al., 2018), incorporating discrete latent variables in text modeling (Yang et al., 2018), and using non-HSMM segmental models for machine translation or summarization (Yu et al., 2016;Wang et al., 2017;Huang et al., 2018).Dai et al. (2017) develop an approximate inference scheme for a neural HSMM using RNNs for continuous emissions; in contrast we maximize the exact log-marginal, and use RNNs to parameterize a discrete emission distribution.Finally, there has also been much recent interest in segmental RNN models for non-generative tasks in NLP (Tang et al., 2016;Kong et al., 2016;Lu et al., 2016).
The neural text generation community has also recently been interested in "controllable" text generation (Hu et al., 2017), where various aspects of the text (often sentiment) are manipulated or transferred (Shen et al., 2017;Zhao et al., 2018;Li et al., 2018).In contrast, here we focus on controlling either the content of a generation or the way it is expressed by manipulating the (latent) template used in realizing the generation.
3 Overview: Data-Driven NLG Our focus is on generating a textual description of a knowledge base or meaning representation.Following standard notation (Liang et al., 2009;Wiseman et al., 2017), let x = {r 1 . . .r J } be a collection of records.A record is made up of a type (r.t), an entity (r.e), and a value (r.m).For example, a knowledge base of restaurants might have a record with r.t = Cuisine, r.e = Denny's, and r.m = American.The aim is to generate an adequate and fluent text description ŷ1:T = ŷ1 , . . ., ŷT of x.Concretely, we consider the E2E Dataset (Novikova et al., 2017) and the WikiBio Dataset (Lebret et al., 2016).We show an example E2E knowledge base x in the top of Figure 1.The top of Figure 2 shows an example knowledge base x from the WikiBio dataset, where it is paired with a reference text y = y 1:T at the bottom.
The dominant approach in neural NLG has been  3 Language Modeling for Constrained Sentence generation Conditional language models are a popular choice to generate sentences.We introduce a tableconditioned language model for constraining text generation to include elements from fact tables.

Language model
Given a sentence s = w 1 , . . ., w T with T words from vocabulary W, a language model estimates: (1) Let c t = w t−(n−1) , . . ., w t−1 be the sequence of n − 1 context words preceding w t .An n-gram language model makes an order n Markov assumption, (2)

Language model conditioned on tables
A table is a set of field/value pairs, where values are sequences of words.We therefore propose language models that are conditioned on these pairs.
Local conditioning refers to the information from the table that is applied to the description of the words which have already generated, i.e. the previous words that constitute the context of the language  to use an encoder network over x and then a conditional decoder network to generate y, training the whole system in an end-to-end manner.To generate a description for a given example, a black-box network (such as an RNN) is used to produce a distribution over the next word, from which a choice is made and fed back into the system.The entire distribution is driven by the internal states of the neural network.
While effective, relying on a neural decoder makes it difficult to understand what aspects of x are correlated with a particular system output.This leads to problems both in controlling finegrained aspects of the generation process and in interpreting model mistakes.
As an example of why controllability is important, consider the records in Figure 1.Given these inputs an end-user might want to generate an output meeting specific constraints, such as not mentioning any information relating to customer rating.Under a standard encoder-decoder style model, one could filter out this information either from the encoder or decoder, but in practice this would lead to unexpected changes in output that might propagate through the whole system.
As an example of the difficulty of interpreting mistakes, consider the following actual generation from an encoder-decoder style system for the records in Figure 2: "frederick parker-rhodes (21 november 1914 -2 march 1987) was an english mycology and plant pathology, mathematics at the university of uk."In addition to not being fluent, it is unclear what the end of this sentence is even attempting to convey: it may be attempting to convey a fact not actually in the knowledge base (e.g., where Parker-Rhodes studied), or perhaps it is simply failing to fluently realize information that is in the knowledge base (e.g., Parker-Rhodes's country of residence).
Traditional NLG systems (Kukich, 1983;McKeown, 1992;Belz, 2008;Gatt and Reiter, 2009), in contrast, largely avoid these problems.Since they typically employ an explicit planning component, which decides which knowledge base records to focus on, and a surface realization component, which realizes the chosen records, the intent of the system is always explicit, and it may be modified to meet constraints.
The goal of this work is to propose an approach to neural NLG that addresses these issues in a principled way.We target this goal by proposing a new model that generates with template-like objects induced by a neural HSMM (see Figure 1).Templates are useful here because they represent a fixed plan for the generation's content, and because they make it clear what part of the generation is associated with which record in the knowledge base.

Background: Semi-Markov Models
What does it mean to learn a template?It is natural to think of a template as a sequence of typed text-segments, perhaps with some segments acting as the template's "backbone" (Wang and Cardie, 2013), and the remaining segments filled in from the knowledge base.
A natural probabilistic model conforming with this intuition is the hidden semi-markov model (HSMM) (Gales and Young, 1993;Ostendorf et al., 1996), which models latent segmentations in an output sequence.Informally, an HSMM is much like an HMM, except emissions may last multiple time-steps, and multi-step emissions need not be independent of each other conditioned on the state.
We briefly review HSMMs following Murphy (2002).Assume we have a sequence of observed tokens y 1 . . .y T and a discrete, latent state z t ∈ {1, . . ., K} for each timestep.We addition-ally use two per-timestep variables to model multistep segments: a length variable l t ∈ {1, . . ., L} specifying the length of the current segment, and a deterministic binary variable f t indicating whether a segment finishes at time t.We will consider in particular conditional HSMMs, which condition on a source x, essentially giving us an HSMM decoder.
An HSMM specifies a joint distribution on the observations and latent segmentations.Letting θ denote all the parameters of the model, and using the variables introduced above, we can write the corresponding joint-likelihood as follows where we take z 0 to be a distinguished startstate, and the deterministic f t variables are used for excluding non-segment log probabilities.We further assume p(z t+1 , l t+1 | z t , l t , x) factors as p(z t+1 | z t , x) × p(l t+1 | z t+1 ).Thus, the likelihood is given by the product of the probabilities of each discrete state transition made, the probability of the length of each segment given its discrete state, and the probability of the observations in each segment, given its state and length.

A Neural HSMM Decoder
We use a novel, neural parameterization of an HSMM to specify the probabilities in the likelihood above.This full model, sketched out in Figure 3, allows us to incorporate the modeling components, such as LSTMs and attention, that make neural text generation effective, while maintaining the HSMM structure.

Parameterization
Since our model must condition on x, let r j ∈ R d represent a real embedding of record r j ∈ x, and let x a ∈ R d represent a real embedding of the entire knowledge base x, obtained by max-pooling coordinate-wise over all the r j .It is also useful to have a representation of just the unique types of records that appear in x, and so we also define x u ∈ R d to be the sum of the embeddings of the unique types appearing in x, plus a bias vector and followed by a ReLU nonlinearity.Here we assume z1 is in the "red" state (out of K possibilities), and transitions to the "blue" state after emitting three words.The transition model, shown as T , is a function of the two states and the neural encoded source x.The emission model is a function of a "red" RNN model (with copy attention over x) that generates words 1, 2 and 3.After transitioning, the next word y4 is generated by the "blue" RNN, but independently of the previous words.
Transition Distribution The transition distribution p(z t+1 | z t , x) may be viewed as a K × K matrix of probabilities, where each row sums to 1.We define this matrix to be are state embeddings, and where C : R d → R K×m 2 and D : R d → R m 2 ×K are parameterized non-linear functions of x u .We apply a row-wise softmax to the resulting matrix to obtain the desired probabilities.

Length Distribution
We simply fix all length probabilities p(l t+1 | z t+1 ) to be uniform up to a maximum length L.1 Emission Distribution The emission model models the generation of a text segment conditioned on a latent state and source information, and so requires a richer parameterization.Inspired by the models used for neural NLG, we base this model on an RNN decoder, and write a segment's probability as a product over token-level probabilities, where </seg> is an end of segment token.The RNN decoder uses attention and copy-attention over the embedded records r j , and is conditioned on z t = k by concatenating an embedding corresponding to the k'th latent state to the RNN's input; the RNN is also conditioned on the entire x by initializing its hidden state with x a .
More concretely, let h k i−1 ∈ R d be the state of an RNN conditioned on x and z t = k (as above) run over the sequence y t−lt+1:t−lt+i−1 .We let the model attend over records r j using h k i−1 (in the style of Luong et al. (2015)), producing a context vector c k i−1 .We may then obtain scores v i−1 for each word in the output vocabulary, with parameters g k 1 ∈ R 2d and W ∈ R V ×2d .Note that there is a g k 1 vector for each of K discrete states.To additionally implement a kind of slot filling, we allow emissions to be directly copied from the value portion of the records r j using copy attention (Gülc ¸ehre et al., 2016;Gu et al., 2016;Yang et al., 2016).Define copy scores, where g k 2 ∈ R d .We then normalize the outputvocabulary and copy scores together, to arrive at and thus An Autoregressive Variant The model as specified assumes segments are independent conditioned on the associated latent state and x.While this assumption still allows for reasonable performance, we can tractably allow interdependence between tokens (but not segments) by having each next-token distribution depend on all the previously generated tokens, giving us an autoregressive HSMM.For this model, we will in fact use p(y t−lt+i = w | y 1:t−lt+i−1 , z t = k, x) in defining our emission model, which is easily implemented by using an additional RNN run over all the preceding tokens.We will report scores for both non-autoregressive and autoregressive HSMM decoders below.

Learning
The model requires fitting a large set of neural network parameters.Since we assume z, l, and f are unobserved, we marginalize over these variables to maximize the log marginal-likelihood of the observed tokens y given x.The HSMM marginal-likelihood calculation can be carried out efficiently with a dynamic program analogous to either the forward-or backward-algorithm familiar from HMMs (Rabiner, 1989).
It is actually more convenient to use the backward-algorithm formulation when using RNNs to parameterize the emission distributions, and we briefly review the backward recurrences here, again following Murphy (2002).We have: with base case β T (j) = 1.
We can now obtain the marginal probability of y as where we have used the fact that f 0 must be 1, and we therefore train to maximize the log-marginal likelihood of the observed y: Since the quantities in (1) are obtained from a dynamic program, which is itself differentiable, we may simply maximize with respect to the parameters θ by back-propagating through the dynamic program; this is easily accomplished with automatic differentiation packages, and we use pytorch (Paszke et al., 2017) in all experiments.

Extracting Templates and Generating
After training, we could simply condition on a new database and generate with beam search, as is standard with encoder-decoder models.However, the structured approach we have developed allows us to generate in a more template-like way, giving us more interpretable and controllable generations.First, note that given a database x and reference generation y we can obtain the MAP assignment to the variables z, l, and f with a dynamic program similar to the Viterbi algorithm familiar from HMMs.These assignments will give us a typed segmentation of y, and we show an example Viterbi segmentation of some training text in Fig- ure 4. Computing MAP segmentations allows us to associate text-segments (i.e., phrases) with the discrete labels z t that frequently generate them.These MAP segmentations can be used in an exploratory way, as a sort of dimensionality reduction of the generations in the corpus.More importantly for us, however, they can also be used to guide generation.
In particular, since each MAP segmentation implies a sequence of hidden states z, we may run a template extraction step, where we collect the most common "templates" (i.e., sequences of hidden states) seen in the training data.Each "template" z (i) consists of a sequence of latent states, with z S representing the S distinct segments in the i'th extracted template (recall that we will technically have a z t for each time-step, and so z (i) is obtained by collapsing adjacent z t 's with the same value); see Figure 4 for an example template (with S = 17) that can be extracted from the E2E corpus.The bottom of Figure 1 shows a visualization of this extracted template, where discrete states are replaced by the phrases they frequently generate in the training data.
With our templates z (i) in hand, we can then restrict the model to using (one of) them during generation.In particular, given a new input x, we may generate by computing which gives us a generation ŷ(i) for each extracted template z (i) .For example, the generation in Figure 1 is obtained by maximizing (2) with x set to the database in Figure 1 and z (i) set to the template extracted in Figure 4.In practice, the arg max in (2) will be intractable to calculate exactly due to the use of RNNs in defining the emission distribution, and so we approximate it with a constrained beam search.This beam search looks very similar to that typically used with RNN decoders, except the search occurs only over a segment, for a particular latent state k.

Discussion
Returning to the discussion of controllability and interpretability, we note that with the proposed model (a) it is possible to explicitly force the generation to use a chosen template z (i) , which is itself automatically learned from training data, and (b) that every segment in the generated ŷ(i) is typed by its corresponding latent variable.We explore these issues empirically in Section 7.1.We also note that these properties may be useful for other text applications, and that they offer an additional perspective on how to approach latent variable modeling for text.Whereas there has been much recent interest in learning continuous latent variable representations for text (see Section 2), it has been somewhat unclear what the latent variables to be learned are intended to capture.On the other hand, the latent, template-like structures we induce here represent a plausible, probabilistic latent variable story, and allow for a more controllable method of generation.
Finally, we highlight one significant possible issue with this model -the assumption that segments are independent of each other given the corresponding latent variable and x.Here we note that the fact that we are allowed to condition on x is quite powerful.Indeed, a clever encoder could capture much of the necessary interdependence between the segments to be generated (e.g., the correct determiner for an upcoming noun phrase) in its encoding, allowing the segments themselves to be decoded more or less independently, given x.

Data and Methods
Our experiments apply the approach outlined above to two recent, data-driven NLG tasks.

Datasets
Experiments use the E2E (Novikova et al., 2017) and WikiBio (Lebret et al., 2016) datasets, examples of which are shown in Figures 1 and 2, respectively.The former dataset, used for the 2018 E2E-Gen Shared Task, contains approximately 50K total examples, and uses 945 distinct word types, and the latter dataset contains approximately 500K examples and uses approximately 400K word types.Because our emission model uses a word-level copy mechanism, any record with a phrase consisting of n words as its value is replaced with n positional records having a single word value, following the preprocessing of Lebret et al. (2016).For example, "type[coffee shop]" in Figure 1 becomes "type-1[coffee]" and "type-2 [shop]." For both datasets we compare with published encoder-decoder models, as well as with direct template-style baselines.The E2E task is evaluated in terms of BLEU (Papineni et al., 2002), NIST (Belz and Reiter, 2006), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015), and ME-TEOR (Banerjee and Lavie, 2005). 2 The benchmark system for the task is an encoder-decoder style system followed by a reranker, proposed by Dušek and Jurcıcek (2016).We compare to this baseline, as well as to a simple but competitive non-parametric template-like baseline ("SUB" in tables), which selects a training sentence with records that maximally overlap (without including extraneous records) the unseen set of records we wish to generate from; ties are broken at random.Then, word-spans in the chosen training sentence are aligned with records by string-match, and replaced with the corresponding fields of the new set of records. 3he WikiBio dataset is evaluated in terms of BLEU, NIST, and ROUGE, and we compare with the systems and baselines implemented by Lebret et al. (2016), which include two neural, encoderdecoder style models, as well as a Kneser-Ney, templated baseline.

Model and Training Details
We first emphasize two additional methodological details important for obtaining good performance.
Constraining Learning We were able to learn more plausible segmentations of y by constraining the model to respect word spans y t+1:t+l that appear in some record r j ∈ x.We accomplish this by giving zero probability (within the backward re-currences in Section 5) to any segmentation that splits up a sequence y t+1:t+l that appears in some r j , or that includes y t+1:t+l as a subsequence of another sequence.Thus, we maximize (1) subject to these hard constraints.

Increasing the Number of Hidden States
While a larger K allows for a more expressive latent model, computing K emission distributions over the vocabulary can be prohibitively expensive.We therefore tie the emission distribution between multiple states, while allowing them to have a different transition distributions.
We give additional architectural details of our model in the Supplemental Material; here we note that we use an MLP to embed r j ∈ R d , and a 1layer LSTM (Hochreiter and Schmidhuber, 1997) in defining our emission distributions.In order to reduce the amount of memory used, we restrict our output vocabulary (and thus the height of the matrix W in Section 5) to only contain words in y that are not present in x; any word in y present in x is assumed to be copied.In the case where a word y t appears in a record r j (and could therefore have been copied), the input to the LSTM at time t+1 is computed using information from r j ; if there are multiple r j from which y t could have been copied, the computed representations are simply averaged.
For all experiments, we set d = 300 and L = 4.At generation time, we select the 100 most common templates z (i) , perform beam search with a beam of size 5, and select the generation with the highest overall joint probability.
For our E2E experiments, our best nonautoregressive model has 55 "base" states, duplicated 5 times, for a total of K = 275 states, and our best autoregressive model uses K = 60 states, without any duplication.For our WikiBio experiments, both our best non-autoregressive and autoregressive models uses 45 base states duplicated 3 times, for a total of K = 135 states.In all cases, K was chosen based on BLEU performance on held-out validation data.Code implementing our models is available at https://github.com/harvardnlp/neural-template-gen.

Results
Our results on automatic metrics are shown in Tables 1 and 2. In general, we find that the templated baselines underperform neural models, whereas our proposed model is fairly competitive with neural models, and sometimes even out-  Lebret et al. (2016), their templated baseline, and our HSMM models (denoted "NTemp" and "NTemp+AR" for the nonautoregressive and autoregressive versions, resp.) on the test portion of the WikiBio dataset.Models marked with a † are from Lebret et al. (2016), and following their methodology we use ROUGE-4.Bottom: state-of-the-art seq2seq-style results from Liu et al. (2018).
performs them.On the E2E data, for example, we see in Table 1 that the SUB baseline, despite having fairly impressive performance for a nonparametric model, fares the worst.The neural HSMM models are largely competitive with the encoder-decoder system on the validation data, despite offering the benefits of interpretability and controllability; however, the gap increases on test.Table 2 evaluates our system's performance on the test portion of the WikiBio dataset, comparing with the systems and baselines implemented by Lebret et al. (2016).Again for this dataset we see that their templated Kneser-Ney model underperforms on the automatic metrics, and that neural models improve on these results.Here the HSMMs are competitive with the best model of Lebret et al. (2016), and even outperform it on ROUGE.We emphasize, however, that recent, sophisticated approaches to encoder-decoder style Table 3: Impact of varying the template z (i) for a single x from the E2E validation data; generations are annotated with the segmentations of the chosen z (i) .Results were obtained using the NTemp+AR model from Table 1.
database-to-text generation have since surpassed the results of Lebret et al. (2016) and our own, and we show the recent seq2seq style results of Liu et al. (2018), who use a somewhat larger model, at the bottom of Table 2.

Qualitative Evaluation
We now qualitatively demonstrate that our generations are controllable and interpretable.
Controllable Diversity One of the powerful aspects of the proposed approach to generation is that we can manipulate the template z (i) while leaving the database x constant, which allows for easily controlling aspects of the generation.In Table 3 we show the generations produced by our model for five different neural template sequences z (i) , while fixing x.There, the segments in each generation are annotated with the latent states determined by the corresponding z (i) .We see that these templates can be used to affect the wordordering, as well as which fields are mentioned in the generated text.Moreover, because the discrete states align with particular fields (see below), it is generally simple to automatically infer to which fields particular latent states correspond, allowing users to choose which template best meets their requirements.We emphasize that this level of controllability is much harder to obtain for encoderdecoder models, since, at best, a large amount of sampling would be required to avoid generating around a particular mode in the conditional distribution, and even then it would be difficult to control the sort of generations obtained.i) for a single x from the WikiBio validation data; generations are annotated with the segmentations of the chosen z (i) .Results were obtained using the NTemp model from Table 2.
Interpretable States Discrete states also provide a method for interpreting the generations produced by the system, since each segment is explicitly typed by the current hidden state of the model.Table 4 shows the impact of varying the template z (i) for a single x from the WikiBio dataset.While there is in general surprisingly little stylistic variation in the WikiBio data itself, there is variation in the information discussed, and the templates capture this.Moreover, we see that particular discrete states correspond in a consistent way to particular pieces of information, allowing us to align states with particular field types.For instance, birth names have the same hidden state (132), as do names (117), nationalities (82), birth dates (101), and occupations (20).
To demonstrate empirically that the learned states indeed align with field types, we calculate the average purity of the discrete states learned for both datasets in Table 5.In particular, for each discrete state for which the majority of its generated words appear in some r j , the purity of a state's record type alignment is calculated as the percentage of the state's words that come from the most frequent record type the state represents.This calculation was carried out over training examples that belonged to one of the top 100 most frequent templates.Table 5 indicates that discrete states learned on the E2E data are quite pure.Discrete states learned on the WikiBio data are less pure, though still rather impressive given that there are approximately 1700 record types represented in the WikiBio data, and we limit the number of states to 135.Unsurprisingly, adding autoregressiveness to the model decreases purity on both datasets, since the model may rely on the autoregressive RNN for typing, in addition to the state's identity.

Conclusion and Future Work
We have developed a neural, template-like generation model based on an HSMM decoder, which can be learned tractably by backpropagating through a dynamic program.The method allows us to extract template-like latent objects in a principled way in the form of state sequences, and then generate with them.This approach scales to large-scale text datasets and is nearly competitive with encoder-decoder models.More importantly, this approach allows for controlling the diversity of generation and for producing interpretable states during generation.We view this work both as the first step towards learning discrete latent variable template models for more difficult generation tasks, as well as a different perspective on learning latent variable text models in general.Future work will examine encouraging the model to learn maximally different (or minimal) templates, which our objective does not explicitly encourage, templates of larger textual phenomena, such as paragraphs and documents, and hierarchical templates.

Figure 2 :
Figure 2: An example from the WikiBio dataset (Lebret et al., 2016), with a database x (top) for Frederick Parker-Rhodes and corresponding reference generation y (bottom).

Figure 3 :
Figure3: HSMM factor graph (under a known segmentation) to illustrate parameters.Here we assume z1 is in the "red" state (out of K possibilities), and transitions to the "blue" state after emitting three words.The transition model, shown as T , is a function of the two states and the neural encoded source x.The emission model is a function of a "red" RNN model (with copy attention over x) that generates words 1, 2 and 3.After transitioning, the next word y4 is generated by the "blue" RNN, but independently of the previous words.

Figure 4 :
Figure 4: A sample Viterbi segmentation of a training text; subscripted numbers indicate the corresponding latent state.From this we can extract a template with S = 17 segments; compare with the template used at the bottom of Figure 1.
Cotto is a coffee shop serving English food in the moderate price range.It is located near The Portland Arms.Its customer rating is 3 out of 5.

Table 4 :
Impact of varying the template z (

Table 5 :
Empirical analysis of the average purity of discrete states learned on the E2E and WikiBio datasets, for the NTemp and NTemp+AR models.Average purities are given as percents, and standard deviations follow in parentheses.See the text for full description of this calculation.