Autoregressive Text Generation Beyond Feedback Loops

Autoregressive state transitions, where predictions are conditioned on past predictions, are the predominant choice for both deterministic and stochastic sequential models. However, autoregressive feedback exposes the evolution of the hidden state trajectory to potential biases from well-known train-test discrepancies. In this paper, we combine a latent state space model with a CRF observation model. We argue that such autoregressive observation models form an interesting middle ground that expresses local correlations on the word level but keeps the state evolution non-autoregressive. On unconditional sentence generation we show performance improvements compared to RNN and GAN baselines while avoiding some prototypical failure modes of autoregressive models.


Introduction
Sequential autoregressive models express predictions of observations based on past predictions. They are the predominant architecture for text generation in a maximum likelihood setup (Graves, 2013;Sutskever et al., 2014) and are used in machine translation (Bahdanau et al., 2015;Vaswani et al., 2017), summarization (Rush et al., 2015), and dialogue systems (Serban et al., 2016).
An immediate consequence of combining autoregressive modeling and maximum likelihood training is that past observations enter the loss functions as ground-truth, not predicted observations (Goodfellow et al., 2016). This discrepancy is often summarized as teacher-forcing and the bias it implies is referred to as exposure-bias (Ranzato et al., 2016;Goyal et al., 2016).
The standard methodology to turn a sequential model into an autoregressive one is to introduce a feedback loop, where one provides the last predicted token as a feature to the computation of the next state (Graves, 2013). The groundtruth observations become effectively input features for the evolution of the hidden state trajectory at training time. Several attempts have been made to introduce robustness with respect to the model's predictions by leaving the maximum likelihood framework, either implicitly Bowman et al., 2016) or explicitly (Goyal et al., 2016;Leblond et al., 2018). Nevertheless, the same feedback mechanisms have been adopted in latent sequential models where they obfuscate the true stochasticity of transitions during training. Non-autoregressive sequence models have recently regained attention for unconditional (Schmidt and Hofmann, 2018;M. Ziegler and M. Rush, 2019) and conditional (Lee et al., 2018) generation.
We argue that there is an interesting intermediate regime between feedback-driven autoregressive models and completely non-autoregressive models, namely modeling temporal correlations as part of the observation model. We propose a neural CRF observation model that leverages wordembeddings to explain local word correlations in a global sequence score. We show how training and generation can be performed efficiently. The result is an autoregressive model that keeps the hidden state evolution less affected by observation noise while generating coherent word sequences.

Related Work
Conditional Random Fields (CRF) were originally introduced by Sha and Pereira (2003) to overcome label bias, a shortcoming of locally normalized observation models. They have been applied and integrated into neural-network architectures (Ma and Hovy, 2016;Huang et al., 2017) in various sequence labeling tasks (Goldman and Goldberger, 2017) where the observation space exhibits small cardinality (typically tens to hundreds).
The importance of global normalization for sequence generation has only lately been emphasized, most notably by Wiseman and Rush (2016) for conditional generation in a learning-as-searchoptimization framework and by (Andor et al., 2016) for parsing.
Word-embeddings have been reported as excellent dense representations of sparse co-occurrence statistics within several learning frameworks (Mikolov et al., 2013;Pennington et al., 2014). Using embeddings in pairwise potentials has been proposed by Goldman and Goldberger (2017), but they do not compute the true log-likelihood during training as we do. Similar techniques have been applied for various message passing schemata (Kim et al., 2017;Domke, 2013).
Local correlations such as our pairwise potentials have been used by (Noraset et al., 2018), yet as an auxiliary loss and not for model design.
Other approaches to tackle teacher-forcing have been proposed in an adversarial setting (Goyal et al., 2016), in search based optimization (Leblond et al., 2018) and in a reinforcement learning setting (Rennie et al., 2016).

Model
Latent sequential models for text generation typically consist of two parts: A mechanism for generating a latent hidden state trajectory h = h 1:T , and an observation model. The latter predicts the data w = w 1:T given the latent states. The most simple dependency structure for such a model is that of an Hidden Markov Model, which breaks into transitions p(h t |h t−1 ) and observations p(w t |h t ).
In contrast, models with autoregressive transitions factorize as The result is a next-state distribution with dependencies identical to deterministic RNN transitions h t = F (h t−1 , w t−1 ) and indeed similar neural networks can be used to parametrize a simple, e.g., Gaussian distribution (Fraccaro et al., 2016).
As a negative consequence, we inherit teacherforcing. This comes with aforementioned biases and also conflicts with our notion of uncertainty in p(h t |h t−1 , w t−1 ) which during training solely depends on the continuous parameters (i.e. a mean and a variance), but is greatly affected by the discrete sampling noise in w t−1 at test time.
Autoregressive observation model We consider an alternative to autoregressive feedback mechanisms such as (1), where predictions are directly injected into states. We write assuming only Markovian transitions and focus on finding a powerful observation model instead.
Crucially, since the state space model is not affected by previous outputs, word coherence may be lost when simply factorizing as in p However, a natural extension can be found by reformulating local normalization as a form of global normalization without correlations across time where S = T t=1 ψ(w t , h t ) contains no dependencies between w t and w t for t = t . As soon as we add word-correlations to S, we obtain a truly global observation model that cannot be expressed in the form of (3).

CRF Observation Model
Equation (4) describes a conditional random field (CRF) with an energy function S (Sha and Pereira, 2003). We consider up to pairwise interactions between consecutive words The potentials ψ reflect the independence assumptions among w and determine the complexity of the normalizer Z = w exp S(w ). Fortunately, for chain-like interactions such as (5), efficient dynamic programming routines are available. Two properties set our model apart from feedback-driven autoregressive models. First, although ψ captures only pairwise interactions, a state h t will not only affect future observations but also all past observations through the global coupling. Second, our model implicitly considers all possible sequences w also at training time due to the global normalizer Z.

Sampling
Given a trained model, we can perform ancestral sampling via h ∼ p(h) and w ∼ p(w|h). However, CRFs are undirected graphical models not designed with generation in mind and therefore we first need to derive ancestral sampling for p(w|h). We can always write p(w|h) = t p(w t |w 1:t−1 , h) and find the factors where with special cases β 1 (w 0 ) = 1 and β T +1 (w T ) = Z are the backwards probabilities we anyway need to compute for (4). Not surprisingly, multiplying (6) for t = 1 : T lets all β terms cancel except for 1/Z and we recover (4). However, this form is more amendable to sampling 2 and reveals an interesting property of globally normalized models: While the chain rule always allows to write such 2 In fact, one can train on (6) instead of (4). However, in our experiments we found the latter global normalization to be much more stable numerically. models autoregressively, we must expect a factor -here β t+1 (w t ) -that implicitly marginalizes out future observations to assess compatibility with a specific next word w t . Tractability of this factor is key to obtain a tractable model and is traded for expressiveness. While locally normalized models are on one end of the spectrum, a globally normalized with fully-connected potentials ψ(h(w)) is on the other end. Such models employ an RNN in each potential to obtain an un-normalized score ψ from states h and have been investigated in conditional generation where argmax-decoding rather than sampling is requried (Wiseman and Rush, 2016). Figure 1 shows the dependencies of the two extremes with our model in the middle.

Embedding-based Local Correlations
Often pairwise potentials can be parametrized directly, i.e. as ψ(w i , w j ) = A ij for some parameter matrix A ∈ R V ×V . However, in our setting this is problematic for two reasons. First, |V | 2 parameters are impractical in terms of model size for most vocabularies. Second, computations involving A are central to the complexity of computing log-likelihood during training. Namely, to compute the normalizer Z, we need to compute all β quantities in (7). Identifying β t (w t−1 ) as a |V |dimensional vector β t , we can write the summation in (7) as a matrix-vector product where is an element-wise product, o t are the unary potentials ψ(w t ) written as a vector and T = exp A element-wise. We observe, computing Z naively requires O(|V | 2 T ) operations.
To overcome the above shortcomings, we propose to factorize T as into context-independent d-dimensional embeddings X, Y ∈ R d×|V | and a context-dependent d × d interaction matrix computed by a neural network S : R d × R d → R d×d . This reduces the memory requirement to O(d|V |) and compute time to O(d|V |T ), which is comparable to computing standard soft-max logits. As an additional benefit we can initialize X and Y with pre-trained word-embeddings, a technique often reported to improve convergence. Sine A does not have more structure than being strictly positive element-wise, it is sufficient to use strictly positive activation functions around the layers in (8) to obtain a valid factorization.

Training
As is standard for latent sequential models, we use variational inference for training (Blei et al., 2017;Zhang et al., 2018). We introduce a parametrized approximate inference model q(h|w) to maximize the evidence lower bound (ELBO) for a sampled trajectory instead of maximizing the marginal across all trajectories: The first term of (11) measures reconstruction while the second measures the discrepancy between the trajectories implied by the inference model q and the generative model p. The exact form of p(h) depends on its factorization and if it is autoregressive but for us simply p(h) = Inference model Like (Fraccaro et al., 2016), we choose q to factorize as the true posterior where w 1:T is encoded using an RNN running backwards in time to parameterize mean and variance of a Gaussian for q(h t |h t−1 , w t:T ). For optimization we follow existing work (Fraccaro et al., 2016;Goyal et al., 2017) and use the re-parametrization trick (Rezende et al., 2014;Kingma et al., 2016) to perform a stochastic gradient step on (11) with Adam (Kingma and Ba, 2014) using a single trajectory.

Experiments
Exposure-bias can be summarized as overconfident conditioning on "pseudo" predictions during training. The strength of the bias depends on the informativeness of such predictions, which in turn depends on the remaining context provided. We test our proposed method on unconditional generation which does not provide context such as a source sentence to narrow down possible outputs a priori. Hence, potential biases are more pronounced and generation is isolated from effects induced by i.e. a translation or summarization task.
Setup Unconditional generation is still considered a challenging task for both, GANs and latent stochastic models,  and standard RNNs form a very competitive baseline (Semeniuta et al., 2018). To obtain a homogeneous text dataset of low complexity we extract the plain text (text and hypothesis) from the Standard SNLI dataset (Bowman et al., 2015) (For details and samples see Appendix A).
Baselines We compare against a GRU (an LSTM performed on par) standard RNN of matching state size denoted DRNN. We also include SeqGAN 3 , a popular GAN architecture for unconditional generation. Further, we restrict our model to unary potentials to obtain a non-autoregressive state space model similar to that of Schmidt and Hofmann (2018), denoted SSM. Finally, 2-GRAM is a bi-gram language model and ORACLE is held-out data, which represents the gold-standard for unconditional generation.
Parameterization We use 16-dimensional latent states, pre-train 100-dimensional GloVe embeddings and use word and context vectors for Y and X. For S we found a diagonal matrix to perform best. In this case, the symmetry of T is broken by larger unary potentials. While we find larger word embedding dimensionality to improve performance, the model does not benefit from more latent dimensions as an RNN does from hidden dimensions, a known issue of deep latent variable models (Schmidt and Hofmann, 2018;M. Ziegler and M. Rush, 2019). Table 1 shows selected output generated by our model (See Appendix B for more output). While a dog runs . the children are alone . the man is being beaten . the man is inside working onstage . the dog is outside with his girlfriend . two dogs going swimming in an open-air festival . a young lady wearing a pink shirt is studying . many of our sentences are grammatical and mimic those of the dataset we note that the corpus is not large enough to learn common sense and all models including the baselines sometimes generate output such as two men are burning snow.

Quantitative Results
Perplexity under external language models is the standard metric to evaluate unconditional output  and we use Kneser-Neysmoothed models up to 4 n = 3 estimated on the training data using SRILM (Stolcke, 2002).
In addition, we propose to estimate some important aggregate statistics easily verifiable against the real data. We choose length l and percentage of unique sentences ρ UNI to assess diversity and percentage of token repetitions ρ REP to adress a failure mode often found in generative models (Tu et al., 2016).

Discussion and Future Work
In terms of perplexity our model clearly improves over SSM, outperforms DRNN as measured by bigram statistics, and is on par with it in terms of trigram statistics. Of course, 2-GRAM excels in terms of bigram statistics, yet falls behind on longer statistics. This confirms that our model can learn beyond pairwise interactions through the latent chain. In addition, through our explicit model of pairwise interaction we obtain repetitions ρ REP significantly closer to the real data distribution. For SEQGAN we report after 20 epochs (as used by the authors) and 200 epochs. We observe in general shorter output with more repetition (i.e. of words are, is and up) and note that depending on training time the stellar fluency is traded with a significant bias on length l and very poor diversity ρ UNI , a tendency also observed by Xu et al. (2018) and possibly related to the choice of temperature parameter (Caccia et al., 2018). While it is not our goal to provide a deeper analysis of GANs here, the example shows how unconditional generation can reveal tradeoffs not present in a conditional setting.

Future Work
We have shown that autoregressive predictions expressed in the observation model instead of hidden states deliver better results on a simple corpus. In particular, mistakes at the bigram-level, such as repetitions, are avoided and we suspect that more densely connected CRFs allow to extend these promising results to more complex patterns found in more complex corpora. In future work we plan to investigate if CRF variants such as (Belanger et al., 2017) or (Krähenbühl and Koltun, 2012) can be adapted to allow efficient sampling and to scale to word vocabulary sizes.

Conclusion
We have shown an alternative methodology to autoregressive modeling that avoids exposure-bias in hidden states by design through a globally normalized observation model. We derived a sampling method and an efficient embedding-based parameteriation of CRFs to trade expressiveness with tractability. On an unconditional generation task, we obtain better results than a deterministic RNN in a low-dimensional setting and more consistent results than a GAN baseline. Finally, we have pointed into directions on how to capture more complex correlations.