On the Discrepancy between Density Estimation and Sequence Generation

Many sequence-to-sequence generation tasks, including machine translation and text-to-speech, can be posed as estimating the density of the output y given the input x: p(y|x). Given this interpretation, it is natural to evaluate sequence-to-sequence models using conditional log-likelihood on a test set. However, the goal of sequence-to-sequence generation (or structured prediction) is to find the best output yˆ given an input x, and each task has its own downstream metric R that scores a model output by comparing against a set of references y*: R(yˆ, y* | x). While we hope that a model that excels in density estimation also performs well on the downstream metric, the exact correlation has not been studied for sequence generation tasks. In this paper, by comparing several density estimators on five machine translation tasks, we find that the correlation between rankings of models based on log-likelihood and BLEU varies significantly depending on the range of the model families being compared. First, log-likelihood is highly correlated with BLEU when we consider models within the same family (e.g. autoregressive models, or latent variable models with the same parameterization of the prior). However, we observe no correlation between rankings of models across different families: (1) among non-autoregressive latent variable models, a flexible prior distribution is better at density estimation but gives worse generation quality than a simple prior, and (2) autoregressive models offer the best translation performance overall, while latent variable models with a normalizing flow prior give the highest held-out log-likelihood across all datasets. Therefore, we recommend using a simple prior for the latent variable non-autoregressive model when fast generation speed is desired.


Introduction
Sequence-to-sequence generation tasks can be cast as conditional density estimation p(y|x) where x and y are input and output sequences. In this framework, density estimators are trained to maximize the conditional log-likelihood, and also evaluated using log-likelihood on a test set. However, many sequence generation tasks require finding the best outputŷ given an input x at test time, and the output is evaluated against a set of references y * on a task-specific metric: R(ŷ, y * |x). For example, machine translation systems are evaluated using BLEU scores (Papineni et al., 2002), image captioning systems use METEOR (Banerjee and Lavie, 2005) and text-to-speech systems use MOS (mean opinion scores). As density estimators are optimized on log-likelihood, we want models with higher held-out log-likelihoods to give better generation quality, but the correlation has not been well studied for sequence generation tasks. In this work, we investigate the correlation between rankings of density estimators based on (1) test log-likelihood and (2) the downstream metric for machine translation.
We present two key observations. First, among models within the same family, we find that loglikelihood is strongly correlated with BLEU. The correlation is almost perfect for autoregressive models and high for latent variable models with the same prior. Between models of different families, however, log-likelihood and BLEU are not correlated. Latent variable models with a flow prior are in fact the best density estimators (even better than autoregressive models), but they give the worst generation quality. Gaussian prior models offer comparable or better BLEU scores, while autoregressive models give the best BLEU scores overall. From these findings, we conclude that the correlation between log-likelihood and BLEU scores varies significantly depending on the range of model families considered.
Second, we find that knowledge distillation drastically hurts density estimation performance across different models and datasets, but consistently improves translation quality of non-autoregressive models. For autoregressive models, distillation slightly hurts translation quality. Among latentvariable models, iterative inference with a delta posterior (Shu et al., 2019) significantly improves the translation quality of latent variable models with a Gaussian prior, whereas the improvement is relatively small for the flow prior. Overall, for fast generation, we recommend a latent variable nonautoregressive model using a simple prior (rather than a flexible one), knowledge distillation, and iterative inference. This is 5-7x faster than the autoregressive model at the expense of 2 BLEU scores on average, and it improves upon latent variable models with a flexible prior across generation speed, BLEU, and parameter count.

Background
Sequence-to-sequence generation is a supervised learning problem of generating an output sequence given an input sequence. For many such tasks, conditional density estimators have been very successful (Sutskever et al., 2014;Bahdanau et al., 2015;Vinyals and Le, 2015).
To learn the distribution of an output sequence, it is crucial to give enough capacity to the model to be able to capture the dependencies among the output variables. We explore two ways to achieve this: (1) directly modeling the dependencies with an autoregressive factorization of the variables, and (2) letting latent variables capture the dependencies, so the distribution of the output sequence can be factorized given the latent variables and therefore more quickly be generated. We discuss both classes of density estimators in depth below. We denote the training set as a set of tuples {(x n , y n )} N n=1 and each input and output example as sequences of random variables x = {x 1 , . . . , x T } and y = {y 1 , . . . , y T } (where we drop the subscript n for notational simplicity). We use θ to denote the model parameters.

Autoregressive Models
Learning Autoregressive models factorize the joint distribution of the sequence of output variables y = {y 1 , . . . , y T } as a product of conditional distributions: They are trained to maximize the loglikelihood of the training data: L AR (θ) = 1 N N n=1 log p AR (y n |x n ).
Parameterization Recurrent neural networks and their gated variants are natural parameterizations of autoregressive models (Elman, 1990;Hochreiter and Schmidhuber, 1997;Chung et al., 2014). By ensuring that no future information y ≥t is used in predicting the current timestep y t , nonrecurrent architectures can also parameterize autoregressive models, such as convolutions (van den Oord et al., 2016;Gehring et al., 2017) and Transformers (Vaswani et al., 2017), which are feedforward networks with self-attention.
Inference Finding the most likely output sequence given an input sequence under an autoregressive model amounts to solving a search problem: argmax y 1:T T t=1 log p θ (y t |y <t , x). As the size of the search space grows exponentially with the length of the output sequence T , solving this exactly is intractable. Therefore, approximate search algorithms are often used such as greedy search or beam search.

Latent Variable Models
Learning Latent variable models posit a joint distribution of observed variables (y) and unobserved variables (z). They are trained to maximize the marginal log-likelihood of the training data: (1) As the marginalization over z makes computing the marginal log-likelihood and posterior inference intractable, variational inference proposes to use a parameterized family of distributions q φ (z|y, x) to approximate the true posterior p(z|y, x). Then, we have the evidence lowerbound (ELBO) (Wainwright and Jordan, 2008;Kingma and Welling, 2014): where p θ (y|z, x) is the decoder, q φ (z|y, x) is the variational posterior and p θ (z|x) is the prior. Both the model and variational parameters θ, φ are estimated to maximize ELBO over the training set: Parameterization As latent variables can capture the dependencies between the output variables, the decoding distribution can be factorized: p θ (y|z, x) = T t=1 p θ (y t |z, x). The approximate posterior distribution is also often factorized, which can be parameterized by any neural network that outputs mean and standard deviation for each output position: q φ (z 1:T |y, x) = T t=1 N z t µ φ,t (y, x), σ φ,t (y, x) . We discuss prior distributions in §2.3.
Inference Generating the most likely output given an input with a latent variable model requires optimizing ELBO with respect to the output: argmax y ELBO(y, x; θ, φ). As computing the expectation in Eq. 2 is intractable, we instead optimize a proxy lowerbound using a delta posterior (Shu et al., 2019): Then, the ELBO reduces to: We maximize Eq. 3 with iterative refinement: the EM algorithm alternates between (1) matching the proxy to the original lowerbound by setting µ = E q φ [z], and (2) maximizing the proxy lowerbound with respect to y by:ŷ = argmax y (log p θ (y|µ, x)). The delta posterior is initialized using the prior (e.g. µ = E z∼p θ (z|x) [z] in case of a Gaussian prior) so that the inference algorithm is fully deterministic, a desirable property for sequence generation tasks. We study the effect of iterative refinement on BLEU score in detail.

Prior for Latent Variable Models
Several work have discovered that the prior distribution plays a critical role in balancing the variational posterior and the decoder, and a standard normal distribution may be too rigid for the aggregate posterior to match (Hoffman and Johnson, 2016; Rosca et al., 2018). Indeed, follow-up work found that more flexible prior distributions outperform simple priors on several density estimation tasks (Tomczak and Welling, 2018;Bauer and Mnih, 2019). Therefore, we explore two choices for the prior distribution: a factorized Gaussian and a normalizing flow.
Diagonal Gaussian A simple model of the conditional prior is a factorized Gaussian distribution: where each latent variable z t is modeled as a diagonal Gaussian with mean and standard deviation computed from a learned function.
Normalizing Flow Normalizing flows (Tabak and Turner, 2013;Rezende and Mohamed, 2015;Papamakarios et al., 2019) offer a general method to construct complex probability distributions over continuous random variables. It consists of (1) a base distribution p b ( ) (often chosen as a standard Gaussian distribution) and an invertible transformation f and its inverse f −1 , such that f (z) = , f −1 ( ) = z. As our prior is conditioned on x, so are the transformations: f (z; x) = , f −1 ( ; x) = z. Then, by change-of-variables, we can evaluate the exact density of the latent variable z under the flow prior: Affine coupling flows (Dinh et al., 2017) enable efficient generation and computation of the Jacobian determinant by constructing each transformation such that only a subset of the random variables undergoes affine transformation, using parameters computed from the remaining variables: where g param can be arbitrarily complex as it needs not be invertible. As invertibility is closed under function composition and the Jacobian determinant is multiplicative, increasingly flexible coupling flows can be constructed by stacking multiple flow layers and reordering such that all the variables are transformed.

Knowledge Distillation
While most density estimators for sequence generation tasks are trained to maximize the loglikelihood of the training data, recent work have shown that it is possible to improve the performance of non-autoregressive models significantly by training them on the predictions of a pre-trained autoregressive model (Gu et al., 2018;van den Oord et al., 2018). While  recently found that distillation reduces complexity of the training data, its effect on density estimation performance has not been studied.

Problem Definition
On a sequence generation task, a conditional density estimator F ∈ H (where H is a hypothesis set of density estimators in §2) is trained to maximize the log-likelihood (or its approximation) of the training set {(x n , y n )} N n=1 : Once training converges, the model F is evaluated on the test set {(x m , y m )} M m=1 using a downstream metric R: To perform model selection, we can rank a set of density estimators {F 1 , . . . , F K } based on either the held-out log-likelihood or the downstream metric. We measure the correlation between the rankings given by the log-likelihood L(F ) and the downstream metric R(F ).

Experimental Setup
On machine translation, we train several autoregressive models and latent variable models and analyze the correlation between their rankings based on log-likelihood and BLEU.
We use the preprocessing scripts with default hyperparameters from the tensor2tensor framework. 4 Namely, we use wordpiece tokenization (Schuster and Nakajima, 2012) with 32K wordpieces on all datasets. For WMT'16 En↔Ro, we follow Sennrich et al. (2016) and normalize Romanian and remove diacritics before applying wordpiece tokenization. For training, we discard sentence pairs if either the source or the target length exceeds 64 tokens. As splitting along the time dimension (Ma et al., 2019) in the coupling flow layer requires that the length of the output sequence is a multiple of 2 at each level, <EOS> tokens are appended to the target sentence until its length is a multiple of 4.

Autoregressive Models
We use three Transformer (Vaswani et al., 2017) models of different sizes: Transformer-big (Tr-L), Transformer-base (Tr-B) and Transformer-small (Tr-S). The first two models have the same hyperparameters as in Vaswani et al. (2017). Transformersmall has 2 attention heads, 5 encoder and decoder layers, d model = 256 and d filter = 1024.

Latent Variable Models
The latent variable models in our experiments are composed of the source sentence encoder, length predictor, prior, decoder and posterior. The source sentence encoder is implemented with a standard Transformer encoder. Given the hidden states of the source sentence, the length predictor (a 2-layer MLP) predicts the length difference between the source and target sentences as a categorical distribution in [−30, 30]. We implement the decoder p θ (y|z, x) with a standard Transformer decoder that outputs the logits of all target tokens in parallel. The approximate posterior q φ (z|y, x) is implemented as a Transformer decoder with a final Linear layer with weight normalization (Salimans and Kingma, 2016) to output the mean and standard deviation (having dimensionality d latent ). Both the decoder and the approximate posterior attend to the source hidden states.

Diagonal Gaussian Prior
The diagonal Gaussian prior is implemented with a Transformer decoder which receives a sequence of positional encodings of length T as input, and outputs the mean and standard deviation of each target token (of dimensionality d latent ). We train two models of different sizes: Gauss-base (Ga-B) and Gauss-large (Ga-L). Gauss-base has 4 attention heads, 3 posterior layers, 3 decoder layers and 6 encoder layers, whereas Gauss-large has 8 attention heads, 4 posterior layers, 6 decoder layers, 6 encoder layers. Normalizing Flow Prior The flow prior is implemented with Glow (Kingma and Dhariwal, 2018). We use a single Transformer decoder layer with a final Linear layer with weight normalization to parameterize g param in Eq. 4. This produces the shift and scale parameters for the affine transformation. Our flow prior has the multi-scale architecture with three levels (Dinh et al., 2017): at the end of each level, half of the latent variables are modeled with a standard Gaussian distribution. We use three split patterns and multi-headed 1x1 convolution from Ma et al. (2019). We experiment with the following hyperparameter settings: Flow-small (Fl-S) with 12/12/8 flow layers in each level and Flowbase (Fl-B) with 12/24/16 flow layers in each level. The first level corresponds to the latent distribution and the last level corresponds to the base distribution. (d model , d latent , d filter ) is (320, 320, 640) for all experiments. For the Transformer decoder in g param , we use 4 attention heads for Flow-small and 8 attention heads for Flow-base.

Training and Optimization
We use the Adam optimizer (Kingma and Ba, 2015) with the learning rate schedule used by Vaswani et al. (2017). The norm of the gradients is clipped at 1.0. We perform early stopping and choose the learning rate warmup steps and dropout rate based on the BLEU score on the development set. To train non-autoregressive models, the loss from the length predictor is minimized jointly with negative ELBO loss.
Knowledge Distillation Following previous work (Kim and Rush, 2016;Gu et al., 2018;Lee et al., 2018), we construct a distilled dataset by decoding the training set using Transformer-base with beam width 4. For IWSLT'16 De→En, we use Transformer-small.

Latent Variable Models
To ease optimization of latent variable models (Bowman et al., 2016;Higgins et al., 2017), we set the weight of the KL term to 0 for the first 5,000 SGD steps and linearly increase it to 1 over the next 20,000 steps. Similarly with Mansimov et al. (2019), we find it helpful to add a small regularization term to the training objective that matches the approximate posterior with a standard Gaussian distribution: α · KL q φ (z|y, x) || N (0, I) , as the original KL term KL q φ (z|y, x) p θ (z|x) does not have a local point minimum but a valley of minima. We find α = 10 −4 to work best.

Flow Prior Models
We perform data-dependent initialization of actnorm parameters for the flow prior (Kingma and Dhariwal, 2018) at the 5,000-th step, which is at the beginning of KL scheduling.

Evaluation Metrics
Log-likelihood is the main metric for measuring density estimation (data modeling) performance. We compute exact log-likelihood for autoregressive models. For latent variable models, we estimate the marginal log-likelihood by importance sampling with 1K samples from the approximate posterior and using the ground truth target length.
BLEU measures the similarity (in terms of ngram overlap) between a generated output and a set of references, regardless of the model. It is a standard metric for generation quality of machine translation systems.
Generation Speed In addition to the qualitydriven metrics, we measure the generation speed of each model in the number of sentences generated per second on a single V100 GPU. Table 1 presents the comparison of three model families (Transformer, Gauss, Flow) on five language pairs in terms of generation quality (BLEU) and log-likelihood (LL). We present two sets of results: one from models trained on raw data (Raw),  Ma et al. (2019), which are denoted with ( * ). We boldface the best log-likelihood overall and the best BLEU score among the latent variable models. We underscore best BLEU score among the autoregressive models.  and another from models trained on distilled data (Dist.) (which we mostly discuss in §5.2). We use the original test set in computing the log-likelihood and BLEU scores of the distilled models, so the results are comparable with the undistilled models. We make two main observations:

Correlation between rankings of models
1. Log-likelihood is highly correlated with BLEU when considering models within the same family. (a) Among autoregressive models (Tr-S, Tr-B and Tr-L), there is a perfect correlation between log-likelihood and BLEU. On all five language pairs (undistilled), the rankings of autoregressive models based on loglikelihood and BLEU are identical.
(b) Among non-autoregressive latent variable models with the same prior distribution, there is a strong but not perfect correlation. Between Gauss-large and Gauss-base, the model with higher held-out log-likelihood also gives higher BLEU on four out of five datasets. Similarly, Flow-base gives higher log-likelihood and BLEU score than Flow-small on all datasets except WMT'14 De→En.
2. Log-likelihood is not correlated with BLEU when comparing models from different families. (a) Between latent variable models with different prior distributions, we observe no correlation between log-likelihood and BLEU. On four out of five language pairs (undistilled), Flow-base gives much higher log-likelihood but similar or worse BLEU score than Gaussbase. With distillation, Gauss-large considerably outperforms Flow-base in BLEU on all datasets, while Flow-base gives better loglikelihood.
(b) Overall, autoregressive models offer the best translation quality but not the best modeling performance. In fact, Flow-base model with a non-autoregressive decoder gives the highest held-out log-likelihood on all datasets.
Correlation between log-likelihood and BLEU across checkpoints Table 2 presents the correlation between log-likelihood and BLEU across the training checkpoints of several models. The findings are similar to Table 1: for Transformerbase, there is almost perfect correlation (0.926) across the checkpoints. For Gauss-base and Flowbase, we observe strong but not perfect correlation (0.831 and 0.678). Overall, these findings suggest that there is a high correlation between log-likelihood and BLEU when comparing models within the same family. We discuss the correlation for models trained with distillation below in §5.2.

Knowledge Distillation
In Table 2, we observe a strong negative correlation between log-likelihood and BLEU across the training checkpoints of several density estimators trained with distillation. Indeed, distillation severely hurts density estimation performance on all datasets (see Table 1). In terms of generation quality, it consistently improves non-autoregressive models, yet the amount of improvement varies across models and datasets. On WMT'14 En→De and WMT'14 De→En, distillation gives a significant 7-9 BLEU increase for diagonal Gaussian prior models, but the improvement is relatively smaller on other datasets. Flow prior models benefit less from distillation, only 3-4 BLEU scores on WMT'14 En↔De and less on other datasets. For autoregressive models, distillation results in a slight decrease in generation performance.

Iterative inference on Gaussian vs. flow prior
We analyze the effect of iterative inference on the Gaussian and the flow prior models. Table 3 shows that iterative refinement improves BLEU and ELBO for both Gaussian prior and flow prior models, but the gain is relatively smaller for the flow prior model.
Visualization of latent space In Figure 1, we visualize the latent space of the approximate prior, the prior and the delta posterior of the latent variable models using t-SNE (van der Maaten, 2014). It is clear from the figures that the delta posterior of Gauss-base has high overlap with the approximate posterior, while the overlap is relatively low for   Flow-small. We conjecture that while the loss surface of ELBO contains many local optima that we can reach via iterative refinement, not all of them share the support of the approximate posterior density (hence correspond to data). This is particularly pronounced for the flow prior model.

Generation speed and model size
We compare performance, generation speed and size of various models in Table 4. While autoregressive models offer the best translation quality, inference is inherently sequential and slow. Decoding from non-autoregressive latent variable models is much more efficient, and requires constant time with respect to sequence length given parallel computation. Compared to Transformer-base, Gausslarge with 1 step of iterative inference improves generation speed by 6x, at the cost of 2.6 BLEU. On WMT'14 De→En, the performance degradation is 1.9 BLEU. Flow prior models perform much worse than the Gaussian prior models despite having more parameters and slower generation speed.

Related Work
For sequence generation, the gap between loglikelihood and downstream metric has long been recognized. To address this discrepancy between density estimation and approximate inference (generation), there has largely been two lines of prior work: (1) structured perceptron training for conditional random fields (Lafferty et al., 2001;Collins, 2002;Liang et al., 2006) and (2) empirical risk minimization with approximate inference (Valtchev et al., 1997;Povey and Woodland, 2002;Och, 2003;Qiang Fu and Biing-Hwang Juang, 2007;Stoyanov et al., 2011;Hopkins and May, 2011;Shen et al., 2016). More recent work proposed to train neural sequence models directly on task-specific losses using reinforcement learning (Ranzato et al., 2016;Bahdanau et al., 2017;Jaques et al., 2017) or adversarial training (Goyal et al., 2016). Despite such a plethora of work in bridging the gap between log-likelihood and the downstream task, the exact correlation between the two has not been established well. Our work investigates the correlation for neural sequence models (autoregressive models and latent variable models) in machine translation. Among autoregressive models for open-domain dialogue, a concurrent work (Adiwardana et al., 2020) found a strong correlation between perplexity and a human evaluation metric that awards sensibleness and specificity. This work confirms a part of our finding that log-likelihood is highly correlated with the downstream metric when we consider models within the same family.
Our work is inspired by recent work on latent variable models for non-autoregressive neural machine translation (Gu et al., 2018;Lee et al., 2018;Kaiser et al., 2018). Specifically, we compare continuous latent variable models with a diagonal Gaussian prior (Shu et al., 2019) and a normalizing flow prior (Ma et al., 2019). We find that while having an expressive prior is beneficial for density estimation, a simple prior delivers better generation quality while being smaller and faster.

Conclusion
In this work, we investigate the correlation between log-likelihood and the downstream evaluation metric for machine translation. We train several autoregressive models and latent variable models on five language pairs from three machine translation datasets (WMT'14 En↔De, WMT'16 En↔Ro and IWSLT'16 De→En), and find that the correlation between log-likelihood and BLEU changes drastically depending on the range of model families being compared: Among the models within the same family, log-likelihood is highly correlated with BLEU. Between models of different families, however, we observe no correlation: the flow prior model gives higher held-out log-likelihood but similar or worse BLEU score than the Gaussian prior model. Furthermore, autoregressive models give the highest BLEU scores overall but the latent variable model with a flow prior gives the highest test log-likelihoods on all datasets.
In the future, we will investigate the factors behind this discrepancy. One possibility is the inherent difficulty of inference for latent variable models, which might be resolved by designing better inference algorithms. We will also explore if the discrepancy is mainly caused by the difference in the decoding distribution (autoregressive vs. factorized) or the training objective (maximum likelihood vs. ELBO).