Addressing Posterior Collapse with Mutual Information for Improved Variational Neural Machine Translation

This paper proposes a simple and effective approach to address the problem of posterior collapse in conditional variational autoencoders (CVAEs). It thus improves performance of machine translation models that use noisy or monolingual data, as well as in conventional settings. Extending Transformer and conditional VAEs, our proposed latent variable model measurably prevents posterior collapse by (1) using a modified evidence lower bound (ELBO) objective which promotes mutual information between the latent variable and the target, and (2) guiding the latent variable with an auxiliary bag-of-words prediction task. As a result, the proposed model yields improved translation quality compared to existing variational NMT models on WMT Ro↔En and De↔En. With latent variables being effectively utilized, our model demonstrates improved robustness over non-latent Transformer in handling uncertainty: exploiting noisy source-side monolingual data (up to +3.2 BLEU), and training with weakly aligned web-mined parallel data (up to +4.7 BLEU).


Introduction
The conditional variational autoencoder (CVAE; Sohn et al., 2015) is a conditional generative model for structured prediction tasks like machine translation. This model, learned by variational Bayesian methods (Kingma and Welling, 2014), can capture global signal about the target in its latent variables. Unfortunately, variational inference for text generation often yields models that ignore their latent variables (Bowman et al., 2016), a phenomenon called posterior collapse.
In this paper, we introduce a new loss function for CVAEs that counteracts posterior collapse, motivated by our analysis of CVAE's evidence lower bound objective (ELBO). Our analysis ( §2) reveals that optimizing ELBO's second term not only brings the variational posterior approximation closer to the prior, but also decreases mutual information between latent variables and observed data. Based on this insight, we modify CVAE's ELBO in two ways ( §3): (1) We explicitly add a principled mutual information term back into the training objective, and (2) we use a factorized decoder , which also predicts the target bagof-words as an auxiliary decoding distribution to regularize our latent variables. Our objective is effective even without Kullback-Leibler term (KL) annealing (Bowman et al., 2016), a strategy for iteratively altering ELBO over the course of training to avoid posterior collapse.
In applying our method to neural machine translation (NMT; Kalchbrenner and Blunsom, 2013;Sutskever et al., 2014), we find that we have measurably mitigated posterior collapse. The latent variables are not ignored, even in the presence of a powerful Transformer decoder. By addressing this problem, the resulting NMT model has improved robustness and performance in low-resource scenarios. Noisy data like those scraped from the Internet (Smith et al., 2013;Michel and Neubig, 2018) present a challenge for NMT (Khayrallah and Koehn, 2018;Ott et al., 2018a); we are measurably more able to model this extrinsic uncertainty than the (non-latent) Transformer  or existing variational NMT with the CVAE architecture . Finally, we extend the model to semi-supervised learning (Cheng et al., 2016) to more effectively learn from monolingual data.
In summary, our conditional text generation model overcomes posterior collapse by promoting mutual information. It can easily and successfully integrate noisy and monolingual data, and it does this without the cost of lower BLEU score than non-latent NMT in typical settings.

Formalism and Mathematical Analysis
Here we review the standard framework for neural MT. Next, we connect this to the conditional variational autoencoder, a model with latent random variables whose distributions are learned by blackbox variational Bayesian inference. Finally, we analyze the CVAE's objective to explain why these models will ignore their latent variables ("posterior collapse").

Neural Machine Translation
Problem instances in machine translation are pairs of sequences (x [x 1 , . . . , x m ], y [y 1 , . . . , y n ]), where x and y represent the source and target sentences, respectively. Conventionally, a neural machine translation model is a parameterized conditional distribution whose likelihood factors in an autoregressive fashion: (1) The dominant translation paradigm first represents the source sentence as a sequence of contextualized vectors (using the encoder), then decodes this representation into a target hypothesis according to Equation 1. The parameters θ are learned by optimizing the log-likelihood of training pairs with stochastic gradient methods (Bottou and Cun, 2004;Kingma and Ba, 2015). Decoding is deterministic, using an efficient approximate search like beam search (Tillmann and Ney, 2003). The Transformer architecture with multi-head attention has become the state of the art for NMT .

The Conditional Variational Autoencoder
Our NMT approach extends the conditional variational autoencoder (Sohn et al., 2015), which we identify as a generalization of Variational NMT . It introduces a latent random variable z into the standard NMT conditional distribution from Equation 1: 1,2 For a given source sentence x, first a latent variable z is sampled from the encoder, then the target sen-1 By contrast, the hidden states of a standard sequence-tosequence model are deterministic latent variables.
2 In Equation 2 we assume a continuous latent variable. For the discrete case, replace integration with summation. tence y is generated by the decoder: z ∼ p θ (z | x), y ∼ p θ (y | z, x). 3 It is intractable to marginalize Equation 2 over z. Instead, the CVAE training objective is a variational lower bound (the ELBO) of the conditional log-likelihood. It relies on a parametric approximation of the model posterior: q φ (z | x, y). The variational family we choose for q is a neural network whose parameters φ are shared (i.e., amortized) across the dataset.
The ELBO lower-bounds the log-likelihood, as can be proven with Jensen's inequality. Its form is: where D KL represents the Kullback-Leibler divergence between two distributions.
We use amortized variational inference to simultaneously perform learning and approximate posterior inference, updating both θ and φ with stochastic gradient methods. Improving θ raises the lower bound, and improving φ keeps the bound tight with respect to the model conditional log-likelihood. The same argument pertains to the joint maximization interpretation of the expectation-maximization (EM) algorithm (Neal and Hinton, 1998). (Our optimization is a variational generalization of EM.)

Posterior Collapse
Despite their success when applied to computer vision tasks, variational autoencoders in natural language generation suffer from posterior collapse, where the learnt latent code is ignored by a strong autoregressive decoder. This presents a challenge to conditional language generation tasks in NLP like machine translation.
The phenomenon can be explained mathematically by an analysis of the ELBO objective, as well as from the perspective of a powerful decoder that can model the true distribution without needing the latent code. We consider both in this subsection.
ELBO surgery Recall that the computed objective approximates the objective on the true data distribution p D , using a finite number of samples (see, e.g., Brown et al., 1992): We can factor the KL term of Equation 3 (omitting parameter subscripts) as: which we prove in Appendix A, following (Hoffman and Johnson, 2016).
As both the resulting mutual information and KL terms are non-negative (Cover and Thomas, 2006), the global minimum of Equation 5 is I q φ (z; x, y) = D KL (q φ (z) p(z)) = 0. Unfortunately, at this point, the consequence of the optimization is that the latent variable z is conditionally independent of the data (x, y).
A powerful decoder Revisiting Equation 3, we see that the decoder is conditioned on both the stochastic latent variable z and the source text x. A sufficiently high-capacity autoregressive decoder can model the conditional density directly, ignoring the latent variable and reducing inference to Equation 1. The KL term can then be reduced to its minimum (0) by equating the posterior to the prior. To prevent this, some work weakens the decoder in various ways. This is a challenge, because NMT requires a powerful decoder such as Transformer with direct attention to the encoder.

An Information-Infused Objective
We modify our training objective to explicitly retain mutual information between the latent variable z and the observation (x, y). Further, we use an auxiliary decoder that only uses the latent variable, not the encoder states. We combine it with the existing decoder as a mixture of softmaxes (Yang et al., 2018a). The model is trained with amortized variational inference. When source-language monolingual text is available, we augment our modified CVAE objective with a similarly modified (non-conditional) VAE objective. The training and inference strategy is summarized in Figure 1.
3.1 Adding I q φ (z; x, y) to ELBO To combat the optimization dilemma from Equation 5 (namely, that the objective discourages mutual information between the latent variable and the data), we explicitly add the mutual information term to the CVAE's ELBO and obtain a new training objective: The new training objective L MICVAE aims to match the aggregated approximate posterior distribution of the latent variable q φ (z) (Hoffman and Johnson, 2016) to the aggregated-posterior prior distribution p θ (z). 4

Guiding z to Encode Global Information
Several existing approaches weaken the decoder: limiting its capacity to encourage latent variables to be utilized (Bowman et al., 2016;Gulrajani et al., 2017). Here we propose a different approach: explicitly guiding the information encoded in z without reducing the decoder's capacity. The decision to weaken the decoder can be understood in the context of Bits-Back Coding theory , which suggests that at optimality the decoder will model whatever it can locally, and only the residual will be encoded in the latent variable z. A consequence is that explicit information placement can give more powerful latent representations.
Inspired by this Bits-Back perspective, we add a global auxiliary loss for z to encode information which cannot be modelled locally by the autoregressive decoder t p θ (y t | x, y <t , z). We use bag-of-words (BoW) prediction as the auxiliary loss. It encodes global information while having a non-autoregressive factorization: t p ψ (y t | z).
(We choose not to condition it on the source sentence x.) Further, it requires no additional annotated data. The auxiliary decoder complements the autoregressive decoder (which is locally factorized), interpolating predictions at the softmax layer, i.e. p(y t | x, y <t , z) is a mixture of softmaxes (Yang et al., 2018b): with mixing parameter λ. (We use λ = 0.1 in this paper.) Thus, the bag-of-words objective regularizes the log-likelihood bound.
4 Implementing Latent Variable NMT 4.1 Architecture Our model uses discrete latent variables. These are used to select a latent embedding, which is concatenated to the decoder state.
Inference Network We use discrete latent variables with reparameterization via Gumbel-Softmax (Jang et al., 2017;Maddison et al., 2017) to allow backpropagation through discrete sampling. Unlike the multivariate Gaussian distribution commonly used in VAE and CVAE, our parameterization can explicitly account for multiple the mismatch between the (joint) data distribution pD(x, y) and the (conditional) likelihood objective p θ (y | x). modes in the data. (See Rezende and Mohamed (2015) for a perspective on the value of multimodal distributions over latent variables.) To make our model more general, we introduce a set of discrete latent variables z = {z 1 , . . . , z K } which are independently sampled from their own inference networks Φ k . Specifically, each Φ k computes scaled dot product attention with encoder outputs h ∈ R d using latent code embedding e k : We can now sample z k by the Gumbel-Softmax reparameterization trick (Maddison et al., 2017;Jang et al., 2017): where g = − log(− log(u)), u ∼ Uniform is the Gumbel noise and τ is a fixed temperature. (We use τ = 1 in this paper.) At inference time, we use a discrete version by directly sampling from the latent variable distribution.
BoW Auxiliary Decoder Given an inferred sample z ∼ Φ k (h), the BoW decoder predicts all tokens at once without considering their order. We compute the cross-entropy loss for the predicted tokens over the output vocabulary space V : (11) We take the (unnormalized) empirical distributionp i to be a token's frequency within a sentence normalized by its total frequency within a minibatch, mitigating the effect of frequent (stop) words. This is then normalized over the sentence to sum to 1, giving values p i . The model distributionp ψ is computed by conditioning on the latent code only, without direct attention to encoder outputs. We use scaled dot-product attention between the latent embeddings and the target embeddings (each of dimensionality d, represented as a matrix E V ): Algorithm 1 Training Strategy Sample (x, y) from D bitext 4: Compute L MICVAE with Equation 6 5: Compute L BoW with Equation 12 7: if self training then 9: Sample x from D mono 10: Compute L Mono with Equation 13 11: end if 13: end while

Training
For training with parallel data, we optimize L MICVAE . We draw samples z from the approximate posterior q φ (z | x, y) parameterized by the inference network, then feed the samples to both the autoregressive and auxiliary (BoW) decoders to get a Monte Carlo estimate of the gradient.
Semi-supervised learning We apply the same modification to VAE's ELBO, following Zhao et al. (2019). For jointly training with source-side monolingual data, we add I q φ (z; x) to the ELBO, and for target-side monolingual data, we add I q φ (z; y). 5 The joint objective sums the modified CVAE and VAE objectives: where L is the number of monolingual examples. Algorithm 1 describes the overall training strategy. 5 Learning to copy the target text has proven useful for low-resource NMT (Currey et al., 2017).

Experiments and Results
Here we present empirical results on the Transformer architecture. We evaluate our model on four standard datasets and compare against three baselines. We use four measures to quantify posterior collapse, then examine translation quality (BLEU score) in standard fully supervised settings, a semi-supervised setting, and a fully supervised setting with noisy source text. Hyperparameters, regularization choices, and subword vocabulary information can be found in §5.3.
The results show that we have effectively addressed posterior collapse: latent variables are no longer ignored despite the presence of a powerful decoder. As a result, we outperform both the standard Transformer and the Transformer-based variational NMT approach, when using noisy data or source-language monolingual data.

Datasets
First, we evaluate our models on a standard highresource and low-resource benchmark dataset from WMT. Second, we focus on situations where noisy or monolingual data is available. We note that lowresource scenarios and noisy data are two representative challenges in MT (Lopez and Post, 2013).
WMT14 German-English We use data from the WMT14 news translation shared task, which has 3.9M sentence pairs for training with the same BPE tokenization as in Gu et al. (2018).
WMT16 Romanian-English We use data from the WMT16 news translation shared task. We use the same BPE-preprocessed (Sennrich et al., 2016b) train, dev and test splits as in Gu et al. (2018) with 608k sentence pairs for training.
FLORES Sinhala-English For this low-resource benchmark, we use the same preprocessed data as in Guzmán et al. (2019). There are 646k sentence pairs.

MT for Noisy Text (MTNT) French-English
This dataset pairs web-scraped text from Reddit with professional translations. We use 30k subword units built jointly from source and target sentences and only keep sentences with less than 100 tokens. For training, there are 34,380 sentence pairs for English-French and 17,616 sentence pairs for French-English (Michel and Neubig, 2018). We also used 18,676 monolingual sentences per language from the same data source (Reddit).

Baselines
We compare our model to three baselines: Non-latent This is a standard Transformer model without latent variables. DCVAE A CVAE model with the same discrete latent variable parameterization as ours but without the new objective (i.e., the mutual information term and bag-of-words regularizer).

Implementation details
All of our models build on Transformer. For WMT14 De-En and WMT16 Ro-En, we use the base configuration : 6 blocks, with 512-dimensional embedding, 2048dimensional feed-forward network, and 8 attention heads. For FLoRes (low-resource) and MTNT (low-resource and noisy), we use a smaller Transformer: 4 layers, 256-dimensional embedding, 1024-dimensional inner layers, and 4 attention heads. Input and output embeddings are shared between the inference network and decoder. We use T = 4 categorical latent variables of dimension 16 (found by grid search on the dev set). Auxiliary bag-of-words predictions are combined with the decoder prediction with λ = 0.1. We optimize using Adam (Kingma and Ba, 2015) with β 1 = 0.9, β 2 = 0.98, = 1E-8, weight decay of 0.001, and the warmup and learning rate schedule of Ott et al. (2018b). All models are trained on 8 NVIDIA V100 GPUs with 32K tokens per mini-batch. We train WMT14 De-En with 200k updates and others with 100k updates. We do not use early stopping. We employ joint BPE vocabularies. The sizes are 32k for En-De and En-Ro; 30k for Fr-En; and 3k for Si-En. We also use a word dropout rate of 0.4 during training of all models, which is complementary to our approach.
We found the default initialization in the FAIRSEQ NMT toolkit was effective; we did not need to explore several initializations to avoid degenerate models. p θ (z | x)) for DCVAE and D KL (q φ (z | y) p θ (z | x)) for our model.

Preventing Posterior Collapse
We compare our model to a standard DCVAE lacking the new objective. We report four metrics of posterior collapse on the validation set of WMT Ro-En: 1. Kullback-Leibler divergence (KL).

Mutual information between the latent vari-
able and the source: 3. Mutual information between the latent variable and the target: I q φ (z; y).
4. Negative conditional log-likelihood (NLL) per token. Table 1 shows that when using standard DCVAE ELBO, even with the common practice of KL annealing (KLA), both the KL loss and mutual information settle to almost 0 which is consistent with the analysis in Equation 5. We also plot the progression of D KL , I q φ (z; x), and I q φ (z; y) during training in Figure 2. The posterior collapse of the baseline model is apparent: both D KL mutual information terms drop to 0 at the beginning of training as a result ELBO's design. On the other hand, our model, without using any annealing schedule, effectively increases mutual information and prevents KL loss from settling to a degenerate solution early on.

Translation Quality
We report corpus-level BLEU (Papineni et al., 2002) 6 on the test sets where the translations are generated by sampling each z k with softassignment (vs. argmax).
Supervised Learning on Parallel Data First, we evaluate our model's performance when trained with parallel data on standard WMT datasets. Table 2 shows that our model consistently outperforms both VNMT and DCVAE models-which   Semi-supervised with Source-side Monolingual Data Leveraging monolingual data is a common practice to improve low resource NMT. One popular approach uses target-side monolingual data through "backtranslation" as a data augmentation, but how to effectively leverage source-side monolingual data is an open challenge (Sennrich et   2016a; Zhang and Zong, 2016;Wu et al., 2019). We use the joint training objective described in Equation 14. To have a fair comparison, we also extend VNMT and DCVAE with the same joint training algorithm, i.e., the newly added monolingual data is used to train their corresponding sequence encoder and inference network with standard VAE ELBO. That is, the only difference is that our model was trained to promote mutual information I q φ (z, x) and I q φ (z, y). As shown in Table 3, by doing so the proposed model brings larger gains during semi-supervised learning with source-side monolingual data. Robustness to Noisy Data While high-quality parallel data is scarce for low-resource language pairs, weakly aligned sentence pairs can be mined from massive unpaired data such as Paracrawl. 7 We evaluate our model's performance when augmenting the training set with increasingly noisy parallel data filtered by Zipporah (Xu and Koehn, 2017). Because VNMT and DCVAE underperform our proposal in previous experiments, we omit them from this experiment. Figure 3 shows the results in the Sinhala-English direction. Our model always outperforms standard Transformer, which struggles as more (and noisier) data is added. The gap grows from +1.2 to +4.7 BLEU.

Analysis
Ablation Study How do the different ingredients of our proposed approach contribute to preventing posterior collapse and improving translation quality? We explore two variants of the proposed model: 1) modified ELBO only: only adding mutual information term to the training objective, while without gradients from L BoW , 2) BoW only: which is equivalent to DCVAE combined with BoW decoder. First, we perform the same collapse metrics evaluation as in Table 1. Figure 2(B) suggests that by explicitly adding mutual information term back to the training objective, both I q φ (z; x) and I q φ (z; y) are effectively raised, while the remaining aggregated KL term is still optimized to zero. Such behavior is consistent with the analysis revealed 7 https://paracrawl.eu/

Model
De-En (3.9M) Ro-En (608K)  in Equation 5. On the other hand, regularizing z with the BoW decoder only, shown in Figure 2(C), is very effective in preventing KL vanishing as well as increasing mutual information. When two approaches are combined, as was shown in Figure 2(A), the model retains higher mutual information for both I q φ (z; x) and I q φ (z; y). Next, we see whether the difference in mutual information yields different translation quality. We compare two models: BoW only (Figure 2(C)) and both (Figure 2(A)), on WMT14 De-En and WMT16 Ro-En test sets. Table 4 shows the difference matters more in a low-data regime.
Analysis of Outputs Delving into model predictions helps us understand how our model outperforms the others. We examined erroneous 1-best predictions on the Ro-En data. We provide salient examples of phenomena we identified in Table 5. (Naturally, as the Ro-En score differences are not dramatic, the predictions are largely similar.) Several examples support the fact that our model has more fluent and accurate translations than the baseline or VNMT. VNMT often struggles by introducing disfluent words, and both VNMT and Transformer select justifiable but incorrect words. For instance, in our second example, the gender and animacy of the possessor are not specified in Romanian. Our model selects a more plausible pronoun for this context.
Analysis of Latent Variables Finally, we probe whether different latent variables encode different information. We random sample 100 sentences from two test sets of distinct domains, MTNT (Reddit comments) and WMT (news) with 50 sentences each. We plot the t-SNE projection of their corresponding samples z k inferred from Φ k , k = 1, 2, 3, 4 respectively. Figure 4 suggests that different latent variables learn to organize the data in different manners, but there was no clear signal that any of them exclusively specialize in encoding a domain label. We leave a thorough analysis of Source: ma intristeaza foarte tare . Reference: that really saddens me . Base: i am very saddened . VNMT: i am saddened very loudly .
(Wrong sense of tare) Ours: i am very saddened .
Source: cred ca executia sa este gresita . Reference: i believe his execution is wrong . Base: i believe that its execution is wrong . VNMT: i believe that its execution is wrong . Ours: i believe that his execution is wrong .
Source: da , chinatown Reference: yes , chinatown Base: yes , chinatown VNMT: yes , thin . Ours: yes , chinatown Source: nu stiu cine va fi propus pentru aceasta functie . Reference: i do not know who will be proposed for this position . Base: i do not know who will be proposed for this function . VNMT: i do not know who will be proposed for this function . Ours: i do not know who will be proposed for this position .
Source: recrutarea , o prioritate tot mai mare pentru companii Reference: recruitment , a growing priority for companies Base: recruitment , an increasing priority for companies VNMT: recruitment , [article missing] increasing priority for companies Ours: recruitment , a growing priority for companies Table 5: Translation examples from the baseline Transformer, VNMT, and our model. Disfluent words or absences are in red, and slightly incorrect lexical choice is in blue. Romanian diacritics have been stripped. their information specialization to future work.
Unlike Ma et al. (2018), who also employ bag-ofwords as an NMT objective, our BoW decoder only sees the latent variable z, not the encoder states. Conversely, unlike Weng et al. (2017), our generative decoder has access to both the latent variable and the encoder states; bag-of-words prediction is handled by separate parameters.
VNMT  applies CVAE with Gaussian priors to conditional text generation. VRNMT (Su et al., 2018) extends VNMT, mod-eling the translation process in greater granularity. Both needed manually designed annealing schedules to increase KL loss and avoid posterior collapse. Discrete latent variables have been applied to NMT Gu et al., 2018;Shen et al., 2019), without variational inference or addressing posterior collapse. Approaches to stop posterior collapse include aggressively trained inference networks (He et al., 2019), skip connections (Dieng et al., 2019), and expressive priors (Tomczak and Welling, 2018;Razavi et al., 2019).
Unlike our conditional approach, Shah and Barber (2018) jointly model the source and target text in a generative fashion. Their EM-based inference is more computationally expensive than our amortized variational inference. Eikema and Aziz (2019) also present a generative (joint) model relying on autoencoding; they condition the source text x on the latent variable z. Finally, Schulz et al. (2018), like us, value mutual information between the data and the latent variable. While they motivate KL annealing using mutual information, we show that the annealing is unnecessary.

Conclusion
We have presented a conditional generative model with latent variables whose distribution is learned with variation inference, then evaluated it in machine translation. Our approach does not require an annealing schedule or a hamstrung decoder to avoid posterior collapse. Instead, by providing a new analysis of the conditional VAE objective to improve it in a principled way and incorporating an auxiliary decoding objective, we measurably prevented posterior collapse.
As a result, our model has outperformed previous variational NMT models in terms of translation quality, and is comparable to non-latent Transformer on standard WMT Ro↔En and De↔En datasets. Furthermore, the proposed method has improved robustness in dealing with uncertainty in data, including exploiting source-side monolingual data as well as training with noisy parallel data.

A Derivation of Equation 5
To prove the decomposition of the conditional VAE's regularization term into a mutual information term and a KL divergence term, we introduce a random variable representing an index into the training data; it uniquely identifies x ( ) , y ( ) . This alteration is "entirely algebraic" (Hoffman and Johnson, 2016) while making our process both more compact and more interpretable.
We define the marginals p(z) and q(z) as the aggregated posterior (Tomczak and Welling, 2018) and aggregated approximate posterior (Hoffman and Johnson, 2016). (This allows the independence assumption above.) Moving forward will require just a bit of information theory: the definitions of entropy and mutual information. For these, we direct the reader to the text of Cover and Thomas (2006). Given these definitions, the regularization term of the ELBO objective may be expressed as We may now multiply the numerator and denominator by 1 L and use its equivalence to p( ) and q( ).
= q( , z) log q( , z) p( , z) Factoring then gives us two log terms.
= D KL (q(z) p(z)) + E q(z) [D KL (q( | z) | p( ))] Because of how we defined p( ), we expand the second term and factor out the constant H(p( )) = log L.