Auto-Encoding Variational Neural Machine Translation

We present a deep generative model of bilingual sentence pairs for machine translation. The model generates source and target sentences jointly from a shared latent representation and is parameterised by neural networks. We perform efficient training using amortised variational inference and reparameterised gradients. Additionally, we discuss the statistical implications of joint modelling and propose an efficient approximation to maximum a posteriori decoding for fast test-time predictions. We demonstrate the effectiveness of our model in three machine translation scenarios: in-domain training, mixed-domain training, and learning from a mix of gold-standard and synthetic data. Our experiments show consistently that our joint formulation outperforms conditional modelling (i.e. standard neural machine translation) in all such scenarios.


Introduction
Neural machine translation (NMT) systems (Kalchbrenner and Blunsom, 2013;Cho et al., 2014b) require vast amounts of labelled data, i.e. bilingual sentence pairs, to be trained effectively. Oftentimes, the data we use to train these systems are a byproduct of mixing different sources of data. For example, labelled data are sometimes obtained by putting together corpora from different domains (Sennrich et al., 2017). Even for a single domain, parallel data often result from the combination of documents independently translated from different languages by different people or agencies, possibly following different guidelines. When resources are scarce, it is not uncommon to mix in some synthetic data, e.g. bilingual data artificially obtained by having a model translate target monolingual data to the source language (Sennrich et al., 2016a). Translation direction, original language, and quality of translation are some of the many factors that we typically choose not to control for (due to lack of information or simply for convenience). 1 All those arguably contribute to making our labelled data a mixture of samples from various data distributions.
Regular NMT systems do not explicitly account for latent factors of variation, instead, given a source sentence, NMT models a single conditional distribution over target sentences as a fully supervised problem. In this work, we introduce a deep generative model that generates source and target sentences jointly from a shared latent representation. The model has the potential to use the latent representation to capture global aspects of the observations, such as some of the latent factors of variation just discussed. The result is a model that accommodates members of a more complex class of marginal distributions. Due to the presence of latent variables, this model requires posterior inference, in particular, we employ the framework of amortised variational inference (Kingma and Welling, 2014). Additionally, we propose an efficient approximation to maximum a posteriori (MAP) decoding for fast test-time predictions.
Contributions We introduce a deep generative model for NMT ( §3) and discuss theoretical advantages of joint modelling over conditional modelling ( §3.1). We also derive an efficient approximation to MAP decoding that requires only a single forward pass through the network for prediction ( §3.3). Finally, we show in §4 that our proposed model improves translation performance in at least three practical scenarios: i) in-domain training on little data, where test data are expected to follow the training data distribution closely; ii) mixed-domain training, where we train a single model but test independently on each domain; and iii) learning from large noisy synthetic data.

Neural Machine Translation
In machine translation our observations are pairs of random sequences, a source sentence x = x 1 , . . . , x m and a target sentence y = y 1 , . . . , y n , whose lengths m and n we denote by |x| and |y|, respectively. In NMT, the likelihood of the target given the source Cat(y j |f θ (x, y <j )) (1) factorises without Markov assumptions Bahdanau et al., 2015;Cho et al., 2014a). We have a fixed parameterised function f θ , i.e. a neural network architecture, compute categorical parameters for varying inputs, namely, the source sentence and target prefix (denoted y <j ). Given a dataset D of i.i.d. observations, the parameters θ of the model are point-estimated to attain a local maximum of the log-likelihood function, L(θ|D) = (x,y)∈D log P (y|x, θ), via stochastic gradient-based optimisation (Robbins and Monro, 1951;Bottou and Cun, 2004).
Predictions For a trained model, predictions are performed by searching for the target sentence y that maximises the conditional P (y|x), or equivalently its logarithm, with a greedy algorithm arg max y P (y|x, θ) ≈ greedy y log P (y|x, θ) (2) such as beam-search , possibly aided by a manually tuned length penalty. This decision rule is often referred to as MAP decoding (Smith, 2011).

Auto-Encoding Variational NMT
To account for a latent space where global features of observations can be captured, we introduce a random sentence embedding z ∈ R d and model the joint distribution over observations as a marginal of p(z, x, y|θ). 2 That is, (x, y) ∈ D is assumed to be sampled from the distribution P (x, y|θ) = p(z)P (x, y|z, θ)dz .
(3) 2 We use uppercase P (·) for probability mass functions and lowercase p(·) for probability density functions.
where we impose a standard Gaussian prior on the latent variable, i.e. Z ∼ N (0, I), and assume X ⊥ Y |Z. That is, given a sentence embedding z, we first generate the source conditioned on z, then generate the target conditioned on x and z, Cat(y j |f θ (z, x, y <j )) . (5) Note that the source sentence is generated without Markov assumptions by drawing one word at a time from a categorical distribution parameterised by a recurrent neural network g θ . The target sentence is generated similarly by drawing target words in context from a categorical distribution parameterised by a sequence-to-sequence architecture f θ . This essentially combines a neural language model (Mikolov et al., 2010) and a neural translation model ( §2), each extended to condition on an additional stochastic input, namely, z.

Statistical considerations
Modelling the conditional directly, as in standard NMT, corresponds to the statistical assumption that the distribution over source sentences can provide no information about the distribution over target sentences given a source. That is, conditional NMT assumes independence of β determining P (y|x, β) and α determining P (x|α). Scenarios where this assumption is unlikely to hold are common: where x is noisy (e.g. synthetic or crowdsourced), poor quality x should be assigned low probability P (x|α) which in turn should inform the conditional. Implications of this assumption extend to parameter estimation: updates to the conditional are not sensitive to how exotic x is. Let us be more explicit about how we parameterise our model by identifying 3 sets of parameters θ = {θ emb-x , θ LM , θ TM }, where θ emb-x parameterises an embedding layer for the source language. The embedding layer is shared between the two model components and it is then clear by inspection that α ∩ β = {z, θ emb-x }. In words, we break the independence assumption in two ways, namely, by having the two distributions share parameters and by having them depend on a shared latent sentence representation z. Note that while the embedding layer is deterministic and global to all sentence pairs in the training data, the latent representation is stochastic and local to each sentence pair. Now let us turn to considerations about latent variable modelling.
Consider a model P (x|θ emb-x , θ LM )P (y|x, θ emb-x , θ TM ) of the joint distribution over observations that does not employ latent variables. This alternative, which we discuss further in experiments, models each component directly, whereas our proposed model (3) requires marginalisation of latent embeddings z. Marginalisation turns our directed graphical model into an undirected one inducing further structure in the marginal. See Appendix B, and Figure 2 in particular, for an extended discussion.

Parameter estimation
The marginal in Equation (3) is clearly intractable, thus precluding maximum likelihood estimation. Instead, we resort to variational inference (Jordan et al., 1999; and introduce a variational approximation q(z|x, y, λ) to the intractable posterior p(z|x, y, θ). We let the approximate posterior be a diagonal Gaussian Z|λ, x, y ∼ N (u, diag(s s)) u = µ λ (x, y) and predict its parameters (i.e. u ∈ R d , s ∈ R d >0 ) with neural networks whose parameters we denote by λ. This makes the model an instance of a variational auto-encoder (Kingma and Welling, 2014). See Figure 1 in Appendix B for a graphical depiction of the generative and inference models.
We can then jointly estimate the parameters of both models (generative θ and inference λ) by maximising the ELBO (Jordan et al., 1999), a lowerbound on the marginal log-likelihood, where we have expressed the expectation with respect to a fixed distribution-a reparameterisation available to location-scale families such as the Gaussian (Kingma and Welling, 2014;Rezende et al., 2014). Due to this reparameterisation, we can compute a Monte Carlo estimate of the gradient of the first term via back-propagation (Rumelhart et al., 1986;Schulman et al., 2015). The KL term, on the other hand, is available in closed form (Kingma and Welling, 2014, Appendix B).

Predictions
In a latent variable model, MAP decoding (9a) requires searching for y that maximises the marginal P (y|x, θ) ∝ P (x, y|θ), or equivalently its logarithm. In addition to approximating exact search with a greedy algorithm, other approximations are necessary in order to achieve fast prediction. First, rather than searching through the true marginal, we search through the evidence lowerbound. Second, we replace the approximate posterior q(z|x, y) by an auxiliary distribution r(z|x).
As we are searching through the space of target sentences, not conditioning on y circumvents combinatorial explosion and allows us to drop terms that depend on x alone (9b). Finally, instead of approximating the expectation via MC sampling, we condition on the expected latent representation and search greedily (9c).
arg max Together, these approximations enable prediction with a single call to an arg max solver, in our case a standard greedy search algorithm, which leads to prediction times that are very close to that of the conditional model. This strategy, and (9b) in particular, suggests that a good auxiliary distribution r(z|x) should approximate q(z|x, y) closely.
We parameterise this prediction model using a neural network and investigate different options to estimate its parameters. As a first option, we restrict the approximate posterior to conditioning on x alone, i.e. we approach posterior inference with q λ (z|x) rather than q λ (z|x, y), and thus, we can use r(z|x) = q λ (z|x) for prediction. 3 As a second option, we make r φ (z|x) a diagonal Gaussian and estimate parameters φ to make r φ (z|x) close to the approximate posterior q λ (z|x, y) as measured by D(r φ , q λ ). For as long as D(r φ , q λ ) ∈ R ≥0 for every choice of φ and λ, we can estimate φ jointly with θ and λ by maximising a modified ELBO which is loosened by the gap between r φ and q λ . In experiments we investigate a few options for D(r φ , q λ ), all available in closed form for Gaussians, such as KL(r φ ||q λ ), KL(q λ ||r φ ), as well as the Jensen-Shannon (JS) divergence.
Note that r φ is used only for prediction as a decoding heuristic and as such need not be stochastic. We can, for example, design r φ (x) to be a point estimate of the posterior mean and optimise which remains a lowerbound on log-likelihood.

Experiments
We investigate two translation tasks, namely, WMT's translation of news (Bojar et al., 2016) and IWSLT's translation of transcripts of TED talks (Cettolo et al., 2014), and concentrate on translations for German (DE) and English (EN) in either direction. In this section we aim to investigate scenarios where we expect observations to be representative of various data distributions. As a sanity check, we start where training conditions can be considered in-domain with respect to test conditions. Though note that this does not preclude the potential for appreciable variability in observations as various other latent factors still likely play a role (see §1). We then mix datasets from these two remarkably different translation tasks and investigate whether performance can be improved across tasks with a single model. Finally, we investigate the case where we learn from synthetic data in addition to gold-standard data. For this investigation we derive synthetic data from observations that are close to the domain of the test set in an attempt to avoid further confounders.
Data For bilingual data we use News Commentary (NC) v12 (Bojar et al., 2017) and IWSLT 2014 (Cettolo et al., 2014), where we assume NC to be representative of the test domain of the WMT News task. The datasets consist of 255, 591 training sentences and 153, 326 training sentences respectively. In experiments with synthetic data, we subsample 10 6 sentences from the News Crawl 2016 articles (Bojar et al., 2017) for either German or English depending on the target language. For the WMT task, we concatenate newstest2014 and newstest2015 for validation/development (5, 172 sentence pairs) and report test results on newstest2016 (2, 999 sentence pairs). For IWSLT, we use the split proposed by Ranzato et al. (2016) who separated 6, 969 training instances for validation/development and reported test results on a concatenation of dev2010, dev2012 and tst2010-2012 (6, 750 sentence pairs).
Pre-processing We tokenized and truecased all data using standard scripts from the Moses toolkit (Koehn et al., 2007), and removed sentences longer than 50 tokens. For computational efficiency and to avoid problems with closed vocabularies, we segment the data using BPE (Sennrich et al., 2016b) with 32, 000 merge operations independently for each language. For training the truecaser and the BPEs we used a concatenation of all the available bilingual and monolingual data for German and all bilingual data for English.
Systems We develop all of our models on top of Tensorflow NMT (Luong et al., 2017). Our baseline system is a standard implementation of conditional NMT (COND) (Bahdanau et al., 2015). To illustrate the importance of latent variable modelling, we also include in the comparison a simpler attempt at JOINT modelling where we do not induce a shared latent space. Instead, the model is trained in a fully-supervised manner to maximise what is essentially a combination of two nearly independent objectives, namely, a language model and a conditional translation model. Note that the two components of the model share very little, i.e. an embedding layer for the source language. Finally, we aim at investigating the effectiveness of our auto-encoding variational NMT (AEVNMT). 4 Appendix A contains a detailed description of the architectures that parameterise our systems. 5 Hyperparameters Our recurrent cells are 256dimensional GRU units (Cho et al., 2014b). We train on batches of 64 sentence pairs with Adam (Kingma and Ba, 2015), learning rate 3 × 10 −4 , for at least T updates. We then perform convergence checks every 500 batches and stop after 20 checks without any improvement measured by BLEU (Papineni et al., 2002). For in-domain training we set T = 140, 000, and for mixeddomain training, as well as training with synthetic data, we set T = 280, 000. For decoding we use a beam width of 10 and a length penalty of 1.0.
We investigate the use of dropout (Srivastava et al., 2014) for the conditional baseline with rates from 10% to 60% in increments of 10%. Best validation performance on WMT required a rate of 40% for EN-DE and 50% for DE-EN, while on IWSLT it required 50% for either translation direction. To spare resources, we also use these rates for training the simple JOINT model.
Avoiding collapsing to prior Many have noticed that VAEs whose observation models are parameterised by strong generators, such as recurrent neural networks, learn to ignore the latent representation (Bowman et al., 2016;Higgins et al., 2017;Sønderby et al., 2016;Alemi et al., 2018). In such cases, the approximate posterior "collapses" to the prior, and where one has a fixed prior, such as our standard Gaussian, this means that the posterior becomes independent of the data, which is obviously not desirable. Bowman et al. (2016) proposed two techniques to counter this effect, namely, "KL annealing", and target word dropout. KL annealing consists in incorporating the KL term of Equation (8) into the objective gradually, thus allowing the posterior to move away from the prior more freely at early stages of training. After and possibly a prediction network. However, this does not add much sequential computation: the inference network can run in parallel with the source encoder, and the source language model runs in parallel with the target decoder. a number of annealing steps, the KL term is incorporated in full and training continues with the actual ELBO. In our search we considered annealing for 20, 000 to 80, 000 training steps. Word dropout consists in randomly masking words in observed target prefixes at a given rate. The idea is to harm the potential of the decoder to capitalise on correlations internal to the structure of the observation in the hope that it will rely more on the latent representation instead. We considered rates from 20% to 40% in increments of 10%. Table 1 shows the configurations that achieve best validation results on EN-DE. To spare resources, we reuse these hyperparameters for DE-EN experiments. With these settings, we attain a nonnegligible validation KL (see, last row of Table 1), which indicates that the approximate posterior is different from the prior at the end of training.

ELBO variants
We investigate the effect of conditioning on target observations for posterior inference during training against a simpler variant that conditions on the source alone. Table 2 suggests that conditioning on x is sufficient and thus we opt to continue with this simpler version. Do note that when we use both observations for posterior inference, i.e. q λ (z|x, y), and thus train an approximation r φ for prediction, we have additional parameters to estimate (e.g. due to the need to encode y for q λ and x for r φ ), thus it may be the case that for these variants to show their potential we need larger data and/or prolonged training.

Results
In this section we report test results in terms of BLEU (Papineni et al., 2002) and BEER (Stanojević and Sima'an, 2014), but in Appendix E  .2) 53.5 (0.1) 20.6 (0.2) 53.6 (0.1) Table 3: Test results for in-domain training on IWSLT (top) and NC (bottom): we report average (1std) across 5 independent runs for COND and AEVNMT, but a single run of JOINT.
we additionally report METEOR (Denkowski and Lavie, 2011) and TER (Snover et al., 2006). We de-truecase and de-tokenize our system's predictions and compute BLEU scores using Sacre-BLEU (Post, 2018). 6 For BEER, METEOR and TER, we tokenize the results and test sets using the same tokenizer as used by SacreBLEU. We make use of BEER 2.0, and for METEOR and TER use MULTEVAL (Clark et al., 2011). In Appendix D we report validation results, in this case in terms of BLEU alone as that is what we used for model selection. Finally, to give an indication of the degree to which results are sensitive to initial conditions (e.g. random initialisation of parameters), and to avoid possibly misleading signifiance testing, we report the average and standard deviation of 5 independently trained models. To spare resources we do not report multiple runs for JOINT, but our experience is that its performance varies similarly to that of the conditional baseline.
We start with the case where we can reasonably assume training data to be in-domain with respect to test data. Table 3 shows in-domain training performance. First, we remark that our conditional baseline for the IWSLT14 task (IWSLT training) is very close to an external baseline trained on the same data (Bahdanau et al., 2017). 7 The results on IWSLT show benefits from joint modelling and in particular from learning a shared latent space. For the WMT16 task (NC training), BLEU shows a similar trend, namely, joint modelling with a shared latent space (AEVNMT) outperforms both conditional modelling and the simple joint model.
We now consider the scenario where we know for a fact that observations come from two different data distributions, which we realise by training our models on a concatenation of IWSLT and NC. In this case, we perform model selection once on the concatenation of both development sets and evaluate the same model on each domain separately. We can see in Table 4 that conditional modelling is never preferred, JOINT performs reasonably well, especially for DE-EN, and that in every comparison our AEVNMT outperforms the conditional baseline both in terms of BLEU and BEER.
Another common scenario where two very distinct data distributions are mixed is when we capitalise on the abundance of monolingual data and train on a concatenation of gold-standard bilingual data (we use NC) and synthetic bilingual data derived from target monolingual corpora via back-translation (Sennrich et al., 2016a) (we use News Crawl). In such a scenario the latent variable might be able to inform the translation model of the amount of noise present in the source sentence. Table 5 shows results for both baselines and AEVNMT. First, note that synthetic data greatly improves the conditional baseline, in particular translating into English. Once again AEVNMT consistently outperforms conditional modelling and joint modelling without latent variables.
By mixing different sources of data we are trying to diagnose whether the generative model we propose is robust to unknown and diverse sources of variation mixed together in one training set (e.g. NC + IWSLT or gold-standard + synthetic data). However, note that a point we are certainly not trying to make is that the model has been designed to perform domain adaptation. Nonetheless, in Appendix C we   it has never seen. On a dataset covering various unseen genres, we observe that both COND and AEVNMT perform considerably worse showing that without taking domain adaptation seriously both models are inadequate. In terms of BLEU, differences range from −0.3 to 0.8 (EN-DE) and 0.3 to 0.7 (DE-EN) and are mostly in favour of AEVNMT (17/20 comparisons).
Remarks It is intuitive to expect latent variable modelling to be most useful in settings containing high variability in the data, i.e. mixed-domain and synthetic data settings, though in our experiments AEVNMT shows larger improvements in the indomain setting. We speculate two reasons for this: i) it is conceivable that variation in the mixeddomain and synthetic data settings are too large to be well accounted by a diagonal Gaussian; and ii) the benefits of latent variable modelling may diminish as the amount of available data grows.

Probing latent space
To investigate what information the latent space encodes we explore the idea of training simple linear probes or diagnostic classifiers (Alain and Bengio, 2017;Hupkes et al., 2018). With simple Bayesian logistic regression we have managed to predict from Z ∼ q(z|x) domain indicators (i.e. newswire vs transcripts) and gold-standard vs synthetic data at performance above 90% accuracy on development set. However, a similar performance is achieved from the deterministic average state of the bidirectional encoder of the conditional baseline. We have also been able to predict from Z ∼ q(z|x) the level of noise in back-translated data measured on the development set at the sentence level by an automatic metric, i.e. METEOR, with performance above what can be done with random features. Though again, the performance is not much better than what can be done with a conditional baseline. Still, it is worth highlighting that these aspects are rather coarse, and it is possible that the performance gains we report in §4.1 are due to far more nuanced variations in the data. At this point, however, we do not have a good qualitative assessment of this conjecture.

Related Work
Joint modelling In similar work, Shah and Barber (2018) propose a joint generative model whose probabilistic formulation is essentially identical to ours. Besides some small differences in architecture, our work differs in two regards: motivation and strategy for predictions. Their goal is to jointly learn from multiple language pairs by sharing a single polyglot architecture (Johnson et al., 2017). Their strategy for prediction is based on a form of stochastic hill-climbing, where they sample an initial z from the standard Gaussian prior and decode via beam search in order to obtain a draft translationỹ = greedy y P (y|z, x).This translation is then iteratively refined by encoding the pair x,ỹ , re-sampling z, though this time from q(z|x,ỹ), and re-decoding with beam search. Unlike our approach, this requires multiple calls to the inference network and to beam search. Moreover, the inference model, which is trained on gold-standard observations, is used on noisy target sentences. Cotterell and Kreutzer (2018) interpret backtranslation as a single iteration of a wake-sleep algorithm (Hinton et al., 1995) for a joint model of bitext P (x, y|θ) = P (y|x, θ)P (x). They sample directly from the data distribution P (x) and learn two NMT models, a generative P (y|x, θ) and an auxiliary model Q(x|y, φ), each trained  Table 5: Test results for training on NC plus synthetic data (back-translated News Crawl): we report average (1std) across 5 independent runs for COND and AEVNMT, but a single run of JOINT. on a separate objective. Zhang et al. (2018) propose a joint model of bitext trained to incorporate the back-translation heuristic as a trainable component in a formulation similar to that of Cotterell and Kreutzer (2018). In both cases, joint modelling is done without a shared latent space and without a source language model.
Multi-task learning An alternative to joint learning is to turn to multi-task learning and explore parameter sharing across models trained on different, though related, data with different objectives. For example, Cheng et al. (2016) incorporate both source and target monolingual data by multi-tasking with a non-differentiable autoencoding objective. They jointly train a source-totarget and target-to-source system that act as encoder and decoder respectively. Zhang and Zong (2016) combine a source language model objective with a source-to-target conditional NMT objective and shared the source encoder in a multitask learning fashion.
Variational LMs and NMT Bowman et al. (2016) first proposed to augment a neural language model with a prior over latent space. Our source component is an instance of their model. More recently, Xu and Durrett (2018) proposed to use a hyperspherical uniform prior rather than a Gaussian and showed the former leads to better representations.  proposed the first VAE for NMT. They augment the conditional with a Gaussian sentence embedding and model observations as draws from the marginal P (y|x, θ) = p(z|x, θ)P (y|x, z, θ)dz. Their formulation is a conditional deep generative model (Sohn et al., 2015) that does not model the source side of the data, where, rather than a fixed standard Gaussian, the latent model is itself parameterised and depends on the data. Schulz et al. (2018) extend the model of  with a Markov chain of latent variables, one per timestep, allowing the model to capture greater variability.
Latent domains In the context of statistical MT, Cuong and Sima'an (2015) estimate a joint distribution over sentence pairs while marginalising discrete latent domain indicators. Their model factorises over word alignments and is not used directly for translation, but rather to improve word and phrase alignments, or to perform data selection (Hoang and Sima'an, 2014), prior to training. There is a vast literature on domain adaptation for statistical machine translation (Cuong and Sima'an, 2017), as well as for NMT (Chu and Wang, 2018), but a full characterisation of this exciting field is beyond the scope of this paper.

Discussion and Future Work
We have presented a joint generative model of translation data that generates both observations conditioned on a shared latent representation. Our formulation leads to questions such as why joint learning? and why latent variable modelling? to which we give an answer based on statistical facts about conditional modelling and marginalisation as well as empirical evidence of improved performance. Our model shows moderate but consistent improvements across various settings and over multiple independent runs.
In future work, we shall investigate datasets annotated with demographics and personal traits in an attempt to assess how far we can go in capturing fine grained variation. Though note that if such factors of variation vary widely in distribution, it may be naïve to expect we can model them well with a simple Gaussian prior. If that turns out to be the case, we will investigate mixing Gaussian components (Miao et al., 2016;Srivastava and Sutton, 2017) and/or employing a hierarchical prior .

A Architectures
Here we describe parameterisation of the different models presented in §3. Rather than completely specifying standard blocks, we use the notation block(inputs; parameters), where we give an indication of the relevant parameter set. This makes it easier to visually track which model a component belongs to.

A.1 Source Language Model
The source language model consists of a sequence of categorical draws for i = 1, . . . , |x| parameterised by a single-layer recurrent neural network using GRU units: We initialise the GRU cell with a transformation (14b) of the stochastic encoding z. For the simple joint model baseline we initialise the GRU with a vector of zeros as there is no stochastic encoding we can condition on in that case.

A.2 Translation Model
The translation model consists of a sequence of categorical draws for j = 1, . . . , |y| parameterised by an architecture that roughly follows Bahdanau et al. (2015). The encoder is a bidirectional GRU encoder (16b) that shares source embeddings with the language model (14a) and is initialised with its own projection of the latent representation put through a tanh activation. The decoder, also initialised with its own projection of the latent representation (16d), is a single-layer recurrent neural network with GRU units (16f). At any timestep the decoder is a function of the previous state, previous output word embedding, and a context vector. This context vector (16e) is a weighted average of the bidirectional source encodings, of which the weights are computed by a Bahdanau-style attention mechanism. The output of the GRU decoder is projected to the target vocabulary size and mapped to the simplex using a softmax activation (17) to obtain the categorical parameters: and f θ (z, x, y <j ) = softmax(affine([t j , e j−1 , c j ]; θ out-y )) .
(17) In baseline models, recurrent cells are initialised with a vector of zeros as there is no stochastic encoding we can condition on.

A.3 Inference Network
The inference model q(z|x, y, λ) is a diagonal Gaussian Z|x, y ∼ N (u, diag(s s)) (18) whose parameters are computed by an inference network. We use two bidirectional GRU encoders to encode the source and target sentences separately. To spare memory, we reuse embeddings from the generative model (19a-19b), but we prevent updates to those parameters based on gradients of the inference network, which we indicate with the function detach. To obtain fixed-size representations for the sentences, GRU encodings are averaged (19c-19d) .
We use a concatenation h xy of the average source and target encodings (19e) as inputs to compute the parameters of the Gaussian approximate posterior, namely, d-dimensional location and scale vectors. Both transformations use ReLU hidden activations (Nair and Hinton, 2010), but locations live in R d and therefore call for linear output activations (19h), whereas scales live in R d >0 and call for strictly positive outputs (19i), we follow  and use softplus. The complete set of parameters used for inference is thus λ = {λ gru-x , λ gru-y , λ u-hid , λ u-out , λ s-hid , λ s-out }.

A.4 Prediction Network
The prediction network parameterises our prediction model r(z|x, φ), a variant of the inference model that conditions on the source sentence alone. In §4 we explore several variants of the ELBO using different parameterisations of r φ . In the simplest case we do not condition on the target sentence during training, thus we can use the same network both for training and prediction. The network is similar to the one described in A.3, except that there is a single bidirectional GRU and we use the average source encoding (19c) as input to the predictors for u and s (20c-20d).
In all other cases we use q(z|x, y, λ) parameterised as discussed in A.3 for training, and design a separate network to parameterise r φ for prediction. Much like the inference model, the prediction model is a diagonal Gaussian also parameterised by d-dimensional location and scale vectors, however in predictingû andŝ (22d-22e) it can only access an encoding of the source (22a).
The complete set of parameters is then φ = {φ gru-x , φ u-hid , φ u-out , φ s-hid , φ s-out }. For the deterministic variant, we useû (22d) alone to approximate u (19h), i.e. the posterior mean of Z. Figure 1 is a graphical depiction of our AEVNMT model. Circled nodes denote random variables while uncircled nodes denote deterministic quantities. Shaded random variables correspond to observations and unshaded random variables are latent. The plate denotes a dataset of |D| observations. In Figure 2a, we illustrate the precise statistical assumptions of AEVNMT. Here plates iterate over words in either the source or the target sentence. Note that the arrow from x i to y j states that the jth target word depends on all of the source sentence, not on the ith source word alone, and that is the case because x i is within the source plate. In Figure 2b, we illustrate the statistical dependencies induced in the marginal distribution upon marginalisation of latent variables. Recall that the marginal is the distribution which by assumption produced the observed data. Now compare that to the distribution modelled by the simple JOINT model (Figure 2c). Marginalisation induces undirected dependencies amongst random variables creating more structure in the marginal distribution. In graphical models literature this is known as moralisation (Koller and Friedman, 2009).  Figure 1a to show the statistical dependencies between observed variables. In the joint distribution (top), we have the directed dependency of a source word on all of the previous source words, and similarly, of a target word on all of the previous target words in addition to the complete source sentence. Besides, all observations depend directly on the latent variable Z. Marginalisation of Z (middle) ties all variables together through undirected connections. At the bottom we show the distribution we get if we model the data distribution directly without latent variables.

C Robustness to out-of-domain data
We use our stronger models, those trained on goldstandard NC bilingual data and synthetic News data, to translate test sets in various unseen genres. These data sets are collected and distributed by TAUS, 8 and have been used in scenarios of adapation to all domains at once (Cuong et al., 2016). Table 6 shows the performance of AEVNMT and the conditional baseline. The first thing to note is the remarkable drop in performance showing that without taking domain adaptation seriously both models are inadequate. In terms of BLEU, differences range from −0.    Table 9: Validation results reported in BLEU for training on NC plus synthetic data: we report average (1std) across 5 independent runs for COND and AEVNMT, but a single run of JOINT.        across 5 independent runs for COND and AEVNMT, but a single run of JOINT.