Latent Variable Model for Multi-modal Translation

In this work, we propose to model the interaction between visual and textual features for multi-modal neural machine translation (MMT) through a latent variable model. This latent variable can be seen as a multi-modal stochastic embedding of an image and its description in a foreign language. It is used in a target-language decoder and also to predict image features. Importantly, our model formulation utilises visual and textual inputs during training but does not require that images be available at test time. We show that our latent variable MMT formulation improves considerably over strong baselines, including a multi-task learning approach (Elliott and Kadar, 2017) and a conditional variational auto-encoder approach (Toyama et al., 2016). Finally, we show improvements due to (i) predicting image features in addition to only conditioning on them, (ii) imposing a constraint on the KL term to promote models with non-negligible mutual information between inputs and latent variable, and (iii) by training on additional target-language image descriptions (i.e. synthetic data).


Introduction
Multi-modal machine translation (MMT) is an exciting novel take on machine translation (MT) where we are interested in learning to translate sentences in the presence of visual input (mostly images). In the last three years there have been shared tasks Barrault et al., 2018) where many research groups proposed different techniques to integrate images into MT, e.g. Caglayan et al. (2017); Libovický and Helcl (2017).
Most MMT models expand neural machine translation (NMT) architectures (Sutskever et al., 2014;Bahdanau et al., 2015) to additionally condition on an image in order to compute the likelihood of a translation in context. This gives the model a chance to exploit correlations in visual and language data, but also means that images must be available at test time. An exception to this rule is the work of Toyama et al. (2016) who exploit the framework of conditional variational auto-encoders (CVAEs) (Sohn et al., 2015) to decouple the encoder used for posterior inference at training time from the encoder used for generation at test time. Rather than conditioning on image features, the model of Elliott and Kádár (2017) learns to rank image features using language data in a multi-task learning (MTL) framework, therefore sharing parameters between a translation (generative) and a sentence-image ranking model (discriminative). This similarly exploits correlations between the two modalities and has the advantage that images are also not necessary at test time.
In this work, we also aim at translating without images at test time, yet learning a visually grounded translation model. To that end, we resort to probabilistic modelling instead of multi-task learning and estimate a joint distribution over translations and images. In a nutshell, we propose to model the interaction between visual and textual features through a latent variable. This latent variable can be seen as a stochastic embedding which is used in the target-language decoder, as well as to predict image features. Our experiments show that this joint formulation improves over an MTL approach (Elliott and Kádár, 2017), which does model both modalities but not jointly, and over the CVAE of Toyama et al. (2016), which uses image features to condition an inference network but crucially does not model the images.
The main contributions of this paper are: 1 • we propose a novel multi-modal NMT model 1 Code and pre-trained models available in https:// github.com/iacercalixto/variational_mmt. that incorporates image features through latent variables in a deep generative model.
• our latent variable MMT formulation improves considerably over strong baselines, and compares favourably to the state-of-the-art.
• we exploit correlations between both modalities at training time through a joint generative approach and do not require images at prediction time.
The remainder of this paper is organised as follows. In §2, we describe our variational MMT models. In §3, we introduce the data sets we used and report experiments and assess how our models compare to prior work. In §4, we position our approach with respect to the literature. Finally, in §5 we draw conclusions and provide avenues for future work.

Variational Multi-modal NMT
Similarly to standard NMT, in MMT we wish to translate a source sequence x m 1 x 1 , · · · , x m into a target sequence y n 1 y 1 , · · · , y n . The main difference is the presence of an image v which illustrates the sentence pair x m 1 , y n 1 . We do not model images directly, but instead an 2048dimensional vector of pre-activations of a ResNet-50's pool5 layer (He et al., 2015).
In our variational MMT models, image features are assumed to be generated by transforming a stochastic latent embedding z, which is also used to inform the RNN decoder in translating source sentences into a target language.
Generative model We propose a generative model of translation and image generation where both the image v and the target sentence y n 1 are independently generated given a common stochastic embedding z. The generative story is as follows. We observe a source sentence x m 1 and draw an embedding z from a latent Gaussian model, where f µ (·) and f σ (·) map from a source sentence to a vector of locations µ ∈ R c and a vector of scales σ ∈ R c >0 , respectively. We then proceed to draw the image features from a Gaussian observation model, where f ν (·) maps from z to a vector of locations ν ∈ R o , and ς ∈ R >0 is a hyperparameter of the model (we use 1). Conditioned on z and on the source sentence x m 1 , and independently of v, we generate a translation by drawing each target word in context from a Categorical observation model, where f π (·) maps z, x m 1 , and a prefix translation y <j to the parameters π j of a categorical distribution over the target vocabulary. Functions f µ (·), f σ (·), f ν (·), and f π (·) are implemented as neural networks whose parameters are collectively denoted by θ. In particular, implementing f π (·) is as simple as augmenting a standard NMT architecture (Bahdanau et al., 2015;Luong et al., 2015), i.e. encoder-decoder with attention, with an additional input z available at every time-step. All other functions are single-layer MLPs that transform the average encoder hidden state to the dimensionality of the corresponding Gaussian variable followed by an appropriate activation. 2 Note that in effect we model a joint distribution consisting of three components which we parameterise directly. As there are no observations for z, we cannot estimate these components directly. We must instead marginalise z out, which yields the marginal An important statistical consideration about this model is that even though y n 1 and v are conditionally independent given z, they are marginally dependent. This means that we have designed a data generating process where our observations y y < z x m 1 v θ n N (a) VMMTC: given the source text x m 1 , we model the joint likelihood of the translation y n 1 , the image (features) v, and a stochastic embedding z sampled from a conditional latent Gaussian model. Note that the stochastic embedding is the sole responsible for assigning a probability to the observation v, and it helps assign a probability to the translation.  y n 1 , v|x m 1 are not assumed to have been independently produced. 3 This is in direct contrast with multi-task learning or joint modelling without latent variables-for an extended discussion see (Eikema and Aziz, 2019, § 3.1).
Finally, Figure 1 (left) is a graphical depiction of the generative model: shaded circles denote observed random variables, unshaded circles indicate latent random variables, deterministic quantities are not circled; the internal plate indicates iteration over time-steps, the external plate indicates iteration over the training data. Note that deterministic parameters θ are global to all training instances, while stochastic embeddings z are local to each tuple x m 1 , y n 1 , v .
Inference Parameter estimation for our model is challenging due to the intractability of the marginal likelihood function (5). We can however employ variational inference (VI) (Jordan et al., 1999), in particular amortised VI (Kingma and Welling, 2014;Rezende et al., 2014), and estimate parameters to maximise a lowerbound on the log-likelihood function. This evidence lowerbound (ELBO) is expressed in terms of an inference model q λ (z|x m 1 , y n 1 , v) which we design having tractability in mind. In particular, our ap-3 This is an aspect of the model we aim to explore more explicitly in the near future.
proximate posterior is a Gaussian distribution parametrised by an inference network, that is, an independently parameterised neural network (whose parameters we denote collectively by λ) which maps from observations, in our case a sentence pair and an image, to a variational location u ∈ R c and a variational scale s ∈ R c >0 . Figure 1 (right) is a graphical depiction of the inference model.
Location-scale variables (e.g. Gaussians) can be reparametrised, i.e. we can obtain a latent sample via a deterministic transformation of the variational parameters and a sample from the standard Gaussian distribution: This reparametrisation enables backpropagation through stochastic units (Kingma and Welling, 2014;Titsias and Lázaro-Gredilla, 2014). In addition, for two Gaussians the KL term in the ELBO (6) can be computed in closed form (Kingma and Welling, 2014, Appendix B). Altogether, we can obtain a reparameterised gradient estimate of the ELBO, we use a single sample estimate of the first term, and count on stochastic gradient descent to attain a local optimum of (6).
Architecture All of our parametric functions are neural network architectures. In particular, f π is a standard sequence-to-sequence architecture with attention and a softmax output. We build upon OpenNMT (Klein et al., 2017), which we modify slightly by providing z as additional input to the target-language decoder at each time step. Location layers f µ , f ν and g u , and scale layers f σ and g s , are feed-forward networks with a single ReLU hidden layer. Furthermore, location layers have a linear output while scale layers have a softplus output. For the generative model, f µ and f σ transform the average source-language encoder hidden state.
We let the inference model condition on sourcelanguage encodings without updating them, and we use a target-language bidirectional LSTM encoder in order to also condition on the complete target sentence. Then g u and g s transform a concatenation of the average source-language encoder hidden state, the average target-language bidirectional encoder hidden state, and the image features.
Fixed Gaussian prior We have just presented our variational MMT model in its full generalitywe refer to that model as VMMT C . However, keeping in mind that MMT datasets are rather small, it is desirable to simplify some of our model's components. In particular, the estimated latent Gaussian model (1) can be replaced by a fixed standard Gaussian prior, i.e., Z ∼ N (0, I)-we refer to this model as VMMT F . Along with this change it is convenient to modify the inference model to condition on x m 1 alone, which allow us to use the inference model for both training and prediction. Importantly this also sidesteps the need for a target-language bidirectional LSTM encoder, which leaves us a smaller set of inference parameters λ to estimate. Interestingly, this model does not rely on features from v, instead only using it as learning signal through the objective in (6), which is in direct contrast with the model of Toyama et al. (2016).

Experiments
Our encoder is a 2-layer 500D bidirectional RNN with GRU, the source and target word embeddings are 500D, and all are trained jointly with the model. We use OpenNMT to implement all our models (Klein et al., 2017). All model parameters are initialised sampling from a uniform distribution U(−0.1, +0.1) and bias vectors are initialised to 0. Visual features are obtained by feeding images to the pre-trained ResNet-50 and using the activations of the pool5 layer (He et al., 2015). We apply dropout with a probability of 0.5 in the encoder bidirectional RNN, the image features, the decoder RNN, and before emitting a target word.
All models are trained using the Adam optimiser (Kingma and Ba, 2014) with an initial learning rate of 0.002 and minibatches of size 40, where each training instance consists of one English sentence, one German sentence and one image (MMT). Models are trained for up to 40 epochs and we perform model selection based on BLEU4, and use the best performing model on the validation set to translate test data. Moreover, we halt training if the model does not improve BLEU4 scores on the validation set for 10 epochs or more. We report mean and standard deviation over 4 independent runs for all models we trained ourselves (NMT, VMMT F , VMMT C ), and other baseline results are the ones reported in the authors' publications (Toyama et al., 2016;Elliott and Kádár, 2017).
We preprocess our data by tokenizing, lowercasing, and converting words to subword tokens using a bilingual BPE model with 10k merge operations (Sennrich et al., 2016b). We quantitatively evaluate translation quality using case-insensitive and tokenized outputs in terms of BLEU4 (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014), chrF3 (Popović, 2015), and BEER (Stanojević and Sima'an, 2014). By using these, we hope to include word-level metrics which are traditionally used by the MT community (i.e. BLEU and METEOR), as well as more recent metrics which operate at the character level and that better correlate with human judgements of translation quality (i.e. chrF3 and BEER) (Bojar et al., 2017).

Datasets
The Flickr30k dataset (Young et al., 2014) consists of images from Flickr and their English descriptions. We use the translated Multi30k (M30k T ) dataset , i.e. an extension of Flickr30k where for each image one of its English descriptions was translated into German by a professional translator. Training, validation and test sets contain 29k, 1014 and 1k images respectively, each accompanied by the original English sentence and its translation into German. In addition to the test set released for the first run of the multimodal translation shared task , henceforth test2016, we also use test2017 released for the next run of this shared task .
Since this dataset is very small, we also investigate the effect of including more in-domain data to train our models. To that purpose, we use addi-  For each model, we report the mean and standard deviation over 4 independent runs where models were selected using validation BLEU4 scores. Best mean baseline scores per metric are underlined and best overall results (i.e. means) are in bold. We highlight in green/red the improvement brought by our models compared to the best baseline mean score.
tional 145K monolingual German descriptions released as part of the Multi30k dataset to the task of image description generation . We refer to this dataset as comparable Multi30k (M30k C ). Descriptions in the comparable Multi30k were collected independently of existing English descriptions and describe the same 29K images as in the M30k T dataset. In order to obtain features for images, we use ResNet-50 (He et al., 2015) pre-trained on Ima-geNet (Russakovsky et al., 2015). We report experiments using pool5 features as our image features, i.e. 2048-dimensional pre-activations of the last layer of the network.
In order to investigate how well our models generalise, we also evaluate our models on the ambiguous MSCOCO test set  which was designed with example sentences that are hard to translate without resorting to visual context available in the accompanying image.
Finally, we use a 50D latent embedding z in our experiments with the translated Multi30k data, whereas in our ablative experiments and experiments with the comparable Multi30k data, we use a 500D stochastic embedding z.

Baselines
We compare our work against three different baselines. The first one is a standard text-only sequenceto-sequence NMT model with attention (Luong et al., 2015), trained from scratch using hyperparameters described above. The second baseline is the variational multi-modal MT model Model G proposed by Toyama et al. (2016), where global image features are used as additional input to condition an inference network. Finally, a third baseline is the Imagination model of Elliott and Kádár (2017), a multi-task MMT model which uses a shared source-language encoder RNN and is trained in two tasks: to translate from English into German and on image-sentence ranking (English↔image).

Translated Multi30k
We now report on experiments conducted with models trained to translate from English into German using the translated Multi30k data set (M30k T ).
In Table 1, we compare our variational MMT models-VMMT C for the general case with a conditional Gaussian latent model, and VMMT F for the simpler case of a fixed Gaussian prior-to the three baselines described above. The general trend is that both formulations of our VMMT improve with respect to all three baselines. We note an improvement in BLEU and METEOR mean scores compared to the Imagination model (Elliott and Kádár, 2017), as well as reduced variance (though note this is based on only 4 independent runs in our case, and 3 independent runs of Imagination). Both models VMMT F and VMMT C outperform Model G according to BLEU and perform comparably according to METEOR, especially since results reported by (Toyama et al., 2016) are based on a single run. Moreover, we also note that both our models outperform the text-only NMT baseline according to all four metrics, and by 1%-1.4% according chrF3 and BEER, both being metrics well-suited to measure the quality of translations into German and generated with subwords units.
In Table 2, we report results when translating the Multi30k test2017 and the ambiguous MSCOCO test sets. Note that standard deviations for the conditional model VMMT C are considerably higher than those obtained for model VMMT F . We investigated the issue further and found out that one of the runs of VMMT C performed considerably  For each model, we report the mean and standard deviation over 4 independent runs where models were selected using validation BLEU4 scores. Best overall results (i.e. means) are in bold. Note that standard deviations for the conditional model VMMT C are considerably higher than those obtained for model VMMT F . This is partly due to the fact that one of the runs of VMMT C underperformed compared to the other three.
worse than the others; this caused the mean scores to be much lower and also increased the variance significantly. Finally, one interesting finding is that all four metrics indicate that the fixed-prior model VMMT F either performs slightly (Table 1) or considerably better (Table 2) than the conditional model VMMT C . We speculate this is partly due to VMMT F 's simpler parameterisation, after all, we have just about 29k training instances to estimate two sets of parameters (θ and λ) and the more complex VMMT C requires an additional bidirectional LSTM encoder for the target text.

Back-translated Comparable Multi30k
Since the translated Multi30k dataset is very small, we also investigate the effect of including more in-domain data to train our models. For that purpose, we use additional 145K monolingual German descriptions released as part of the comparable Multi30k dataset (M30k C ). We train a text-only NMT model to translate from German into English using the original 29K parallel sentences in the translated Multi30k (without images), and apply this model to back-translate the 145K German descriptions into English (Sennrich et al., 2016a).
In this set of experiments, we explore how pretraining models NMT, VMMT F and VMMT C using both the translated and back-translated comparable Multi30k affects results. Models are pre-trained on mini-batches with a one-to-one ratio of translated and back-translated data. 4 All three models NMT, VMMT F and VMMT C , are further fine- tuned on the translated Multi30k until convergence, and model selection using BLEU is only applied during fine-tuning and not at the pre-training stage.
In Figure 2, we inspect for how many epochs should a model be pre-trained using the additional noisy back-translated descriptions, and note that both VMMT F and VMMT C reach best BLEU scores on the validation set when pre-trained for about 3 epochs. As shown in Figure 2, we note that when using additional noisy data VMMT C , which uses a conditional prior, performs considerably better than its counterpart VMMT F , which has a fixed prior. These results indicate that VMMT C makes better use of additional synthetic data than VMMT F . Some of the reasons that explain these results are (i) the conditional prior p(z|x) can learn  Table 3: Results for models pre-trained using the translated and comparable Multi30k to translate the Multi30k test set. We report the mean and standard deviation over 4 independent runs. Our best overall results are highlighted in bold, and we highlight in green/red the improvement/decrease brought by our models compared to the baseline mean score. We additionally show results for the Imagination model trained on 4× more data (as reported in the authors' paper).
to be sensitive to whether x is gold-standard or synthetic, whereas p(z) cannot; (ii) in the conditional case the posterior approximation q(z|x, y, v) can directly exploit different patterns arising from a gold-standard versus a synthetic x, y pair; and finally (iii) our synthetic data is made of targetlanguage gold-standard image descriptions, which help train the inference network's target-language BiLSTM encoder.
In Table 3, we show results when applying VMMT F and VMMT C to translate the Multi30k test set. Both models and the NMT baseline are pretrained on the translated and the back-translated comparable Multi30k data sets, and are selected according to validation set BLEU scores. For comparison, we also include results for Imagination (Elliott and Kádár, 2017) when trained on the translated Multi30k, the WMT News Commentary English-German dataset (240K parallel sentence pairs) and the MSCOCO image description dataset (414K German descriptions of 83K images, i.e. 5 descriptions for each image). In contrast, our models observe 29K images (i.e. the same as the models evaluated in Section 3.3) plus 145K German descriptions only. 5

Ablative experiments
In our ablation we are interested in finding out to what extent the model makes use of the latent space, i.e. how important is the latent variable.
KL free bits A common issue when training latent variable models with a strong decoder is having 5 There are no additional images because the comparable Multi30k consists of additional German descriptions for the same 29K images already in the translated Multi30k.  the true posterior collapse to the prior and the KL term in the ELBO vanish to zero. In practice, that would mean the model has virtually not used the latent variable z to predict image features v, but mostly as a source of stochasticity in the decoder. This can happen because the model has access to informative features from the source bi-LSTM encoder and need not learn a difficult mapping from observations to latent representations predictive of image features.
For that reason, we wish to measure how well can we train latent variable MMT models while ensuring that the KL term in the loss (Equation (6)) does not vanish to zero. We use the free bits heuristic (Kingma et al., 2016) to impose a constraint on the KL, which in turn promotes models with non-negligible mutual information between inputs and latent variables (Alemi et al., 2018).
In Table 4, we see the results of different models trained using different number of free bits in the KL component. We note that including free bits improves translations slightly, but note that finding the optimal number of free bits requires hyper-parameter search.

Discussion
In Table 5 we show how our different models translate two examples of the M30k test set. In the first example (id#801), training on additional backtranslated data improves variational models but not the NMT baseline, whereas in the second example (id#873) differences between baseline and variational models still persist even when training on  In the first example, neither the NMT baseline (with or without back-translated data) nor model VMMT C (trained on limited data) could translate archway correctly; the NMT baseline translates it as "scheibe" (disk) and "bogen" (bow), and VMMT C also incorrectly translates it as "bogen" (bow). However, VMMT C translates without errors when trained on additional back-translated data, i.e. "torbogen" (archway). In the second example, the NMT baseline translates bay as "luft" (air) or "meer" (sea), whereas VMMT F translates it as "bucht" (bay) or "wellen" (waves) and VMMT C always as "bucht" (bay).
additional back-translated data.

Related work
Even though there has been growing interest in variational approaches to machine translation (Zhang et al., 2016;Schulz et al., 2018;Shah and Barber, 2018;Eikema and Aziz, 2019) and to tasks that integrate vision and language, e.g. image description generation (Pu et al., 2016;Wang et al., 2017), relatively little attention has been dedicated to variational models for multi-modal translation. This is partly due to the fact that multi-modal machine translation was only recently addressed by the MT community by means of a shared task Barrault et al., 2018). Nevertheless, we now discuss relevant variational and deterministic multi-modal MT models in the literature.
Fully supervised MMT models. All submissions to the three runs of the multi-modal MT shared tasks Barrault et al., 2018) model conditional probabilities directly without latent variables.
Perhaps the first MMT model proposed prior to these shared tasks is that of Hitschler et al. (2016), who used image features to re-rank translations of image descriptions generated by a phrase-based statistical MT model (PBSMT) and reported significant improvements. Shah et al. (2016) propose a similar model where image logits are used to rerank the output of PBSMT. Global image features, i.e. features computed over an entire image (such as pool5 ResNet-50 features used in this work), have been directly used as "tokens" in the source sentence, to initialise encoder RNN hidden states, or as additional information used to initialise the decoder RNN states (Huang et al., 2016;Libovický et al., 2016;. On the other hand, spatial visual features, i.e. local features that encode different parts of the image separately in different vectors, have been used in doubly-attentive models where there is one attention mechanism over the source RNN hidden states and another one over the image features (Caglayan et al., 2016;. Finally, Caglayan et al. (2017) proposed to interact image features with target word embeddings, more specifically to perform an element-wise multiplication of the (projected) global image features and the target word embeddings before feeding the target word embeddings into their decoder GRU. They reported significant improvements by using image features to gate target word embeddings and won the 2017 Multi-modal MT shared task .
Multi-task MMT models. Multi-task learning MMT models are easily applicable to translate sentences without images (at test time), which is an advantage over the above-mentioned models. Luong et al. (2016) proposed a multi-task approach where a model is trained using two tasks and a shared decoder: the main task is to translate from German into English and the secondary task is to generate English descriptions given an image. They show improvements in the main translation task when also training for the secondary image description task. Their model is large, i.e. a 4-layer encoder LSTM and a 4-layer decoder LSTM, and their best set up uses a ratio of 0.05 image description generation training data samples in comparison to translation training data samples. Elliott and Kádár (2017) propose an MTL model trained to do translation (English→German) and sentenceimage ranking (English↔image), using a standard word cross-entropy and margin-based losses as its task objectives, respectively. Their model uses the pre-trained GoogleNet v3 CNN (Szegedy et al., 2016) to extract pool5 features, and has a 1-layer source-language bidirectional GRU encoder and a 1-layer GRU decoder.
Variational MMT models. Toyama et al. (2016) proposed a variational MMT model that is likely the most similar model to the one we put forward in this work. They build on the variational neural MT (VNMT) model of Zhang et al. (2016), which is a conditional latent model where a Gaussiandistributed prior of z is parameterised as a function of the the source sentence x m 1 , i.e. p(z|x m 1 ), and both x m 1 and z are used at each time step in an attentive decoder RNN, P (y j |x m 1 , z, y <j ). In Toyama et al. (2016), image features are used as input to the inference model q λ (z|x m 1 , y n 1 , v) that approximates the posterior over the latent variable, but otherwise are not modelled and not used in the generative network. Differently from their work, we use image features in all our generative models, and propose modelling them as random observed outcomes while still being able to use our model to translate without images at test time. In the conditional case, we further use image features for posterior inference. Additionally, we also investigate both conditional and fixed priors, i.e. p(z|x m 1 ) and p(z), whereas their model is always conditional. Interestingly, we found in our experiments that fixed-prior models perform slightly better than conditional ones under limited training data. Toyama et al. (2016) uses the pre-trained VGG19 CNN (Simonyan and Zisserman, 2015) to extract FC7 features, and additionally experiment with using additional features from object detections obtained with the Fast RCNN network (Girshick, 2015). One more difference between their work and ours is that we only use the ResNet-50 network to extract pool5 features, and no additional pretrained CNN nor object detections.

Conclusions and Future work
We have proposed a latent variable model for multimodal neural machine translation and have shown benefits from both modelling images and promoting use of latent space. We also show that in the absence of enough data to train a more complex inference network a simple fixed prior suffices, whereas when more training data is available (even noisy data) a conditional prior is preferable. Importantly, our models compare favourably to the state-of-theart.
In future work we will explore other generative models for multi-modal MT, as well as different ways to directly incorporate images into these models. We are also interested in modelling different views of the image, such as global vs. local image features, and also in using larger image collections and modelling images directly, i.e. pixel intensities.

A Model Architecture
Once again, we wish to translate a source sequence x m 1 x 1 , · · · , x m into a target sequence y n 1 y 1 , · · · , y n , and also predict image features v. In Figure 3, we illustrate generative and inference networks for models VMMT C and VMMT F .

A.1 Generative model
Source-language encoder The source-language encoder is deterministic and implemented using a 2-layer bidirectional Long Short-Term Memory (LSTM) network (Hochreiter and Schmidhuber, 1997): where emb is the source look-up matrix, trained jointly with the model, and h m 1 are the final source hidden states.
Image decoder We do not model images directly, but instead as a 2048-dimensional feature vector v of pre-activations of a ResNet-50's pool5 layer. We simply draw image features from a Gaussian observation model: where a multi-layer perceptron (MLP) maps from z to a vector of locations ν ∈ R o , and ς ∈ R >0 is a hyper-parameter of the model (we use 1).
Fixed prior VMMT F In the MMT model VMMT F , we simply have a draw from a standard Normal prior: Z ∼ N (0, I).

A.2 Inference model
The inference network shares the source-language encoder with the generative model and differs depending on the model (VMMT C or VMMT F ).