Generative Imagination Elevates Machine Translation

There are common semantics shared across text and images. Given a sentence in a source language, whether depicting the visual scene helps translation into a target language? Existing multimodal neural machine translation methods (MNMT) require triplets of bilingual sentence - image for training and tuples of source sentence - image for inference. In this paper, we propose ImagiT, a novel machine translation method via visual imagination. ImagiT first learns to generate visual representation from the source sentence, and then utilizes both source sentence and the “imagined representation” to produce a target translation. Unlike previous methods, it only needs the source sentence at the inference time. Experiments demonstrate that ImagiT benefits from visual imagination and significantly outperforms the text-only neural machine translation baselines. Further analysis reveals that the imagination process in ImagiT helps fill in missing information when performing the degradation strategy.


Introduction
Visual foundation has been introduced in a novel multimodal Neural Machine Translation (MNMT) task Barrault et al., 2018), which uses bilingual (or multilingual) parallel corpora annotated by images describing sentences' contents (see Figure 1(a)). The superiority of MNMT lies in its ability to use visual information to improve the quality of translation, but its effectiveness largely depends on the availability of data sets, especially the quantity and quality of annotated images. In addition, because the cost of manual image annotation is relatively high, at this stage, MNMT is mostly applied on a small and specific dataset, Multi30K , and is not suitable for large-scale text-only Neural Machine Translation (NMT) (Bahdanau et al., 2015; Vaswani et al., 2017). Such limitations hinder the applicability of visual information in NMT.
To address the bottlenecks mentioned above,  propose to build a lookup table from an image dataset and then using the searchbased method to retrieve pictures that match the source language keywords. However, the lookup table is built from Multi30K, which leads to a relatively limited coverage of the pictures, and potentially introduces much irrelevant noise. It does not always find the exact image corresponding to the text, or the image may not even exist in the database. Elliott and Kádár (2017) present a multitask learning framework to ground visual representation to a shared space. Their architecture called "imagination" shares an encoder between a primary NMT task and an auxiliary task of ranking the vi-sual features for image retrieval. However, neither the image is explicitly generated, nor the visual feature is directly leveraged by the translation decoder, the model simply learns the visual grounded shared encoder. Based on other researchers' earlier exploration, we hypothesize that the potential of vision in conventional text-only NMT has not been fully discovered. Different with Elliott and Kádár (2017) implicit approach, we understand "imagination" to be more like "picturing", since it is similar to humans who can visually depict figures in the mind from an utterance. Our approach aims to explicitly imagine a "vague figure" (see Figure 1(b)) to guide the translation, since A picture is worth a thousand words, and imagining the picture of a sentence is the instinctive reaction of a human being who is learning bilingualism.
In this paper, we propose a novel end-to-end machine translation model that is embedded in visual semantics with generative imagination (ImagiT) (see Figure 1(b)). Given a source language sentence, ImagiT first encodes it and transforms the word representations into visual features through an attentive generator, which can effectively capture the semantics of both global and local levels, and the generated visual representations can be considered as semantic-equivalent reconstructions of sentences. A simple yet effective integration module is designed to aggregate the textual and visual modalities. In the final stage, the model learns to generate the target language sentence based on the joint features. To train the model in an end-to-end fashion, we apply a visual realism adversarial loss and a text-image pair-aware adversarial loss, as well as text-semantic reconstruction loss and target language translation loss based on cross-entropy.
In contrast with most prior MNMT work, our proposed ImagiT model does not require images as input during the inference time but can leverage visual information through imagination, making it an appealing method in low-resource scenario. Moreover, ImagiT is also flexible, accepting external parallel text data or non-parallel image captioning data. We evaluate our Imagination modal on the Multi30K dataset. The experiment results show that our proposed method significantly outperforms the text-only NMT baseline. The analysis demonstrates that imagination help the model complete the missing information in the sentence when we perform degradation masking, and we also see improvements in translation quality by pre-training the model with an external non-parallel image captioning dataset.
To summarize, the paper has the following contributions: 1. We propose generative imagination, a new setup for machine translation assisted by synthesized visual representation, without annotated images as input; 2. We propose the ImagiT method, which shows advantages over the conventional MNMT model and gains significant improvements over the text-only NMT baseline; 3. We conduct experiments to verify and analyze how imagination helps the translation.

Related work
MNMT As a language shared by people worldwide, visual modality may help machines have a more comprehensive perception of the real world. Multimodal neural machine translation (MNMT) is a novel machine translation task proposed by the machine translation community, which aims to design multimodal translation frameworks using context from the additional visual modality . The shared task releases the dataset Multi30K , which is an extended German version of Flickr30K (Young et al., 2014), then expanded to French and Czech Barrault et al., 2018). In the three versions of tasks, scholars have proposed many multimodal machine translation models and methods. Huang et al. (2016) encodes word sequences with regional visual objects, while  study the effects of incorporating global visual features to initialize the encoder/decoder hidden states of RNN. Caglayan et al. (2017) models the image-text interaction by leveraging elementwise multiplication. Elliott and Kádár (2017) propose a multitask learning framework to ground visual representation to a shared space and learn with the auxiliary triplet alignment task. The common practice is to use convolutional neural networks to extract visual information and then using attention mechanisms to extract visual contexts (Caglayan et al., 2016;Calixto et al., 2016;Libovický and Helcl, 2017). Ive et al. (2019) propose a translateand-refine approach using two-stage decoder. Calixto et al. (2019) put forward a latent variable model to capture the multimodal interactions between visual and textual features. Caglayan et al. (2019) show that visual content is more critical when the textual content is limited or uncertain in MMT. Recently, Yao and Wan (2020) propose multimodal self-attention in Transformer to avoid encoding irrelevant information in images, and Yin et al. (2020) propose a graph-based multimodal fusion encoder to capture various relationships.
Text-to-image synthesis Traditional Text-toimage (T2I) synthesis mainly uses keywords to search for small image regions, and finally optimizes the entire layout (Zhu et al., 2007). After generative adversarial networks (GANs) (Goodfellow et al., 2014) were proposed, scholars have presented a variety of GAN-based T2I models. Reed et al. (2016) propose DC-GAN and design a direct and straightforward network and a training strategy for T2I generation. Zhang et al. (2017) propose stackGAN, which contains multiple cascaded generators and discriminators, and the higher stage generates better quality pictures. In previous work, scholars only considered global semantics. Xu et al. (2018) proposed AttnGAN to apply the attention mechanism to capture fine-grained word-level information. MirrorGAN (Qiao et al., 2019) employs a mirror structure, which reversely learns from the inverse task of T2I to further validate whether generated images are consistent with the input texts. The inverse task is also known as image captioning.

ImagiT model
As shown in Figure2, ImagiT embodies the encoder-decoder structure for end-to-end machine translation. Between the encoder and the decoder, there is an imagination step to generate semanticequivalent visual representation. Technically, our model is composed of following modules: source text encoder, generative imagination network, image captioning, multimodal aggregation and decoder for translation. We will elaborate on each of them in the rest of this section. Vaswani et al. (2017) propose the state-of-art Transformer-based machine translation framework, which can be written as follows:

Source text encoder
Where Att l , LN, and FFN l are the self-attention module, layer normalization, and the feed-forward network for the l-th identical layer respectively. The core of the Transformer is the multi-head selfattention, in each attention head, we have: W V , W Q , W K are layer-specific trainable parameter matrices. For the output of final stacked layer, we use w = {w 0 , w 1 , ..., w L−1 }, w ∈ R d×L to represent the source word embedding, L is the length of the source sentence. Besides, we add a special token to each source language sentence to obtain the sentence representation s ∈ R d .

Generative imagination network
Generative Adversarial Network (Goodfellow et al., 2014) has been applied to synthesis images similar to ground truth (Zhang et al., 2017;Xu et al., 2018;Qiao et al., 2019). We follow the common practice of using the conditioning augmentation (Zhang et al., 2017) to enhance robustness to small perturbations along the conditioning text manifold and improve the diversity of generated samples. 1 F ca represents the conditioning augmentation function, and s ca represents the enhanced sentence representation.
{F 0 , F 1 } are two visual feature converters, sharing similar architecture. F 0 contains a fully connected layer and four deconvolution layers (Noh et al., 2015) to obtain image-sized feature vectors. Furthermore, we define {f 0 , f 1 } are the visual features after two transformations with different resolution. For detailed layer structure and block design, please refer to (Xu et al., 2018).  Figure 2: Overview of the framework of the proposed ImagiT. F 0 and F 1 are text-to-image converters, sharing similar structures, comprising of perceptron, residual, and unsampling blocks. L× represents L identical layers. Noting that we only need to obtain the generated visual feature to guide the translation, for the whole pipeline, up-sampling this feature to image is redundant.
Where f 0 ∈ R M 0 ×N 0 , z is the noise vector, sampled from the standard normal distribution, and it will be concatenated with s ca . Each column of f i is a feature vector of a sub-region of the image, which can also be treat as a pseudo-token. To generate fine-grained details at different subregions of the image by paying attention to the relevant words in the source language, we use image vector in each sub-region to query word vectors by leveraging attention strategy. F attn is an attentive function to obtain word-context feature, then we have: Word feature w l is firstly converted into the common semantic space of the visual feature, U 0 is a perceptron layer. Then it will be multiplied with f 0 to acquire the attention score. f 1 is the output of the imagination network, capturing multiple levels (word level and sentence level) of semantic meaning. f 1 is denoted as the blue block "generated visual feature" in Figure2. It will be utilized directly for target language generation, and it will also be passed to the discriminator for adversarial training. Note that for the whole pipeline, upsampling f 1 to an image is redundant.
Comparing to T2I synthesis works which use cascaded generators and disjoint discriminators (Zhang et al., 2017;Xu et al., 2018;Qiao et al., 2019), we only use one stage to reduce the model size and make our generated visual feature f 1 focus more on text-mage consistency, but not the realism and authenticity.

Image captioning
Image captioning (I2T) can be regarded as the inverse problem of text-to-image generation, generating the given image's description. If an imagined image is semantic equivalent to the source sentence, then its description should be almost identical to the given text. Thus we leverage the image captioning to translate the imagined visual representation back to the source language(Qiao et al., 2019), and this symmetric structure can make the imagined visual feature act like a mirror, effectively enhancing the semantic consistency of the imagined visual feature and precisely reflect the underlying semantics. Following Qiao et al. (2019), we utilize the widely used encoder-decoder image captioning framework (Vinyals et al., 2015), and fix the parameters of the pre-trained image captioning framework when end-to-end training other modules in ImagiT.
p t = Decoder(h t−1 ), t = 0, 1, ..., L − 1, (9) p t is the predicted probability distribution over the words at t-th decoding step, and T t is the T t -th entry of the probability vector.

Multimodal aggregation
After obtaining the imagined visual representation, we aggregate two modalities for the translation decoder. Although the vision carries richer information, it also contains irrelevant noise. Comparing to encoding and integrating visual feature directly, a more elegant method is to induce the hidden representation under the guide of image-aware attention and graph perspective of Transformer (Yao and Wan, 2020), since each local spatial regions of the image can also be considered as pseudo-tokens, which can be added to the source fully-connected graph. In the multimodal self-attention layer, we add the spatial feature of the generated feature map in the source sentence, that is, the attention query vector is the combination of text and visual embeddings, gettingx ∈ R (L+M )×d . Then perform image-aware attention, the key and value vectors are just text embeddings, we have:

Objective function
During the translation phase, similar to equation 10, we have: To train the whole network end-to-end, we leverage adversarial training to alternatively train the generator and the discriminator. Especially, as shown in Figure 3, the discriminator take the imagined visual representation, source language sentence, and the real image as input, and we employ two adversarial losses: a visual realism adversarial source language sentence Generated Image Target language sentence Real Image Discriminator Figure 3: Training objective. The discriminator takes source language sentences, generated images, and real images as input, then computes two adversarial loss: realism loss and text-image paired loss. L I2T is designed to guarantee the semantic consistency, and L trans is the core loss function to translate integrated embedding to the target language.
f 1 is the generated visual feature computed by equation 7 from the model distribution p G , s is the global sentence vector. The first term is to distinguish real and fake, ensuring that the generator generates visually realistic images. The second term is to guarantee the semantic consistency between the input text and the generated image. L G 0 jointly approximates the unconditional and conditional distributions. The final objective function of the generator is defined as: Accordingly, the discriminator D is trained by minimizing the following loss: Where I is from the true image distribution p data . The first two items are unconditional loss, the latter two are conditional loss.

Datasets
We evaluate our proposed ImagiT model on two datasets, Multi30K  and Ambiguous COCO . To show its ability to train with external out-of-domain datasets, we adopt MS COCO (Lin et al., 2014) in the next analyzing section. Multi30K is the largest existing human-labeled collection for MNMT, containing 31K images and consisting of two multilingual expansions of the original Flickr30K (Young et al., 2014) dataset. The first expansion has five English descriptions and five German descriptions, and they are independent of each other. The second expansion has one of its English description manually translated to German by a professional translator, then expanded to French and Czech in the following shared task Barrault et al., 2018). We only apply the second expansion in our experiments, which has 29, 000 instances for training, 1, 014 for development, and 1, 000 for evaluation. We present our results on English-German (En-De) English-French (En-Fr) Test2016 and Test2017.
Ambiguous COCO is a small evaluation dataset collected in the WMT2017 multimodal machine translation challenge , which collected and translated a set of image descriptions that potentially contain ambiguous verbs. It contains 461 images from the MS COCO (Lin et al., 2014) for 56 ambiguous vers in total.
MS COCO is the widely used non-parallel textimage paired dataset in T2I and I2T generation. It contains 82, 783 training images and 40, 504 validation images with 91 different object types, and each image has 5 English descriptions.

Settings
Our baseline is the conventional text-only Transformer (Vaswani et al., 2017). Specifically, each encoder-decoder has a 6-layer stacked Transformer network, eight heads, 512 hidden units, and the inner feed-forward layer filter size is set to 2048. The dropout is set to p = 0.1, and we use Adam optimizer (Kingma and Ba, 2015) to tune the parameter. The learning rate increases linearly for the warmup strategy with 8, 000 steps and decreases with the step number's inverse square root. We train the model up to 10, 000 steps, the early-stop strategy is adopted. We use the same setting as Vaswani et al. (2017). We use the metrics BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014)to evaluate the translation quality.
For the imagination network, the noise vector's dimension is 100, and the generated visual feature is 128 × 128. The upsampling and residual block in visual feature transformers consist of 3 × 3 stride 1 convolution, batch normalization, and ReLU activation. The training is early-stopped if the dev set BLEU score do not improve for 10 epochs, since  (Caglayan et al., 2017) 25.1 46.0 43.2 63.1 trg-mul (Caglayan et al., 2017) 26.4 47.4 43.5 63.2 VAG-NMT (Zhou et al., 2018) 28  the translation is the core task. The batch size is 64, and the learning rate is initialized to be 2e −4 and decayed to half of its previous value every 100 epochs. A similar learning schedule is adopted in Zhang et al. (2017). The margin size γ is set to 0.1, the balance weight λ 1 = 20, λ 2 = 40. Table 1 illustrates the results for the En-De Test2016, En-De Test2017, En-Fr Test2016 and En-Fr Test2017 tasks. Our text-only Transformer baseline (Vaswani et al., 2017) has similar results compared to most prior MNMT works, which is consistent with the previous findings (Caglayan et al., 2019), that is, textual modality is good enough to translate for Multi30K dataset. This finding helps to explain that it is already tricky for a MNMT model to ground visual modality even with the presence of annotated images. However, Our ImagiT gains improvements over the text-only Transformer baseline on four evaluation datasets, demonstrating that our model can effectively embed the visual semantics during the training time and guide the translation through imagination with the absence of annotated images during the inference time. We assume much of the performance improvement is due to ImagiT's strong ability to capture the interaction between text and image, generate semanticconsistent visual representations, and incorporate information from visual modality properly. We also observe that our approach surpasses the results of most MNMT systems by a noticeable margin in terms of BLEU score and METEOR score on four evaluation datasets. Our ImagiT is also competitive with ImagiT + ground truth, which is our translation decoder taking ground truth visual representations instead of imagined ones, and can be regarded as the upper boundary of imagiT. This proves imaginative ability of ImagiT. Table 2 shows results for the En-De En-Fr Am-biguous COCO. For Ambiguous COCO, which was purposely curated such that verbs have ambiguous meaning, demands more visual contribution for guiding the translation and selecting correct words. Our ImagiT benefits from visual imagination and substantially outperforms previous works on ambiguous COCO. and even gets the same performance as ImagiT + ground truth (45.3 BLEU).

Ablation studies
The hyper-parameter λ 1 in equation 15 is important. When λ 1 = 0, there is no image captioning component, the BLEU score drops from 38.5 to 37.9, while this variant still outperforms the Transformer baseline. This indicates the effectiveness of image captioning module, since it will potentially prevent visual-textual mismatching, thus helps generator achieve better performance. When λ 1 increases from 5 to 20, the BLEU and METEOR increase accordingly. Whereas λ 1 is set to equal to λ 2 , the BLEU score falls to 38.3. That's reasonable because λ 2 L trans is the main task of the whole model.
Evaluation metric BLEU METEOR ImagiT, λ 1 = 0 37.9 55.3 ImagiT, λ 1 = 5 38.2 55.5 ImagiT, λ 1 = 10 38.4 55.7 ImagiT, λ 1 = 20 38.5 55.7 ImagiT, λ 1 = 40 38.3 55.6 Since the proposed model does not require images as input, one may ask how it uses visual informa-tion and where the information comes? We claim that ImagiT has already been embedded with visual semantics during the training phase, and in this section, we validate that ImagiT is able to generate visual grounded representation by performing the image retrieval task. For each source sentence, we generate the intermediate visual representation. Furthermore, we query the ground truth image features for each generated representation to find the closest image vectors around it based on the cosine similarity. Then we can measure the R@K score, which computes the recall rate of the matched image in the top K nearest neighborhoods.  Some previous studies on VSE perform sentenceto-image retrieval and image-to-sentence retrieval, but their results can not be directly compared with ours, since we are performing image-to-image retrieval in practical. However, from Table 4, especially for R@10, the results demonstrate that our generated representation has excellent quality of shared semantics and have been grounded with visual semantic-consistency.

How does the imagination help the translation?
Although we have validated the effectiveness of ImagiT on three widely used MNMT evaluation datasets. A natural question to ask is that how does the imagination guide the translation, and to which extent? When human beings confronting with complicate sentences and obscure words, we often resort to mind-picturing and mental visualization to assist us to auto-complete and fill the whole imagination. Thus we hypothesis that imagination could help recover and retrieve the missing and implicate textual information. Inspired by Ive et al. (2019); Caglayan et al. (2019), we apply degradation strategy to the input source language, and feed to the trained Transformer baseline, MNMT baseline, and ImagiT respectively, to validate if our proposed approach could recover the missing information and obtain better performance. And we conduct the analysing experiments on En-De Test2016 evaluation set.
Color deprivation is to mask the source tokens that refers to colors, and replace them with a special token [M]. Under this circumstance, text-only NMT model have to rely on source-side contextual information and biases, while for MNMT model, it can directly utilize the paired colorrelated information-rich images. But for ImagiT, the model will turn to imagination and visualization.

Model
S S text-only Transformer 37.6 36.3 MNMT 38.2 37.7 ImagiT 38.4 37.9  Table 5 demonstrates the results of color deprivation. We implement a simple transformer-based MNMT baseline model using the multimodal selfattention approach (Yao and Wan, 2020). Thus the illustrated three models in Table 5 can be compared directly. We can observe that the BLEU score of text-only NMT decreases 1.3, whereas MNMT and ImagiT system only decreases 0.5. This result corroborates that our ImagiT has a similar ability to recover color compared to MNMT, but our ImagiT achieves the same effect through its own efforts, i.e., imagination. One possible explanation is that ImagiT could learn the correlation and cooccurrence of the color and specific entities during the training phase, thus imagiT could infer the color from the context and recover it by visualization.
Visually depictable entity masking. Plummer et al. (2015) extend Flickr30K with cereference chains to tag mentions of visually depictable entities. Similar to color deprivation, we randomly replace 0%, 15%, 30%, 45%, 60% visually depictable entities with a special token [M]. Figure 4 is the result of visually depictable entity masking. We observe a large BLEU score drop of text-only Transformer baseline with the increasing of masking proportion, while MNMT and ImagiT are relatively smaller. This result demonstrates that our ImagiT model can much more effectively infer and imagine missing entities compared to text-only Transformer, and have comparable capability over the MNMT model.

Will better imagination with external data render better translation?
Our ImagiT model also accepts external parallel text data or non-parallel image captioning data, and we can easily modify the objective function to train with out-of-domain non-triple data. To train with text-image paired image captioning data, we can pre-train our imagination model by ignoring L trans term . In other words, the T2I synthesis module can be solely trained with MS COCO dataset. We randomly split MS COCO in half, and use COCO half and COCO f ull to pretrain ImagiT. The MS COCO is processed using the same pipeline as in Zhang et al. (2017  As is shown in Table 6, our ImagiT model pretrained with half MS COCO gain 0.6 METEOR increase, and the improvement becomes more apparent when training with the whole MS COCO. We can contemplate that large-scale external data may further improve the performance of ImagiT, and we have not utilized parallel text data (e.g., WMT), even image-only and monolingual text data can also be adopted to enhance the model capability, and we leave this for future work.

Conclusion
This work presents generative imagination-based machine translation model (ImagiT), which can effectively capture the source semantics and generate semantic-consistent visual representations for imagination-guided translation. Without annotated images as input, out model gains significant improvements over text-only NMT baselines and is comparable with the SOTA MNMT model. We analyze how imagination elevates machine translation and show improvement using external image captioning data. Further work may center around introducing more parallel and non-parallel, text, and image data for different training schemes.

Broader Impact
This work brings together text-to-image synthesis, image captioning, and neural machine translation (NMT) for an adversarial learning setup, advancing the traditional NMT to utilize visual information. For multimodal neural machine translation (MNMT), which possesses annotated images and can gain better performance, manual image annotation is costly, so MNMT is only applied on a small and specific dataset. This work tries to extend the applicability of MNMT techniques and visual information in NMT by imagining a semantic equivalent picture and making it appropriately utilized by visual-guided decoder. Compared to the previous multimodal machine translation approaches, this technique takes only sentences in the source languages as the usual machine translation task, making it an appealing method in low-resource scenarios. However, the goal is still far from being achieved, and more efforts from the community are needed for us to get there. One pitfall of our proposed model is that trained ImagiT is not applicable to larger-scale text-only NMT tasks, such as WMT'14, which is mainly related to economies and politics, since those texts are not easy to be visualized, containing fewer objects and visually depictable entities. We advise practitioners who apply visual information in large-scale text-to-text translation to be aware of this issue. In addition, the effectiveness of MNMT model largely depends on the quantity and quality of annotated images, likewise, our model performance also depends on the quality of generated visual representations. We will need to carefully study how the model balance the contribution of different modality and response to ambiguity and bias to avoid undesired behaviors of the learned models.