Multimodal Transformer for Multimodal Machine Translation

Multimodal Machine Translation (MMT) aims to introduce information from other modality, generally static images, to improve the translation quality. Previous works propose various incorporation methods, but most of them do not consider the relative importance of multiple modalities. Equally treating all modalities may encode too much useless information from less important modalities. In this paper, we introduce the multimodal self-attention in Transformer to solve the issues above in MMT. The proposed method learns the representation of images based on the text, which avoids encoding irrelevant information in images. Experiments and visualization analysis demonstrate that our model benefits from visual information and substantially outperforms previous works and competitive baselines in terms of various metrics.


Introduction
Multimodal machine translation (MMT) is a novel machine translation (MT) task which aims at designing better translation systems using context from an additional modality, usually images (See Figure 1). It initially organized as a shared task within the First Conference on Machine Translation Barrault et al., 2018). Current works focus on the dataset named Multi30k , a multilingual extension of Flickr30k dataset with translations of the English image descriptions into different languages.
Previous works propose various incorporation methods.  utilize global image features to initialize the encoder/decoder hidden states of RNN. Elliott and Kádár (2017) model the source sentence and reconstruct the image representation jointly via multi-task learning. Recently, Ive et al. (2019) propose a translate-and-refine ap- proach using two-stage decoder based on Transformer (Vaswani et al., 2017). Calixto et al. (2019) put forward a latent variable model to learn the interaction between visual and textual features. However, in multimodal tasks the different modalities usually are not equally important. For example, in MMT the text is obviously more important than images. Although the image carries richer information, it also contains more irrelevant content. If we directly encode the image features, it may introduce a lot of noise.
To address the issues above, we propose the multimodal Transformer. The proposed model does not directly encode image features. Instead, the hidden representations of images are induced from the text under the guide of image-aware attention. Meanwhile, we introduce a better way to incorporate information from other modality based on a graph perspective of Transformer. Experimental results and visualization show that our model can make good use of visual information and substantially outperforms the current state of the art.

Methodology
Our model is adapted from Transformer and it is also an encoder-decoder architecture, consisting of stacked encoder and decoder layers. The focus of our work is to build a powerful encoder to incorporate the information from other modality. Thus, we will first begin with an introduction to the incorpo-ration method. Then we will detail the multimodal self-attention. The final representations of text and images are sent to the sequence decoder to generate the target text.

Incorporating Method
The method of incorporating information from other modality is based on a graph perspective of Transformer. The core of Transformer is selfattention which employs the multi-head mechanism. Each attention head operates on an input sequence x = (x 1 , ..., x n ) of n elements where x i ∈ R d , and computes a new sequence z = (z 1 , ..., z n ) of the same length where z ∈ R d : where α ij is weight coefficient computed by a softmax function: Thus we can see that each word representation is induced from all the other words. If we consider every word to be a node, then Transformer can be regarded as a variant of GNN which treats each sentence as a fully-connected graph with words as nodes (Battaglia et al., 2018;Yao et al., 2020). In traditional MT tasks, the source sentence graph only contains nodes with text information. If we want to incorporate information from other modality, we should add the nodes with other modality information into the source graph. Therefore, as the words are local semantic representations of the sentence, we extract the spatial features which are the semantic representations of local spatial regions of the image. We add the spatial features of the image as pseudo-words in the source sentence and feed it into the multimodal self-attention layer.

Multimodal Self-attention
As stated before, in MMT the text and images are not equally important. Directly encoding images which contain a lot of irrelevant content may introduce noise. Therefore, we propose the multimodal self-attention to encode multimodal information. In multimodal self-attention, the hidden representations of the image are induced from text under  Figure 2.
Formally, we consider two modalities text and img, with two entries from each of them denoted by x text ∈ R n×d and x img W img ∈ R p×d , respectively. The output of multimodal self-attention is computed as follows: whereα ij is weight coefficient computed by a softmax function: where c ∈ R (n+p)×d is the hidden representation of words and the image. At last layer, c is fed into sequence decoder to generate target sequence. We can see that the hidden representations of the image is only induced from words but under the guide of image-aware attention. The extracted spatial features of the image are not directly encoded in the model. Instead, they adjust the attention of each word to compute the hidden representations of the image. In each encoder layer we also employ residual connections between each layer as well as layer normalization. And the decoder are followed the standard implemention of Transformer.

Baselines and Metrics
We compare the performance of our model with previous kinds of models: (1) sequence-to-sequence model only trained on text data (LSTM, Transformer).
(2) Previous works trained on both text and image data. We evaluated the translation quality of our model in terms of BLEU (Papineni et al., 2002) and METEOR (Denkowski and Lavie, 2014), which have been used in most previous works.

Datasets
We build and test our model on the Multi30k dataset , which consists of two multilingual expansions of the original Flickr30k dataset referred to as M30k T and M30K C , respectively. Multi30k contains 30k images, and for each of the images, M30k T has one of its English description manually translated into German by a professional translator. M30K C has five English descriptions and five German descriptions, but the German descriptions were crowdsourced independently from their English versions. The training, validation, test sets of Multi30k contain 29k, 1014 and 1k instances respectively. We use M30k T as the original training data and M30k C for building additional back-translated training data following Calixto et al. (2019). We present our experiment results on English-German (En-De) Test2016. We use LSTM trained on the textual part of the M30KT dataset (De-En, the original 29k sentences) without images to build a back-translation model (Sennrich et al., 2016), and then apply this model to translate 145k monolingual German description in M30k C into English as additional training data. This part of data we refer to as back-translated data.

Settings
We preprocess the data by tokenizing and lowercasing. Word embeddings are initialized using pretrained 300-dimensional Glove vectors. we extract spatial image features from the last convolutional layer of ResNet-50. The spatial features are 7 × 7 × 2048-dimensional vectors which are representations of local spatial regions of the image.

Results
The results of all methods are shown in Table 1. We can see our Transformer baseline has comparable results compared to most previous works, When trained on the original data, our model substantially outperforms the SoTA according to BLEU and gets a competitive result according to METEOR. Moreover, we note that our model surpasses the text-only baseline by above 1 BLEU points. It demonstrates that our model benefits a lot from the visual modality.
To further investigate our model performance on more data, we also train the models with additional back-translated data, and the comparison results are shown in the lower part of Table 1. We can see that almost all models get improved with the additional training data, but our model obtains the most improvements and achieving new SoTA results on all metrics. It demonstrates that our model will perform better on the larger dataset.

Visualization Analysis
Figure 3 depicts translations for two cases in the test set. Colors highlight improvement. Furthermore, we visualize the contributions of different local regions of the image in different attention heads, which shows our model can focus on the appropriate regions of the image. For example, our model pays more attention to the building and the person in the first case, and thus the model understands that the person is working on the building rather than just standing there. In the second case, most attention heads attend to the balance beam and the jean dress of the girl, avoiding errors in the translation.

Ablation Study
To further study the influence of the individual components in our model, we conduct ablation experiments to better understand their relative importance. The results are presented in Table 2. Firstly, we investigate the effect of multimodal self-attention. As shown in the second columns (replace with selfattention) in Table 2. If we simply concatenate the word vectors with the image features and then perform self-attention, we will lose 0.6 BLEU score and 0.4 METEOR score. Inspired by Elliott (2018), we further examine the utility of the image by the adversarial evaluation. When we replace all input images with a blank picture, the performance of the model drops a lot. When we replace all input images with a random image (the context of image does not match the description in the sentence pair), the model performs even worse than the text-only model. The image here is actually a noise which distracts the translation.

Conclusion
In this paper, we propose the multimodal selfattention to consider the relative importance between different modalities in the MMT task. The hidden representations of less important modality (image) are induced from the important modality (text) under the guide of image-aware attention.
The experiments and visualization show that our model can make good use of multimodal information and get better performance than previous works.
There are various multimodal tasks where multiple modalities have different relative importance. In future work, we would like to investigate the effectiveness of our model in these tasks.