LIUM-CVC Submissions for WMT18 Multimodal Translation Task

This paper describes the multimodal Neural Machine Translation systems developed by LIUM and CVC for WMT18 Shared Task on Multimodal Translation. This year we propose several modifications to our previous multimodal attention architecture in order to better integrate convolutional features and refine them using encoder-side information. Our final constrained submissions ranked first for English-French and second for English-German language pairs among the constrained submissions according to the automatic evaluation metric METEOR.


Introduction
In this paper, we present the neural machine translation (NMT) and multimodal NMT (MMT) systems developed by LIUM and CVC for the third edition of the shared task. Several lines of work have been conducted since the introduction of the shared task on MMT in 2016 . The majority of last year submissions including ours (Caglayan et al., 2017a) were based on the integration of global visual features into various parts of the NMT architecture . Apart from these, hierarchical multimodal attention (Helcl and Libovický, 2017) and multi-task learning (Elliott and Kádár, 2017) were also explored by the participants.
This year we decided to revisit the multimodal attention (Caglayan et al., 2016) since our previous observations about qualitative analysis of the visual attention was not satisfying. In order to improve the multimodal attention both qualitatively and quantitatively, we experiment with several refinements to it: first, we try to use different input image sizes prior to feature extraction and second we normalize the final convolutional feature maps to assess its impact on the final MMT performance. In terms of architecture, we propose to refine the visual features by learning an encoderguided early spatial attention. In overall, we find that normalizing feature maps is crucial for the multimodal attention to obtain a comparable performance to monomodal NMT while the impact of the input image size remains unclear. Finally, with the help of the refined attention, we obtain modest improvements in terms of BLEU (Papineni et al., 2002) and METEOR (Lavie and Agarwal, 2007).
The paper is organized as follows: data preprocessing, model details and training hyperparameters are detailed respectively in section 2 and section 3. The results based on automatic evaluation metrics are reported in section 4. Finally the paper ends with a conclusion in section 5.

Data
We use Multi30k  dataset provided by the organizers which contains 29000, 1014, 1000 and 1000 English→{German,French} sentence pairs respectively for train, dev, test2016 and test2017. A new training split of 30014 pairs is formed by concatenating the train and dev splits. Early-stopping is performed based on METEOR computed over the test2016 set and the final model selection is done over test2017.
Punctuation normalization, lowercasing and aggressive hyphen splitting were applied to all sentences prior to training. A Byte Pair Encoding (BPE) model (Sennrich et al., 2016) with 10K merge operations is jointly learned on English-German and English-French resulting in vocabularies of 5189-7090 and 5830-6608 subwords respectively.

Visual Features
Since Multi30k images involve much more complex region-level relationships and scene compositions compared to ImageNet (Russakovsky et  A ba l p a r a b a k r x = Figure 1: Filtered attention (FA): the convolutional feature maps are dynamically masked using an attention conditioned on the source sentence representation. 2015) object classification task, we explore different input image sizes to quantify its impact in the context of MMT since rescaling the input image has a direct effect on the size of the receptive fields of the CNN. After normalizing the images using ImageNet mean and standard deviation, we resize and crop the images to 224x224 and 448x448. Features are then extracted from the final convolutional layer (res5c relu) of a pretrained ResNet50  CNN. 1 This led to feature maps V ∈ R 2048×w×w where the spatial dimensionality w is 7 or 14.

Feature Normalization
We conjecture that transferring ReLU features from a CNN into a model that only makes use of bounded non-linearities like sigmoid and tanh, can saturate the non-linear neurons in the very early stages of training if their weights are not carefully initialized. Instead of tuning the initialization, we experiment with L 2 normalization over the channel dimension so that each feature vector (∈ R 2048 ) has an L 2 norm of 1.

Models
In this section we will describe our baseline NMT and multimodal NMT systems. All models use 128 dimensional embeddings and GRU  layers with 256 hidden states. Dropout (Srivastava et al., 2014) is applied over source embeddings x s , encoder states H enc and pre-softmax activations o t . We also apply L 2 regularization with a factor of 1e−5 on all parameters except biases. The parameters are initialized using the method proposed by He et al. (2015) and optimized with Adam (Kingma and Ba, 2014). The total gradient norm is clipped to 1 (Pascanu et al., 2013). We use batches of size 64 and an initial learning rate of 4e−4. All systems are im- 1 We use torchvision for feature extraction. plemented using the PyTorch version of nmtpy 2 (Caglayan et al., 2017b).

Baseline NMT
Let us denote the length of the source sentence {x 1 , . . . , x S } and the target sentence {y 1 , . . . , y T } by S and T respectively. The source sentence is first encoded with a 2-layer bidirectional GRU to obtain the set of hidden states: The decoder is a 2-layer conditional GRU (CGRU) (Sennrich et al., 2017) with tied embeddings (Press and Wolf, 2016). CGRU is a stacked 2-layer recurrence block with the attention mechanism in the middle. We use feed-forward attention  which encapsulates a learnable layer. The first decoder (which is initialized with a zero vector) receives the previous target embeddings as inputs (equation 1). At each timestep of the decoding stage, the attention mechanisms produces a context vector c txt t (equation 2) that becomes the input to the second GRU (equation 3). Finally, the probability over the target vocabulary is conditioned over a transformation of the final hidden state h dec 2 t (equation 4, 5).

Multimodal Attention (MA)
Our baseline multimodal attention (MA) system (Caglayan et al., 2016) applies a spatial attention mechanism (Xu et al., 2015) over the visual features. At each timestep t of the decoding stage, a multimodal context vector c t is computed and given as input to the second decoder (equation 3): Previous analysis showed that the attention over the visual features is inconsistent and weak. We argue that this is because of the diluted relevant visual information, and the competition with the far more relevant source text information.

Filtered Attention (FA)
In order to enhance the visual attention, we propose an extension to the multimodal attention where the objective is to filter the convolutional feature maps using the last hidden state of the source language encoder (Figure 1). We conjecture that a learnable masking operation over the convolutional feature maps can help the decoderside visual attention mechanism by filtering out regions irrelevant to translation and focus on the most important part of the visual input. The filtered convolutional feature map V is computed as follows: ConvAtt block is inspired from previous works in visual question answering (VQA) (Yang et al., 2016;Kazemi and Elqursh, 2017). It basically computes a spatial attention distribution β pre which we further use to mask the actual convolutional features V. The filtered V replaces V in the equation 7 instead of being pooled into a single visual embedding in contrast to VQA models.

Results
We train each model 4 times using different seeds and report mean and standard deviation for the final results using multeval (Clark et al., 2011) Feature Normalization We can see from Table 1 that without L 2 normalization, multimodal attention is not able to reach the performance of baseline NMT. Applying the normalization consistently improves the results for all input sizes by around ∼2 points in BLEU and METEOR. From now on, we only present systems trained with normalized features. 31.6 ± 0.5 52.5 ± 0.4 Table 2: Impact of input image width on the performance of multimodal attention variants.
Image Size Although the impact of doubling the image width and height at the input seems marginal (Table 2), we switch to 448x448 images to benefit from the slight gains which are consistent across both attention variants.

Monomodal vs Multimodal Comparison
We first present the mean and standard deviation of BLEU and METEOR over 4 runs on the internal test set test2017 (Table 3). With the help of L 2 normalization, MA system almost reaches the monomodal system but fails to improve over it. On the contrary, the filtered attention (FA) mechanism improves over the baseline and produces hypotheses that are statistically different than the baseline with p ≤ 0.02. The improvements obtained for EN→DE language pair are not reflected on the EN→FR performance. One should note that the hyperparameters from EN→DE task were transferred to EN→FR without any other tuning.
The automatic evaluation of our final submissions (which are ensembles of 4 runs) on the official test set test2018 is presented in Table 5. In addition to our submissions, we also provide the best constrained and unconstrained systems 3 in terms of METEOR. However, it should be noted that the submitted systems will be primarily evaluated using human direct assessment.
On EN→DE, our constrained FA system is comparable to the constrained UMONS submission. On EN→FR, our submission obtained the 11.3M 31.6 ± 0.5 52.5 ± 0.4 50.5 ± 0.5   highest automatic evaluation scores among the constrained submissions and is slightly worse than the unconstrained CUNI system.

Conclusion
MMT task consists of translating a source sentence into a target language with the help of an image representing the source sentence. The different level of relevance of both input modalities makes it a difficult task where the image should be used with parsimony. With the aim of improving the attention over visual input, we introduced a filtering technique to allow the network to ignore irrelevant parts of the image that should not be considered during decoding. This is done by using an attention-like mechanism between the source sentence and the convolutional feature maps. Results show that this mechanism significantly improves the results for English→German on one of the test sets. In the future, we plan to qualitatively analyze the spatial attention and try to improve it further.