Ensemble Sequence Level Training for Multimodal MT: OSU-Baidu WMT18 Multimodal Machine Translation System Report

This paper describes multimodal machine translation systems developed jointly by Oregon State University and Baidu Research for WMT 2018 Shared Task on multimodal translation. In this paper, we introduce a simple approach to incorporate image information by feeding image features to the decoder side. We also explore different sequence level training methods including scheduled sampling and reinforcement learning which lead to substantial improvements. Our systems ensemble several models using different architectures and training methods and achieve the best performance for three subtasks: En-De and En-Cs in task 1 and (En+De+Fr)-Cs task 1B.


Introduction
In recent years, neural text generation has attracted much attention due to its impressive generation accuracy and wide applicability. In addition to demonstrating compelling results for machine translation (Sutskever et al., 2014;Bahdanau et al., 2014), by simple adaptation, similar models have also proven to be successful for summarization (Rush et al., 2015;Nallapati et al., 2016), image or video captioning (Venugopalan et al., 2015;Xu et al., 2015) and multimodal machine translation (Elliott et al., 2017;Caglayan et al., 2017;Calixto and Liu, 2017;, which aims to translate the caption from one language to another with the help of the corresponding image. However, the conventional neural text generation models suffer from two major drawbacks. First, they are typically trained by predicting the next word given the previous ground-truth word. But at test time, the models recurrently feed their own predictions into it. This "exposure bias" (Ranzato et al., 2015) leads to error accumulation * Equal contribution † Contributions made while at Baidu Research during generation at test time. Second, the models are optimized by maximizing the probability of the next ground-truth words which is different from the desired non-differentiable evaluation metrics, e.g. BLEU.
Several approaches have been proposed to tackle the previous problems.  propose scheduled sampling to alleviate "exposure bias" by feeding back the model's own predictions with a slowly increasing probability during training. Furthermore, reinforcement learning (Sutton et al., 1998) is proven to be helpful to directly optimize the evaluation metrics in neural text generation models training. Ranzato et al. (2015) successfully use the REINFORCE algorithm to directly optimize the evaluation metric over multiple text generation tasks. Rennie et al. (2017);  achieve state-of-the-art on image captioning using REINFORCE with baseline to reduce training variance.
Moreover, many existing works show that neural text generation models can benefit from model ensembling by simply averaging the outputs of different models (Elliott et al., 2017;Rennie et al., 2017). Garmash and Monz (2016) claim that it is essential to introduce diverse models into the ensemble. To this end, we ensemble models with various architectures and training methods.
This paper describes our participation in the WMT 2018 multimodal tasks. Our submitted systems include a series of models which only consider text information, as well as multimodal models which also include image information to initialize the decoders. We train these models using scheduled sampling and reinforcement learning. The final outputs are decoded by ensembling those models. To the best of our knowledge, this is the first multimodal machine translation system that achieves the state-of-the-art using sequence level learning methods.

Methods
Our model is based on the sequence-to-sequence RNN architecture with attention (Bahdanau et al., 2014). We incorporate image features to initialize the decoder's hidden state as shown in Figure 1. Originally, this hidden state is initialized using the concatenation of last encoder's forward and backward hidden states, − → h e and ← − h e resp. We propose to use the sum of encoder's output and image features h img to initialize the decoder. Formally, we have the final initialization state h d as: where W e and W img project the encoder and image feature vector into the decoder hidden state dimensionality and b is the bias parameter. This approach has been previously explored by Calixto and Liu (2017). As discussed previously, translation systems are traditionally trained using cross entropy loss. To overcome the discrepancy between training and inference distributions, we train our models using scheduled sampling  which mixes the ground truth with model predictions, further adopting the REINFORCE algorithm with baseline to directly optimize translation metrics.

Scheduled Sampling
When predicting a tokenŷ t , scheduled sampling uses the previous model predictionŷ t−1 with probability or the previous ground truth prediction y t−1 with probability 1 − . The model prediction y t−1 is obtained by sampling a token according to the probability distribution by P (y t−1 |h t−1 ). At the beginning of training, the sampled token can be very random. Thus, the probability is set very low initially and increased over time.
One major limitation of scheduled sampling is that at each time step, the target sequences can be incorrect since they are randomly selected from the ground truth data or model predictions, regardless of how input was chosen (Ranzato et al., 2015). Thus, we use reinforcement learning techniques to further optimize models on translation metrics directly.

Reinforcement Learning
Following Ranzato et al. (2015) and Rennie et al. (2017), we use REINFORCE with baseline to directly optimize the evaluation metric.
According to the reinforcement learning literature (Sutton et al., 1998), the neural network, θ, defines a policy p θ , that results in an "action" that is the prediction of next word. After generating the end-of-sequence term (EOS), the model will get a reward r, which can be the evaluation metric, e.g. BLEU score, between the golden and generated sequence. The goal of training is to minimize the negative expected reward.
( 2) where sentence w s = (w s 1 , ..., w s T ). In order to compute the gradient ∇ θ L(θ), we use the REINFORCE algorithm, which is based on the observation that the expected gradient of a non-differentiable reward function can be computed as follows: The policy gradient can be generalized to compute the reward associated with an action value relative to a reference reward or baseline b: (4) The baseline does not change the expected gradient, but importantly, it can reduce the variance of the gradient estimate. We use the baseline introduced in Rennie et al. (2017) which is obtained by the current model with greedy decoding at test time.
b = r(ŵ s ) whereŵ s is generated by greedy decoding. For each training case, we approximate the expected gradient with a single sample w s ∼ p θ :

Ensembling
In our experiments with relatively small training dataset, the translation qualities of models with different initializations can vary notably. To make the performance much more stable and improve the translation quality, we ensemble different models during decoding to achieve better translation.
To ensemble, we take the average of all model outputs:ŷ whereŷ i t denotes the output distribution of ith model at position t. Similar to Zhou et al. (2017), we can ensemble models trained with different architectures and training algorithms.

Datasets
We perform experiments using Flickr30K (Elliott et al., 2016) which are provided by the WMT organization. Task 1 (Multimodal Machine Translation) consists of translating an image with an English caption into German, French and Czech. Task 1b (Multisource Multimodal Machine Translation) involves translating parallel English, German and French sentences with accompanying image into Czech.
As shown in Table 1, both tasks have 2900 training and 1014 validation examples. For preprocessing, we convert all of the sentences to lower case, normalize the punctuation, and tokenize. We employ byte-pair encoding (BPE) (Sennrich et al., 2015) on the whole training data including the four languages and reduce the source and target language vocabulary sizes to 20k in total.

Training details
The image feature is extracted using ResNet-101 (He et al., 2016) Table 2: BLEU scores of different approaches on the validation set. Details of the ensemble models are described in Table 9.
trained on the ImageNet dataset. Our implementation is adapted from Pytorch-based OpenNMT (Klein et al., 2017). We use two layered bi-LSTM (Sutskever et al., 2014) as the encoder and share the vocabulary between the encoder and the decoder. We adopt length reward ) on En-Cs task to find the optimal sentence length. We use a batch size of 50, SGD optimization, dropout rate as 0.1 and learning rate as 1.0. Our word embeddings are randomly initialized of dimension 500. To train the model with scheduled sampling, we first set probability as 0, and then gradually increase it 0.05 every 5 epochs until it's 0.25. The reinforcement learning models are trained based on those models pre-trained by scheduled sampling.

Results for task 1
To study the performance of different approaches, we conduct an ablation study. Table 2 shows the BLEU scores on validation set with different models and training methods. Generally, models with scheduled sampling perform better than baseline models, and reinforcement learning further improves the performance. Ensemble models lead to substantial improvements over the best single model by about +2 to +3 BLEU scores. However, by including image information, MNMT per-      forms better than NMT only on the En-Fr task with scheduled sampling. Table 4, 5 and 6 show the test set performance of our models on En-De, En-Fr and En-Cs subtasks with other top performance models. We rank those models according to BLEU. Our submitted systems rank first in BLEU and TER on En-De and En-Cs subtasks.  Table 9: Rank of our models. † represents the total number of models. ‡ represents the total number of teams.

Results for task 1B
cussed in multi-reference training (Zheng et al., 2018), we randomly shuffle the source data in all languages and train using a traditional attention based-neural machine translation model in every epoch. Since we do BPE on the whole training data, we can share the vocabulary of different languages during training. The results show that models trained using single English to Czech data perform much better than the rest. Table 8 shows results on test set. The submitted systems are the same as those used in En-Cs task of task 1. Although we only consider the English source during training, our proposed systems still rank first among all the submissions.

Conclusions
We describe our systems submitted to the shared WMT 2018 multimodal translation tasks. We use sequence training methods which lead to substantial improvements over strong baselines. Our ensembled models achieve the best performance in BLEU score for three subtasks: En-De, En-Cs of task 1 and (En+De+Fr)-Cs task 1B.