OSU Multimodal Machine Translation System Report

This paper describes Oregon State University's submissions to the shared WMT'17 task"multimodal translation task I". In this task, all the sentence pairs are image captions in different languages. The key difference between this task and conventional machine translation is that we have corresponding images as additional information for each sentence pair. In this paper, we introduce a simple but effective system which takes an image shared between different languages, feeding it into the both encoding and decoding side. We report our system's performance for English-French and English-German with Flickr30K (in-domain) and MSCOCO (out-of-domain) datasets. Our system achieves the best performance in TER for English-German for MSCOCO dataset.


Introduction
Natural language generation (NLG) is one of the most important tasks in natural language processing (NLP). It can be applied to a lot of interesting applications such like machine translation, image captioning, question answering. In recent years, Recurrent Neural Networks (RNNs) based approaches have shown promising performance in generating more fluent and meaningful sentences compared with conventional models such as rulebased model (Mirkovic et al., 2011), corpusbased n-gram models (Wen et al., 2015) and trainable generators (Stent et al., 2004).
More recently, attention-based encoderdecoder models (Bahdanau et al., 2014) have been proposed to provide the decoder more accurate alignments to generate more relevant words. The remarkable ability of attention mechanisms quickly update the state-of-theart performance on variety of NLG tasks, such as machine translation (Luong et al., 2015), image captioning (Xu et al., 2015;Yang et al., 2016), and text summarization (Rush et al., 2015;Nallapati et al., 2016).
However, for multimodal translation (Elliott et al., 2015), where we translate a caption from one language into another given a corresponding image, we need to design a new model since the decoder needs to consider both language and images at the same time.
This paper describes our participation in the WMT 2017 multimodal task 1. Our model feeds the image information to both the encoder and decoder, to ground their hidden representation within the same context of image during training. In this way, during testing time, the decoder would generate more relevant words given the context of both source sentence and image.

Model Description
For the neural-based machine translation model, the encoder needs to map sequence of word embeddings from the source side into another representation of the entire sequence using recurrent networks. Then, in the second stage, decoder generates one word at a time with considering global (sentence representation) and local information (weighted context) from source side. For simplicity, our proposed model is based on the attention-based encoderdecoder framework in (Luong et al., 2015), ref-ereed as "Global attention".
On the other hand, for the early work of neural-basic caption generation models (Vinyals et al., 2015), the convolutional neural networks (CNN) generate the image features which feed into the decoder directly for generating the description.
The first stage of the above two tasks both map the temporal and spatial information into a fixed dimensional vector which makes it feasible to utilize both information at the same time. Fig. 1 shows the basic idea of our proposed model (OSU1). The red character I represents the image feature that is generated from CNN. In our case, we directly use the image features that are provided by WMT, and these features are generated by residual networks (He et al., 2016).
The encoder (blue boxes) in Fig. 1 takes the image feature as initialization for generating each hidden representation. This process is very similar to neural-basic caption generation (Vinyals et al., 2015) which grounds each word's hidden representation to the context given by the image. On the decoder side (green boxes in Fig. 1), we not only let each decoded word align to source words by global attention but also feed the image feature as initialization to the decoder.

Datasets
In our experiments, we use two datasets Flickr30K (Elliott et al., 2016) and MSCOCO (Lin et al., 2014) which are provided by the WMT organization. For both datasets, there are triples that contains English as source sentence, its German and French human translations and corresponding image. The system is only trained on Flickr30K datasets but are also tested on MSCOCO besides Flickr30K. MSCOCO datasets are considered out-of-domain (OOD) testing while Flickr30K dataset are considered in-domain testing. The datasets' statics is shown in Table 1 Datasets Train Dev Test OOD ? Flickr30K 29, 000 1, 014 1, 000 No MSCOCO --461 Yes

Training details
For preprocessing, we convert all of the sentences to lower case, normalize the punctuation, and do the tokenization. For simplicity, our vocabulary keeps all the words that show in training set. For image representation, we use ResNet (He et al., 2016) generated image features which are provided by the WMT organization. In our experiments, we only use average pooled features.
Our implementation is adapted from on Pytorch-based OpenNMT (Klein et al., 2017). We use two layered bi-LSTM (Sutskever et al., 2014) on the source side as encoder. Our batch size is 64, with SGD optimization and a learning rate at 1. For English to German, the dropout rate is 0.6, and for English to French, the dropout rate is 0.4. These two parameters are selected by observing the performance on development set. Our word embeddings are randomly initialized with 500 dimensions. The source side vocabulary is 10,214 and the target side vocabulary is 18,726 for German and 11,222 for French.

Beam search with length reward
During test time, beam search is widely used to improve the output text quality by giving the decoder more options to generate the next possible word. However, different from traditional beam search in phrase-based MT where all hypotheses know the number of steps to finish the generation, while in neural-based generation, there is no information about what is the most ideal number of steps to finish the decoding. The above issue also leads to another problem that the beam search in neuralbased MT prefers shorter sequences due to probability-based scores for evaluating different candidates. In this paper, we use Optimal Beam Search (Huang et al., 2017) (OBS) during decoding time. OBS uses bounded length reward mechanism which allows a modified version of our beam search algorithm to remain optimal. Figure 2 and Figure 3 show the BLEU score and length ratio with different rewards for different beam size. We choose beam size equals to 5 and reward equals to 0.1 during decoding.

Results
WMT organization provides three different evaluating metrics: BLEU (Papineni et al., 2002), METEOR (Lavie and Denkowski, 2009) and TER (Snover et al., 2006). Table 2 to Table 5 summarize the performance with their corresponding rank among all other systems. We only show a few top performing systems in the tables to make a comparison. OSU1 is our proposed model and OSU2 is our baseline system without any image information. For MSCOCO dataset, the translation from English to German (Table 3), which is the hardest tasks compared with others since it is from English to German on OOD dataset, we achieve best TER score across all other systems.     decoder, and OSU2 is only the neural machine translation baseline without any image information. From the above results table we found that image information would hurt the performance in some cases. In order to have more detailed analysis, we show some test examples for the translation from English to German on MSCOCO dataset. Fig 4 shows two examples that NMT baseline model performances better than OSU1 model. In the first example, OSU1 generates several unseen objects from given image, such like knife. The image feature might not represent the image accurately. For the second example, OSU1 model ignores the object "box" in the image. Fig 5 shows two examples that image feature helps the OSU1 to generate better results. In the first example, image feature successfully detects the object "drink" while the baseline completely neglects this. In the second example, the image feature even help the model figure out the action of the cat is "sleeping".

Conclusion
We describe our system submission to the shared WMT'17 task "multimodal translation task I". The results for English-German and English-French on Flickr30K and MSCOCO datasets are reported in this paper. Our proposed model is simple but effective and we achieve the best performance in TER for  zwei erwachsene und ein erwachsener befinden sich auf dem rechteckigen tisch . Reference auf dem transparenten tisch stehen zwei speisen und ein getränk . input a camera set up in front of a sleeping cat . OSU1 eine kameracrew vor einer schlafenden katze . OSU2 eine kamera vor einer blonden katze . Reference eine kamera , die vor einer schlafenden katze aufgebaut ist