Idiap NMT System for WAT 2019 Multimodal Translation Task

This paper describes the Idiap submission to WAT 2019 for the English-Hindi Multi-Modal Translation Task. We have used the state-of-the-art Transformer model and utilized the IITB English-Hindi parallel corpus as an additional data source. Among the different tracks of the multi-modal task, we have participated in the “Text-Only” track for the evaluation and challenge test sets. Our submission tops in its track among the competitors in terms of both automatic and manual evaluation. Based on automatic scores, our text-only submission also outperforms systems that consider visual information in the “multi-modal translation” task.


Introduction
In recent years, significant research has been done to address problems that require joint modelling of language and vision (Specia et al., 2016). The popular applications involving Natural Language Processing (NLP) and Computer Vision (CV) include image description generation (Bernardi et al., 2016), video captioning (Li et al., 2019), or visual question answering (Antol et al., 2015).
In the past few decades, multi-modality has received critical attention in translation studies, although the benefit of visual modality in machine translation is still in debate (Caglayan et al., 2019). The main motivation in multi-modal research in machine translation is the intuition that information from other modalities could help to find the correct sense of ambiguous words in the source sentence, which could potentially lead to more accurate translations (Lala and Specia, 2018  Despite the lack of multi-modal datasets, there is a visible interest in using image features even for machine translation for lowresource language. For instance, Chowdhury et al. (2018) train a multi-modal neural MT system for Hindi→English using synthetic parallel data only.
In this system description paper, we explain how we used additional resources in the textonly track of WAT 2019 Multi-Modal Translation Task. Section 2 describes the datasets used in our experiment. Section 3 presents the model and experimental setups used in our approach. Section 4 provides the official evaluation results of WAT 2019 followed by the conclusion in Section 6.

Dataset
The official training set was provided by the task organizers: Hindi Visual Genome (HVG for short, Parida et al., 2019a,b). The training part consists of 29k English and Hindi short captions of rectangular areas in photos of various scenes and it is complemented by three test sets: development (D-Test), evaluation (E-Test) and challenge test set (C-Test). We did not make any use of the images. Our WAT submissions were for E-Test (denoted "EV" in WAT official tables) and C-Test (denoted "CH" in WAT tables).  Additionally, we used the IITB Corpus (Kunchukuttan et al., 2017) which is supposedly the largest publicly available English-Hindi parallel corpus. This corpus contains 1.49 million parallel segments and it was found very effective for English-Hindi translation (Parida and Bojar, 2018).
The statistics of the datasets are shown in Table 1.

Experiments
We focussed only on the text translation task.

Tokenization and Vocabulary
Subword units were constructed using the word pieces algorithm (Johnson et al., 2017). Tokenization is handled automatically as part of the pre-processing pipeline of word pieces.
We generated the vocabulary of 32k subword types jointly for both the source and target languages. The vocabulary is shared between the encoder and decoder.

Training
To train the model, we used a single GPU and followed the standard "Noam" learning rate 1 http://opennmt.net/OpenNMT-py/quickstart. html decay, 2 see Vaswani et al. (2017) or Popel and Bojar (2018) for more details. Our starting learning rate was 0.2 and we used 8000 warm up steps.
We ran only one training run. We concatenated HVG and IITB training data and shuffled it at the level of sentences. We let the model train for up to 200K steps, interrupted a few times due to GPU queueing limitations of our cluster. Following the recommendation of Popel and Bojar (2018), we present the full learning curves on D-Test, E-Test and C-Test in Figure 1.
We observed a huge difference between BLEU (Papineni et al., 2002) scores as implemented in the Moses toolkit (Koehn et al., 2007) and the newer implementation in sacre-BLEU (Post, 2018). The discrepancy is very likely caused by different tokenization but the best choice in terms of linguistic plausibility still has to be made. In Figure 1, we show both implementations and see that the Moses implementation gives scores higher by 10 (!) points absolute. More importantly, it is a little less peaked, which we see as evidence for better robustness and thus hopefully the linguistic adequacy.
All of the test sets (D-, E-and C-Test) are independent of the training data and the training itself is not affected by them in any way.  In other words, they all can be seen as interchangeable, only the choice which particular iteration to run must be done on one of them and evaluated on a different one. At the submission deadline for E-Test, our training has only started, so we submitted the latest result available, namely E-Test translated with the model at 35K training steps. When submitting the translations of C-Test for the WAT official evaluation, we already knew the full training run and selected the step 165K where E-Test reached its maximum score. In other words, the choice of the model for the C-Test was based on E-Test serving as a validation set.

Official Results
We report the official automatic as well as manual evaluation results of our models for the evaluation and challenge test dataset here in Table 2. All the scores are available on the WAT 2019 website 3 and in the WAT overview paper (Nakazawa et al., 2019).
According to both automatic and manual scores, our submissions were the best in the text-only task (MM**TEXT), see the tables in Nakazawa et al. (2019).
Since the text-only and multi-modal tracks differ only in the fact whether the image is available and the underlying set of sentences is identical, we can also compare our result with the scores of systems participating in the multi-modal track (MM**MM). We show only the best system of the multi-modal track. Both on the E-Test and C-Test, our (textonly) candidates scored better in BLEU that the best competitor in the multi-modal track (41.32 vs. 40.55 on E-Test and 30.94 vs. 20.37 on C-Test). Manual judgments also indicate that our translations are better than those of the best multi-modal system, but here the comparison has to be taken with a grain of salt. The root of the trouble is that the manual evaluation for the text-only and multimodal tracks ran separately. While the underlying method (Direct Assessment, DA, Graham et al., 2013) in principle scores sentences in absolute terms, it has been observed by  that DA scores from independent runs are not reliably comparable. We indicate this by the additional horizontal lines in Table 2. Figure 2 illustrates of our translation output.

Discussion
We did not explore the space of possible configurations much, we just ran training and observed the development of the learning curve. Our final results are nevertheless good, indicating that reasonably clean data and baseline settings of the Transformer architecture deliver good translations. The specifics of the task have to be taken into account. The "sentences" in Hindi Visual Genome are quite short, only 4.7 Hindi and 4.9 English tokens per sentence. This is substantially less than the IITB corpus where the average number of tokens is 15.8 (Hindi) and 14.7 (English). With IITB mixed in the training data, the model gets a significant advantage, not only because of the better coverage of words and phrases but also due to the length. As observed by  and Popel and Bojar (2018), NMT models struggle to produce outputs longer than the training data was. Our situation is the reverse, so our model "operates within its comfortable zone".  Comparing the scores of D-and E-Test on the one hand and C-Test on the other hand, we see that D-and E-Test are much easier for the system. This can be attributed to the identical distributional properties of D-Test and E-Test as the model observed for HVG in the training data. According to Parida et al. (2019a), C-Test also comes from the Visual Genome but the sampling is different, each sentence illustrating one of 19 particularly ambiguous words (focus words in the following).
As shown in Figure 2, our system has generally no trouble in figuring out the correct sense of the focus words, thanks to the surrounding words in the context. The BLEU scores on C-Test are nevertheless much lower than on E-Test or D-Test. We attribute this primarily to the slight mismatch between HVG training data and C-Test. As can be confirmed in Table 1, the average sentence length in C-Test is 6.2 (Hindi) and 5.8 (English) tokens, i.e. 0.9-1.5 longer than the training data. Indeed, the model produces shorter outputs than expected and BLEU brevity penalty affects C-Test more (BP=0.907) than E-Test (BP=0.974).
By a quick visual inspection of the outputs, we notice that some rare words were not translated at all, for example, "dugout", "skiing", or "celtic". Most of the non-translated words are not the focus words of the challenge test set but simply random words in the sentences. The focus words that were not translated include: "springs", "cross" and some instance of the word "stand". We did not have the human capacity to review the translations of all the focus words but our general impression is that they were mostly correct. One example, the mistranslation of the (tennis) court is given at the bottom of Figure 2.
Finally, we would like to return to the issue of BLEU implementation pointed out in Section 3.2. The main message to take from this observation is that many common tools are not really polished and well tested for use on lessresearched languages and languages not using Latin script. No conclusions can be thus drawn by comparing numbers reported across papers. A solid comparison can be only made with the evaluation tool fixed, as is the practice of WAT shared task.

Conclusion and Future Plans
In this system description paper, we presented our English→Hindi NMT system. We have highlighted the benefits of using additional text-only training data. Our system performed best among the competitors for the submitted track ("text-only") and also performs better than systems that did consider the image in the "multi-modal" track according to automatic evaluation. We conclude that for the general performance, more parallel data are more important than the visual features available in the image. A targeted manual evaluation would be however necessary to see if the translation of the particularly ambiguous words is better when MT systems consider the image.
As the next step, we plan to utilize image features and carry out a comparison study with the current setup. Also, we plan to experiment with the image captioning variant of the task.