Findings of the Second Shared Task on Multimodal Machine Translation and Multilingual Image Description

We present the results from the second shared task on multimodal machine translation and multilingual image description. Nine teams submitted 19 systems to two tasks. The multimodal translation task, in which the source sentence is supplemented by an image, was extended with a new language (French) and two new test sets. The multilingual image description task was changed such that at test time, only the image is given. Compared to last year, multimodal systems improved, but text-only systems remain competitive.


Introduction
The Shared Task on Multimodal Translation and Multilingual Image Description tackles the problem of generating descriptions of images for languages other than English. The vast majority of image description research has focused on Englishlanguage description due to the abundance of crowdsourced resources (Bernardi et al., 2016). However, there has been a significant amount of recent work on creating multilingual image description datasets in German Hitschler et al., 2016;Rajendran et al., 2016), Turkish (Unal et al., 2016), Chinese (Li et al., 2016), Japanese (Miyazaki and Shimizu, 2016;Yoshikawa et al., 2017), and Dutch (van Miltenburg et al., 2017). Progress on this problem will be useful for native-language image search, multilingual ecommerce, and audio-described video for visually impaired viewers.
The first empirical results for multimodal translation showed the potential for visual context to improve translation quality (Elliott et al., 2015;Hitschler et al., 2016). This was quickly followed by a wider range of work in the first shared task at WMT 2016 .
The current shared task consists of two subtasks: • Task 1: Multimodal translation takes an image with a source language description that is then translated into a target language. The training data consists of parallel sentences with images.
• Task 2: Multilingual image description takes an image and generates a description in the target language without additional source language information at test time. The training data, however, consists of images with independent descriptions in both source and target languages.
The translation task has been extended to include a new language, French. This extension means the Multi30K dataset  is now triple aligned, with English descriptions translated into both German and French.
The description generation task has substantially changed since last year. The main difference is that source language descriptions are no longer observed for test images. This mirrors the realworld scenario in which a target-language speaker wants a description of image that does not already have source language descriptions associated with it. The two subtasks are now more distinct because multilingual image description requires the use of the image (no text-only system is possible because the input contains no text).
Another change for this year is the introduction of two new evaluation datasets: an extension of the existing Multi30K dataset, and a "teaser" evaluation dataset with images carefully chosen to contain ambiguities in the source language.
This year we encouraged participants to submit systems using unconstrained data for both tasks. Training on additional out-of-domain data is underexplored for these tasks. We believe this setting will be critical for future real-world improvements, given that the current training datasets are small and expensive to construct. 1

Tasks
The Multimodal Translation task (Task 1) follows the format of the 2016 Shared Task . The Multilingual Image Description Task (Task 2) is new this year but it is related to the Crosslingual Image Description task from 2016. The main difference between the Crosslingual Image Description task and the Multilingual Image Description task is the presence of source language descriptions. In last year's Crosslingual Image Description task, the aim was to produce a single target language description, given five source language descriptions and the image. In this year's Multilingual Image Description task, participants received only an unseen image at test time, without source language descriptions.

Datasets
The Multi30K dataset  is the primary dataset for the shared task. It contains 31K images originally described in English (Young et al., 2014) with two types of multilingual data: a collection of professionally translated German sentences, and a collection of independently crowdsourced German descriptions.
This year the Multi30K dataset has been extended with new evaluation data for the Translation and Image Description tasks, and an additional language for the Translation task. In addition, we released a new evaluation dataset featuring ambiguities that we expected would benefit from visual context. Table 1 presents an overview of the new evaluation datasets. Figure 1 shows an example of an image with an aligned English-German-French description.
In addition to releasing the parallel text, we also distributed two types of ResNet-50 visual features 1 All of the data and results are available at http://www. statmt.org/wmt17/multimodal-task.html En: A group of people are eating noddles. De: Eine Gruppe von Leuten isst Nudeln. Fr: Un groupe de gens mangent des nouilles. Figure 1: Example of an image with a source description in English, together with German and French translations. (He et al., 2016) for all of the images, namely the 'res4 relu' convolutional features (which preserve the spatial location of a feature in the original image) and averaged pooled features.

Multi30K French Translations
We extended the translation data in Multi30K dataset with crowdsourced French translations. The crowdsourced translations were collected from 12 workers using an internal platform. We estimate the translation work had a monetary value of e9,700. The translators had access to the source segment, the image and an automatic translation created with a standard phrase-based system (Koehn et al., 2007) trained on WMT'15 parallel text. The automatic translations were presented to the crowdworkers to further simplify the crowdsourcing task. We note that this did not end up being a post-editing task, that is, the translators did not simply copy and paste the suggested translations. To demonstrate this, we calculated text-similarity metric scores between the phrase-based system outputs and the human translations on the training corpus, resulting in 0.41 edit distance (measured using the TER metric), meaning that more than 40% of the words between these two versions do not match.

Multi30K 2017 test data
We collected new evaluation data for the Multi30K dataset. We sampled new images from five of the six Flickr groups used to create the original Flickr30K dataset using MMFeat (Kiela, 2016)   We sampled additional images from two thematically related groups (Everything Outdoor and Flickr Social Club) because Outdoor Activities only returned 10 new CC-licensed images and Flickr-Social no longer exists. Table 2 shows the distribution of images across the groups and tasks. We initially downloaded 2,000 images per Flickr group, which were then manually filtered by three of the authors. The filtering was done to remove (near) duplicate images, clearly watermarked images, and images with dubious content. This process resulted in a total of 2,071 images. We crowdsourced five English descriptions of each image from Crowdflower 3 using the same process as . One of the authors selected 1,000 images from the collection to form the dataset for the Multimodal Translation task based on a manual inspection of the English descriptions. Professional German translations were collected for those 1,000 English-described images. The remaining 1,071 images were used for the Multilingual Image Description task. We collected five additional independent German descriptions of those images from Crowdflower.

Ambiguous COCO
As a secondary evaluation dataset for the Multimodal Translation task, we collected and translated a set of image descriptions that potentially contain ambiguous verbs. We based our selection on the VerSe dataset (Gella et al., 2016), which annotates a subset of the COCO (Lin et al., 2014) and TUHOI (Le et al., 2014) images with OntoNotes senses for 90 verbs which are ambiguous, e.g. play.
Their goals were to test the feasibility of annotating images with the word sense of a given verb (rather than verbs themselves) and to provide a gold-labelled dataset for evaluating automatic visual sense disambiguation methods.
Altogether, the VerSe dataset contains 3,518 images, but we limited ourselves to its COCO section, since for our purposes we also need the image descriptions, which are not available in TUHOI. The COCO portion covers 82 verbs; we further discarded verbs that are unambiguous in the dataset, i.e. although some verbs have multiple senses in OntoNotes, they all occur with one sense in VerSe (e.g. gather is used in all instances to describe the 'people gathering' sense), resulting in 57 ambiguous verbs (2,699 images). The actual descriptions of the images were not distributed with the VerSe dataset. However, given that the ambiguous verbs were selected based on the image descriptions, we assumed that in all cases at least one of the original COCO description (out of the five per image) should contain the ambiguous verb. In cases where more than one description contained the verb, we  randomly selected one such description to be part of the dataset of descriptions containing ambiguous verbs. This resulted in 2,699 descriptions.
As a consequence of the original goals of the VerSe dataset, each sense of each ambiguous verb was used multiple times in the dataset, which resulted in many descriptions with the same sense, for example, 85 images (and descriptions) were available for the verb show, but they referred to a small set of senses of the verb.
The number of images (and therefore descriptions) per ambiguous verb varied from 6 (stir) to 100 (pull, serve). Since our intention was to have a small but varied dataset, we selected a subset of a subset of descriptions per ambiguous verb, aiming at keeping 1-3 instances per sense per verb. This resulted in 461 descriptions for 56 verbs in total, ranging from 3 (e.g. shake, carry) to 26 (reach) (the verb lay/lie was excluded as it had only one sense). We note that the descriptions include the use of the verbs in phrasal verbs. Two examples of the English verb "to pass" are shown in Figure  2. In the German translations, the source language verb did not require disambiguation (both German translations use the verb "fährt"), whereas in the French translations, the verb was disambiguated into "dépasse" and "traverse", respectively.

Participants
This year we attracted submissions from nine different groups. Table 3 presents an overview of the groups and their submission identifiers.
AFRL-OHIOSTATE (Task 1) The AFRL-OHIOSTATE system submission is an atypical Machine Translation (MT) system in that the image is the catalyst for the MT results, and not the textual content. This system architecture assumes an image caption engine can be trained in a target language to give meaningful output in the form of a set of the most probable n target language candidate captions. A learned mapping function of the encoded source language caption to the corresponding encoded target language captions is then employed. Finally, a distance function is applied to retrieve the "nearest" candidate caption to be the translation of the source caption.
CMU (Task 2) The CMU submission uses a multi-task learning technique, extending the baseline so that it generates both a German caption and an English caption. First, a German caption is generated using the baseline method. After the LSTM for the baseline model finishes producing a German caption, it has some final hidden state. Decoding is simply resumed starting from that final state with an independent decoder, separate vocabulary, and this time without any direct access to the image. The goal is to encourage the model to keep information about the image in the hidden state throughout the decoding process, hopefully improving the model output. Although the model is trained to produce both German and English cap-  CUNI (Tasks 1 and 2) For Task 1, the submissions employ the standard neural MT (NMT) scheme enriched with another attentive encoder for the input image. It uses a hierarchical attention combination in the decoder ). The best system was trained with additional data obtained from selecting similar sentences from parallel corpora and by back-translation of similar sentences found in the SDEWAC corpus (Faaß and Eckart, 2013). The submission to Task 2 is a combination of two neural models. The first model generates an English caption from the image. The second model is a text-only NMT model that translates the English caption to German.
DCU-ADAPT (Task 1) This submission evaluates ensembles of up to four different multimodal NMT models. All models use global image features obtained with the pre-trained CNN VGG19, and are either incorporated in the encoder or the decoder. These models are described in detail in (Calixto et al., 2017b). They are model IMG W , in which image features are used as words in the source-language encoder; model IMG E , where image features are used to initialise the hidden states of the forward and backward encoder RNNs; and model IMG D , where the image features are used as additional signals to initialise the decoder hid-den state. Each image has one corresponding feature vector, obtained from the activations of the FC7 layer of the VGG19 network, and consist of a 4096D real-valued vector that encode information about the entire image.
LIUMCVC (Task 1) LIUMCVC experiment with two approaches: a multimodal attentive NMT with separate attention (Caglayan et al., 2016) over source text and convolutional image features, and an NMT where global visual features (2048dimensional pool5 features from ResNet-50) are multiplicatively interacted with word embeddings. More specifically, each target word embedding is multiplied with global visual features in an elementwise fashion in order to visually contextualize word representations. With 128-dimensional embeddings and 256-dimensional recurrent layers, the resulting models have around 5M parameters.
NICT (Task 1) These are constrained submissions for both language pairs. First, a hierarchical phrase-based (HPB) translation system s built using Moses (Koehn et al., 2007) with standard features. Then, an attentional encoder-decoder network (Bahdanau et al., 2015) is trained and used as an additional feature to rerank the n-best output of the HPB system. A unimodal NMT model is also trained to integrate visual information. Instead of integrating visual features into the NMT model directly, image retrieval methods are employed to obtain target language descriptions of images that are similar to the image described by the source sentence, and this target description information is integrated into the NMT model. A multimodal NMT model is also used to rerank the HPB output. All feature weights (including the standard features, the NMT feature and the multimodal NMT feature) were tuned by MERT (Och, 2003). On the development set, the NMT feature improved the HPB system significantly. However, the multimodal NMT feature did not further improve the HPB system that had integrated the NMT feature.
OREGONSTATE (Task 1) The OREGON-STATE system uses a very simple but effective model which feeds the image information to both encoder and decoder. On the encoder side, the image representation was used as an initialization information to generate the source words' representations. This step strengthens the relatedness between image's and source words' representations. Additionally, the decoder uses alignment to source words by a global attention mechanism. In this way, the decoder benefits from both image and source language information and generates more accurate target side sentence.
UvA-TiCC (Task 1) The submitted systems are Imagination models (Elliott and Kádár, 2017), which are trained to perform two tasks in a multitask learning framework: a) produce the target sentence, and b) predict the visual feature vector of the corresponding image. The constrained models are trained over only the 29,000 training examples in the Multi30K dataset with a source-side vocabulary of 10,214 types and a target-side vocabulary of 16,022 types. The unconstrained models are trained over a concatenation of the Multi30K, News Commentary (Tiedemann, 2012) parallel texts, and MS COCO (Chen et al., 2015) dataset with a joint source-target vocabulary of 17,597 word pieces (Schuster and Nakajima, 2012). In both constrained and unconstrained submissions, the models were trained to predict the 2048D GoogleLeNetV3 feature vector (Szegedy et al., 2015) of an image associated with a source language sentence. The output of an ensemble of the three best randomly initialized models -as measured by BLEU on the Multi30K development set -was used for both the constrained and unconstrained submissions.
SHEF (Task 1) The SHEF systems utilize the predicted posterior probability distribution over the image object classes as image features. To do so, they make use of the pre-trained ResNet-152 (He et al., 2016), a deep CNN based image network that is trained over the 1,000 object categories on the Imagenet dataset (Deng et al., 2009) to obtain the posterior distribution. The model follows a standard encoder-decoder NMT approach using softdot attention as described in (Luong et al., 2015). It explores image information in three ways: a) to initialize the encoder; b) to initialize the decoder; c) to condition each source word with the image class posteriors. In all these three ways, non-linear affine transformations over the posteriors are used as image features.
Baseline -Task 1 The baseline system for the multimodal translation task is a text-only neural machine translation system built with the Nematus toolkit (Sennrich et al., 2017). Most settings and hyperparameters were kept as default, with a few exceptions: batch size of 40 (instead of 80 due to memory constraints) and ADAM as optimizer. In order to handle rare and OOV words, we used the Byte Pair Encoding Compression Algorithm to segment words (Sennrich et al., 2016b). The merge operations for word segmentation were learned using training data in both source and target languages. These were then applied to all training, validation and test sets in both source and target languages. In post-processing, the original words were restored by concatenating the subwords.
Baseline -Task 2 The baseline for the multilingual image description task is an attention-based image description system trained over only the German image descriptions . The visual representation are extracted from the so-called res4f relu layer from a ResNet-50 (He et al., 2016) convolutional neural network trained on the ImageNet dataset (Russakovsky et al., 2015). Those feature maps provide spatial information on which the model focuses through the attention mechanism.
1.5 (Denkowski and Lavie, 2014). We also report the results using BLEU (Papineni et al., 2002) and TER (Snover et al., 2006) Table 4 shows the results on the Multi30K 2017 test data with a German target language. It interesting to note that the metrics do not fully agree on the ranking of systems, although the four best (statistically indistinguishable) systems win by all metrics.
All-but-one submission outperformed the text-only NMT baseline. This year, the best performing systems include both multimodal (LIUMCVC MNMT C and UvA-TiCC IMAGINATION U) and text-only (NICT NMTrerank C and LIUMCVC MNMT C) submissions.
(Strictly speaking, the UvA-TiCC IMAGINATION U system is incomparable because it is an unconstrained system, but all unconstrained systems perform in the same range as the constrained systems.) Table 5 shows the results for the out-of-domain ambiguous COCO dataset with a German target language. Once again the evaluation metrics do not fully agree on the ranking of the submissions.

Ambiguous COCO
It is interesting to note that the metric scores are lower for the out-of-domain Ambiguous COCO data compared to the in-domain Multi30K 2017 test data. However, we cannot make definitive claims about the difficulty of the dataset because the Ambiguous COCO dataset contains fewer sentences than the Multi30K 2017 test data (461 compared to 1,000).
The systems are mostly in the same order as on the Multi30K 2017 test data, with the same four systems performing best. However, two systems (DCU-ADAPT MultiMT C and OREGON-STATE 1NeuralTranslation C) are ranked higher on this test set than on the in-domain Flickr dataset, indicating that they are relatively more robust and possibly better at resolving the ambiguities found in the Ambiguous COCO dataset.  Table 6 shows the results for the Multi30K 2017 test data with French as target language. A reduced number of submissions were received for this new language pair, with no unconstrained systems. In contrast to the English→German results, the evaluation metrics are in better agreement about the ranking of the submissions.
Translating from English→French is an easier task than English→German systems, as reflected in the higher metric scores. This also includes the baseline systems where English→French results in 63.1 Meteor compared to 41.9 for English→German.
Eight out of the ten submissions outperformed the English→French baseline system. Two of the best submissions for English→German remain the best for English→French (LIUMCVC MNMT C and NICT NMTrerank C), the text-only system (LIUMCVC NMT C) decreased in performance, and no UvA-TiCC IMAGINATION U system was submitted for French.
An interesting observation is the difference of the Meteor scores between text-only NMT system (LIUMCVC NMT C) and Moses hierarchical phrase-based system with reranking (NICT NMTrerank C). While the two systems are very close for the English→German direction, the hierarchical system is better than the text-only NMT systems in the English→French direction. This pattern holds for both the Multi30K 2017 test data and Ambiguous COCO test data. Table 7 shows the results for the out-of-domain Ambiguous COCO dataset with the French target language. Once again, in contrast to the English→German results, the evaluation metrics are in better agreement about the ranking of the submissions. The performance of all the models is once again in mostly agreement with the Multi30K 2017 test data, albeit lower. Both DCU-ADAPT MultiMT C and OREGON-STATE 2NeuralTranslation C again perform relatively better on this dataset.

Task 2: English → German
The description generation task, in which systems must generate target-language (German) captions for a test image, has substantially changed since    last year. The main difference is that source language descriptions are no longer observed for images at test time. The training data remains the same and contains images with both source and target language descriptions. The aim is thus to leverage multilingual training data to improve a monolingual task. Table 8 shows the results for the Multilingual image description task. This task attracted fewer submissions than last year, which may be because it was no longer possible to re-use a model designed for Multimodal Translation. The evaluation metrics do not agree on the ranking of the submissions, with major differences in the ranking using either BLEU or TER instead of Meteor.
The main result is that none of the submissions outperform the monolingual German baseline according to Meteor.
All of the submissions are statistically significantly different compared to the baseline.
However, the CMU NeuralEncoderDecoder C submission marginally outperformed the baseline according to TER and equalled its BLEU score.

Human Judgement Results
This year, we conducted a human evaluation in addition to the text-similarity metrics to assess the translation quality of the submissions. This evaluation was undertaken for the Task 1 German and French outputs for the Multi30K 2017 test data.
This section describes how we collected the human assessments and computed the results. We would like to gratefully thank all assessors.

Methodology
The system outputs were manually evaluated by bilingual Direct Assessment (DA) (Graham et al., 2015) using the Appraise platform (Federmann, 2012). The annotators (mostly researchers) were asked to evaluate the semantic relatedness between the source sentence in English and the target sentence in German or French. The image was shown along with the source sentence and the candidate translation and evaluators were told to rely on the image when necessary to obtain a better understanding of the source sentence (e.g. in cases where the text was ambiguous). Note that the reference sentence is not displayed during the evaluation, in order to avoid influencing the assessor. Figure 3 shows an example of the direct assessment interface used in the evaluation. The score of each translation candidate ranges from 0 (meaning that the meaning of the source is not preserved in the target language sentence) to 100 (meaning the meaning of the source is "perfectly" preserved). The human assessment scores are standardized according to each individual assessor's overall mean and standard deviation score. The overall score of a given system (z) corresponds to the mean standardized score of its translations.

Results
The French outputs were evaluated by seven assessors, who conducted a total of 2,521 DAs, resulting in a minimum of 319 and a maximum of 368 direct assessments per system submission, respectively. The German outputs were evaluated by 25 assessors, who conducted a total of 3,485 DAs, resulting in a minimum of 291 and a maximum of 357 direct assessments per system submission, respectively. This is somewhat less than the recommended number of 500, so the results should be considered preliminary.
Tables 9 and 10 show the results of the human evaluation for the English to German and the English to French Multimodal Translation task (Multi30K 2017 test data). The systems are ordered by standardized mean DA scores and clustered ac- cording to the Wilcoxon signed-rank test at p-level p ≤ 0.05. Systems within a cluster are considered tied. The Wilcoxon signed-rank scores can be found in Tables 11 and 12 in Appendix A.
When comparing automatic and human evaluations, we can observe that they globally agree with each other, as shown in Figures 4 and 5, with German showing better agreement than French. We point out two interesting disagreements: First, in the English→French language pair, CUNI NeuralMonkeyMultimodalMT C and DCU-ADAPT MultiMT C are significantly better than LIUMCVC MNMT C, despite the fact that the latter system achieves much higher metric scores. Secondly, across both languages, the text-only LIUM-CVC NMT C system performs well on metrics but does relatively poorly on human judgements, especially as compared to the multimodal version of the same system.

Discussion
Visual Features: do they help? Three teams provided text-only counterparts to their multimodal systems for Task 1 (CUNI, LIUMCVC, and ORE-GONSTATE), which enables us to evaluate the contribution of visual features. For many systems, visual features did not seem to help reliably, at least as measured by metric evaluations: in German, the CUNI and OREGONSTATE text-only systems outperformed the counterparts, while in French, there were small improvements for the CUNI multimodal system. However, the LIUMCVC multimodal system outperformed their text-only system across both languages.
The human evaluation results are perhaps more promising: nearly all the highest ranked systems (with the exception of NICT) are multimodal. An intruiging result was the text-only LIUM-CVC NMT C, which ranked highly on metrics but poorly in the human evaluation. The LIUMCVC systems were indistinguishable from each other in terms of Meteor scores but the standardized mean direct assessment score showed a significant difference in performance (see Tables 11 and 12): further analysis of the reasons for humans disliking the text-only translations will be necessary.
The multimodal Task 1 submissions can be broadly categorised into three groups based on how they use the images: approaches useing double-attention mechanisms, initialising the hidden state of the encoder and/or decoder networks with the global image feature vector, and alternative uses of image features. The doubleattention models calculate context vectors over the source language hidden states and locationpreserving feature vectors over the image; these vectors are used as inputs to the translation decoder (CUNI NeuralMonkeyMultimodalMT). Encoder and/or decoder initialisation involves initialising the recurrent neural network with an affine transformation of a global image feature vector (DCU-ADAPT MultiMT, OREGON-STATE 1NeuralTranslation) or initialising the encoder and decoder with the 1000 dimension softmax probability vector over the object classes in ImageNet object recognition challenge    Table 11). Systems using unconstrained data are identified with a gray background.
(SHEF ShefClassInitDec). The alternative uses of the image features include element-wise multiplication of the target language embeddings with an affine transformation of a global image feature vector (LIUMCVC MNMT), summing the source language word embeddings with affine-transformed 1000 dimension softmax probability vector (SHEF ShefClassProj), using the visual features in a retrieval framework (AFRL-OHIOSTATE MULTIMODAL), and learning visually-grounded encoder representations by learning to predict the global image feature vector from the source language hidden states (UvA-TiCC IMAGINATION).
Overall, the metric and human judgement results in Sections 4 and 5 indicate that there is still a wide scope for exploration of the best way to integrate visual and textual information. In particular, the alternative approaches proposed in the LIUM-CVC MNMT and UvA-TiCC IMAGINATION submissions led to strong performance in both the metric and human judgement results, surpassing the more common approaches using initialisation and double attention.
Finally, the text-only NICT system ranks highly across both languages. This system uses hierarchical phrase-based MT with a reranking step based on a neural text-only system, since their multimodal system never outperformed the text-only variant in development (Zhang et al., 2017). This is in line with last year's results and the strong Moses baseline , and suggests a continuing role for phrase-based MT for small homogeneous datasets.
Unconstrained systems The Multi30k dataset is relatively small, so unconstrained systems use more data to complement the image description translations. Three groups submitted systems using external resources: UvA-TiCC, CUNI, and AFRL-OHIOSTATE. The unconstrained UvA-TiCC and CUNI submissions always outperformed their respective constrained variants by 2-3 Meteor points and achieved higher standardized mean DA scores. These results suggest that external parallel text corpora (UvA-TiCC and CUNI) and external monolingual image description datasets (UvA-TiCC) can usefully improve the quality of multimodal translation models.
However, tuning to the target domain remains important, even for relatively simple image captions.   Table 12).
We ran the best-performing English→German WMT'16 news translation system (Sennrich et al., 2016a)  Ambiguous COCO dataset We introduced a new evaluation dataset this year with the aim of testing systems' ability to use visual features to identify word senses. However, it is unclear whether visual features improve performance on this test set. The text-only NICT NMTrerank system performs competitively, ranking in the top three submissions for both languages. We find mixed results for submissions with text-only and multimodal counterparts (CUNI, LIUMCVC, OREGONSTATE): LIUMCVC's multimodal system improves over the text-only system for French but not German, while the visual features help for German but not French in the CUNI and OREGONSTATE systems.
We plan to perform a further analysis on the extent of translation ambiguity in this dataset. We will also continue to work on other methods for constructing datasets in which textual ambiguity can be disambiguated by visual information.
Multilingual Image Description It proved difficult for Task 2 systems to use the English data to improve over the monolingual German baseline.
In future iterations of the task, we will consider a lopsided data setting, in which there is much more English data than target language data. This setting is more realistic and will push the use of multilingual data. We also hope to conduct human evaluation to better assess performance because automatic metrics are problematic for this task (Elliott and Keller, 2014;Kilickaya et al., 2017).

Conclusions
We presented the results of the second shared task on multimodal translation and multilingual image description. The shared task attracted submissions from nine groups, who submitted a total of 19 systems across the tasks. The Multimodal Translation task attracted the majority of the submissions. Human judgements for the translation task were collected for the first time this year and ranked systems broadly in line with the automatic metrics.
The main findings of the shared task are: (i) There is still scope for novel approaches to integrating visual and linguistic features in multilingual multimodal models, as demonstrated by the winning systems.
(ii) External resources have an important role to play in improving the performance of multimodal translation models beyond what can be learned from limited training data.
(iii) The differences between text-only and multimodal systems are being obfuscated by the well-known shortcomings of text-similarity metrics. Multimodal systems often seem to be prefered by humans but not rewarded by metrics. Future research on this topic, encompassing both multimodal translation and multilingual image description, should be evaluated using human judgements.
In future editions of the task, we will encourage participants to submit the output of single decoder systems to better understand the empirical differences between approaches. We are also considering a Multilingual Multimodal Translation challenge, where the systems can observe two language inputs alongside the image to encourage the development of multi-source multimodal models. Tables 11 and 12 show the Wilcoxon signed-rank test used to create the clustering of the systems.   Table 12: English → French Wilcoxon signed-rank test at p-level p ≤ 0.05. '-' means that the value is higher than 0.05.