Adversarial Evaluation of Multimodal Machine Translation

The promise of combining language and vision in multimodal machine translation is that systems will produce better translations by leveraging the image data. However, the evidence surrounding whether the images are useful is unconvincing due to inconsistencies between text-similarity metrics and human judgements. We present an adversarial evaluation to directly examine the utility of the image data in this task. Our evaluation tests whether systems perform better when paired with congruent images or incongruent images. This evaluation shows that only one out of three publicly available systems is sensitive to this perturbation of the data. We recommend that multimodal translation systems should be able to pass this sanity check in the future.


Introduction
Multimodal machine translation is the task of translating sentences situated in a visual context, such as captioned images on social media. The core argument of this area of research is that we can produce better translations by exploiting both the source language sentence and the visual context (Elliott et al., 2015;Hitschler et al., 2016). There is some evidence to support this argument for human translation: Frank et al. (2018) found that 13% of the German evaluation data in the Multi30K dataset  needed at least one post-edit to reflect the joint meaning of the visual and linguistic context. However, the evidence that visual context helps computational models is less clear. Consider the three teams that submitted contrastive multimodal and text-only variants of their systems to the 2017 Multimodal Translation Shared Task (Elliott et al., 2017): the University of Le Mans' multimodal system outperformed their text-only variant (Caglayan et al., * Work carried out at the University of Edinburgh.
Two dogs play with an orange toy in tall grass.

Model
Zwei Hunde spielen im hohen Gras mit einem orangen Spielzeug. 2017); the Oregon State University text-only system outperformed their multimodal variant (Ma et al., 2017); and the performance of the Charles University systems depended on the language pair (Libovický and Helcl, 2017). In light of these results, we need a better understanding of the role of visual context in multimodal translation systems.
We propose an adversarial evaluation Method to determine whether multimodal translation systems are aware of the visual context. We introduce a measure of image awareness to quantify the difference in performance in two settings: (i) when a system is presented with congruent visual data; (ii) when it is presented with incongruent visual data. In both settings, a system is presented with the correct source language sentence. See Figure  1 for an illustration of our evaluation. We hypothesise that if a system is aware of the visual context, i.e. it is actually using the image for translation, then the system will perform better when it is presented with the congruent visual data than incongruent visual data. Our evaluation is related to the foiled image captions evaluation, in which the performance of an image captioning system is measured when a single word is replaced with an incorrect, but similar word (Shekhar et al., 2017); the main difference is that we replace the visual data instead of manipulating the text. Our work is also related to a study of question-answering systems, in which additional text was appended to the end of a document (Jia and Liang, 2017). They found that these additional text segments distracted QA systems from producing the correct answer. In contrast, our evaluation does not manipulate the textual data, instead we replace the original visual input with a random distractor.
We evaluate three publicly available multimodal translation systems with our adversarial evaluation. The main finding of this paper is that one publicly available multimodal translation system is not aware of the congruent image data. This finding raises doubts about whether state-of-theart multimodal translation systems actually use the visual context to produce better translations. We conclude this paper by discussing whether this is likely to be due to problems with the data or with the model architectures.

Image Awareness
We propose an adversarial evaluation method for multimodal machine translation. This method measures how a system performs when it is presented with the correct text data and either the congruent image or with an incongruent image. In this section we define two image awareness functions to measure whether a multimodal translation system is aware of the congruent visual data.
Let x be a source language sentence, y be a target language sentence, v be the congruent image, andv be an incongruent image. Image awareness is calculated using an evaluable performance measure E. The overall image awareness of a model M on an evaluation dataset D is: The image awareness of a model M for a single instance a M ( Under this definition, the output of the evaluable performance measure should be higher in the presence of the congruent data than the incongruent data, i.e. E( If this is the case, on average, then the overall image awareness of a model ∆-Awareness is positive. This can only happen when model outputs are evaluated more favourably in the presence of the the congruent image data than the incongruent image data.

Model-internal awareness ∆ I
A model-internal image measure of awareness is the difference in the probability assigned to the target language sentence y in the congruent and incongruent conditions. This is model-internal because it has the same form as the maximumlikelihood objective used to train the translation model. In this case, E = p(y|x, ·), and the difference in performance for a single instance is:

Model-external awareness ∆ E
A model-external awareness measure could be a text-similarity evaluation or human judgement. In this paper, we use the Meteor text-similarity score (Denkowski and Lavie, 2014) because it naturally decomposes to the sentence level, and it is already the de-facto evaluation metric for multimodal machine translation . Let E be any text-similarity scoring function T that decomposes to the sentence level. The difference in performance for a single instance is defined as:

Systems Evaluation
We evaluate the image awareness of three pretrained multimodal translations systems that we received by direct correspondence: decinit: The initial state of the decoder network is set with a learned transformation of the visual data (Caglayan et al., 2017).
trgmul: The target language word embeddings are modulated by an element-wise multiplication with a learned transformation of the visual data (Caglayan et al., 2017). hierattn: The decoder network learns to selectively attend to a combination of the source language and the visual data (Libovický and Helcl, 2017).
Each system was trained on the 29,000 English-German-image tuples in the Multi30K dataset . We evaluate the image awareness of these systems using the 1,014 tuples in the validation data, which is typically used for model selection. We select the incongruent imagesv by randomly shuffling the order in which the images v are associated with the source language text x. In our evaluation, we report the mean and standard deviation of randomly shuffling the image data five times. The code to evaluate your own system is publicly available. 2

Statistical test
To determine if a model passes the proposed evaluation, we conduct a non-parametric Wilcoxon signed-rank test of the following hypothesis: H 1 : Congruent images improve the quality of multimodal translation compared to incongruent images.
H 0 : Congruent images make no difference to the quality of multimodal translation compared to incongruent images.
We conduct this statistical test using the pairs of values that are calculated in the process of computing the the image awareness scores (Eq. 2), i.e. E(x i , y i , v i ) and E(x i , y i ,v i ).
We combine the k=5 separate p values from each test using Fisher's method and reject the null hypothesis H 0 if the result of the χ 2 test with 2k degrees of freedom is p ≤ 0.005. 3 Table 1 shows the corpus-level results of a Meteorbased evaluation and the Meteor-awareness evaluation. We find that images improve the quality of the hierattn system (χ 2 = 136.74, p < 0.0001), and images also improve the quality of the decinit system (χ 2 = 32.79, p = 0.0003). Images make no difference to the quality of the translations generated by the trgmul system (χ 2 = 8.98, p = 0.533).

Results
To complement these tests, Figure 2 shows violin plots of the Meteor-awareness scores. These show that the translations generated by the trgmul and decinit systems are most likely to result in no difference in Meteor score between the congruent and incongruent conditions. We now turn our attention to the results of the probability-awareness evaluation. Images improve the quality of the trgmul system (χ 2 = 52.55, p < 0.0001), and images also improve the quality of the hierattn system (χ 2 = 622.03, p < 0.0001). Images make no difference to the quality of decinit system (χ 2 = 6.49, p = 0.772). Figure 3 shows examples of translations pro-  Figure 3: Examples of the difference in Meteor awareness for the hierattn system. In each example, the source sentence is shown at the top and the reference sentence is shown at the bottom, both in Typewriter font. The congruent image is on the left, and the incongruent image is on the right.
duced by the hierattn system for sentences paired with congruent / incongruent images. Figure 3 (a) shows an example with high positive difference in Meteor score. The incongruent image causes the translation system to refer to an unseen Hawaiian shirt. In neither setting does the system translate the phrase "Mardi Gras". Figure 3 (b) shows an example with a negative difference in Meteor score. The congruent image results in a long translation with poor coverage of the reference, which Meteor punishes more severely than the shorter translation arising from the incongruent image. In neither setting does the model translate the prepositional phrase "on a very breezy California day".

Data problems
We posit that the current Multi30K training data does not necessarily require systems to use the visual context to solve the translation task.  note that the German translation data was produced without showing the translators the images, and Frank et al. (2018) found that 13% of the Multi30K test data needed to be post-edited to reflect the joint semantics of both modalities. We recommend that entirety of the German Multi30K training data should be post-edited so that future systems are more likely to require a joint under-standing of the visual and linguistic context. 4 We note that a similar issue was found in a visual question answering dataset, resulting in the creation of a new "balanced" dataset (Goyal et al., 2017).

The role of model architectures
The key difference between the systems evaluated in this paper is how they use the visual context. The hierattn system learns a timestep-dependent context vector over a location-preserving 3D volume of image features, whereas the trgmul and decinit systems use an average-pool of the 3D location-preserving features. In our evaluation, the only system that is aware of the congruent image data for both types of image-awareness is the hierattn system that learns a spatial context over the image. Learning to attend to specific regions of the image may prove to be crucial to improving translations with visual context.

Conclusion
We proposed an adversarial evaluation method to determine whether multimodal translation systems are aware of the visual context. This evaluation method measures the difference in the perfor-mance of a system given the congruent or an incongruent image as additional context. We found that two out of three publicly available multimodal translation systems were improved by the congruent visual context, when compared to the incongruent visual context. We encourage researchers to use this method to evaluate their own systems. Future work includes augment existing multimodal translation models with an additional adversarial objective that forces the model to perform better in the presence of the congruent image than a random incongruent image. We will also apply this evaluation method to other tasks that use additional context, e.g. images in visualquestion answering, or part-of-speech tags in neural machine translation.