Distilling Translations with Visual Awareness

Previous work on multimodal machine translation has shown that visual information is only needed in very specific cases, for example in the presence of ambiguous words where the textual context is not sufficient. As a consequence, models tend to learn to ignore this information. We propose a translate-and-refine approach to this problem where images are only used by a second stage decoder. This approach is trained jointly to generate a good first draft translation and to improve over this draft by (i) making better use of the target language textual context (both left and right-side contexts) and (ii) making use of visual context. This approach leads to the state of the art results. Additionally, we show that it has the ability to recover from erroneous or missing words in the source language.


Introduction
Multimodal machine translation (MMT) is an area of research that addresses the task of translating texts using context from an additional modality, generally static images. The assumption is that the visual context can help ground the meaning of the text and, as a consequence, generate more adequate translations. Current work has focused on datasets of images paired with their descriptions, which are crowdsourced in English and then translated into different languages, namely the Multi30K dataset .
Results from the most recent evaluation campaigns in the area Barrault et al., 2018) have shown that visual information can be helpful, as humans generally prefer translations generated by multimodal models than by their text-only counterparts. However, previous work has also shown that images are only needed in very specific cases . This is also the case for humans.  (see Figure 1) concluded that visual information is needed by humans in the presence of the following: incorrect or ambiguous source words and gender-neutral words that need to be marked for gender in the target language. In an experiment where human translators were asked to first translate descriptions based on their textual context only and then revise their translation based on a corresponding image, they report that these three cases accounted for 62-77% of the revisions in the translations in two subsets of Multi30K.
Ambiguities are very frequent in Multi30K, as in most language corpora. Barrault et al. (2018) shows that in its latest test set, 358 (German) and 438 (French) instances (out of 1,000) contain at least one word that has more than one translation in the training set. However, these do not always represent a challenge for translation models: often the text context can easily disambiguate words (see baseline translation in Figure 4(a)); additionally, the models are naturally biased to generate the most frequent translation of the word, which by definition is the correct one in most cases.
The need to gender-mark words in a target language when translating from English can be thought of as a disambiguation problem, except that the text context is often less telling and the frequency bias plays ends up playing a bigger role (see baseline translation in Figure 4(c)). This has been shown to be a common problem in neural machine translation (Vanmassenhove et al., 2018;Font and Costa-Jussà, 2019), as well as in areas such as image captioning (Hendricks et al., 2018) and co-reference resolution (Zhao et al., 2018).
Incorrect source words are common in Multi30K, as in many other crowdsourced or usergenerated dataset. In this case the context may not be enough (see DE translation in Figure 1(c)). We posit that models should be robust to such a type of noise and note that similar treatment would be required for out of vocabulary (OOV) words, i.e. correct words that are unknown to the model. We propose an approach that takes into account the strengths of a text-only baseline model and only refines its translations when needed. Our approach is based on deliberation networks (Xia et al., 2017) to jointly learn to generate draft translations and refine them based on left and right side target context as well as structured visual information. This approach outperforms previous work.
In order to further probe how well our models can address the three problems mentioned above, we perform a controlled experiment where we minimise the interference of the frequency bias by masking ambiguous and gender-related words, as well as randomly selected words (to simulate noise and OOV). This experiment shows that our multimodal refinement approach outperforms the textonly one in more complex linguistic setups.
Our main contributions are: (i) a novel approach to MMT based on deliberation networks and structured visual information which gives state of the art results (Sections 3.2 and 5.1); (ii) a frequency bias-free investigation on the need for visual context in MMT (Sections 4.2 and 5.2); and (iii) a thorough investigation on different vi-sual representations for transformer-based architectures (Section 3.3).

Related work
MMT: Approaches to MMT vary with regards to how they represent images and how they incorporate this information in the models. Initial approaches use RNN-based sequence to sequence models (Bahdanau et al., 2015) enhanced with a single, global image vector, extracted as one of the layers of a CNN trained for object classification (He et al., 2016), often the penultimate or final layer.
The image representation is integrated into the MT models by initialising the encoder or decoder (Elliott et al., 2015;Caglayan et al., 2017;Madhyastha et al., 2017); element-wise multiplication with the source word annotations (Caglayan et al., 2017); or projecting the image representation and encoder context to a common space to initialise the decoder . Elliott and Kádár (2017) and Helcl et al. (2018) instead model the source sentence and reconstruct the image representation jointly via multi-task learning.
An alternative way of exploring image rep-resentations is to have an attention mechanism (Bahdanau et al., 2015) on the output of the last convolutional layer of a CNN (Xu et al., 2015). The layer represents the activation of K different convolutional filters on evenly quantised N × N spatial regions of the image. Caglayan et al. (2017) learn the attention weights for both source text and visual encoders, while ; Delbrouck and Dupont (2017) combine both attentions independently via a gating scalar, and Libovický and Helcl (2017); Helcl et al. (2018) apply a hierarchical attention distribution over two projected vectors where the attention for each is learnt independently. Helcl et al. (2018) is the closest to our work: we also use a doubly-attentive transformer architecture and explore spatial visual information. However, we differ in two main aspects (Section 3): (i) our approach explores additional textual context through a second pass decoding process and uses visual information only at this stage, and (ii) in addition to convolutional filters we use objectlevel visual information. The latter has only been explored to generate a single global representation (Grönroos et al., 2018) and used for example to initialise the encoder (Huang et al., 2016). We note that translation refinement is different translation re-ranking from a text-only model based on image representation (Shah et al., 2016;Hitschler et al., 2016;, since the latter assumes that the correct translation can already be produced by a text-only model. Caglayan et al. (2019) investigate the importance and the contribution of multimodality for MMT. They perform careful experiments by using input degradation and observe that, specially under limited textual context, multimodal models exploit the visual input to generate better translations. Caglayan et al. (2019) also show that MMT systems exploit visual cues and obtain correct translations even with typographical errors in the source sentences. In this paper, we build upon this idea and investigate the potential of visual cues for refining translation.
Translation refinement: The idea of treating machine translation as a two step approach dates back to statistical models, e.g. in order to improve a draft sentence-level translation by exploring document-wide context through hill-climbing for local refinements (Hardmeier et al., 2012). Iterative refinement approaches have also been pro-posed that start with a draft translation and then predict discrete substitutions based on an attention mechanism (Novak et al., 2016), or using nonautoregressive methods with a focus on speeding up decoding (Lee et al., 2018). Translation refinement can also be done through learning a separate model for automatic post-editing (Niehues et al., 2016;Junczys-Dowmunt and Grundkiewicz, 2017;Chatterjee et al., 2018), but this requires additional training data with draft translations and their correct version.
An interesting approach is that of deliberation networks, which jointly train an encoder and first and second stage decoders (Xia et al., 2017). The second stage decoder has access to both left and right side context and this has been shown to improve translation (Xia et al., 2017;. We follow this approach as it offers a very flexible framework to incorporate additional information in the second stage decoder.

Model
We base our model on the transformer architecture (Vaswani et al., 2017) for neural machine translation. Our implementation is a multilayer encoder-decoder architecture that uses the tensor2tensor 1 (Vaswani et al., 2018) library. The encoder and decoder blocks are as follows: Encoder Block (E): The encoder block comprises of 6 layers, with each containing two sublayers of multi-head self-attention mechanism followed by a fully connected feed forward neural network. We follow the standard implementation and employ residual connections between each layer, as well as layer normalisation. The output of the encoder forms the encoder memory which consists of contextualised representations for each of the source tokens (M E ).
Decoder Block (D): The decoder block also comprises of 6 layers. It contains an additional sublayer which performs multi-head attention over the outputs of the encoder block. Specifically, decoding layer d l i is the result of a) multi-head attention over the outputs of the encoder which in turn is a function of the encoder memory and the outputs from the previous layer: where, the keys and values are the encoder outputs and the queries correspond to the decoder input, and b) the multi-head self attention which is a function of the generated outputs from the previous layer:

Deliberation networks
Deliberation networks Xia et al., 2017) build on the standard sequence to sequence architecture to add an additional decoder block (in our case, with 3 layers -see Figure 2). The additional decoder (also referred to as secondpass decoder) is conditioned on the source and sampled outputs from the standard transformer decoder (the first-pass decoder). More concretely, the second-pass decoder (D ) at layer d l consists of A D , A D →E , A D →D , where, A D and A D →E is similar to the standard deliberation architecture multi-head attention over the encoder memory and self attention respectively while, A D →D is the multi-head attention over outputs O d from the first-pass decoder (D) ( In our experiments, we obtain samples as a set of translations from the first-pass decoder using beam-search. Given a translation candidate, O d consists of the first-pass decoder's hidden layer before softmax concatenated with the embeddings of the resultant words.

Multimodal transformer & deliberation
Our multimodal transformer models follow one of the two formulations below for conditioning trans-lations on image information: Additive image conditioning (AIC): A projected image vector is added to each of the outputs of the encoder. The projections matrices are parameters that are jointly learned with the model.

Attention over image features (AIF):
The model attends over image features, as in Helcl et al. (2018), where the decoder block now contains an additional cross-attention sub-layer A D →V which attends to the visual information (V). The keys and values correspond to the visual information.
Within the deliberation network framework, based on the previously discussed observation (Section 1) that images are only needed in a small number of cases, we propose to add visual crossattention only to the second-pass decoder block (see Figure 2).

Image features
Motivated by previous work that indicates the importance of structured information from images (Caglayan et al., 2017;, we focus on structural forms of image representations, including the spatially aware feature maps from CNNs and information extracted from automatic object detectors. Spatial image features: We use spatial feature maps from the last convolutional layer of a pretrained ResNet-50 (He et al., 2016) CNN-based image classifier for every image. 3 These feature maps contain output activations for various filters while preserving spatial information. They have been used in various vision to language tasks including image captioning (Xu et al., 2015) and multimodal machine translation (Section 2). Our formulation for the integration of these features into the deliberation network is shown in Figure 2, setup (b). We use the the AIF setup and refer to models that use the representation as att.
Object-based image features: We use a bag-of-objects representation where the objects are obtained using an off-shelf object detector (Kuznetsova et al., 2018) based on the Open Images dataset. This representations is a sparse 545-dimensional vector with the frequency of each (545) given object in an image. This is inspired by previous research that investigates the potential of object-based information for vision to language tasks (Mitchell et al., 2012;. We use the the AIC setup and refer to models that use the representation as sum. Object-based embedding features: The bagof-objects representations makes it hard to exploit object-to-object similarity, since visual representations of different objects can be very different. To mitigate this, we propose a simple extension using bag-of-object embeddings. We represent each object using the pre-trained GLoVebased (Pennington et al., 2014) 50-dimensional word vectors for their categories (e.g. woman). We use the the AIF based setup and refer to models that use the representation as obj (Figure 2  setup (a)).

Experimental settings 4.1 Data
We build and test our MMT models on the Multi30K dataset . Each image in Multi30K contains one English (EN) description taken from Flickr30K (Young et al., 2014) and human translations into German (DE), French (FR) and Czech Barrault et al., 2018). The dataset contains 29,000 instances for training, 1,014 for development, and 1,000 for test. We only experiment with German and French, which are languages for which we have in-house expertise for the type of analysis we present. In addition to the official Multi30K test set (test 2016), we also use the test set from the latest WMT evaluation competition, test 2018 (Barrault et al., 2018). 4

Degradation of source
In addition to using the Multi30K dataset as is (standard setup), we probe the ability of our models to address the three linguistic phenomena where additional context has been proved important (Section 1): ambiguities, gender-neutral words and noisy input. In a controlled experiment where we aim to remove the influence of frequency biases, we degrade the source sentences by masking words through three strategies to replace words by a placeholder: random source words, ambiguous source words and gender unmarked source words. The procedure is applied to the train, validation and test sets. For the resulting dataset generated for each setting, we compare models having access to text-only context versus additional text and multimodal contexts. We seek to get insights into the contribution of each type of context to address each type of degradation.
Random content words In this setting (RND) we simulate erroneous source words by randomly dropping source content words. We first tag the entire source sentences using the spacy toolkit (Honnibal and Montani, 2017) and then drop nouns, verbs, adjectives and adverbs and replace these with a default BLANK token. By focusing on content words, we differ from previous work that suggests that neural machine translation is robust to non-content word noise in the source (Klubička et al., 2017).
Ambiguous words In this setting (AMB), we rely on the MLT dataset  which provides a list of source words with multiple translations in the Multi30k training set. We replace ambiguous words with the BLANK token in the source language, which results in two languagespecific datasets.
Person words In this setting (PERS), we use the Flickr Entities dataset (Plummer et al., 2017) to identify all the words that were annotated by humans as corresponding to the category person. 5 We then replace such source words with the BLANK token.
The statistics of the resulting datasets for the three degradation strategies are shown in Table 1. We note that RND and PERS are the same for language pairs as the degradation only depends on the source side, while for AMB the words replaced depend on the target language.

Models
Based on the models described in Section 3 we experiment with eight variants: (a) baseline transformer model (base); (b) base with AIC (base+sum); (c) base with AIF using spacial (base+att) or object based (base+obj) image features; (d) standard deliberation model (del); (e) deliberation models enriched with image information: del+sum, del+att and del+obj.

Training
In all cases, we optimise our models with cross entropy loss. For deliberation network models, we first train the standard transformer model until convergence, and use it to initialise the encoder and first-pass decoder. For each of the training samples, we follow (Xia et al., 2017) and obtain a set of 10-best samples from the first pass decoder, with a beam search of size 10. We use these as the first-pass decoder samples. We use Adam as optimiser (Kingma and Ba, 2014) (Helcl et al., 2018). 6 We built on the tensor2tensor implementation of deliberation nets in https://github.com/ustctf/ delibnet using the transformer big parameters with a learning rate of 0.05 with 8K warmup steps for both the first and the second-pass decoders, and early stopping with the patience of 10 epochs based on the validation BLEU score.

Results
In this section we present results of our experiments, first in the original dataset without any source degradation (Section 5.1) and then in the setup with various source degradation strategies (Section 5.2). Table 2 shows the results of our main experiments on the 2016 and 2018 test sets for French and German. We use Meteor (Denkowski and Lavie, 2014) as the main metric, as in the WMT tasks (Barrault et al., 2018). We compare our transformer baseline to transformer models enriched with image information, as well as to the deliberation models, with or without image information.

Standard setup
We first note that our multimodal models achieve the state of the art performance for transformer networks (constrained models) on the English-German dataset, as compared to (Helcl et al., 2018). Second, our deliberation models lead to significant improvements over this baseline across test sets (average ∆METEOR = 1, ∆BLEU = 1).
Transformer-based models enriched with image information (base+sum, base+att and base+obj), on the other hand, show no major improvements with respect to the base performance. This is also the case for deliberation models with image information (del+sum, del+att, del+obj), which do not show significant improvement over the vanilla deliberation performance (del).
However, as it has been shown in the WMT shared tasks on MMT Barrault et al., 2018), automatic metrics often fail to capture nuances in translation quality, such as, the ones we expect the visual modality to help with, which -according to human perception -lead to better translations. To test this assumption in our settings, we performed human evaluation involving professional translators and native speakers of both French and German (three annotators).
The annotators were asked to rank randomly selected test samples according to how well they convey the meaning of the source, given the image (50 samples per language pair per annotator). For each source segment, the annotator was shown the outputs of three systems: base+att, the current MMT state-of-the-art (Helcl et al., 2018), del EN: Two men work under the hood of a white race car.
(a) base+att translates race car with Rennen (race), del with Auto (car) and del+obj with Rennwagen (race car).
Objects: land, vehicle, car, wheel EN: A young child holding an oar paddling a blue kayak in a body of water.

base+att:
Un jeune enfant tenant une rame dans un kayak bleu.

del:
Un jeune enfant tenant une rame dans un kayak bleu sur un plan d'eau.

del+obj:
Un jeune enfant tenant une rame dans un kayak bleu pagayant sur un plan d'eau. and del+obj. A rank could be assigned from 1 to 3, allowing ties (Bojar et al., 2017). Annotators could assign zero rank to all translations if they were judged incomprehensible. Following the common practice in WMT (Bojar et al., 2017), each system was then assigned a score which reflects the proportion of times it was judged to be better or equal other systems. Table 3 shows the human evaluation results. They are consistent with the automatic evaluation results when it comes to the preference of humans towards the deliberation-based setups, but show a more positive outlook regarding the addition of visual information (del+obj over del) for French.  Manual inspection of translations suggests that deliberation setups tend to improve both the grammaticality and adequacy of the first pass outputs. For German, the most common modifications performed by the second-pass decoder are substitutions of adjectives and verbs (for test 2016, 15% and 12% respectively, of all the edit distance operations). Changes to adjectives are mainly gram-matical, changes to verbs are contextual (e.g., changing laufen to rennen, both verbs mean run, but the second refers to running very fast). For French, 15% of all the changes are substitutions of nouns (for test 2016). These are again very contextual. For example, the French word travailleur (worker) is replaced by ouvrier (manual worker) in the contexts where tools, machinery or buildings are mentioned. For our analysis we used again spacy.
The information on detected objects is particularly helpful for specific adequacy issues. Figure 3 demonstrates some such cases. In the first case, the base+att model misses the translation of race car: the German word Rennen translates only the word race. del introduces the word car (Auto) into the translation. Finally, del+obj correctly translates the expression race car (Rennwagen) by exploiting the object information. For French, del translates the source part in a body of water, missing from the base+att translation. del+obj additionally translated the word paddling according to the detected object Paddle.

Source degradation setup
Results of our source degradation experiments are shown in Table 4. A first observation is that -as with the standard setup -the performance of our deliberation models is overall better than that of the base models. The results of the multimodal  models differ for German and French. For German, del+obj is the most successful configuration and shows statistically significant improvements over base for all setups. Moreover, for RND and AMB, it shows statistically significant improvements over del. However, especially for RND and AMB, del and del+sum are either the same or slightly worse than base. For French, all the deliberation models show statistically significant improvements over base (average ∆METEOR = 1, ∆BLEU = 1.1), but the image information added to del only improve scores significantly for test 2018 RND.
This difference in performances for French and German is potentially related to the need of more significant restructurings while translating from English into German. 7 This is where a more complex del+obj architecture is more helpful. This is especially true for RND and AMB setups where blanked words could also be verbs, the part-ofspeech most influenced by word order differences between English and German (see the decreasing complexity of translations for del and del+obj for the example (c) in Figure 4).
To get an insight into the contribution of different contexts to the resolution of blanks, we performed manual analysis of examples coming from the English-German base, del and del+obj setups (50 random examples per setup), where we count correctly translated blanks per system.
The results are shown in Table 5. As expected, they show that the RND and AMB blanks are more 7 English and French are both languages with the subjectverb-object (SVO) sentence structure. German, on the other hand, can have subject-object-verb (SOV) constructions. For example, a German sentence Gestern bin ich in London gewesen (Yesterday have I to London been) would need to be restructured to Yesterday I have been to London in English.  difficult to resolve (at most 40% resolved as compared to 61% for PERS). Translations of the majority of those blanks tend to be guessed by the textual context alone (especially for verbs). Image information is more helpful for PERS: we observe an increase of 10% in resolved blanks for del+obj as compared to del. However, for PERS the textual context is still enough in the majority of the cases: models tend to associate men with sports or women with cooking and are usually right (see Figure 4 example (c)). The cases where image helps seem to be those with rather generic contexts: see Figure 4 (b) where enjoying a summer day is not associated with any particular gender and make other models choose homme (man) or femme (woman), and only base+obj chooses enfant (child) (the option closest to the reference).
In some cases detected objects are inaccurate or not precise enough to be helpful (e.g., when an object Person is detected) and can even harm correct translations.

Conclusions
We have proposed a novel approach to multimodal machine translation which makes better EN: Three farmers harvest rice out in a rice field.
(a) Example of a blank resolved by the textual context for AMB: field translated as Reisfeld (rice field) by base. del+obj incorrectly translated the blank into Reishut (rice hat) due to detected objects. Objects: person, clothing, mammal EN: The boy is outside enjoying a summer day.
(b) Example of a blank resolved by the multimodal context for PERS. The textual context is too generic and del+obj uses the detected objects to correctly translate boy into l'enfant (child). Objects: clothing, face, tree, boy, jeans EN: Dirt biker makes a sloping turn in a forest during the fall.
(c) Example of a blank resolved by the textual context for PERS. biker correctly translated into the Masc. form Geländemotorradfahrer (dirt biker) by base. Objects: person, tree, bike, helmet use of context, both textual and visual. Our results show that further exploring textual context through deliberation networks already leads to better results than the previous state of the art. Adding visual information, and in particular structural representations of this information, proved beneficial when input text contains noise and the language pair requires substantial restructuring from source to target. Our findings suggest that the combination of a deliberation approach and information from additional modalities is a promising direction for machine translation that is robust to noisy input. Our code and pre-processing scripts are available at https:// github.com/ImperialNLP/MMT-Delib.