Simultaneous Machine Translation with Visual Context

Simultaneous machine translation (SiMT) aims to translate a continuous input text stream into another language with the lowest latency and highest quality possible. The translation thus has to start with an incomplete source text, which is read progressively, creating the need for anticipation. In this paper, we seek to understand whether the addition of visual information can compensate for the missing source context. To this end, we analyse the impact of different multimodal approaches and visual features on state-of-the-art SiMT frameworks. Our results show that visual context is helpful and that visually-grounded models based on explicit object region information are much better than commonly used global features, reaching up to 3 BLEU points improvement under low latency scenarios. Our qualitative analysis illustrates cases where only the multimodal systems are able to translate correctly from English into gender-marked languages, as well as deal with differences in word order, such as adjective-noun placement between English and French.


Introduction
Simultaneous machine translation (SiMT) aims to reproduce human interpretation, where an interpreter translates spoken utterances as they are produced.The interpreter has to dynamically find the balance between how much context is needed to generate the translation reliably, and how long the listener has to wait for the translation.In contrast to consecutive machine translation where source sentences are available in their entirety before translation, the challenge in SiMT is thus the design of a strategy to find a good trade-off between the quality of the translation and the latency incurred in producing it.Previous work has considered rulebased strategies that rely on waiting until some white chat blanc cat a un Figure 1: An illustration of a 1-word latency system that makes use of visual grounding to resolve the gender of the article 'un' and to predict the noun 'chat' after reading its qualifier 'white'.− and − denote READ and WRITE, respectively.constraint is satisfied, which includes approaches based on syntactic constraints (Bub et al., 1997;Ryu et al., 2006), segment/chunk/alignment information (Bangalore et al., 2012) heuristic-based conditions during decoding (Cho and Esipova, 2016) or deterministic policies with pre-determined latency constraints (Ma et al., 2019).An alternative line of research focuses on learning the decision policy: Gu et al. (2017) and Alinejad et al. (2018) frame SiMT as learning to generate READ/WRITE actions and employ reinforcement learning (RL) to formulate the problem as a policy agent interacting with its environment (i.e. a pre-trained MT model).Recent work has also explored supervised learning of the policy, by using oracle action sequences predicted by a pre-trained MT using confidence-based heuristics (Zheng et al., 2019) or external word aligners (Arthur et al., 2020) (details in §2).
Thus far, all prior research has focused on unimodal interpretation1 .In this paper, we explore SiMT for multimodal machine translation (MMT) (Specia et al., 2016), where in addition to the source sentence, we have access to visual information in the form of an image.We believe that having access to a complementary context should help the models anticipate the missing context (Figure 1) by grounding their decisions about 'when ' and 'what' to translate.To test our hypothesis, we explore heuristic-based decoding and fixed-latency wait-k policy2 , and investigate the effectiveness of different visual representations ( §3).
Our contributions highlight that: (i) visual context offers up to 3 BLEU points improvement for low-latency wait-k policies, and consistently lowers the latency for wait-if-diff (Cho and Esipova, 2016) decoding, (ii) explicit object region features are more expressive than commonly used global visual features, (iii) training wait-k MMTs offers remarkably better grounding capabilities than decoding-only wait-k for linguistic phenomena such as gender resolution and adjective-noun ordering, and (iv) with twice the runtime speed of decoder-based visual attention, the encoder-based grounding is promising for application scenarios.

Multimodal Machine Translation (MMT)
MMT aims to improve the quality of automatic translation using auxiliary sources of information (Sulubacak et al., 2020).The most typical framework explored in previous work makes use of the images when translating their descriptions between languages, with the hypothesis that visual grounding could provide contextual cues to resolve linguistic phenomena such as word-sense disambiguation or gender marking.
Existing work often rely on the use of visual features extracted from state-of-the-art CNN models pre-trained on large-scale visual tasks.The methods can be grouped into two branches depending on the feature type used: (i) multimodal attention (Calixto et al., 2016;Caglayan et al., 2016;Libovický and Helcl, 2017;Delbrouck and Dupont, 2017) which implements a soft attention (Bahdanau et al., 2014) over spatial feature maps, and (ii) multimodal interaction between a pooled visual feature vector and linguistic representations (Calixto and Liu, 2017;Caglayan et al., 2017a;Elliott and Kádár, 2017;Grönroos et al., 2018).

Simultaneous Neural MT
Simultaneous NMT was first explored by Cho and Esipova (2016) in a greedy decoding framework where heuristic waiting criteria are used to decide whether the model should read more source words or emit a target word.Gu et al. (2017) instead utilised a pre-trained NMT model in conjunction with a reinforcement learning agent whose goal is to learn a READ/WRITE policy by maximising quality and minimising latency.Alinejad et al. (2018) further extended the latter approach by adding a PREDICT action whose purpose is to anticipate the next source word.
A common property of the above approaches is their reliance on consecutive NMT models pretrained on full-sentences.Dalvi et al. (2018) pointed out a potential mismatch between the training and decoding regimens of such approaches and proposed fine-tuning the models using chunked data or prefix pairs.Ma et al. (2019) proposed an end-to-end, fixed-latency framework called 'waitk' which allows prefix-to-prefix training using a deterministic policy: the agent starts by reading a specified number of source tokens (k), followed by alternating WRITE and READ actions.Arivazhagan et al. (2019) extended the wait-k framework using an advanced attention mechanism and optimising a differential latency metric (DAL).Recently, Arivazhagan et al. (2020) explored a radically different approach which enriches full-sentence training with prefix pairs (Niehues et al., 2018) and allows re-translation of previously committed target tokens to increase the translation quality.
Another line of research focuses on learning adaptive policies in a supervised way by using oracle READ/WRITE actions generated with heuristic or alignment-based approaches.Zheng et al. (2019) extracted action sequences from a pre-trained NMT model with a confidence-based heuristic and used them to train a separate policy network while Arthur et al. (2020) explored jointly training the translation model and the policy with oracle sequences obtained from a word alignment model.
In this section, we first describe the underlying NMT architectures and baseline simultaneous MT approaches, to then introduce the proposed multimodal extensions to SiMT.

Baseline NMT
Our consecutive baseline consists of a 2-layer GRU (Cho et al., 2014) encoder and a 2-layer Conditional GRU decoder (Sennrich et al., 2017) with attention (Bahdanau et al., 2014).The encoder is unidirectional as source sentences would be progressively read3 .Given a source sequence of embeddings X={x 1 , . . ., x S } and a target sequence of embeddings Y ={y 1 , . . ., y T }, the encoder first computes the sequence of hidden states H={h 1 , . . ., h S }.At a given timestep t of decoding, the output layer estimates the probability of the next target word y t as follows: (1) For a single training sample, we then maximise the joint likelihood of source and target sentences: Following the formulation of Ma et al. (2019), g(t) in equation 2 is a function which returns the number of source tokens encoded so far when predicting the target token y t .In the case of consecutive NMT, since all source tokens are observed before the decoder runs, g(t) is equal to the length of the source sentence i.e. g(t) = |X|.

Incorporating the visual modality
We consider a setting where the visual context is static and is available in its entirety at encoding time.This is a realistic setting in many applications, for example, the simultaneous translation of news, where images (or video frames) are shown before the whole source stream is available.We consider the following ways of integrating visual information.
Object classification (OC) features are global image information extracted from convolutional feature maps, which are believed to capture spatial cues.These spatial features are extracted from the final convolution layer of a ResNet-50 CNN (He et al., 2016) trained on ImageNet (Deng et al., 2009) for object classification.An image is represented by a feature tensor V ∈ R 8×8×2048 .
Object detection (OD) features are explicit object information where local regions in an image detected as objects are encoded by pooled feature vectors.These features are generated by the "bottomup-top-down (BUTD)" (Anderson et al., 2018) extractor4 which is a Faster R-CNN/ResNet-101 object detector (with 1600 object labels) trained on the Visual Genome dataset (Krishna et al., 2017).For a given image, the detector provides 36 object region proposals and extracts a pooled feature vector from each.An image is thus represented by a feature tensor V ∈ R 36×2048 .We hypothesise that explicit object information can result in better referential grounding by using conceptually meaningful units rather than global features.

Multimodal architectures
Decoder attention (DEC-OC/OD).A standard way of integrating visual modality into NMT is to apply a secondary attention at each decoding timestep (Calixto et al., 2016;Caglayan et al., 2016).We follow this approach to construct an MMT baseline.Specifically, equation 1 is extended so that the decoder attends to both the source hidden states H (eq. 3) and the visual features V (eq.4), and they are added together to form the multimodal context vector c t (eq.5): Multimodal encoder (ENC-OD).Instead of integrating the visual modality into the decoder, we propose to ground the source sentence representation within the encoder similar to Delbrouck and Dupont (2017).We hypothesise that early visual integration could be more appropriate for SiMT to fill in the missing context.Our approach differs from Delbrouck and Dupont (2017) in the use of scaled-dot attention (Vaswani et al., 2017) and object detection (OD) features.The attention layer receives unidirectional hidden states H (for source states that were encoded/read so far) as the query and the visual features V as keys and values, i.e. it computes a mixture M of region features based on the cross-modal relevance.The final representation that will be used as input to the equation 1 is defined as LAYERNORM(M + H) (Ba et al., 2016).Regardless of the multimodal approach taken, all visual features are first linearly projected into the dimension of textual representations H.To make modality representations compatible in terms of magnitude statistics, we apply layer normalisation (Ba et al., 2016) on textual representations H and the previously projected visual representations V .A dropout (Srivastava et al., 2014) of p = 0.5 follows the layer normalisation.

Simultaneous MT approaches
This section summarises the SiMT approaches explored in this work: (i) the heuristic-based decoding approach wait-if-diff (Cho and Esipova, 2016), (ii) the wait-k policy (Ma et al., 2019), and (iii) the reinforcement learning (RL) policy (Gu et al., 2017).The first approach offers a heuristically guided latency while the second one fixes it to an arbitrary value.The third one learns a stochastic policy to find the desired quality-latency trade-off.But before going into full details of methods, we now introduce the common metrics used to measure the latency of a given SiMT model.

Latency metrics
Average proportion (AP) is the very first metric used for latency measurement in the literature (Cho and Esipova, 2016).AP computes a normalised score between 0 and 1, which is the average number of source tokens required to commit a translation: AP produces different scores for 2 samples when the underlying latency is actually the same but the source and target sentence lengths differ.To remedy this, Ma et al. (2019) propose Average Lagging (AL) which estimates the number of tokens the "writer" is lagging behind the "reader", as a function of the number of input tokens read.τ denotes the timestep where the entire source sentence has been read, as the authors state that the subsequent timesteps do not incur further delay: Finally, Consecutive Wait (CW) (Gu et al., 2017) measures how many source tokens are consecutively read between committing two translations: Decoding-only mode.A wait-k model is denoted as "trained" if it is both trained and decoded using the algorithm above.It is also possible to take a pre-trained consecutive NMT or MMT model, and apply wait-k algorithm at decoding time i.e. during greedy search.

Wait-if decoding
Cho and Esipova (2016) propose two decoding algorithms which can be directly applied on a pretrained consecutive NMT model, similar to the consecutive wait-k decoding.These algorithms have two hyper-parameters, namely the number of initial source tokens to read (k) before starting the translation and the number of further tokens to read (δ) if the algorithm decides to wait for more context.We specifically use the wait-if-diff (WID) variant, which reads more tokens if the current most likely target word changes when doing so.We intentionally left out the wait-if-worse (WIW) approach as it exhibits very high latency.

Reinforcement learning
where C t denotes the CW metric introduced here to avoid long consecutive waits and D t refers to AP (see § 3.4.1 for metrics).D * and C * are hyperparameters that determine the expected/target values for AP and CW, respectively.The optimal quality-latency trade-off is achieved by balancing the two reward terms.
4 Experimental Setup

Dataset
We use the Multi30k dataset (Elliott et al., 2016) 5 which has been the primary corpus for MMT research across the three shared tasks of the "Conference on Machine Translation (WMT)" (Specia et al., 2016;Elliott et al., 2017;Barrault et al., 2018).Multi30k extends the Flickr30k image captioning dataset (Young et al., 2014) to provide caption translations in German, French and Czech.
In this work, we focus on the English→German and English→French (Elliott et al., 2017) language directions (Table 1).We use flickr2016 (2016), flickr2017 (2017) and coco2017 (COCO) for model evaluation.The latter test set is explicitly designed (Elliott et al., 2017) to contain at least one ambiguous word per sentence, which makes it appealing for MMT experiments.
Preprocessing.We use Moses scripts (Koehn et al., 2007) to lowercase, punctuation-normalise and tokenise the sentences with hyphen splitting.
We then create word vocabularies on the training subset of the dataset.We did not use subword segmentation to avoid its potential side effects on SiMT and to be able to analyse the grounding capability of the models better.The resulting English, French and German vocabularies contain 9.8K, 11K and 18K tokens, respectively.

Reproducibility
Hyperparameters.The dimensions of embeddings and GRU hidden states are set to 200 and 320, respectively.The decoder's input and output embeddings are shared (Press and Wolf, 2017).We use ADAM (Kingma and Ba, 2014) as the optimiser and set the learning rate and mini-batch size to 0.0004 and 64, respectively.A weight decay of 1e−5 is applied for regularisation.We clip the gradients if the norm of the full parameter vector exceeds 1 (Pascanu et al., 2013).For the RL baseline, we closely follow (Gu et al., 2017) 6 .The agent is implemented by a 320-dimensional GRU followed by a softmax layer and the baseline network -used for variance reduction of policy gradient -is similar to the agent except with a scalar output layer.We use ADAM as the optimiser and set the learning rate and mini-batch size to 0.0004 and 6, respectively.For each sentence pair in a batch, ten trajectories are sampled.For inference, greedy sampling is used to pick action sequences.We set the hyperparameters C * =2, D * =0.3, α=0.025 and β= − 1.
To encourage exploration, the negative entropy policy term is weighed empirically with 0.1 and 0.3 for En→Fr and En→De directions, respectively.
Training.We use nmtpytorch (Caglayan et al., 2017b) with PyTorch (Paszke et al., 2019) v1.4 for our experiments7 .We train each model for a maximum of 50 epochs and early stop the training if validation BLEU (Papineni et al., 2002) does not improve for 10 epochs.We also halve the learning rate if no improvement is obtained for two epochs.On a single NVIDIA RTX2080-Ti GPU, it takes around 35 minutes for the unimodal and multimodal encoder variants to complete train- ing whereas the decoder attention variant requires around twice that time.The number of learnable parameters is between 6.9M and 9.4M depending on the language pair and the type of multimodality.
For the RL baseline, we choose the model that maximises the quality-to-latency ratio (BLEU/AL) on the validation set with patience set to ten epochs.The number of learnable parameters is around 6M.

Evaluation
To mitigate variance in results due to different initialisations, we repeat each experiment three times, with random seeds.Following previous work, we decode translations with greedy search, using the checkpoint that achieved the lowest perplexity.We report average BLEU scores across three runs using sacreBLEU (Post, 2018), which is also used for computing sentence-level scores for the oracle experiments.

Consecutive baselines
We first present the impact of the visual integration approaches on consecutive NMT performance (Table 2).We observe that the decoder-attention using object detection features (DEC-OD) performs better than other variants.We also see that the improvements on flickr2017 (⇑ 0.5) and coco2017 (⇑ 1.03) test sets are higher than flickr2016 (⇑ 0.1) on average.A possible explanation is that flickr2017 and coco2017 are more distant from the training set distribution (higher OOV count, see Table 1) and thus there is more room for improvement with the visual cues.In summary, unlike previous conclusions in MMT where improvements were not found to be substantial (Grönroos et al., 2018;Caglayan et al., 2019), we observe that the benefit of the visual modality is more pronounced here.We believe that this is due to (i) the encoder being now unidirectional different from state-of-the-art NMT, (ii)  the modality representations being passed through layer normalisation (Ba et al., 2016), and (iii) the representational power of OD features.

Unimodal SiMT baselines
We now compare unimodal SiMT approaches to get an initial understanding of how they perform on Multi30k.Figure 2 contrasts AL and BLEU for three trained wait-k systems, wait-if-diff (WID) decoding with k ∈ {1, 2} and δ=1, reinforcement learning (RL) and the consecutive NMT.These configurations are chosen particularly to satisfy a low-latency regimen.The results suggest that wait-k models offer good translation quality for fixed latency.The RL based policy, however, is not able to surpass wait-k models.Finally, WID decoding exhibits the worst performance, according to BLEU.Given the difficulty in finding stable hyper-parameters for the RL models, we leave the integration of RL to MMT for future work and explore wait-k, and WID approaches in what follows.

Wait-k training for MMT
We present results with trained wait-k MMTs with k ∈ {1, 2, 3, 5, 7}. Figure 3 plots a summary of the gains obtained by three MMT variants with respect to the unimodal wait-k.We observe that as k increases, the gains due to the visual modality decrease globally, in line with the findings of Caglayan et al. (2019).This phenomenon is more visible for German, which exhibits 1 BLEU point  drop consistently across all models and two test sets at k=7.We hypothesise that this instability is probably due to the interplay of several factors for German, including the high OOV rates & rich morphology and source sentences being slightly longer than target on average.The latter is a major issue for trained wait-k systems since the source sentences may not have been fully observed during training,8 preventing the decoder to learn about source <EOS> markers.For French, the results are much more encouraging as the improvements are larger and still observed with larger k values.From a multimodal perspective, like with the consecutive models, the DEC-OD system has the best performance: it is beneficial for all values of k in French, and it shows the largest gains in German for k ∈ {1, 2}.From a runtime perspective, encoder-based attention benefits heavily from batched matrix operations and runs almost at the same speed as a unimodal NMT, thus encouraging us to focus more on that in the future.
Qualitative examples.Table 3 shows some examples regarding the impact of the visual modality for the wait-1 policy.In the first example, the image assists in predicting the correct article eine (feminine 'a') instead of ein (masculine 'a') in German.Upon inspecting the attention over object regions, we observe that the region that obtained the highest probability (p=0.2) when predicting eine is labelled with 'woman' by the object detector.In the second example, we observe a biased anticipation case where the NMT system had to emit a wrong translation chien ('dog') before seeing the noun 'bird'.However, the multimodal model successfully leveraged the visual context for anticipation and correctly handled the adjective-noun placement phenomenon.Once again, the attention distribution confirms that when generating the first two wordsun and oiseau ('bird'), the model correctly attends to the object regions corresponding to 'bird' (with p=0.22 and p=0.14 respectively).

Trained vs. decoding-only SiMT
We are now interested in how trained wait-k MMTs compare to decoding-only wait-k and wait-if-diff (WID) heuristic under low latency.Figure 4 summarises latency vs. quality trade-off across all languages and test sets.First of all, the translation quality of the heuristic WID approach consistently improves with visual information with its latency slightly increasing across the board.Second, the translation quality of both trained and decodingonly wait-k policies improve with multimodality.
Interestingly, although Ma et al. (2019) show that trained wait-k models are substantially better than decoding-only ones for a news translation task, here we observe quite the opposite: in almost all cases there exists a shift between these approaches which favours the decoding-only approach for small k values.Zheng et al. (2019) observed a similar phenomenon for their wait-1 and wait-2 textual SiMT models.To investigate further, we compute an adaptive low-latency wait-k oracle for k ∈ {1, 2, 3}.Specifically, for a given model, we first select a representative hypothesis across the three runs using median9 sentence-level BLEU.We then pick the hypothesis with best BLEU (and  Table 5 shows that the unimodal NMT has no way of anticipating the context to resolve this kind of gender ambiguity, and therefore always picks the masculine version of the article.This is a clear example of models reflecting biases in the training data.In fact, 69.4% of all training instances starting with an indefinite article in French, have the masculine realisation of the article (un) instead of its feminine counterpart (une).The results also make it clear that decoding-only wait-k systems are not as successful as their trained counterparts when it comes to incorporating the visual modality, and the explicit object information is more expressive

72
Table 5: Gender resolution accuracy of decoding only and retrained wait-1 systems when translating sentences starting with 'a woman ...' into French.
than global object features.At k = 2 however, all systems reach 100% accuracy eventually.

Conclusion
We present the first thorough investigation of the utility of visual context for the task of simultaneous machine translation.Our experiments reveal that integrating visual context lowers the latency for heuristic policies while retaining the quality of the translations.Under low-latency wait-k policies, the visual cues are highly impactful with quality improvements of almost 3 BLEU points compared to unimodal baselines.From a multimodal perspective, we introduce effective ways of integrating visual features and show that explicit object region information consistently outperforms commonly used global features.Our qualitative analysis illustrates that the models are capable of resolving linguistic particularities, including gender marking and word order handling by exploiting the associated visual cues.We hope that future research continues this line of work, especially by finding novel ways to devise adaptive policies -such as reinforcement learning models with the visual modality.We believe that our work can also benefit research in multimodal speech translation (Niehues et al., 2019) where the audio stream is accompanied by a video stream.

Figure 3 :
Figure 3: BLEU comparison of trained wait-k MMT systems: the vertical axes of each subplot represent the improvement with respect to the unimodal wait-k.

Table 3 :
Examples showing the effectiveness of DEC-OD MMT (wait-1) for gender marking (top) and adjective noun placement (bottom).underlined and bold represent wrong and correct word choices, respectively.

Figure 4 :CR
Figure 4: Comparison of trained vs. decoding-only SiMT approaches: light and dark colors denote unimodal and multimodal (DEC-OD) systems, respectively.
Ma et al. (2019)a et al. (2019)propose a simple deterministic policy which relies on training and decoding an NMT model in a prefix-to-prefix fashion.Specifically, a wait-k model starts by reading k source tokens and writes the first target token.The model then reads and writes one token at a time to complete the translation process.This implies that the attention layer will now attend to a partial textual repre- sentation H ≤g(t) instead of H, with g(t) redefined as min(k + t − 1, |X|) (eq. 1 and 2).

Table 1 :
Multi30k statistics: T/S and OOV are the average target-to-source sentence length ratio and the % of sentence pairs with at least 1 unknown token.anenvironment and an agent.The environment is a pre-trained NMT system which is not updated during RL training.The agent is a GRU that parameterises a stochastic policy which decides on the action a t by receiving as input the observation o t .The observation o t is defined as [c t ; d t ; y t ], i.e.
Gu et al. (2017)frame SiMT as a sequence of READ or WRITE actions and aim to learn a reinforcement learning (RL) strategy with a reward function taking into account both quality and latency.Following standard RL, the framework is composed of

Table 2 :
Multimodal gains in BLEU for consecutive baselines: the DEC-OD system exhibits the best average improvements.

Table 4 :
BLEU (AL) oracles for low-latency decodingonly (first line) and trained wait-k (second line).resolution accuracy.Motivated by the qualitative examples in §5.3, we further investigate how accurate English→French MMT variants are at choosing the correct indefinite article une when translating sentences beginning with 'a woman'.