Feature-level Incongruence Reduction for Multimodal Translation

Caption translation aims to translate image annotations (captions for short). Recently, Multimodal Neural Machine Translation (MNMT) has been explored as the essential solution. Besides of linguistic features in captions, MNMT allows visual(image) features to be used. The integration of multimodal features reinforces the semantic representation and considerably improves translation performance. However, MNMT suffers from the incongruence between visual and linguistic features. To overcome the problem, we propose to extend MNMT architecture with a harmonization network, which harmonizes multimodal features(linguistic and visual features)by unidirectional modal space conversion. It enables multimodal translation to be carried out in a seemingly monomodal translation pipeline. We experiment on the golden Multi30k-16 and 17. Experimental results show that, compared to the baseline,the proposed method yields the improvements of 2.2% BLEU for the scenario of translating English captions into German (En→De) at best,7.6% for the case of English-to-French translation(En→Fr) and 1.5% for English-to-Czech(En→Cz). The utilization of harmonization network leads to the competitive performance to the-state-of-the-art.


Introduction
Caption translation is required to translate a sourcelanguage caption into target-language, where a caption refers to the sentence-level text annotation of an image. As defined in the shared multimodal translation task 1 in WMT, caption translation can be conducted over both visual features in images and linguistic features of the accompanying captions. The question of how to opportunely utilize images for caption translation motivates the study of multimodality, including not only the extraction of visual features but the cooperation between visual and linguistic features. In this paper, we follow 1 http://www.statmt.org/wmt16/ the previous work  to boil caption translation down to a problem of multimodal machine translation.
So far, a large majority of previous studies tend to develop a neural network based multimodal machine translation model (viz., MNMT), which consists of three basic components: • Image encoder which characterizes a captioned image as a vector of global or multi-regional visual features using a convolutional neural network (CNN) (Huang et al., 2016).
• Neural translation network (Caglayan et al., 2016;Sutskever et al., 2014;Bahdanau et al., 2014) which serves both to encode a sourcelanguage caption and to generate the targetlanguage caption by decoding, where the latent information that flows through the network is referred to linguistic feature.
• Multimodal learning network which uses visual features to enhance the encoding of linguistic semantics (Ngiam et al., 2011). Besides of the concatenation and combination of linguistic and visual features, vision-to-language attention mechanisms serve as the essential operations for cross-modality learning. Nowadays, they are implemented with single-layer attentive (Caglayan et al., 2017a;Calixto et al., 2017b), doubly-attentive (Calixto et al., 2017a), interpolated (Hitschler et al., 2016) and multi-task (Zhou et al., 2018) neural networks, respectively.
Multimodal learning networks have been successfully grounded with different parts of various neural translation networks. They are proven effective in enhancing translation performance. Nevertheless, the networks suffer from incongruence between visual and linguistic features because: • Visual and linguistic features are projected into incompatible semantic spaces and therefore fail to be corresponded to each other.

Ground-truth:
Two brown horses pulling a sleigh through snow.

Ground-truth:
Sled dogs running and pulling a sled.

Counterfeit:
Two brown horses running and pulling a sled.
Linguistic feature Linguistic feature Visual linguistic Figure 1: An example in which image captioning contributes to the reduction of incongruence.
• Linguistic features are sequence-dependent. This is attributable to pragmatics, syntax or even rhetoric. On the contrary, visual features are sequence-independent but position-sensitive. This is attributable to spatial relationships of visual elements. Thus, a limited number of visual features can be directly used to improve the understanding of linguistic features and translation.
Considering the Figure 1("Counterfeit" means Image Captioning output), the visual features enable a image processing model to recognize "two horses" as well as their position relative to a "sleigh". However, such features are obscure for a translation model and useful for translating a verb, such as "pulling" in the caption. In this case, incongruence of heterogeneous features results from the unawareness of the correspondence between spatial relationship ("running horses" ahead of "sleigh") and linguistic semantics ("pulling").
To ease the incongruence, we propose to equip the current MNMT with a harmonization network, in which visual features are not directly introduced into the encoding of linguistic semantics. Instead, they are transformed into linguistic features before absorbed into semantic representations. In other words, we tend to make a detour during the cross-modality understanding, so as to bypass the modality barrier ( Figure 2). In our experiments, we employ a captioning model to conduct harmonization. The hidden states it produced for decoding caption words are intercepted and involved into the representation learning process of MNMT.
The rest of the paper is organized as follows: Section 2 presents the motivation and methodolog- ical framework. Section 3 gives the NMT model we use. In Section 4, we introduce the captioning model that is trainable for cross-modality feature space transformation. Section 5 presents the captioning based harmonization networks as well as the resultant MNMT models. We discuss test results in Section 6 and overview the related work in Section 7. We conclude the paper in section 8.

Fundamentals and Methodological Framework
We utilize Anderson et al (2018)'s image captioning (CAP for short) to guide the cross-modality feature transformation, converting visual features into linguistic. CAP is one of the generation models which are specially trained to generate language conditioned on visual features of images. Ideally, during training, it learns to perceive the correspondence between visual and linguistic features, such as that between the spatial relationship of "running dogs ahead of a sled" in Figure 1 and the meaning of the verb "pulling". This allows CAP to produce appropriate linguistic features during testing in terms of similar visual features, such as that in the case of predicting the verb "pulling" for the scenario of "running horses ahead of a sleigh". Methodologically speaking, we adopt the linguistic features produced by the encoder of CAP instead of the captions generated by the decoder of CAP. On the basis, we integrate both the linguistic features of the original source-language caption and those produced by CAP into Calixto et al (2017b)'s attention-based cross-modality learning model (see Figure 3). Experimenal results show that the learning model substantially improves Bahdanau et al (2014)'s encoder-decoder NMT system. with a BiGRU encoder and a Conditional GRU (CGRU) decoder Caglayan et al., 2017a). Attention mechanism is used between BiGRU and CGRU. The diagram at the right side of Figure 3 shows the baseline framework.
For a source-language caption, we represent it with a sequence of randomly-initialized (Kalchbrenner and Blunsom, 2013) word embeddings X=(x 1 , ..., x N ), where each x t is uniformly specified as a k-dimensional word embedding. Conditioned on the embeddings, Chung et al (2014)'s BiGRU is used to compute the bidirectional hidden states S=(s 1 , ..., s N ), where each s t is obtained by combining the t-th hidden state of forward GRU and that of backward GRU: Padding  and dynamic stabilization (Ba et al., 2016) are used.
Firat and Cho (2016)'s CGRU is utilized for decoding, which comprises two forward GRU units, t is computed based on the output state h d1 t−1 and prediction y t−1 at the previous time step: the prediction y t denotes the k-dimensional embedding of the predicted word at the t-th decoding step). By contrast, − −− → GRU d2 serves to produce the attentive decoder is computed conditioned on the previous attentive state h d2 t−1 , the current inattentive state h d1 t , as well as the current attention-aware context c t : The context c t is obtained by the attention mechanism over the global encoder hidden states S: c t = α t S, where α t denotes the attention weight at the t-th time step. Eventually, the prediction of each target-language word is carried out as follows (where, W h , W c , W y , b o and b y are trainable parameters): 4 Preliminary 2: Image-dependent Linguistic Feature Acquisition by CAP For an image, captioning models serve to generate a sequence of natural language (caption) that describes the image. Such kind of models are capable of transforming visual features into linguistic features by encoder-decoder networks. We utilize Anderson et al. (2018)'s CAP to obtain the transformed linguistic features.

CNN based Image Encoder
What we feed into CAP is a full-size image which needs to be convolutionally encoded beforehand. He et al (He et al., 2016a)'s CNNs (known as ResNet) with deep residual learning mechanism (He et al., 2016b) is capable of encoding images.
In our experiments, we employ the recent version of ResNet, i.e., ResNet-101 , which is constructed with 101 convolutional layers. It is pretrained on ImageNet (Russakovsky et al., 2015) in the scenario of 1000-class image classification.
Using ResNet-101, we characterize an image as a convolutional feature matrix: is a real-valued vector and corresponds to an image region in the size of 14 × 14 pixels.

Top-down Attention-based CAP
CAP learns to generate a caption over V . It is constructed with two-layer RNNs with LSTM (Anderson et al., 2018), LSTM1 and LSTM2 respectively. LSTM1 (in layer-1) computes the current first-layer hidden stateȟ d1 t conditioned on the current first-layer inputx d1 t and previous hidden statě t is obtained by concatenating the previous hidden statě h d1 t−1 and previous predictiony t−1 , as well as the condensed global visual featurev: . We specify the first-layer hidden state as the initial image-dependent linguistic features.
Attention mechanism (Sennrich et al., 2015) is used for highlighting the attention-worthy image context, so as to produce the attention-aware vector of image contextv t :v t = α t V . The attention weightα t is obtained by aligning the current imagedependent hidden stateȟ d1

Harmonization for MNMT
In the previous work of multimodal NMT, visual features in V are directly used for cross-modality learning. By contrast, we transform visual features into image-dependent attention-aware linguistic features (i.e., second-layer hidden statesȟ d2 t emitted by CAP) before use. We provide four-class variants of cross-modality learning to improve NMT. They absorb image-dependent attention-aware linguistic features in different ways, including a variant that comprises attentive feature fusion (CAP-ATT) and three variants (CAP-ENC, CAP-DEC and CAP-TKN) which carry out reinitialization and target-language embedding modulation. Figure 3 shows the positions in the baseline NMT where the variants come into play.
CAP-ATT intends to improve NMT by conducting joint representation learning across the features of the source-language caption and that of the accompanying image. On one side, CAP-ATT adopts the encoder hidden state s t (emitted by the Bi-GRU encoder of the baseline NMT) and uses it as the language-dependent linguistic feature. On the other side, it takes the image-dependent attentionaware linguistic featureȟ d2 t (produced by CAP). We suppose that the two kinds of features (i.e., h d2 t and s t ) are congruent with each other. On the basis, CAP-ATT blendsȟ d2 t into s t to form the joint representation s t . Element-wise feature fusion (Cao and Xiong, 2018) is used to compute s t : s t = s t ȟ d2 t . Using the joint representation s t , CAP-ATT updates the attention-aware context c t which is fed into the CGRU decoder of the baseline NMT: c t =α t S, ∀ s ∈ S. By substituting the updated context c t into the computation of the CGRU decoder, CAP-ATT further refines the decoder hidden state h d 2 t and prediction of target-language words. Equation 2 formulates the decoding process, where D t is the shorthand of equation (1).
(2) CAP-ENC reinitializes the BiGRU encoder of the baseline NMT with the final image-dependent attention-aware linguistic featureȟ d2 t (t=N ) (produced by CAP): 0 and h d2 0 are the initial decoder hidden states of CGRU. CAP-TKN modulates the predicted target-language word embedding y t withȟ d2 t (t=N ) at each decoding step: y t = y t tanh(W tknȟ d2 t ), where W tkn is the trainable parameter. CAP-ALL equips a MNMT system with all the variants.

Resource and Experimental Datasets
We perform experiments on Multi30k-16 and Multi30k-17 2 , which are provided by WMT for the shared tasks of multilingual captioning and multimodal MT . The corpora are used as the extended versions of Flichr30k (Young et al., 2014), since they contain not only English (En) image captions but their translations in German (De), French (Fr) and Czech (Cz). Hereinafter, we specify an example in Multi30k as an image which is accompanied by three En→De, En→Fr and En→Cz caption-translation pairs. Each of Multi30k-16 and Multi30k-17 contains about 31K examples. We experiment on the corpora separately, and as usual divide each of them into training, validation and test sets, at the scale of 29K, 1,014 and 1K examples, respectively.
In addition, we carry out a complementary experiment on the ambiguous COCO which contains 461 examples (Elliott et al., 2017). Due to the inclusion of ambiguous verbs, the examples in ambiguous COCO can be used for the evaluation of visual sense disambiguation in a MNMT scenario.

Training and Hyperparameter Settings
For preprocessing, we apply Byte-Pair Encoding (BPE) (Sennrich et al., 2015) for tokenizing all the captions and translations in Multi30k and COCO, and use the open-source toolkit 3 of Moses (Koehn et al., 2007) for lowercasing and punctuation normalization. It reproduces the neural network architecture of Anderson et al (Anderson et al., 2018)'s top-down attentive CAP. The only difference is that it merely utilizes ResNet-101 in generating the input set of visual features V , without the use of Faster R-CNN (Ren et al., 2015). This CAP has been trained on MSCOCO captions dataset (Lin et al., 2014) using the same hyperparameter settings as that in Anderson et al. (2018)'s work.
Besides of the baseline NMT (Bahdanau et al., 2014) mentioned in section 2, we compare our model with Caglayan et al (Caglayan et al., 2017a)'s convolutional visualfeature based MNMT. In this paper, we follow Caglayan et al (Caglayan et al., 2017a)'s practice to implement and train our model. First of all, we implement our model with the nmtpy framework (Caglayan et al., 2017b) using Theano v0.9. During training, ADAM with a learning rate of 4e-4 is used and the batch size is set as 32. We initialize all the parameters (i.e., transformation matrices and biases) using Xavier and clip the total gradient norm to 5. We drop out the input embeddings, hidden states and output states with the probabilities of (0.3, 0.5, 0.5) for En→De MT, (0.2, 0.4, 0.4) for En→Fr and (0.1, 0.3, 0.3) for En→Cz. In order to avoid overfitting, we apply a L 2 regularization term with a factor of 1e-5. We specify the dimension as 128 for all token embeddings (k = 128) and 256 for hidden states.

Comparison to the Baseline
We carry out 5 independent experiments (5 runs) for each of the proposed MNMT variants. In each run, any of the variants is retrained and redeveloped under cold-start conditions using a set of randomlyselected seeds by MultEval 4 . Eventually, the resultant models are evaluated on the test set with BLEU (Papineni et al., 2002), METEOR (Denkowski and Lavie, 2014) and TER (Snover et al., 2006).
For each variant, we report not only the comprehensive performance (denoted as ensemble) which is obtained using ensemble learning (Garmash and Monz, 2016) but that without ensemble learning. In the latter case, the average performance (µ) and deviations (σ) in the 5 runs are reported.

Performance on Multi30k
Tables 1 and 2 respectively show the performance of our models on Multi30k-16 and Multi30k-17 for the translation scenarios of En→De, En→Fr and En→Cz. Each of our MNMT models in the tables is denoted with a symbol "+", which indicates that a MNMT model is constructed with the baseline and one of our cross-modality learning models. The baseline is specified as the monomodal NMT model which is developed by Bahdanau et al. (2014)   mentioned in section 2) and redeveloped as the baselines in a variety of research studies on multimodal NMT (Calixto et al., 2017a,b;Caglayan et al., 2017a). We quote the results reported in Caglayan et al. (2017a)'s work as they were better.
It can be observed that our MNMT models outperform the baseline. They benefits from the performance gains yielded by the variants of CAP based cross-modality learning, which are no less than 1.5% BLEU when ensemble learning is used, and 0.6% when not to use it. In particular, +CAP-ATT obtains a performance increase of up to 7.6% BLEU (µ) in the scenario of En→Fr MT. The gains in METEOR score we obtain are less obvious than that in BLEU, which is about 5.3% (µ) at best.
We follow Clark et al. (2011) to perform significance test. The test results show that +CAP-ATT, +CAP-DEC and +CAP-TKN achieve a pvalue of 0.02, 0.01 and 0.007, respectively. Clark et al. (2011) have proven that the performance improvements are significant only if the p-value is less than 0.05. Therefore, the proposed method yields statistically significant performance gains. Table 3 shows the translation performance. It can be found that our models yield a certain amount of gains (in BLEU scores) for En→De translation, and raise both BLEU and METEOR scores for En→Fr.

Performance on Ambiguous COCO
The METER scores for En→De are comparable to that the baseline achieved. However, the improvement is less significant compared to that obtained on Multi30k-16&17 (see Table 1). Considering that the ambiguous COCO contains a larger number of ambiguous words than Multi30k-16&17, we suggest that our method fails to largely shield the baseline from the misleading of ambiguous words.
Nevertheless, our method doesn't result in a twofold error propagation, but on the contrary it alleviates the negative influences of the errors because: • Error propagation, in general, is inevitable when a GRU or LSTM unit is used. Both are trained to predict a sequence of words one by one. Appropriate prediction of previous words is crucial for ensuring the correctness of subsequent words. Thus, once a mistake is made at a certain decoding step, the error will be propagated forward, and mislead the prediction of subsequent words.
• The baseline is equipped with a GRU decoder and therefore suffers from error propagation. More seriously, ambiguous words increase the risk of error propagation. This causes a significant performance reduction on Ambiguous COCO. For example, the BLEU score for En→De is 28.7% at best. It is far below that (40.7%) obtained on Multi30k-16&17.
• Two-fold error propagation is suspected to occur when LSTM-based CAP is integrated with the baseline. Though the opposite is actually true. After CAP is used, the translation performance is improved instead of falling down.

Comparison to the state of the art
We survey the state-of-the-art research activities in the field of MNMT, and compare them with ours   (as shown in Table 4). Comparison are made for all the WMT translation scenarios (En→De, Fr and Cz) on Multi30k-16&17 but merely for En→De and En→Fr on ambiguous COCO (as shown in Table 3). To our best knowledge, there is no previous attempt to evaluate the performance of an En→Cz translation model on ambiguous COCO, and thus a precise comparison for that is not available. It is noteworthy that some of the cited work reports the ensemble learning results for MNMT, others make no mention of it. We label the former with a symbol of " " in Tables 3 and 4 to ease the comparison. It can be observed that our best model outperforms the state of the art for most scenarios over different corpora except the En→Fr case on Multi30k-17. The performance increases are most apparent in the case of En→Fr on Multi30k-16 when ensemble learning is used, where the BLEU and METEOR scores reach the levels of more than 63% and 77%, with the improvements of 6.6% and 4.1%.
We regard the work of Caglayan et al. (2017a) and Calixto et al. (2017a)

Performance in Adversarial Evaluation
We examine the use efficiency of images for MNMT using Elliott's adversarial evaluation (Elliott, 2018). Elliott suppose that if a model efficiently uses images during MNMT, its performance would degrade when it is cheated by some incongruent images. Table 5 shows the test results, where "C" is specified as a METEOR score which is evaluated when there is not any incongruent image in the test set, while "I" is that when some incogruent images are used to replace the original images.
If the value of "C" is larger than "I", a positive E -Awareness can be obtained. It illustrates an acceptable use efficiency. On the contrary, a negative E -Awareness is a warning of low efficiency. Table 5 shows the test results. It can be observed that our +CAP-ATT and +CAP-ALL models achieve positive E -Awareness for all the translation scenarios on Multi30k-16. In addition, the models obtain higher values of E -Awareness than Caglayan et al. (2017a)'s models of decinit and hierattn. As mentioned above, Caglayan et al directly use visual features to enhance the MNMT, while we use the image-dependent linguistic features that are transformed from visual features. Therefore, we suppose that modality transformation leads to a higher use efficiency of images.

RELATED WORK
We have mentioned the previous work of MNMT in section 1, where the research interest has been classified into image encoding, encoder-decoder NMT construction and cross-modality learning. Besides, we present the methods of Caglayan et al. (2017a) and Calixto et al. (2017a) in the section 4.4.2, along with the systematic analysis. Besides, many scholars within the research community have made great efforts upon the development of sophisticated NMT architectures, including multi-source (Zoph and Knight, 2016), multi-task (Dong et al., 2015) and multi-way  NMT, as well as those equipped with attention mechanisms (Sennrich et al., 2015). The research activities are particularly crucial since they broaden the range of cross-modality learning strategies.
Current research interest has concentrated on the incorporation of visual features into NMT (Lala et al., 2018), by means of visual-linguistic context vector concatenation (Libovickỳ et al., 2016), doubly-attentive decoding (Calixto et al., 2017a), hierarchical attention combination (Libovickỳ and Helcl, 2017), cross-attention network , gated attention network (Zhang et al., 2019), joint (Zhou et al., 2018) and ensemble (Zheng et al., 2018) learning . In addition, image attention optimization (Delbrouck and Dupont, 2017) and monolingual data expansion (Hitschler et al., 2016) have been proven effective in this field. Ive et al. (2019) use an off-shelf object detector and an additional image dataset (Kuznetsova et al., 2018) to form a bag of category-level object embeddings. Conditioned on the embeddings, Ive et al. (2019)   a sophisticated MNMT model which integrates selfattention and cross-attention mechanisms into the encoder-decoder based deliberation architecture. This paper also touches on the research area of image captioning. Mao et al. (2014) provide an interpretable image modeling method using multimodal RNN. Vinyals et al. (2015) design a caption generator (IDG) by Seq2Seq framework. Further, Xu et al. (2015) propose an attention-based IDG.

CONCLUSION
We demonstrate that the captioning based harmonization model reduces incongruence between multimodal features. This contributes to the performance improvement of MNMT. It is proven that our method increases the use efficiency of images.
The interesting phenomenon we observed in the experiments is that modality incongruence reduction is more effective in the scenario of En→Fr translation than that of En→De and En→Cz. This raises a problem of adaptation to languges. In the future, we will study on the distinct grammatical and syntactic principles of target languages, as well as their influences on the adaptation. For example, the syntax of French can be considered as most strict. Thus, a sequence-dependent feature vector may be more adaptive to MNMT towards French. Accordingly, we will attempt to develop a generative adversarial network based adaptation enhancement model. The goal is to refine the generated linguistic features by learning to detect and eliminate the features of less adaptability.