Aligning Multilingual Word Embeddings for Cross-Modal Retrieval Task

In this paper, we propose a new approach to learn multimodal multilingual embeddings for matching images and their relevant captions in two languages. We combine two existing objective functions to make images and captions close in a joint embedding space while adapting the alignment of word embeddings between existing languages in our model. We show that our approach enables better generalization, achieving state-of-the-art performance in text-to-image and image-to-text retrieval task, and caption-caption similarity task. Two multimodal multilingual datasets are used for evaluation: Multi30k with German and English captions and Microsoft-COCO with English and Japanese captions.


Introduction
In recent years, there has been a huge and significant amount of research in text and image retrieval tasks which needs the joint modeling of both modalities.Further, a large number of imagetext datasets have become available (Elliott et al., 2016;Hodosh et al., 2013;Young et al., 2014;Lin et al., 2014), and several models have been proposed to generate captions for images in the dataset (Lu et al., 2018;Bernardi et al., 2016;Anderson et al., 2017;Lu et al., 2016;Mao et al., 2014;Rennie et al., 2016).There has been a great amount of research in learning a joint embedding space for texts and images in order to use the model in sentence-based image search or crossmodal retrieval task (Frome et al., 2013;Kiros et al., 2014;Donahue et al., 2014;Lazaridou et al., 2015;Socher et al., 2013;Hodosh et al., 2013;Karpathy et al., 2014).
Previous works in image-caption task and learning a joint embedding space for texts and images are mostly related to English language, however, recently there is a large amount of research in other languages due to the availability of multilingual datasets (Funaki and Nakayama, 2015;Elliott et al., 2016;Rajendran et al., 2015;Miyazaki and Shimizu, 2016;Lucia Specia and Elliott, 2016;Young et al., 2014;Hitschler and Riezler, 2016;Yoshikawa et al., 2017).The aim of these models is to map images and their captions in a single language into a joint embedding space (Rajendran et al., 2015;Calixto et al., 2017).Related to our work, Gella et al. (2017) proposed a model to learn a multilingual multimodal embedding by utilizing an image as a pivot between languages of captions.While a text encoder is trained for each language in Gella et al. (2017), we propose instead a model that learns a shared and language-independent text encoder between languages, yielding better generalization.It is generally important to adapt word embeddings for the task at hand.Our model enables tuning of word embeddings while keeping the two languages aligned during training, building a taskspecific shared embedding space for existing languages.
In this attempt, we define a new objective function that combines a pairwise ranking loss with a loss that maintains the alignment in multiple languages.For the latter, we use the objective function proposed in Joulin et al. (2018) for learning a linear mapping between languages inspired by cross-domain similarity local scaling (CSLS) retrieval criterion (Conneau et al., 2017) which obtains the state-of-the-art performance on word translation task.
In the next sections, the proposed approach is called Aligning Multilingual Embeddings for cross-modal retrieval (AME We use two multilingual image-caption datasets to evaluate our model, Multi30k and Microsoft COCO (Elliott et al., 2016;Lin et al., 2014) Where S stands for both languages, and α is the margin.c S j and j are irrelevant caption and image of the gold-standard pair (c S i , i).

Alignment Model
Each word k in the language X is defined by a word embedding x k ∈ R d (y k ∈ R d in the lan-guage Y respectively).Given a bilingual lexicon of N pairs of words, we assume the first n pairs {(x i , y i )} n i=1 are the initial seeds, and our aim is to augment it to all word pairs that are not in the initial lexicons.Mikolov et al. (2013) proposed a model to learn a linear mapping W ∈ R d×d between the source and target languages: Where is a square loss.One can find the translation of a source word in the target language by performing a nearest neighbor search with Euclidean distance.But, the model suffers from a "hubness problem": some word embeddings become uncommonly the nearest neighbors of a great number of other words (Doddington et al., 1998;Dinu and Baroni, 2014).
In order to resolve this issue, Joulin et al. ( 2018) proposed a new objective function inspired by CSLS criterion to learn the linear mapping: Where N X (y i ) means the k-nearest neighbors of y i in the set of source language X.They constrained the linear mapping W to be orthogonal, and word vectors are l2-normalized.
The whole loss function is the equally weighted summation of the aforementioned objective functions: The model architecture is illustrated in Figure 1.We observe that updating the parameters in (3) every T iterations with learning rate lr align obtains the best performance.
We use two different similarity functions, symmetric and asymmetric.For the former, we use the cosine similarity function and for the latter, we use the metric proposed in Vendrov et al. (2015), which encodes the partial order structure of the visual-semantic hierarchy.The metric similarity is defined as: Where a and b are the embeddings of image and caption.
To extract features of images, we use a ResNet152 (He et al., 2015) CNN architecture pretrained on Imagenet and extract the image features from FC7, the penultimate fully connected layer.We use average features from 10-crop of the rescaled images.

Image to Text
Text to Image R@1 R@5 R@10 Mr R@1 R@5 R@10 Mr Alignment symmetric UVS (Kiros et al., 2014) 43.4 75.7 85.8 2 31.0 66.7 79.9 3 -EmbeddingNet (Wang et al., 2017)     vised Embeddings (MUSE) benchmark (Lample et al., 2017).MUSE is a large-scale high-quality bilingual dictionaries for training and evaluating the translation task.We extract the training words of descriptions in two languages.For training, we combine "full" and "test" sections of MUSE, then filter them to the training words.For evaluation, we filter "train" section of MUSE to the training words. 3 For evaluating the benefit of the proposed objective function, we compare AME with monolingual training (Mono), and multilingual training without the alignment model described in Section 3.2.For the latter, the pre-aligned word embeddings are frozen during training (FME).We add Mono since the proposed model in Gella et al. (2017) did not utilize pre-trained word embeddings for the initialization, and the image encoder is different (ResNet152 vs. VGG19).
3 You can find the code for building bilingual lexicons on the Github link.
We compare models based on two retrieval metrics, recall at position k (R@k) and Median of ranks (Mr).

Multi30k Results
In Table 1 and 2, we show the results for English and German captions.For English captions, we see 21.28% improvement on average compared to Kiros et al. (2014).There is a 1.8% boost on average compared to Mono due to more training data and multilingual text encoder.AME performs better than FME model on both symmetric and asymmetric modes, which shows the advantage of finetuning word embeddings during training.We have 25.26% boost on average compared to Kiros et al. (2014) in asymmetric mode.
For German descriptions, The results are 11.05% better on average compared to (Gella et al., 2017) in symmetric mode.AME also achieves competitive or better results than FME model in German descriptions too.

MS-COCO Results 4
In Table 3 and 4, we show the performance of AME and baselines for English and Japanese captions.We achieve 10.42% improvement on aver- age compared to Kiros et al. (2014) in the symmetric manner.We show that adapting the word embedding for the task at hand, boosts the general performance, since AME model significantly outperforms FME model in both languages.
For the Japanese captions, AME reaches 6.25% and 3.66% better results on average compared to monolingual model in symmetric and asymmetric modes, respectively.

Alignment results
In Tables 1 and 2, we can see that the alignment ratio for AME is 6.80% lower than FME which means that the translators can almost keep languages aligned in Multi30k dataset.In MS-COCO dataset, the alignment ratio for AME is 8.93% lower compared to FME.
We compute the alignment ratio and recall at position 1 (R@1) in each validation step.Figure 2 shows the trade-off between alignment and retrieval tasks.At the first few epochs, the model improves the alignment ratio since the retrieval task hasn't enough number of instances.Then, the retrieval task tries to fine-tune word embeddings.Finally, they reach an agreement near the half of training process.At this point, we update the learning rate of retrieval task to improve the performance, and the alignment ratio preserves constant.
Additionally, we also train AME model without adding the alignment objective function, and the model breaks the alignment between the initial aligned word embeddings, so it's essential to add the alignment objective function to the retrieval task.

Caption-Caption Similarity Scores
Given the caption in a language, the task is to retrieve the related caption in another language.In Table 5, we show the performance on Multi30k dataset in asymmetric mode.AME outperforms the FME model, confirming the importance of word embeddings adaptation.

Conclusion
We proposed a multimodal model with a shared multilingual text encoder by adapting the alignment between languages for image-description retrieval task while training.We introduced a loss function which is a combination of a pairwise ranking loss and a loss that maintains the alignment of word embeddings in multiple languages.Through experiments with different multimodal multilingual datasets, we have shown that our approach yields better generalization performance on image-to-text and text-to-image retrieval tasks, as well as caption-caption similarity task.
In the future work, we can investigate on applying self-attention models like Transformer (Vaswani et al., 2017) on the shared text encoder to find a more comprehensive representation for descriptions in the dataset.Additionally, we can explore the effect of a weighted summation of two loss functions instead of equally summing them together.

Figure
Figure 1: The AME -model architecture

Table 3 :
Image-caption ranking results for English (MS-COCO)

Table 4 :
Image-caption ranking results for Japanese (MS-COCO)