Image Pivoting for Learning Multilingual Multimodal Representations

In this paper we propose a model to learn multimodal multilingual representations for matching images and sentences in different languages, with the aim of advancing multilingual versions of image search and image understanding. Our model learns a common representation for images and their descriptions in two different languages (which need not be parallel) by considering the image as a pivot between two languages. We introduce a new pairwise ranking loss function which can handle both symmetric and asymmetric similarity between the two modalities. We evaluate our models on image-description ranking for German and English, and on semantic textual similarity of image descriptions in English. In both cases we achieve state-of-the-art performance.


Introduction
In recent years there has been a significant amount of research in language and vision tasks which require the joint modeling of texts and images. Examples include text-based image retrieval, image description and visual question answering. An increasing number of large image description datasets has become available (Hodosh et al., 2013;Young et al., 2014;Lin et al., 2014) and various systems have been proposed to handle the image description task as a generation problem (Bernardi et al., 2016;Mao et al., 2015;Vinyals et al., 2015;Fang et al., 2015). There has also been a great deal of work on sentence-based image search or cross-modal retrieval where the objective is to learn a joint space for images and text (Hodosh et al., 2013;Frome et al., 2013;Kiros et al., 2015;Socher et al., 2014;Donahue et al., 2015).
Previous work on image description generation or learning a joint space for images and text has mostly focused on English due to the availability of English datasets. Recently there have been attempts to create image descriptions and models for other languages (Funaki and Nakayama, 2015;Rajendran et al., 2016;Miyazaki and Shimizu, 2016;Li et al., 2016;Hitschler et al., 2016;Yoshikawa et al., 2017).
Most work on learning a joint space for images and their descriptions is based on Canonical Correlation Analysis (CCA) or neural variants of CCA over representations of image and its descriptions (Hodosh et al., 2013;Andrew et al., 2013;Yan and Mikolajczyk, 2015;Gong et al., 2014;. Besides CCA, a few others learn a visual-semantic or multimodal embedding space of image descriptions and representations by optimizing a ranking cost function (Kiros et al., 2015;Socher et al., 2014;Vendrov et al., 2016) or by aligning image regions (objects) and segments of the description Plummer et al., 2015) in a common space. Recently Lin and Parikh (2016) have leveraged visual question answering models to encode images and descriptions into the same space.
However, all of this work is targeted at monolingual descriptions, i.e., mapping images and descriptions in a single language onto a joint embedding space. The idea of pivoting or bridging is not new and language pivoting is well explored for machine translation (Wu and Wang, 2007;Firat et al., 2016) and to learn multilingual multimodal representations (Rajendran et al., 2016;Calixto et al., 2017). Rajendran et al. (2016) propose a Two men playing soccer on field a zwei männer kämpfen um einen fussball Joint space Figure 1: Our multilingual multimodal model with image as pivot model to learn common representations between M views and assume there is parallel data available between a pivot view and the remaining M −1 views. Their multimodal experiments are based on English as the pivot and use large parallel corpora available between languages to learn their representations. Related to our work Calixto et al. (2017) proposed a model for creating multilingual multimodal embeddings. Our work is different from theirs in that we choose the image as the pivot and use a different similarity function. We also propose a single model for learning representations of images and multiple languages, whereas their model is language-specific.
In this paper, we learn multimodal representations in multiple languages, i.e., our model yields a joint space for images and text in multiple languages using the image as a pivot between languages. We propose a new objective function in a multitask learning setting and jointly optimize the mappings between images and text in two different languages.

Dataset
We experiment with the Multi30k dataset, a multilingual extension of Flickr30k corpus (Young et al., 2014) consisting of English and German image descriptions . The Multi30K dataset has 29k, 1k and 1k images in the train, validation and test splits respectively, and contains two types of multilingual annotations: (i) a corpus of one English description per image and its translation into German; and (ii) a corpus of five independently collected English and German descriptions per image. We use the independently collected English and German descriptions to train our models. Note that these descriptions are not translations of each other, i.e., they are not parallel, although they describe the same image.

Problem Formulation
Given an image i and its descriptions c 1 and c 2 in two different languages our aim is to learn a model which maps i, c 1 and c 2 onto same common space R N (where N is the dimensionality of the embedding space) such that the image and its gold-standard descriptions in both languages are mapped close to each other (as shown in Figure 1). Our model consists of the embedding functions f i and f c to encode images and descriptions and a scoring function S to compute the similarity between a description-image pair.
In the following we describe two models: (i) the PIVOT model that uses the image as pivot between the description in both the languages; (ii) the PAR-ALLEL model that further forces the image descriptions in both languages to be closer to each other in the joint space. We build two variants of PIVOT and PARALLEL with different similarity functions S to learn the joint space.

Multilingual Multimodal Representation Models
In both PIVOT and PARALLEL we use a deep convolutional neural network architecture (CNN) to represent the image i denoted by f i (i) = W i · CNN(i) where W i is a learned weight matrix and CNN(i) is the image vector representation. For each language we define a recurrent neural network encoder f c (c k ) = GRU(c k ) with gated recurrent units (GRU) activations to encode the description c k . In PIVOT, we use monolingual corpora from multiple languages of sentences aligned with images to learn the joint space. The intuition of this model is that an image is a universal representation across all languages, and if we constrain a sentence representation to be closer to image, sentences in different languages may also come closer. Accordingly we design a loss function as follows: where k stands for each language. This loss function encourages the similarity S(c k , i) between gold-standard description c k and image i to be greater than any other irrelevant description c k by a margin α. A similar loss function is useful for learning multimodal embeddings in a single language (Kiros et al., 2015). For each minibatch, we obtain invalid descriptions by selecting descriptions of other images except the current image of interest and vice-versa.
In PARALLEL, in addition to making an image similar to a description, we make multiple descriptions of the same image in different languages similar to each other, based on the assumption that these descriptions, although not parallel, share some commonalities. Accordingly we enhance the previous loss function with an additional term: Note that we are iterating over all pairs of descriptions (c 1 , c 2 ), and maximizing the similarity between descriptions of the same image and at the same time minimizing the similarity between descriptions of different images.
We learn models using two similarity functions: symmetric and asymmetric. For the former we use cosine similarity and for the latter we use the metric of Vendrov et al. (2016) which is useful for learning embeddings that maintain an order, e.g., dog and cat are more closer to pet than animal while being distinct. Such ordering is shown to be useful in building effective multimodal space of images and texts. An analogy in our setting would be two descriptions of an image are closer to the image while at the same time preserving the identity of each (which is useful when sentences describe two different aspects of the image). The similarity metric is defined as: where a and b are embeddings of image and description.
We call the symmetric similarity variants of our models as PIVOT-SYM and PARALLEL-SYM, and the asymmetric variants PIVOT-ASYM and PARALLEL-ASYM.

Experiments and Results
We test our model on the tasks of imagedescription ranking and semantic textual similarity. We work with each language separately. Since we learn embeddings for images and languages in the same semantic space, our hope is that the training data for each modality or language acts complementary data for the another modality or language, and thus helps us learn better embeddings.

Experiment Setup
We sampled minibatches of size 64 images and their descriptions, and drew all negative samples from the minibatch. We trained using the Adam optimizer with learning rate 0.001, and early stopping on the validation set. Following Vendrov et al. (2016) we set the dimensionality of the embedding space and the GRU hidden layer N to 1024 for both English and German. We set the dimensionality of the learned word embeddings to 300 for both languages, and the margin α to 0.05 and 0.2, respectively, to learn asymmetric and symmetric similarity-based embeddings. 1 We keep all hyperparameters constant across all models. We used the L2 norm to mitigate over-fitting (Kiros et al., 2015). We tokenize and truecase both English and German descriptions using the Moses Decoder scripts. 2 To extract image features, we used a convolutional neural network model trained on 1.2M images of 1000 class ILSVRC 2012 object classification dataset, a subset of ImageNet (Russakovsky et al., 2015). Specifically, we used VGG 19-layer CNN architecture and extracted the activations of the penultimate fully connected layer to obtain features for all images in the dataset (Simonyan and Zisserman, 2015). We use average features from 10 crops of the re-scaled images. 3 Baselines As baselines we use monolingual models, i.e., models trained on each language separately. Specifically, we use Visual Semantic Embeddings (VSE) of Kiros et al. (2015) and Order Embeddings (OE) of Vendrov et al. (2016). We System Text to Image Image to Text R@1 R@5 R@10 Mr R@1 R@5 R@10 Mr VSE (Kiros et al., 2015) 23.3 53.6 65.8 5 31.6 60.4 72.7 3 OE (Vendrov et al., 2016) (Kiros et al., 2015) 20.3 47.2 60.1 6 29.3 58.1 71.8 4 OE (Vendrov et al., 2016)

Image-Description Ranking Results
To evaluate the multimodal multilingual embeddings, we report results on an image-description ranking task. Given a query in the form of a description or an image, the task its to retrieve all images or descriptions sorted based on the relevance. We use the standard ranking evaluation metrics of recall at position k (R@K, where higher is better) and median rank (Mr, where lower is better) to evaluate our models. We report results for both English and German descriptions. Note that we have one single model for both languages.
In Tables 1 and 2 we present the ranking results of the baseline models of Kiros et al. (2015) and Vendrov et al. (2016) and our proposed PIVOT and PARALLEL models. We do not compare our image-description ranking results with Calixto et al. (2017) since they report results on half of validation set of Multi30k whereas our results are on the publicly available test set of Multi30k. For English, PIVOT with asymmetric similarity is either competitive or better than monolingual models 4 https://github.com/ivendrov/order-embedding and symmetric similarity, especially in the R@10 category it obtains state-of-the-art. For German, both PIVOT and PARALLEL with the asymmetric scoring function outperform monolingual models and symmetric similarity. We also observe that the German ranking experiments benefit the most from the multilingual signal. A reason for this could be that the German description corpus has many singleton words (more than 50% of the vocabulary) and English description mapping might have helped in learning better semantic embeddings. These results suggest that the multilingual signal could be used to learn better multimodal embeddings, irrespective of the language. Our results also show that the asymmetric scoring function can help learn better embeddings. In Table 3 we present a few examples where PIVOT-ASYM and PARALLEL-ASYM models performed better on both the languages compared to baseline order embedding model even using descriptions of very different lengths as queries.
In Table 4, we present the Pearson correlation coefficients of our model predicted scores with the gold-standard similarity scores provided as part of the STS image/video description tasks. We compare with the best reported scores for the STS shared tasks, achieved by MLMME (Calixto et al., 2017), paraphrastic sentence embeddings (Wieting et al., 2017), visual semantic embeddings (Kiros et al., 2015), and order embeddings (Vendrov et al., 2016). The shared task baseline is computed based on word overlap and is high for both the 2014 and the 2015 dataset, indicating that there is substantial lexical overlap between the STS image description datasets. Our models outperform both the baseline system and the best system submitted to the shared task. For the 2012 video paraphrase corpus, our multilingual methods performed better than the monolingual methods showing that similarity across paraphrases can be learned using multilingual signals. Similarly, Wieting et al. (2017) have reported to learn better paraphrastic sentence embeddings with multilingual signals. Overall, we observe that models learned using the asymmetric scoring function outperform the state-of-theart on these datasets, suggesting that multilingual S1 S2 GT Pred Black bird standing on concrete.
Blue bird standing on green grass.  Table 5: Example sentences with gold-standard semantic textual similarity score and the predicted score using our best performing PARALLEL-ASYM model.
sharing is beneficial. Although the task has nothing to do German, because our models can make use of datasets from different languages, we were able to train on significantly larger training dataset of approximately 145k descriptions. Calixto et al. (2017) also train on a larger dataset like ours, but could not exploit this to their advantage. In Table 5 we present the example sentences with the highest and lowest difference between gold-standard and predicted semantic textual similarity scores using our best performing PARALLEL-ASYM model.

Conclusions
We proposed a new model that jointly learns multilingual multimodal representations using the image as a pivot between languages. We introduced new objective functions that can exploit similarities between images and descriptions across languages. We obtained state-of-the-art results on two tasks: image-description ranking and semantic textual similarity. Our results suggest that exploiting multilingual and multimodal resources can help in learning better semantic representations.