Bridge Correlational Neural Networks for Multilingual Multimodal Representation Learning

Recently there has been a lot of interest in learning common representations for multiple views of data. Typically, such common representations are learned using a parallel corpus between the two views (say, 1M images and their English captions). In this work, we address a real-world scenario where no direct parallel data is available between two views of interest (say, $V_1$ and $V_2$) but parallel data is available between each of these views and a pivot view ($V_3$). We propose a model for learning a common representation for $V_1$, $V_2$ and $V_3$ using only the parallel data available between $V_1V_3$ and $V_2V_3$. The proposed model is generic and even works when there are $n$ views of interest and only one pivot view which acts as a bridge between them. There are two specific downstream applications that we focus on (i) transfer learning between languages $L_1$,$L_2$,...,$L_n$ using a pivot language $L$ and (ii) cross modal access between images and a language $L_1$ using a pivot language $L_2$. Our model achieves state-of-the-art performance in multilingual document classification on the publicly available multilingual TED corpus and promising results in multilingual multimodal retrieval on a new dataset created and released as a part of this work.


Introduction
The proliferation of multilingual and multimodal content online has ensured that multiple views of the same data exist. For example, it is common to find the same article published in multiple languages online in multilingual news articles, multilingual wikipedia articles, etc. Such multiple views can even belong to different modalities. For example, images and their textual descriptions are two views of the same entity. Similarly, audio, video and subtitles of a movie are multiple views of the same entity.
Learning common representations for such multiple views of data will help in several downstream applications. For example, learning a common representation for images and their textual descriptions could help in finding images which match a given textual description. Further, such common representations can also facilitate transfer learning between views. For example, a document classifier trained on one language (view) can be used to classify documents in another language by representing documents of both languages in a common subspace.
Existing approaches to common representation learning (Ngiam et al., 2011;Klementiev et al., 2012;Chandar et al., 2013;Chandar et al., 2014;Andrew et al., 2013;Wang et al., 2015) except (Hermann and Blunsom, 2014b) typically require parallel data between all views. However, in many realworld scenarios such parallel data may not be available. For example, while there are many publicly available datasets containing images and their corresponding English captions, it is very hard to find datasets containing images and their corresponding captions in Russian, Dutch, Hindi, Urdu, etc. In this work, we are interested in addressing such scenarios. More specifically, we consider scenarios where we have n different views but parallel data is only available between each of these views, and a pivot view. In particular, there is no parallel data available between the non-pivot views.
To this end, we propose Bridge Correlational Neural Networks (Bridge CorrNets) which learn aligned representations across multiple views using a pivot view. We build on the work of (Chandar et al., 2016) but unlike their model, which only addresses scenarios where direct parallel data is available between two views, our model can work for n(≥2) views even when no parallel data is available between all of them. Our model only requires parallel data between each of these n views and a pivot view. During training, our model maximizes the correlation between the representations of the pivot view and each of the n views. Intuitively, the pivot view ensures that similar entities across different views get mapped close to each other since the model would learn to map each of them close to the corresponding entity in the pivot view.
We evaluate our approach using two downstream applications. First, we employ our model to facilitate transfer learning between multiple languages using English as the pivot language. For this, we do an extensive evaluation using 110 sourcetarget language pairs and clearly show that we outperform the current state-of-the art approach (Hermann and Blunsom, 2014b). Second, we employ our model to enable cross modal access between images and French/German captions using English as the pivot view. For this, we created a test dataset consisting of images and their captions in French and German in addition to the English captions which were publicly available. To the best of our knowledge, this task of retrieving images given French/German captions (and vice versa) without direct parallel training data between them has not been addressed in the past. Even on this task we report promising results. Code and data used in this paper can be downloaded from http: //sarathchandar.in/bridge-corrnet.

Related Work
Canonical Correlation Analysis (CCA) and its variants (Hotelling, 1936;Vinod, 1976;Nielsen et al., 1998;Cruz-Cano and Lee, 2014;Akaho, 2001) are the most commonly used methods for learning a common representation for two views. However, most of these models generally work with two views only. Even though there are multi-view generalizations of CCA (Tenenhaus and Tenenhaus, 2011;Luo et al., 2015), their computational complexity makes them unsuitable for larger data sizes.
Another class of algorithms for multiview learning is based on Neural Networks. One of the earliest neural network based model for learning common representations was proposed in (Hsieh, 2000). Recently, there has been a renewed interest in this field and several neural network based models have been proposed. For example, Multimodal Autoencoder (Ngiam et al., 2011), Deep Canonically Correlated Autoencoder (Wang et al., 2015), Deep CCA (Andrew et al., 2013) and Correlational Neural Networks (CorrNet) (Chandar et al., 2016). CorrNet performs better than most of the above mentioned methods and we build on their work as discussed in the next section.
One of the tasks that we address in this work is multilingual representation learning where the aim is to learn aligned representations for words across languages. Some notable neural network based approaches here include the works of (Klementiev et al., 2012;Zou et al., 2013;Mikolov et al., 2013;Hermann and Blunsom, 2014b;Hermann and Blunsom, 2014a;Chandar et al., 2014;Soyer et al., 2015;Gouws et al., 2015). However, except for (Hermann and Blunsom, 2014a;Hermann and Blunsom, 2014b), none of these other works handle the case when parallel data is not available between all languages. Our model addresses this issue and outperforms the model of Hermann and Blunsom (2014b).
The task of cross modal access between images and text addressed in this work comes under Mul-tiModal Representation Learning where each view belongs to a different modality. Ngiam et al. (2011) proposed an autoencoder based solution to learning common representation for audio and video. Srivastava and Salakhutdinov (2014) extended this idea to RBMs and learned common representations for image and text. Other solutions for image/text representation learning include (Zheng et al., 2014a;Zheng et al., 2014b;Socher et al., 2014). All these approaches require parallel data between the two views and do not address multimodal, multilingual learning in situations where parallel data is available only between different views and a pivot view.
In the past, pivot/bridge languages have been used to facilitate MT (for example, (Wu and Wang, 2007;Cohn and Lapata, 2007;Utiyama and Isahara, 2007;Nakov and Ng, 2009)), transitive CLIR (Ballesteros, 2000;Lehtokangas et al., 2008), transliteration and transliteration mining (Khapra et al., 2010a;Khapra et al., 2010b;Zhang et al., 2011). None of these works use neural networks but it is important to mention them here because they use the concept of a pivot language (view) which is central to our work.

Bridge Correlational Neural Network
In this section, we describe Bridge CorrNet which is an extension of the CorrNet model proposed by (Chandar et al., 2016). They address the problem of learning common representations between two views when parallel data is available between them. We propose an extension to their model which simultaneously learns a common representation for M views when parallel data is available only between one pivot view and the remaining M − 1 views.
Let these views be denoted by V 1 , V 2 , ..., V M and let d 1 , d 2 , ..., d M be their respective dimensionalities. Let the training data be Z where each training instance contains only two views, i.e., ., M −1} and M is a pivot view. To be more clear, the training data contains N 1 instances for which We denote each of these disjoint pairwise training sets by Z 1 , Z 2 to Z M −1 such that Z is the union of all these sets.
As an illustration consider the case when English, French and German texts are the three views of interest with English as the pivot view. As training data, we have N 1 instances containing English and their corresponding French texts and N 2 instances containing English and their corresponding German texts. We are then interested in learning a common representation for English, French and German even though we do not have any training instance containing French and their corresponding German texts.
Bridge CorrNet uses an encoder-decoder architecture with a correlation based regularizer to achieve this. It contains one encoder-decoder pair for each of the M views. For each view V j , we have, where f is any non-linear function such as sigmoid or tanh, W j ∈ R k×d j is the encoder matrix for view V j , b ∈ R k is the common bias shared by all the encoders. We also compute a hidden representation for the concatenated training instance z = (v j , v M ) using the following encoder function: In the remainder of this paper, whenever we drop the subscript for the encoder, then the encoder is determined by its argument.
Our model also has a decoder corresponding to each view as follows: where p can be any activation function, W j ∈ R d j ×k is the decoder matrix for view V j , c j ∈ R d j is the decoder bias for view V j . We also define g(h) as simply the concatenation of In effect, h V j (.) encodes the input v j into a hidden representation h and then g V j (.) tries to decode/reconstruct v j from this hidden representation h. Note that h can be computed using h(v j ) or h(v M ). The decoder can then be trained to decode/reconstruct both v j and v M given a hidden representation computed using any one of them. More formally, we train Bridge CorrNet by minimizing the following objective function: where l(i) = j if z i ∈ Z j and the correlation term corr is defined as follows: Note that g(h(z i )) is the reconstruction of the input z i after passing through the encoder and decoder. L is a loss function which captures the error in this reconstruction, λ is the scaling parameter to scale the last term with respect to the remaining terms, h(X) is the mean vector for the hidden representations of the first view and h(Y ) is the mean vector for the hidden representations of the second view. We now explain the intuition behind each term in the objective function. The first term captures the error in reconstructing the concatenated input z i from itself. The second term captures the error in reconstructing both views given the non-pivot view, v i l(i) . The third term captures the error in reconstructing both views given the pivot view, v i M . Minimizing the second and third terms ensures that both the views can be predicted from any one view. Finally, the correlation term ensures that the network learns correlated common representations for all views.
Our model can be viewed as a generalization of the two-view CorrNet model proposed in (Chandar et al., 2016). By learning joint representations for multiple views using disjoint training sets Z 1 , Z 2 to Z M −1 it eliminates the need for n C 2 pair-wise parallel datasets between all views of interest. The pivot view acts as a bridge and ensures that similar entities across different views get mapped close to each other since all of them would be close to the corresponding entity in the pivot view.
Note that unlike the objective function of Cor-rNet (Chandar et al., 2016), the objective function of Equation 4, is a dynamic objective function which changes with each training instance. In other words, l(i) ∈ {1, 2, .., M −1} varies for each i ∈ {1, 2, .., N }. For efficient implementation, we construct mini-batches where each mini-batch will come from only one of the sets Z 1 to Z M −1 . We randomly shuffle these mini-batches and use corresponding objective function for each mini-batch.
As a side note, we would like to mention that in addition to Z 1 , Z 2 to Z M −1 as defined earlier, if additional parallel data is available between some of the non-pivot views then the objective function can be suitably modified to use this parallel data to further improve the learning. However, this is not the focus of this work and we leave this as a possible future work.

Datasets
In this section, we describe the two datasets that we used for our experiments.

Multlingual TED corpus
Hermann and Blunsom (2014b) provide a multilingual corpus based on the TED corpus for IWSLT 2013 (Cettolo et al., 2012). It contains English transcriptions of several talks from the TED conference and their translations in multiple languages. We use the parallel data between English and other languages for training Bridge Corrnet (English, thus, acts as the pivot langauge). Hermann and Blunsom (2014b) also propose a multlingual document classification task using this corpus. The idea is to use the keywords associated with each talk (document) as class labels and then train a classifier to predict these classes. There are one or more such keywords associated with each talk but only the 15 most frequent keywords across all documents are considered as class labels. We used the same pre-processed splits 1 as provided by (Hermann and Blunsom, 2014b). The training corpus consists of a total of 12,078 parallel documents distributed across 12 language pairs.

Multilingual Image Caption dataset
The MSCOCO dataset 2 contains images and their English captions. On an average there are 5 captions per image. The standard train/valid/test splits for this dataset are also available online. However, the reference captions for the images in the test split are not provided. Since we need such reference captions for evaluations, we create a new train/valid/test of this dataset. Specifically, we take 80K images from the standard train split and 40K images from the standard valid split. We then randomly split the merged 120K images into train(118K), validation (1K) and test set (1K).
We then create a multilingual version of the test data by collecting French and German translations for all the 5 captions for each image in the test set. We use crowdsourcing to do this. We used the CrowdFlower platform and solicited one French and one German translation for each of the 5000 captions using native speakers. We got each translation verified by 3 annotators. We restricted the geographical location of annotators based on the target language. We found that roughly 70% of the French translations and 60% of the German translations were marked as correct by a majority of the verifiers. On further inspection with the help of in-house annotators, we found that the errors were mainly syntactic and the content words are translated correctly in most of the cases. Since none of the approaches described in this work rely on syntax, we decided to use all the 5000 translations as test data. This multilingual image caption test data (MIC test data) will be made publicly available 3 and will hopefully assist further research in this area.

Experiment 1: Transfer learning using a pivot language
From the TED corpus described earlier, we consider English transcriptions and their translations in 11 languages, viz., Arabic, German, Spanish, French, Italian, Dutch, Polish, Portuguese (Brazilian), Roman, Russian and Turkish. Following the setup of Hermann and Blunsom (2014b), we consider the task of cross language learning between each of the 11 C 2 non-English language pairs. The task is to classify documents in a language when no labeled training data is available in this language but training data is available in another language. This involves the following steps: 1. Train classifier: Consider one language as the source language and the remaining 10 languages as target languages. Train a document classifier using the labeled data of the source language, where each training document is represented using the hidden representation computed using a trained Bridge Corrnet model. As in (Hermann and Blunsom, 2014b) we used an averaged perceptron trained for 10 epochs as the classifier for all our experiments. The train split provided by (Hermann and Blunsom, 3 http://sarathchandar.in/bridge-corrnet 2014b) is used for training. 2. Cross language classification: For every target language, compute a hidden representation for every document in its test set using Bridge CorrNet. Now use the classifier trained in the previous step to classify this document. The test split provided by (Hermann and Blunsom, 2014b) is used for testing.

Training and tuning Bridge Corrnet
For the above process to work, we first need to train Bridge Corrnet so that it can then be used for computing a common hidden representation for documents in different languages. For training Bridge CorrNet, we treat English as the pivot language (view) and construct parallel training sets Z 1 to Z 11 . Every instance in Z 1 contains the English and Arabic view of the same talk (document). Similarly, every instance in Z 2 contains the English and German view of the same talk (document) and so on. For every language, we first construct a vocabulary containing all words appearing more than 5 times in the corpus (all talks) of that language. We then use this vocabulary to construct a bag-of-words representation for each document. The size of the vocabulary (|V |) for different languages varied from 31213 to 60326 words. To be more clear, v 1 = v arabic ∈ R |V | arabic , v 2 = v german ∈ R |V |german and so on. We train our model for 10 epochs using the above training data Z = {Z 1 , Z 2 , ..., Z 11 }. We use hidden representations of size D = 128, as in (Hermann and Blunsom, 2014b). Further, we used stochastic gradient descent with mini-batches of size 20. Each mini-batch contains data from only one of the Z i s. We get a stochastic estimate for the correlation term in the objective function using this mini-batch. The hyperparameter λ was tuned to each task using a training/validation split for the source language and using the performance on the validation set of an averaged perceptron trained on the training set (notice that this corresponds to a monolingual classification experiment, since the general assumption is that no labeled data is available in the target language).

Results
We now present the results of our cross language classification task in Table 1. Each row corresponds to a source language and each column corresponds to a target language. We report the average F1-   scores over all the 15 classes. We compare our results with the best results reported in (Hermann and Blunsom, 2014b) (see Table 2). Out of the 110 experiments, our model outperforms the model of (Hermann and Blunsom, 2014b) in 107 experiments. This suggests that our model efficiently exploits the pivot language to facilitate cross language learning between other languages.
Finally, we present the results for a monolingual classification task in Table 3. The idea here is to see if learning common representations for multiple views can also help in improving the performance of a task involving only one view. Hermann and Blunsom (2014b) argue that a Naive Bayes (NB) classifier trained using a bag-of-words representation of the documents is a very strong baseline. In fact, a classifier trained on document representations learned using their model does not beat a NB classifier for the task of monolingual classification. Rows 2 to 5 in Table 3 show the different settings tried by them (we refer the reader to (Hermann and Blunsom, 2014b) for a detailed description of these settings). On the other hand our model is able to beat NB for 5/11 languages. Further, for 4 other languages (German, French, Romanian, Russian) its performance is only marginally poor than that of NB. 6 Experiment 2: Cross modal access using a pivot language In this experiment, we are interested in retrieving images given their captions in French (or German) and vice versa. However, for training we do not have any parallel data containing images and their French (or German) captions. Instead, we have the following datasets: (i) a dataset Z 1 containing images and their English captions and (ii) a dataset Z 2 containing English and their parallel French (or German) documents. For Z 1 , we use the training split of MSCOCO dataset which contains 118K images and their English captions (see Section 4.2). For Z 2 , we use the English-French (or German) parallel documents from the train split of the TED corpus (see Section 4.1). We use English as the pivot language and train Bridge Corrnet using Z = {Z 1 , Z 2 } to learn common representations for images, English text and French (or German) text. For text, we use bag-of-words representation and for image, we use the 4096 (fc6) representation got from a pretrained ConvNet (BVLC Reference CaffeNet (Jia et al., 2014)). We learn hidden representations of size D = 200 by training Bridge Corrnet for 20 epochs using stochastic gradient descent with mini-batches of size 20. Each mini-batch contains data from only one of the Z i s. For the task of retrieving captions given an image, we consider the 1000 images in our test set (see section 4.2) as queries. The 5000 French (or German) captions corresponding to these images (5 per image) are considered as documents. The task is then to retrieve the relevant captions for each image. We represent all the captions and images in the common space as computed using Bridge Corrnet. For a given query, we rank all the captions based on the Euclidean distance between the representation of the image and the caption. For the task of retrieving images given a caption, we simply reverse the role of the captions and images. In other words, each of the 5000 captions is treated as a query and the 1000 images are treated as documents. λ was tuned to each task using a training/validation split. For the task of retrieving French/German captions given an image, λ was tuned using the performance on the validation set for retrieving French (or German) sentences for a given English sentence. For the other task, λ was tuned using the performance on the validation set for retrieving images, given English captions. We do not use any image-French/German parallel data for tuning the hyperparameters.
We use recall@k as the performance metric and compare the following methods in Table 4: 1. En-Image CorrNet: This is the CorrNet model trained using only Z 1 as defined earlier in this section. The task is to retrieve English captions for a given image (or vice versa). This gives us an idea about the performance we could expect if direct parallel data is available between images and their captions in some language. We used the publicly available implementation of CorrNet provided by (Chandar et al., 2016).

Bridge
CorrNet: This is the Bridge CorrNet model trained using Z 1 and Z 2 as defined earlier in this section. The task is to retrieve French (or German) captions for a given image (or vice versa).

Bridge MAE:
The Multimodal Autoencoder (MAE) proposed by (Ngiam et al., 2011) was the only competing model which was easily extendable to the bridge case. We train their model using Z 1 and Z 2 to minimize a suitably modified objective function. We then use the representations learned to retrieve French (or German) captions for a given image (or vice versa). 4. 2-CorrNet: Here, we train two individual Corr-Nets using Z 1 and Z 2 respectively. For the task of retrieving images given a French (or German) caption we first find its nearest English caption using the Fr-En (or De-En) CorrNet. We then use this English caption to retrieve images using the En-Image CorrNet. Similarly, for retrieving captions given an image we use the En-Image CorrNet followed by the En-Fr (or En-De) CorrNet. 5. CorrNet + MT: Here, we train an En-Image Cor-rNet using Z 1 and an Fr/De-En MT system 4 using Z 2 . For the task of retrieving images given a French (or German) caption we translate the caption to English using the MT system. We then use this English caption to retrieve images using the En-Image Cor-rNet. For retrieving captions given images, we first translate all the 5000 French (or Germam) captions to English. We then embed these English translations (documents) and images (queries) in the com-  Table 4: Performance of different models for image to caption (I to C) and caption to image (C to I) retrieval mon space computed using Image-En CorrNet and do a retrieval as explained earlier.
6. Random: A random image is returned for the given caption (and vice versa). From Table 4, we observe that CorrNet + MT is a very strong competitor and gives the best results. The main reason for this is that over the years MT has matured enough for language pairs such as Fr-En and De-En and it can generate almost perfect translations for short sentences (such as captions). In fact, the results for this method are almost comparable to what we could have hoped for if we had direct parallel data between Fr-Images and De-Images (as approximated by the first row in the table which reports cross-modal retrieval results between En-Images using direct parallel data between them for training). However, we would like to argue that learning a joint embedding for multiple views instead of having multiple pairwise systems is a more elegant solution and definitely merits further attention. Further, a "translation system" may not be available when we are dealing with modalities other than text (for example, there are no audio-to-video translation systems). In such cases, BridgeCorrNet could still be employed. In this context, the performance of BridgeCorrNet is definitely promising and shows that a model which jointly learns representations for multiple views can perform better than methods which learn pair-wise common representations (2-CorrNet).

Qualitative Analysis
To get a qualitative feel for our model's performance, we refer the reader to Table 5 and 6. The first row in Table 5 shows an image and its top-5 nearest German captions (based on Euclidean distance between their common representations). As per our parallel image caption test set, only the second and fourth caption actually correspond to this image. However, we observe that the first and fifth caption are also semantically very related to the image. Both these captions talk about horses, grass or water body (ocean), etc. Similarly the last row in Table  5 shows an image and its top-5 nearest French captions. None of these captions actually correspond to the image as per our parallel image caption test set. However, clearly the first, third and fourth caption are semantically very relevant to this image as all of them talk about baseball. Even the remaining two captions capture the concept of a sport and raquet. We can make a similar observation from Table  6 where most of the top-5 retrieved images do not correspond to the French/German caption but they are semantically very similar. It is indeed impressive that the model is able to capture such cross modal semantics between images and French/German even without any direct parallel data between them.

Conclusion
In this paper, we propose Bridge Correlational Neural Networks which can learn common representations for multiple views even when parallel data is available only between these views and a pivot view. Our method performs better than the existing state of the art approaches on the cross language classification task and gives very promising results on the cross modal access task. We also release a new multilingual image caption benchmark (MIC benchmark) which will help in further research in this field 5 .   Table 6: French and German queries and their top-5 nearest images based on representations learned using Bridge CorrNet. First two queries are in German and the last two queries are French. English translations are given in parenthesis.