Bridging Languages through Images with Deep Partial Canonical Correlation Analysis

We present a deep neural network that leverages images to improve bilingual text embeddings. Relying on bilingual image tags and descriptions, our approach conditions text embedding induction on the shared visual information for both languages, producing highly correlated bilingual embeddings. In particular, we propose a novel model based on Partial Canonical Correlation Analysis (PCCA). While the original PCCA finds linear projections of two views in order to maximize their canonical correlation conditioned on a shared third variable, we introduce a non-linear Deep PCCA (DPCCA) model, and develop a new stochastic iterative algorithm for its optimization. We evaluate PCCA and DPCCA on multilingual word similarity and cross-lingual image description retrieval. Our models outperform a large variety of previous methods, despite not having access to any visual signal during test time inference.


Introduction
Research in multi-modal semantics deals with the grounding problem (Harnad, 1990), motivated by evidence that many semantic concepts, irrespective of the actual language, are grounded in the perceptual system (Barsalou and Wiemer-Hastings, 2005). In particular, recent studies have shown that performance on NLP tasks can be improved by joint modeling of text and vision, with multimodal and perceptually enhanced representation learning outperforming purely textual representa-tions (Feng and Lapata, 2010;Kiela and Bottou, 2014;Lazaridou et al., 2015).
These findings are not surprising, and can be explained by the fact that humans understand language not only by its words, but also by their visual/perceptual context. The ability to connect vision and language has also enabled new tasks which require both visual and language understanding, such as visual question answering (Antol et al., 2015;Fukui et al., 2016;Xu and Saenko, 2016), image-to-text retrieval and text-to-image retrieval (Kiros et al., 2014;Mao et al., 2014), image caption generation (Farhadi et al., 2010;Mao et al., 2015;Vinyals et al., 2015;, and visual sense disambiguation (Gella et al., 2016).
While the main focus is still on monolingual settings, the fact that visual data can serve as a natural bridge between languages has sparked additional interest towards multilingual multi-modal modeling. Such models induce bilingual multi-modal spaces based on multi-view learning Gella et al., 2017;Rajendran et al., 2016).
In this work, we propose a novel effective approach for learning bilingual text embeddings conditioned on shared visual information. This additional perceptual modality bridges the gap between languages and reveals latent connections between concepts in the multilingual setup. The shared visual information in our work takes the form of images with word-level tags or sentence-level descriptions assigned in more than one language.
We propose a deep neural architecture termed Deep Partial Canonical Correlation Analysis (DPCCA) based on the Partial CCA (PCCA) method (Rao, 1969). To the best of our knowledge, PCCA has not been used in multilingual settings before. In short, PCCA is a variant of CCA which learns maximally correlated linear projections of two views (e.g., two language-specific "text-based views") conditioned on a shared third view (e.g., the "visual view"). We discuss the PCCA and DPCCA methods in §3 and show how they can be applied without having access to the shared images at test time inference.
PCCA inherits one disadvantageous property from CCA: both methods compute estimates for covariance matrices based on all training data. This would prevent feasible training of their deep nonlinear variants, since deep neural nets (DNNs) are predominantly optimized via stochastic optimization algorithms. To resolve this major hindrance, we propose an effective optimization algorithm for DPCCA, inspired by the work of Wang et al. (2015b) on Deep CCA (DCCA) optimization.
We evaluate our DPCCA architecture on two semantic tasks: 1) multilingual word similarity and 2) cross-lingual image description retrieval. For the former, we construct and provide to the community a new Word-Image-Word (WIW) dataset containing bilingual lexicons for three languages with shared images for 5K+ concepts. WIW is used as training data for word similarity experiments, while evaluation is conducted on the standard multilingual SimLex-999 dataset (Hill et al., 2015;Leviant and Reichart, 2015).
The results reveal stable improvements over a large space of non-deep and deep CCA-style baselines in both tasks. Most importantly, 1) PCCA is overall better than other methods which do not use the additional perceptual view; 2) DPCCA outperforms PCCA, indicating the importance of nonlinear transformations modeled through DNNs; 3) DPCCA outscores DCCA, again verifying the importance of conditioning multilingual text embedding induction on the shared visual view; and 4) DPCCA outperforms two recent multi-modal bilingual models which also leverage visual information (Gella et al., 2017;Rajendran et al., 2016).

Related Work
This work is related to two research threads: 1) multi-modal models that combine vision and language, with a focus on multilingual settings; 2) correlational multi-view models based on CCA which learn a shared vector space for multiple views.

Multi-Modal Modeling in Multilingual Settings
Research in cognitive science suggests that human meaning representations are grounded in our perceptual system and sensori-motor experience (Harnad, 1990;Lakoff and Johnson, 1999;Louwerse, 2011). Visual context serves as a useful cross-lingual grounding signal (Bruni et al., 2014;Glavaš et al., 2017) due to its language invariance, even enabling the induction of word-level bilingual semantic spaces solely through tagged images obtained from the Web (Bergsma and Van Durme, 2011;Kiela et al., 2015). Vulić et al. (2016) combine text embeddings with visual features via simple techniques of concatenation and averaging to obtain bilingual multi-modal representations, with noted improvements over text-only embeddings on word similarity and bilingual lexicon extraction. However, similar to the monolingual model of Kiela and Bottou (2014), their models lack the training phase, and require the visual signal at test time.
Recent work from Gella et al. (2017) exploits visual content as a bridge between multiple languages by optimizing a contrastive loss function. Furthermore, Rajendran et al. (2016) extend the work of  and propose to use a pivot representation in multimodal multilingual setups, with English representations serving as the pivot. While these works learn shared multimodal multilingual vector spaces, we demonstrate improved performance with our models (see §7).
Finally, although not directly comparable, recent work in neural machine translation has constructed models that can translate image descriptions by additionally relying on visual features of the image provided Elliott et al., 2015;Hitschler et al., 2016;Huang et al., 2016;Nakayama and Nishida, 2017, inter alia).
Correlational Models CCA-based techniques support multiple views on related data: e.g., when coupled with a bilingual dictionary, input monolingual word embeddings for two different languages can be seen as two views of the same latent semantic signal. Recently, CCA-based models for bilingual text embedding induction were proposed. These models rely on the basic CCA model Faruqui and Dyer, 2014), its deep variant , and a CCA extension which supports more than two views (Funaki and Nakayama, 2015;Rastogi et al., 2015). In this work, we propose to use (D)PCCA, which organically supports our setup: it conditions the two (textual) views on a shared (visual) view.
CCA-based methods (including PCCA) require the estimation of covariance matrices over all training data (Kessy et al., 2017). This hinders the use of DNNs with these models, as DNNs are typically trained via stochastic optimization over mini-batches on very large training sets. To address this limitation, various optimization methods for Deep CCA were proposed. Andrew et al. (2013) use L-BFGS (Byrd et al., 1995 over all training samples, while  and Yan and Mikolajczyk (2015) train with large batches. However, these methods suffer from high memory complexity with unstable numerical computations. Wang et al. (2015b) have recently proposed a stochastic approach for CCA and DCCA which copes well with small and large batch sizes while preserving high model performance. They use orthogonal iterations to estimate a moving average of the covariance matrices, which improves memory consumption. Therefore, we base our novel optimization algorithm for DPCCA on this approach.

Methodology: Deep Partial CCA
Given two image descriptions x and y in two languages and an image z that they refer to, the task is to learn a shared bilingual space such that similar descriptions obtain similar representations in the induced space. The image z serves as a shared third view on the textual data during training. The representation model is then utilized in cross-lingual and monolingual tasks. In this paper we focus on the more realistic scenario where no relevant visual content is available at test time. For this goal we propose a novel Deep Partial CCA (DPCCA) framework.
In what follows, we first review the CCA model and its deep variant: DCCA. We then introduce our DPCCA architecture, and describe our new stochastic optimization algorithm for DPCCA.

CCA and Deep CCA
DCCA (Andrew et al., 2013) extends CCA by learning non-linear (instead of linear) transformations of features contained in the input matrices X ∈ R Dx×N and Y ∈ R Dy×N , where D x and D y are input vector dimensionalities, and N is the number of input items. Since CCA is a special case of the non-linear DCCA (see below), we here briefly outline the more general DCCA model.
The DCCA architecture is illustrated in Figure 1a. Non-linear transformations are achieved through two DNNs f : R Dx×N → R D x ×N and g : R Dy×N → R D y ×N for X and Y . D x and D y are the output dimensionalities. A final linear layer is added to resemble the linear CCA projection.
The goal is to project the features of X and and V ∈ R D y ×L are projection matrices: they project the final outputs of the DNNs to the shared space. W f and V g (the parameters of f and g) and the projection matrices are the model parameters: Formally, the DCCA objective can be written as: (1) T are the estimations of the autocovariance matrices of the outputs. 3 Further, following Wang et al. (2015b), the optimal solution of Eq. (1) is equivalent to the optimal solution of the following: The main disadvantage of DCCA is its inability to support more than two views, and to learn conditioned on an additional shared view, which is why we introduce Deep Partial CCA. Figure 1b illustrates the architecture of DPCCA. The training data now consists of triplets (x i , y i , z i ) N 1=1 from three views, forming the columns of X, Y and Z, where x i ∈ R Dx , y i ∈ R Dy , z i ∈ R Dz for i = 1, . . . , N . The objective is to maximize the canonical correlation of the first two views X and Y conditioned on the shared third variable Z. Following Rao (1969)'s work on Partial CCA, we first consider two multivariate linear multiple regression models: A, B ∈ R L×Dz are matrices of coefficients, and F (X|Z), G(Y |Z) ∈ R L×N are normal random error matrices: residuals. We then minimize the mean-squared error regression criterion:

New Model: Deep Partial CCA
After obtaining the optimal solutions for the coefficients,Â andB, the residuals are as follows: The canonical correlation between the residual matrices F (X|Z) and G(Y |Z) is referred to as the partial canonical correlation. The Deep PCCA objective can be obtained by replacing F (X) and G(Y ) with their residuals in Eq. (2): The computation of the conditional covariance ma-trixΣ F F |Z can be formulated as follows: 4 A small value > 0 is added to the main diagonal of the covariance estimators for numerical stability.
The other conditional covariance matrixΣ GG|Z is again computed in the analogous manner, replacing F with G and X with Y . 5 While the (D)PCCA objective is computed over the residuals, after the network is trained (using multilingual texts and corresponding images) we can compute the representations of F (X) and G(Y ) at test time without having access to images (see the network structure in Figure 1b). This heuristic enables the use of DPCCA in a real-life scenario in which images are unavailable at test time, and its encouraging results are demonstrated in §7.

Model Variants
We consider two DPCCA variants : 1) in DPCCA Variant A, the shared view Z is kept fixed; 2) DPCCA Variant B also optimizes over Z, as illustrated in Figure 1b. Variant A may be seen as a special case of Variant B. 6 Variant B learns a non-linear function of the shared variable, ing the same architecture as f and g. U ∈ R D z ×L is the final linear layer of H, such that overall, the additional parameters of the model are U H = {U h , U }. Instead of assuming a linear connection between F (X) and G(Y ) to Z, as in Variant A, we now assume that the linear connection takes place with H(Z). This assumption changes Eq. (3) and Eq. (4) to: 7

DPCCA: Optimization Algorithm
Training deep variants of CCA-style multi-view models is non-trivial due to estimation on the entire training set related to whitening constraints (i.e., the orthogonality of covariance matrices). To overcome this issue, Wang et al. (2015b) proposed a stochastic optimization algorithm for DCCA via non-linear orthogonal iterations (DCCA NOI). Relying on the solution for DCCA ( §4.1), we develop a new optimization algorithm for DPCCA in §4.2.

Optimization of DCCA
The DCCA optimization from Wang et al. (2015b), fully provided in Algorithm 1, relies on three key steps. First, the estimation of the covariance matrices in the form ofΣ F F t at time t is calculated by a moving average over the minibatches: b t is the minibatch at time t, X bt is the current input matrix at time t, and ρ ∈ [0, 1] controls the ratio between the overall covariance estimation and the covariance estimation of the current minibatch. 8 This step eliminates the need of estimating the covariances over all training data, as well as the inherent bias when the estimate relies only on the current minibatch. Second, the DCCA NOI algorithm forces the whitening constraints to hold by performing an explicit matrix transformation in the form of: According to Horn et al. (1988), if ρ = 0: Finally, in order to optimize the DCCA objective (see Eq. (2)), the weights of the two DNNs are decoupled: i.e., the objective is disassembled into two separate mean-squared error objectives. Instead of Algorithm Randomly choose a minibatch (Xb t , Yb t ).
Update covariances: trying to bring F (X bt ) and G(Y bt ) closer in one gradient descent step, two steps are performed: one of the views is fixed, and a gradient step over the other is performed, and so on, iteratively. The final objective functions at each time step are: Wang et al. (2015b) show that the projection matrices W and V converge to the exact solutions of CCA as t→ ∞ when considering linear CCA.

Optimization of DPCCA
Our DPCCA optimization is based on the DCCA NOI algorithm with several adjustments. Besides the requirement to obtain the sample covariancesΣ F F andΣ GG , when calculating the conditional variables F (X|Z), G(Y |Z), Σ F F |Z andΣ GG|Z , we additionally have to obtain the stochastic estimatorsΣ F Z ,Σ GZ and Σ ZZ . To this end, we use the moving average estimation from Eq. (12). Next, we define the whitening transformation on the residuals: As before, the whitening constraints hold when ρ = 0. From here, we derive our two final objective functions over the residuals at time t: Equivalently to Eq. (15)- (16)

Cross-lingual Image Description Retrieval
The cross-lingual image description retrieval task is formulated as follows: taking an image description as a query in the source language, the system has to retrieve a set of relevant descriptions in the target language which describe the same image. Our evaluation assumes a single-best scenario, where only a single target description is relevant for each query. In addition, in our setup, images are not available during inference: retrieval is performed based solely on text queries. This enables a fair comparison between our model and many baseline models that cannot represent images and text in a shared space. Moreover, it allows us to test our model in the realistic setup where images are not available at test time. To avoid the use of images at retrieval time with DPCCA, we perform the retrieval on F (X) and G(Y ), rather than on F (X|Z) and G(Y |Z) (see §3.2).
We use the Multi30K dataset (Elliott et al., 2016), originated from Flickr30K (Young et al., 2014) that is comprised of Flicker images described with 1-5 English descriptions per image. Multi30K adds Algorithm 2 The non-linear orthogonal iterations (NOI) algorithm for DPCCA Variant B Input: Data matrices X ∈ R Dx×N , Y ∈ R Dy ×N , Z ∈ R Dz ×N , time constant ρ, learning rate η.
initialization: Initialize weights (WF , VG, UH ). Randomly choose a minibatch (Xb 0 , Yb 0 , Zb 0 ). Initialize covariances: Randomly choose a minibatch (Xb t , Yb t , Zb t ). Update covariances: German descriptions to a total of 30,014 images: most were written independently of the English descriptions, while some are direct translations. Each image is associated with one English and one German description. We rely on the original Multi30K splits with 29,000, 1,014, and 1,000 triplets for training, validation, and test, respectively.

Multilingual Word Similarity
The word similarity task tests the correlation between automatic and human generated word similarity scores. We evaluate with the Multilingual SimLex-999 dataset (Leviant and Reichart, 2015): the 999 English (EN)  word pairs from SimLex-999 (Hill et al., 2015) were translated to German (DE), Italian (IT), and Russian (RU), and similarity scores were crowdsourced from native speakers. We introduce a new dataset termed Word-Image-Word (WIW), which we use to train word-level models for the multilingual word similarity task. WIW contains three bilingual lexicons (EN-DE, EN-IT, EN-RU) with images shared between words in a lexicon entry. Each WIW entry is a triplet: an English word, its translation in DE/IT/RU, and a set of images relevant to the pair.
English words were taken from the January 2017 Wikipedia dump. After removing stop words and punctuation, we extract the 6,000 most frequent words from the cleaned corpus not present in SimLex. DE/IT/RU words were obtained semiautomatically from the EN words using Google Translate. The images are crawled from the Bing search engine using MMFeat 9 (Kiela, 2016) by querying the EN words only. Following the suggestions from the study of , we save the top 20 images as relevant images. 10 Table 1 provides a summary of the WIW dataset. The dataset contains both concrete and abstract words, and words of different POS tags. 11 This property has an influence on the image collection: similar to , we have noticed that images of more concrete concepts are less dispersed (see also examples from Figure 2).

Experimental Setup
Data Preprocessing and Embeddings For the sentence-level task, all descriptions were lower- Figure 2: WIW examples from each of the three bilingual lexicons. Note that the designated words can be either abstract (true), express an action (dance) or be more concrete (plant). cased and tokenized. Each sentence is represented with one vector: the average of its word embeddings. For English, we rely on 500-dimensional English skip-gram word embeddings (Mikolov et al., 2013) trained on the January 2017 Wikipedia dump with bag-of-words contexts (window size of 5). For German we use the deWaC 1.7B corpus (Baroni et al., 2009) to obtain 500-dimensional German embeddings using the same word embedding model. For word similarity, to be directly comparable to previous work, we rely on 300-dim word vectors in EN, DE, IT, and RU from Mrkšić et al. (2017).
Visual features are extracted from the penultimate layer (FC7) of the VGG-19 network (Simonyan and Zisserman, 2015), and compressed to the dimensionality of the textual inputs by a Principal Component Analysis (PCA) step. For the word similarity task, we average the visual vectors across all images of each word pair as done in, e.g., (Vulić et al., 2016), before the PCA step.
Baseline Models We consider a wide variety of multi-view CCA-based baselines. First, we compare against the original (linear) CCA model (Hotelling, 1936), and its deep non-linear extension DCCA (Andrew et al., 2013). For DCCA: 1) we rely on its improved optimization algorithm from Wang et al. (2015a) which uses a stochastic approach with large minibatches; 2) we compare against the DCCA NOI variant (Wang et al., 2015b) described by Algorithm 1, and another recent DCCA variant with the optimization algorithm based on a stochastic decorrelational loss (Chang et al., 2017) (DCCA SDL); and 3) we also test the DCCA Autoencoder model (DCCAE) (Wang et al., 2015a), which offers a trade-off between maximizing the canonical correlation of two sets of variables and finding informative features for their reconstruction.
Another baseline is Generalized CCA (GCCA) (Funaki and Nakayama, 2015;Horst, 1961;Rastogi et al., 2015): a linear model which extends CCA to three or more views. Unlike PCCA, GCCA does not condition two variables on the third shared one, but rather seeks to maximize the canonical correlations of all pairs of views. We also compare to Nonparametric CCA (NCCA) (Michaeli et al., 2016), and to a probabilistic variant of PCCA (PPCCA, Mukuta and Harada (2014)).
Finally, we compare with the two recent models which operate in the setup most similar to ours: 1) Bridge Correlational Networks (BCN) (Rajendran et al., 2016); and 2) Image Pivoting (IMG PIVOT) from Gella et al. (2017). For both models, we report results only with the strongest variant based on the findings from the original papers, also verified by additional experimentation in our work. 12 Hyperparameter Tuning The hyperparameters of the different models are tuned with a grid search over the following values: {2,3,4,5} for number of layers, {tanh, sigmoid, ReLU} as the activation functions (we use the same activation function in all the layers of the same network), {64,128,256} for minibatch size, {0.001,0.0001} for learning rate, and {128,256} for L (the size of the output vectors). The dimensions of all mid-layers are set to the input size. We use the Adam optimizer (Kingma and Ba, 2015), with the number of epochs set to 300.
For all participating models, we report test performance of the best hyperparameter on the validation set. For word similarity, following a standard practice (Levy et al., 2015; we tune all models on one half of the SimLex data and evaluate on the other half, and vice versa. The reported score is the average of the two halves. Similarity scores for all tasks were computed using the cosine similarity measure.

Results and Discussion
Cross-lingual Image Description Retrieval We report two standard evaluation metrics: 1) Recall at 1 (R@1) scores, and 2) the sentence-level BLEU+1 metric (Lin and Och, 2004), a variant of BLEU which smooths terms for higher-order n-grams, making it more suitable for evaluating short sentences. The scores for the retrieval task with all models are summarized in Table 2. 12 More details about preprocessing and baselines (including all links to their code), are in the the supplementary material. We use original readily available implementations of all baselines whenever this is possible, and our in-house implementations for baselines for which no code is provided by the original authors.  (Wang et al., 2015b) 0.812 0.788 0.849 0.830 DCCA SDL (Chang et al., 2017) 0.507 0.487 0.552 0.533 DCCA (Wang et al., 2015a) 0.619 0.621 0.664 0.673 DCCAE (Wang et al., 2015a) 0.564 0.542 0.607 0.598 IMG PIVOT (Gella et al., 2017) 0.772 0.763 0.789 0.781 BCN (Rajendran et al., 2016) 0.579 0.570 0.628 0.629 PCCA (Rao, 1969) 0.785 0.737 0.825 0.787 CCA (Hotelling, 1936) 0.764 0.704 0.803 0.754 GCCA (Funaki and Nakayama, 2015) 0.699 0.690 0.742 0.743 NCCA (Michaeli et al., 2016) 0.157 0.165 0.205 0.213 PPCCA (Mukuta and Harada, 2014) 0.035 0.050 0.063 0.086 The results clearly demonstrate the superiority of DPCCA (with a slight advantage to the more complex Variant B) and of the concatenation of their representation with that of the DCCA NOI (strongest) baseline. Furthermore, the non-deep, linear PCCA achieves strong results: it outscores all non-deep models, as well as all deep models except from DCCA NOI, IMG PIVOT in one case, and its deep version: DPCCA. This emphasizes our contribution in proposing PCCA for multilingual processing with images as a cross-lingual bridge.
The results suggest that: 1) the inclusion of visual information in the training process helps the retrieval task even without such information during inference. DPCCA outscores all DCCA variants (either alone or through a concatenation with the DCCA NOI representation), and PCCA outscores the original two-view CCA model; and 2) deep, non-linear architectures are useful: our DPCCA outperforms the linear PCCA model.
We also note clear improvements over the two recent models which also rely on visual information: IMG PIVOT and BCN. The gain over IMG PIVOT is observed despite the fact that IMG PIVOT is a more complex multi-modal model which relies on RNNs, and is tailored to sentence-level tasks. Finally, the scores from Table 2 suggest that improved performance can be achieved by an ensemble model, that is, a simple concatenation of DPCCA (B) and DCCA NOI.

Multilingual Word Similarity
The results, presented as standard Spearman's rank correlation scores, are summarized in  (Rao, 1969) 0.614 0.296 0.340 0.305 0.143 0.340 CCA (Hotelling, 1936) 0.557 0.297 0.321 0.284 0.157 0.346 GCCA (Funaki and Nakayama, 2015)    a selection of strongest baselines. Further, Table 4 presents results on all SimLex word pairs. The POS class result patterns for EN-IT and EN-RU are very similar to the patterns in Table 3 and are provided in the supplementary material. First, the results over the initial monolingual embeddings before training (INIT EMB) clearly indicate that multilingual information is beneficial for the word similarity task. We observe improvements with all models (the only exception being extremely lowscoring PPCCA and NCCA, not shown). Moreover, by additionally grounding concepts from two languages in the visual modality it is possible to further boost word similarity scores. This result is in line with prior work in monolingual settings (Chrupała et al., 2015;Kiela and Bottou, 2014;Lazaridou et al., 2015), which have shown to profit from multi-modal features.
The results on the POS classes represented in SimLex-999 (nouns, verbs, adjectives, Table 3) form our main finding: conditioning the multilingual representations on a shared image leads to improvements in verb and adjective representations. While for nouns one of the DPCCA variants is the best performing model for both languages, the gaps from the best performing baselines are much smaller. This is interesting since, e.g., verbs are more abstract than nouns (Hartmann and Søgaard, 2017;. Considering the fact that SimLex-999 consists of 666 noun pairs, 222 verb pairs and 111 adjective pairs, this is the reason that the gains of DPCCA over the strongest baselines across the entire evaluation set are more modest (Table 4). We note again that the same patterns presented in Table 3 for EN-DE -more prominent verb and adjective gains and a smaller gain on nouns -also hold for EN-IT and EN-RU (see the supplementary material).

Conclusion and Future Work
We addressed the problem of utilizing images as a bridge between languages to learn improved bilingual text representations. Our main contribution is two-fold. First, we proposed to use the Partial CCA (PCCA) method. In addition, we proposed a stochastic optimization algorithm for the deep version of PCCA that overcomes the challenges posed by the covariance estimation required by the method. Our experiments reveal the effectiveness of these methods for both sentence-level and wordlevel tasks. Crucially, our proposed solution does not require access to images at inference/test time, in line with the realistic scenario where images that describe sentential queries are not readily available.
In future work we plan to improve our methods by exploiting the internal structure of images and sentences as well as by effectively integrating signals from more than two languages.