Do Neural Network Cross-Modal Mappings Really Bridge Modalities?

Feed-forward networks are widely used in cross-modal applications to bridge modalities by mapping distributed vectors of one modality to the other, or to a shared space. The predicted vectors are then used to perform e.g., retrieval or labeling. Thus, the success of the whole system relies on the ability of the mapping to make the neighborhood structure (i.e., the pairwise similarities) of the predicted vectors akin to that of the target vectors. However, whether this is achieved has not been investigated yet. Here, we propose a new similarity measure and two ad hoc experiments to shed light on this issue. In three cross-modal benchmarks we learn a large number of language-to-vision and vision-to-language neural network mappings (up to five layers) using a rich diversity of image and text features and loss functions. Our results reveal that, surprisingly, the neighborhood structure of the predicted vectors consistently resembles more that of the input vectors than that of the target vectors. In a second experiment, we further show that untrained nets do not significantly disrupt the neighborhood (i.e., semantic) structure of the input vectors.


Introduction
Neural network mappings are widely used to bridge modalities or spaces in cross-modal retrieval (Qiao et al., 2017;Wang et al., 2016;Zhang et al., 2016), zero-shot learning (Lazaridou et al., 2015b(Lazaridou et al., , 2014Socher et al., 2013) in building multimodal representations (Collell et al., 2017) or in word translation (Lazaridou et al., 2015a), to name a few. Typically, a neural network is firstly trained to predict the distributed vectors of one modality (or space) from the other. At test time, some operation such as retrieval or labeling is performed based on the nearest neighbors of the predicted (mapped) vectors. For instance, in zero-shot image classification, image features are mapped to the text space and the label of the nearest neighbor word is assigned. Thus, the success of such systems relies entirely on the ability of the map to make the predicted vectors similar to the target vectors in terms of semantic or neighborhood structure. 1 However, whether neural nets achieve this goal in general has not been investigated yet. In fact, recent work evidences that considerable information about the input modality propagates into the predicted modality (Collell et al., 2017;Lazaridou et al., 2015b;Frome et al., 2013).
To shed light on these questions, we first introduce the (to the best of our knowledge) first existing measure to quantify similarity between the neighborhood structures of two sets of vectors. Second, we perform extensive experiments in three benchmarks where we learn image-to-text and text-to-image neural net mappings using a rich variety of state-of-the-art text and image features and loss functions. Our results reveal that, contrary to expectation, the semantic structure of the mapped vectors consistently resembles more that of the input vectors than that of the target vectors of interest. In a second experiment, by using six concept similarity tasks we show that the semantic structure of the input vectors is preserved after mapping them with an untrained network, further evidencing that feed-forward nets naturally preserve semantic information about the input. Overall, we uncover and rise awareness of a largely ignored phenomenon relevant to a wide range of cross-modal / cross-space applications such as retrieval, zero-shot learning or image annotation.
Ultimately, this paper aims at: (1) Encouraging the development of better architectures to bridge modalities / spaces; (2) Advocating for the use of semantic-based criteria to evaluate the quality of predicted vectors such as the neighborhood-based measure proposed here, instead of purely geometric measures such as mean squared error (MSE).

Related Work and Motivation
Neural network and linear mappings are popular tools to bridge modalities in cross-modal retrieval systems. Lazaridou et al. (2015b) leverage a text-to-image linear mapping to retrieve images given text queries. Weston et al. (2011) map label and image features into a shared space with a linear mapping to perform image annotation. Alternatively, Frome et al. (2013), Lazaridou et al. (2014 and Socher et al. (2013) perform zero-shot image classification with an image-to-text neural network mapping. Instead of mapping to latent features, Collell et al. (2018) use a 2-layer feedforward network to map word embeddings directly to image pixels in order to visualize spatial arrangements of objects. Neural networks are also popular in other cross-space applications such as cross-lingual tasks. Lazaridou et al. (2015a) learn a linear map from language A to language B and then translate new words by returning the nearest neighbor of the mapped vector in the B space.
In the context of zero-shot learning, shortcomings of cross-space neural mappings have also been identified.
For instance, "hubness" (Radovanović et al., 2010) and"pollu-tion" (Lazaridou et al., 2015a) relate to the highdimensionality of the feature spaces and to overfitting respectively. Crucially, we do not assume that our cross-modal problem has any class labels, and we study the similarity between input and mapped vectors and between output and mapped vectors.
Recent work evidences that the predicted vectors of cross-modal neural net mappings are still largely informative about the input vectors. Lazaridou et al. (2015b) qualitatively observe that abstract textual concepts are grounded with the visual input modality. Counterintuitively, Collell et al. (2017) find that the vectors "imagined" from a language-to-vision neural map, outperform the original visual vectors in concept similarity tasks. The paper argued that the reconstructed visual vectors become grounded with language because the map preserves topological properties of the input. Here, we go one step further and show that the mapped vectors often resemble the input vectors more than the target vectors in semantic terms, which goes against the goal of a cross-modal map.
Well-known theoretical work shows that networks with as few as one hidden layer are able to approximate any function (Hornik et al., 1989). However, this result does not reveal much neither about test performance nor about the semantic structure of the mapped vectors. Instead, the phenomenon described is more closely tied to other properties of neural networks. In particular, continuity guarantees that topological properties of the input, such as connectedness, are preserved (Armstrong, 2013). Furthermore, continuity in a topology induced by a metric also ensures that points that are close together are mapped close together. As a toy example, Fig. 1 illustrates the distortion of a manifold after being mapped by a neural net. 2 In a noiseless world with fully statistically dependent modalities, the vectors of one modality could be perfectly predicted from those of the other. However, in real-world problems this is unrealistic given the noise of the features and the fact that modalities encode complementary information (Collell and Moens, 2016). Such unpredictability combined with continuity and topology-preserving properties of neural nets propel the phenomenon identified, namely mapped vectors resembling more the input than the target vectors, in nearest neighbors terms.

Proposed Approach
To bridge modalities X and Y, we consider two popular cross-modal mappings f : X → Y.
(i) Linear mapping (lin): where d x and d y are the input and output dimensions respectively.
(ii) Feed-forward neural network (nn): where d h is the number of hidden units and σ() the non-linearity (e.g., tanh or sigmoid). Although single hidden layer networks are already universal approximators (Hornik et al., 1989), we explored whether deeper nets with 3 and 5 hidden layers could improve the fit (see Supplement).

Loss:
Our primary choice is the MSE: We also tested other losses such as the cosine: 1 − cos(f (x), y) and the max-margin: x belongs to a different class than (x, y), and γ is the margin. As in Lazaridou et al. (2015a) and Weston et al. (2011), we choose the firstx that violates the constraint. Notice that losses that do not require class labels such as MSE are suitable for a wider, more general set of tasks than discriminative losses (e.g., cross-entropy). In fact, cross-modal retrieval tasks often do not exhibit any class labels. Additionally, our research question concerns the cross-space mapping problem in isolation (independently of class labels). Let us denote a set of N input and output vectors by X ∈ R N ×dx and Y ∈ R N ×dy respectively. Each input vector x i is paired to the output vector y i of the same index (i = 1, · · · , N ). Let us henceforth denote the mapped input vectors by f (X) ∈ R N ×dy . In order to explore the similarity between f (X) and X, and between f (X) and Y , we propose two ad hoc settings below.

Neighborhood Structure of Mapped Vectors (Experiment 1)
To measure the similarity between the neighborhood structure of two sets of paired vectors V and Z, we propose the mean nearest neighbor overlap measure (mNNO K (V, Z)). We define the nearest neighbor overlap NNO K (v i , z i ) as the number of K nearest neighbors that two paired vec- be two sets of N paired vectors. We define: The normalizing constant K simply scales mNNO K (V, Z) between 0 and 1, making it independent of the choice of K.
Thus, a mNNO K (V, Z) = 0.7 means that the vectors in V and Z share, on average, 70% of their nearest neighbors. Notice that mNNO implicitly performs retrieval for some similarity measure (e.g., Euclidean or cosine), and quantifies how semantically similar two sets of paired vectors are.

Mapping with Untrained Networks (Experiment 2)
To complement the setting above (Sect. 3.1), it is instructive to consider the limit case of an untrained network. Concept similarity tasks provide a suitable setting to study the semantic structure of distributed representations (Pennington et al., 2014). That is, semantically similar concepts should ideally be close together. In particular, our interest is in comparing X with its projection f (X) through a mapping with random parameters, to understand the extent to which the mapping may disrupt or preserve the semantic structure of X.

Hyperparameters and Implementation
The parameters in W 0 , W 1 are drawn from a random uniform distribution [−1, 1] and b 0 , b 1 are set to zero. We use a tanh activation σ(). 6 The output dimension d y is set to 2,048 for all embeddings.

Image and Text Features
Textual and visual features are the same as described in Sect. 4.1.3 for the ImageNet dataset.

Similarity Predictions
We compute the prediction of similarity between two vectors z 1 , z 2 with both the cosine z 1 z 2 z 1 z 2 and the Euclidean similarity 1 1+ z 1 −z 2 . 7

Performance Metrics
As is common practice, we evaluate the predictions of similarity of the embeddings (Sect. 4.2.4) against the human similarity ratings with the Spearman correlation ρ. We report the average of 10 sets of randomly generated parameters.

Results and Discussion
We test statistical significance with a two-sided Wilcoxon rank sum test adjusted with Bonferroni. The null hypothesis is that a compared pair is equal. In Tab. 1, * indicates that mNNO(X, f (X)) differs from mNNO(Y, f (X)) (p < 0.001) on the same mapping, embedding and direction. In Tab. 2, * indicates that performance of mapped and input vectors differs (p < 0.05) in the 10 runs.

Experiment 1
Results below are with cosine neighbors and K = 10. Euclidean neighbors yield similar results and are thus left to the Supplement. Similarly, results in ImageNet with GloVe embeddings are shown below and word2vec results in the Supplement. The choice of K = {5, 10, 30} had no visible effect on results. Results with 3-and 5-layer nets did not show big differences with the results below (see Supplement). The cosine and max-margin losses 6 We find that sigmoid and ReLu yield similar results. 7 Notice that papers generally use only cosine similarity (Lazaridou et al., 2015b;Pennington et al., 2014).  2011) find that max-margin performs the best in their tasks, we do not find our result entirely surprising given that max-margin focuses on inter-class differences while we look also at intraclass neighbors (in fact, we do not require classes). Tab. 1 shows our core finding, namely that the semantic structure of f (X) resembles more that of X than that of Y , for both lin and nn maps. ResNet VGG-128  Table 1: Test mean nearest neighbor overlap. Boldface indicates the largest score at each mNNO 10 (X, f (X)) and mNNO 10 (Y, f (X)) pair, which are abbreviated by X, f (X) and Y, f (X). Fig. 2 is particularly revealing. If we would only look at train performance (and allow train MSE to reach 0) then f (X) = Y and clearly train mNNO(f (X), Y ) = 1 while mNNO(f (X), X) can only be smaller than 1. However, the interest is always on test samples, and (near-)perfect test prediction is unrealistic. Notice in fact in Fig. 2 that even if we look at train fit, MSE needs to be close to 0 for mNNO(f (X), Y ) to be reasonably large. In all the combinations from Tab. 1, the test mNNO(f (X), Y ) never surpasses test mNNO(f (X), X) for any number of epochs, even with an oracle (not shown).

Experiment 2
Tab. 2 shows that untrained linear (f lin ) and neural net (f nn ) mappings preserve the semantic structure of the input X, complementing thus the findings of Experiment 1. Experiment 1 concerns learning, while, by "ablating" the learning part and randomizing weights, Experiment 2 is revealing about the natural tendency of neural nets to preserve semantic information about the input, regardless of the choice of the target vectors and loss function.

Conclusions
Overall, we uncovered a phenomenon neglected so far, namely that neural net cross-modal mappings can produce mapped vectors more akin to the input vectors than the target vectors, in terms of semantic structure. Such finding has been possible thanks to the proposed measure that explicitly quantifies similarity between the neighborhood structure of two sets of vectors. While other measures such as mean squared error can be misleading, our measure provides a more realistic estimate of the semantic similarity between predicted and target vectors. In fact, it is the semantic structure (or pairwise similarities) what ultimately matters in cross-modal applications. Using different number of hidden units (and selecting the best-performing one) is important in order to guarantee that our conclusions are not influenced or just a product of underfitting or overfitting. Similarly, we learned the mappings at different levels of dropout {0.25, 0.5, 0.75} which did not yield any improvement w.r.t. zero dropout (shown in our results).
We use a ReLu activation, the RMSprop optimizer (ρ = 0.9, = 10 −8 ) and a batch size of 64. We find that sigmoid and tanh yield similar results as ReLu. Our implementation is in Keras (Chollet et al., 2015).
Since ImageNet does not have any set of "test concepts", we employ 5-fold CV. Reported results are either averages on 5 folds (ImageNet) or 5 runs with different model weights initializations (IAPR TC-12 and Wiki).

B Textual Feature Extraction
Unlike ImageNet where we associate a word embedding to each concept, the textual modality in IAPR TC-12 and Wiki consists of sentences. In order to extract state-of-the art textual features in these datasets we train the following, separate network (prior to the cross-modal mapping). First, the embedded input sentences are passed to a bidirectional GRU of 64 units, then fed into a fully-connected layer, followed by a cross-entropy loss on the vector of class labels. We collect the 64-d averaged GRU hidden states of both directions as features. The network is trained with the Adam optimizer.
In Wiki and IAPR TC-12 we verify that the extracted text and image features are indeed informative and useful by computing their mean average precision (mAP) in retrieval (considering that a document B is relevant for document A if A and B share at least one class label). In Wiki we find mAPs of: biGRU = 0.77, ResNet = 0.22 and vgg128 = 0.21. In IAPR TC-12 we find mAPs of: biGRU = 0.77, ResNet = 0.49 and vgg128 = 0.46. Notice that ImageNet has a single data point per class in our setting, and thus mAP cannot be computed. However, we employ standard GloVe, word2vec, VGG-128 and ResNet vectors in ImageNet, which are known to perform well.

C Additional Results
Results with mNNO(X, Y ) (omitted in the main paper for space reasons): Interestingly, the similarity mNNO(X, Y ) between original input X and output Y vectors is generally low (between 1.5 and 2.3), indicating that these spaces are originally quite different. However, mNNO(X, Y ) always remains lower than mNNO(f (X), Y ), indicating thus that the mapping makes a difference.

C.1.1 Results with 3 and 5 layers
ResNet  Table 3: Test mean nearest neighbor overlap with 3-and 5-hidden layer neural networks, using cosinebased neighbors and MSE loss. Boldface indicates best performance between each mNNO 10 (X, f (X)) and mNNO 10 (Y, f (X)) pair, which are abbreviated by X, f (X) and Y, f (X).
It is interesting to notice that even though the difference between mNNO 10 (X, f (X)) and mNNO 10 (Y, f (X)) has narrowed down w.r.t. the linear and 1-hidden layer models (in the main paper) in some cases (e.g., ImageNet), this does not seem to be caused by better predictions, i.e., an increase of mNNO 10 (Y, f (X)), but rather by a decrease of mNNO 10 (X, f (X)). This is expected since with more layers the information about the input is less preserved.           Table 9: Test mean nearest neighbor overlap with Euclidean-based neighbors and MSE loss. Boldface indicates best performance between each mNNO 10 (X, f (X)) and mNNO 10 (Y, f (X)) pair, which are abbreviated by X, f (X) and Y, f (X).    Table 12: Spearman correlations between human ratings and similarities (cosine or Euclidean) predicted from the embeddings, using word2vec and VGG-128 embeddings.