Gaussian Visual-Linguistic Embedding for Zero-Shot Recognition

An exciting outcome of research at the inter-section of language and vision is that of zero-shot learning (ZSL). ZSL promises to scale visual recognition by borrowing distributed semantic models learned from linguistic corpora and turning them into visual recognition models. However the popular word-vector DSM embeddings are relatively impoverished in their expressivity as they model each word as a single vector point. In this paper we explore word-distribution embeddings for ZSL. We present a visual-linguistic mapping for ZSL in the case where words and visual categories are both represented by distributions. Experiments show improved results on ZSL benchmarks due to this better exploiting of intra-concept variability in each modality


Introduction
Learning vector representations of word meaning is a topical area in computational linguistics. Based on the distributional hypothesis (Harris, 1954) -that words in similar context have similar meaningsdistributed semantic models (DSM)s build vector representations based on corpus-extracted context. DSM approaches such as topic models (Blei et al., 2003), and more recently neural networks (Collobert et al., 2011;Mikolov et al., 2013) have had great success in a variety of lexical and semantic tasks (Arora et al., 2015;Schwenk, 2007).
However despite their successes, classic DSMs are severely impoverished compared to humans due to learning solely from word cooccurrence without grounding in the outside world. This has motivated a wave of recent research into multi-modal and crossmodal learning that aims to ground DSMs in nonlinguistic modalities Kiela and Bottou, 2014;Silberer and Lapata, 2014;?). Such multi-modal DSMs are attractive because they learn richer representations than language-only models (e.g., that bananas are yellow fruit (Bruni et al., 2012b)), and thus often outperform language only models in various lexical tasks (Bruni et al., 2012a).
In this paper, we focus on a key unique and practically valuable capability enabled by cross-modal DSMs: that of zero-shot learning (ZSL). Zero-shot recognition aims to recognise visual categories in the absence of any training examples by cross-modal transfer from language. The idea is to use a limited set of training data to learn a linguistic-visual mapping and then apply the induced function to map images from novel visual categories (unseen during training) to a linguistic embedding: thus enabling recognition in the absence of visual training examples. ZSL has generated big impact (Lampert et al., 2009;Socher et al., 2013;Lazaridou et al., 2014) due to the potential of leveraging language to help visual recognition scale to many categories without labor intensive image annotation.
DSMs typically generate vector embeddings of words, and hence ZSL is typically realised by variants of vector-valued cross-modal regression. However, such vector representations have limited expressivity -each word is represented by a point, with no notion of intra-class variability. In this paper, we consider ZSL in the case where both visual and linguistic concepts are represented by Gaussian distribution embeddings. Specifically, our Gaussian-embedding approach to ZSL learns concept distributions in both domains: Gaussians representing individual words (as in (Vilnis and McCallum, 2015)) and Gaussians representing visual concepts. Simultaneously, it learns a cross-domain mapping that warps language-domain Gaussian concept representations into alignment with visual-domain concept Gaussians. Some existing vector DSM-based crossmodal ZSL mappings (Akata et al., 2013;Frome et al., 2013) can be seen as special cases of ours where the within-domain model is pre-fixed as vector corresponding to the Gaussian means alone, and only the cross-domain mapping is learned. Our results show that modeling linguistic and visual concepts as Gaussian distributions rather than vectors can significantly improve zero-shot recognition results.

Background
Vector Word Embeddings In a typical setup for unsupervised learning of word-vectors, we observe a sequence of tokens {w i } and their context words {c(w) i }. The goal is to map each word w to a ddimensional vector e w reflecting its distributional properties. Popular skip-gram and CBOW models (Mikolov et al., 2013), learn a matrix W ∈ R |V |×d of word embeddings for each of V vocabulary words (e w = W (w,:) ) based on the objective of predicting words given their contexts.
Another way to formalise a word vector representation learning problem is to search for a representation W so that words w have high representational similarity with co-occuring words c(w), and low similarity with representations of non-co-occurring words ¬c(w). This could be expressed as optimisation of max-margin loss J; requiring that each word w's representation e w is more similar to that of context words e p than non-context words e n .
where similarity measure E(·, ·) is a distance in R d space such as cosine or euclidean.
Gaussian Word Embeddings Vector-space models are successful, but have limited expressivity in terms of modelling the variance of a concept, or asymmetric distances between words, etc. This has motivated recent work into distribution-based embeddings (Vilnis and McCallum, 2015). Rather than learning word-vectors e w , the goal here is now to learn a distribution for each word, represented by a per-word mean µ w and covariance Σ w .
In order to extend word representation learning approaches such as Eq. (1) to learning Gaussians, we need to replace vector similarity measure E(·, ·) with a similarity measure for Gaussians. We follow (Vilnis and McCallum, 2015) in using the inner product between distributions f and g -the probability product kernel (Jebara et al., 2004). (2) The probability product kernel (PPK) has a convenient closed form in the case of Gaussians: where µ f , µ g are the means and Σ f , Σ g are the covariances of the probability distribution f and g.

Cross-Modal Distribution Mapping
Gaussian models of words can be learned as in the previous section, and that Gaussian models of image categories can be trivially obtained by maximum likelihood. The central task is therefore to establish a mapping between word-and image-Gaussians, which will be of different dimensions d w and d x .
We aim to find a projection matrix A ∈ R dx×dw such that a word w generates an image vector as e x = Ae w . Working with distributions, this implies that we have µ x = Aµ w and Σ x = AΣ w A T . We can now evaluate the similarity of concept distributions across modalities. The similarity between image-and text-domain Gaussians f and g is: Using this metric, we can train our cross-modal projection A via the cross-domain loss: where P is the set of matching pairs that should be aligned (e.g., the word Gaussian 'plane' and the Gaussian of plane images) and N is the set of mismatching pairs that should be separated (e.g., 'plane' and images of dogs). This can be optimised with SGD using the gradient:

Joint Representation and Mapping
The cross-domain mapping A can be learned (Eq. 5) for fixed within-domain representations (word and image Gaussians). It is also possible to simultaneously learn the text and image-domain gaussians by optimising the sum of three coupled losses: Eq. 1 with Eq. 3, Eq. 5 and max-margin image-classification using Gaussians. We found jointly learning the image-classification Gaussians did not bring much benefit over the MLE Gaussians, so we only jointly learn the text Gaussians and cross-domain mapping.

Application to Zero-Shot Recognition
Once the text-domain Gaussians and cross-domain mapping have been trained for a set of known words/classes, we can use the learned model to recognise any novel/unseen but name-able visual category w as follows: 1. Get the word-Gaussians of target categories w, N (µ w , Σ w ). 2. Project those Gaussians to image modality, N (Aµ w , AΣ w A T ). 3. Classify a test image x by evaluating its likelihood under each Gaussian, and picking the most likely Gaussian: p(w|x) ∝ N (x|Aµ w , AΣ w A T ).

Contextual Query
To illustrate our approach, we also experiment with a new variant of the ZSL setting. In conventional ZSL, a novel word can be matched against images by projecting it into image space, and sorting images by their distance to the word (vector), or likelihood under the word (Gaussian). However, results may be unreliable when used with polysemous words, or words with large appearance variability. In this case we may wish to enrich the query with contextual words that disambiguate the visual meaning of the query. With regular vector-based queries, the typical approach is to sum the word-vectors. For example: For contextual disambiguation of polysemy, we may hope that vec('bank')+vec('river') may retrieve a very different set of images than vec('bank')+vec('finance'). For specification of a specific subcategory or variant, we may hope that vec('plane')+vec('military') retrieves a different set of images than vec('plane')+vec('passenger'). By using distributions rather than vectors, our framework provides a richer means to make such queries that accounts for the intra-class variability of each concept. When each word is represented by a Gaussian, a two-word query can be represented by their product, which is the new Gaussian N (

Datasets and Settings
Datasets: We evaluate our method 1 using the main Animals with Attributes (AWA) and Ima-geNet1K benchmarks. To extract visual features we use the VGG-16 CNN (Simonyan and Zisserman, 2015) to extract a d x = 4096 dimensional feature for each image. To train the word Gaussian representation, we use a combination of UkWAC (Ferraresi et al., 2008) and Wikipedia corpus of 25 million tokens, and learn a d w = 100 dimensional Gaussian representation. We set our margin parameter to ∆ = 1. Settings: Our zero-shot setting involves training a visual recogniser (i.e., our mapping A) on a subset of classes, and evaluating it on a disjoint subset. For AWA, we use the standard 40/10 class split (Lampert et al., 2009), and for ImageNet we use a standard 800/200 class split (Mensink et al., 2012). Competitors: We implement a set of representative alternatives for direct comparison with ours on the same visual features and text corpus. These include: cross-modal linear regression (LinReg, (Dinu et al., 2015)), non-linear regression (NLin-Reg, (Lazaridou et al., 2014;Socher et al., 2013)),  ES-ZSL (Romera-Paredes and Torr, 2015), and a max-margin cross-modal energy function method (CME, (Akata et al., 2013;Frome et al., 2013)). Note that the CME strategy is the most closely related to ours in that it also trains a d x × d w matrix with max-margin loss, but uses it in a bilinear energy function with vectors E(x, y) = x T Ay; while our energy function operates on Gaussians.

Results
Table 1 compares our results on the AWA benchmark against alternatives using the same visual features, and word vectors trained on the same corpus. We observe that: (i) Our Gaussian-embedding obtains the best performance overall. (ii) Our method outperforms CME which shares an objective function and optimisation strategy with ours, but operates on vectors rather than Gaussians. This suggests that our new distribution rather than vectorembedding does indeed bring significant benefit. A comparison to published results obtained by other studies on the same ZSL splits is given in Table 2, where we see that our results are competitive despite exploitation of supervised embeddings such as attributes (Fu et al., 2014), or combinations of embeddings (Akata et al., 2013) by other methods.
We next demonstrate our approach qualitatively by means of the contextual query idea introduced in ImageNet ConSE (Norouzi et al., 2014) 28.5% DeVISE (Frome et al., 2013) 31.8% Large Scale Metric. (Mensink et al., 2012) 35.7% Semantic Manifold. (Fu et al., 2015) 41.0% Gaussian Embedding 45.7% AwA DAP (CNN feat) (Lampert et al., 2009) 53.2% ALE (Akata et al., 2013) 43.5% TMV-BLP (Fu et al., 2014) 47.1% ES-ZSL (Romera-Paredes and Torr, 2015) 49.3% Gaussian Embedding 65.4% Sec 2.5. Fig. 1 shows examples of how the top retrieved images differ intuitively when querying Im-ageNet for zero-shot categories 'plane' and 'horse' with different context words. To ease interpretation, we constrain the retrieval to the true target class, and focus on the effect of the context word. Our learned Gaussian method retrieves more relevant images than the word-vector sum baseline. E.g., with the Gaussian model all of the top-4 retrieved images for Passenger+Plane are relevant, while only two are relevant with the vector model. Similarly, the retrieved black horses are more clearly black.

Further Analysis
To provide insight into our contribution, we repeat the analysis of the AwA dataset and evaluate several variants of our full method. These use our features, and train the same cross-domain max-margin loss in Eq 5, but vary in the energy function and representa-   Table 3, we make the observations: (i) Bilinear-MeanVec outperforming Bilinear-WordVec shows that cross-modal (Sec 2.3) training of word Gaussians learns better point estimates of words than conventional word-vector training, since these only differ in the choice of vector representation of class names. (ii) PPK-Gaussian outperforming PPK-MeanVec shows that having a model of intra-class variability (as provided by the word-Gaussians) allows better zero-shot recognition, since these differ only in whether covariance is used at testing time.

Related Work and Discussion
Our approach models intra-class variability in both images and text. For example, the variability in visual appearance of military versus passenger 'plane's, and the variability in context according to whether a the word 'plane' is being used in a military or civilian sense. Given distribution-based representations in each domain, we find a cross-modal map that warps the two distributions into alignment.
Concurrently with our work, Ren et al (2016) present a related study on distribution-based visualtext embeddings. Methodologically, they benefit from end-to-end learning of deep features as well as cross-modal mapping, but they only discrimi-natively train word covariances, rather than jointly training both means and covariances as we do.
With regards to efficiency, our model is fast to train if fixing pre-trained word-Gaussians and optimising only the cross-modal mapping A. However, training the mapping jointly with the word-Gaussians comes at the cost of updating the representations of all words in the dictionary, and is thus much slower.
In terms of future work, an immediate improvement would be to generalise our of Gaussian embeddings to model concepts as mixtures of Gaussians or other exponential family distributions (Rudolph et al., 2016;Chen et al., 2015). This would for example, allow polysemy to be represented more cleanly as a mixture, rather than as a wide-covariance Gaussian as happens now. We would also like to explore distribution-based embeddings of sentences/paragraphs for class description (rather than class name) based zero-shot recognition (Reed et al., 2016). Finally, besides end-to-end deep learning of visual features, training non-linear cross-modal mappings is also of interest.

Conclusion
In this paper, we advocate using distribution-based embeddings of text and images when bridging the gap between vision and text modalities. This is in contrast to the common practice of point vectorbased embeddings. Our distribution-based approach provides a representation of intra-class variability that improves zero-shot recognition, allows more meaningful retrieval by multiple keywords, and also produces better point-estimates of word vectors.