A Probabilistic Model for Joint Learning of Word Embeddings from Texts and Images

Several recent studies have shown the benefits of combining language and perception to infer word embeddings. These multimodal approaches either simply combine pre-trained textual and visual representations (e.g. features extracted from convolutional neural networks), or use the latter to bias the learning of textual word embeddings. In this work, we propose a novel probabilistic model to formalize how linguistic and perceptual inputs can work in concert to explain the observed word-context pairs in a text corpus. Our approach learns textual and visual representations jointly: latent visual factors couple together a skip-gram model for co-occurrence in linguistic data and a generative latent variable model for visual data. Extensive experimental studies validate the proposed model. Concretely, on the tasks of assessing pairwise word similarity and image/caption retrieval, our approach attains equally competitive or stronger results when compared to other state-of-the-art multimodal models.


Introduction
Continuous-valued vector representation of words has been one of the key components in neural architectures for natural language processing (Mikolov et al., 2013;Pennington et al., 2014;Levy and Goldberg, 2014). The main idea is based on the distributional hypothesis (Harris, 1954), which states that words used in similar contexts have similar semantic meanings. To this end, words are mapped to points in an Euclidean space such that the displacement between their coordinates (i.e., embeddings) reflects similarity and difference in semantics (Pennington et al., 2014). As such, word embeddings * On leave from U of Southern California have been shown to be useful in determining semantic and syntactic similarity between individual words (Mikolov et al., 2013;Levy et al., 2015), as well as in downstream NLP tasks, e.g., sentiment analysis, question answering, and coreference resolution, just to name a few.
Most existing approaches rely solely on text corpora to infer word representations. While successful, the embeddings produced by such models do not necessarily reflect all inherent aspects of human semantic knowledge, such as the perceptual aspect (Feng and Lapata, 2010). This has motivated many researchers to explore different ways to infuse visual information, often represented in the form of pre-computed visual features, into word embeddings (Kiela and Bottou, 2014;Silberer et al., 2017;Collell et al., 2017;Lazaridou et al., 2015). The main theme is to take either the text embeddings, or the visual features or both as such to derive multimodal embeddings: through concatenation (Kiela and Bottou, 2014), or by treating visual features as regression targets (Lazaridou et al., 2015;Collell et al., 2017).
Despite the success of these prior efforts in yielding multimodal embeddings and applying them to downstream NLP tasks, there are still several deficiencies. In particular, the visual features (as such) are not guaranteed to be suitable for the word embedding task since they are typically optimized independently for another objective (e.g., image classification). Hence, fusing pre-computed word representations and visual features may not be a good strategy.
To address the above issues, we explore a new way to integrate linguistic and perceptual information. We develop a new model which jointly learns word embeddings from text and extracts latent visual information, from pre-computed visual features, that could supplement the linguistic embeddings in modeling the co-occurrence of words and their contexts in a corpus. Instead of using pre-trained visual features as it is or as regression targets, we posit that they contain latent perceptual information that could complement text in representing words.
More specifically, the proposed model consists of two components. The visual component is an unsupervised probabilistic model for learning latent factors that generates the visual data. The linguistic component is a revised SKIP-GRAM model in which the text embeddings work in concert with the latent visual factors to explain the occurrence of wordcontext pairs in a corpus. One advantage of our joint modeling is that it allows two-way interaction. On one hand, the linguistic information can guide the extraction of latent visual factors. On the other hand, the extracted visual factors can improve the modeling of word-context co-occurrences in text data. Another appealing property of our model is its natural ability to propagate perceptual information to the embeddings of words lacking visual features (e.g., abstract words) during learning.
We conduct extensive quantitative and qualitative experiments to examine and understand the effectiveness of our approach, on the tasks of word similarity and image/caption retrieval. We show its matching or stronger performance when compared to other state-of-the-art approaches for learning multimodal embeddings.

Our Approach
We start by introducing the problem setup and notations. We then describe our model, namely PIXIE (ProbabIlistic teXtual Image Embeddings), for joint learning of word representations from text and perceptual information.

Setup and Background
We are given a corpus of H tokens (words): w 1 , . . . , w i , . . . , w H . From the corpus, we form a collection of word-context pairs δ w,c = (w, c), such that w ∈ V w , c ∈ V c , with V w and V c denoting respectively the word and context vocabularies. As in most previous work, the contexts for word w i are the words that surround it in a L-sized window. We introduce the binary indicator variables y wc , such that y wc = 1 if δ w,c appears in our collection, and y wc = 0 otherwise.
For some words with visual grounding (we will refer to them as visual words), we have access to a visual representation x w . In practice, we use con- volutional net features (see Section 4 for details).

SKIP-GRAM WITH NEGATIVE SAMPLING
(SGNS) The SGNS's objective is to learn word representations that are good at distinguishing the observed pairs (y wc = 1) from non-observed or "negative" pairs (y wc = 0), using logistic regression. Formally, SGNS maximizes the following log-likelihood: where σ(·) is the sigmoid function, v c , e w denote respectively the vectors for the context c and target word w. The second term in (1) is intractable due to the large number of possible negative pairs, and is approximated by sampling N negative examples {c i } N i=1 for every observed pair of words and their contexts. This gives rise to the following objective function for each observed pair: where c i is a (negative) context that does not appear in the context of w (Mikolov et al., 2013). In practice, criterion (2) is optimized in an online fashion, by using Stochastic Gradient Descent (SGD) over the observed pairs δ wc in the corpus. Each observed pair δ wc typically occurs several times in the corpus, therefore performing SGD over the corpus amounts to weighting equation (2) by the number of occurrence of each pair.

Joint Visual and Text Modeling
We now describe our model, namely PIXIE (Proba-bIlistic teXtual Image Embeddings) illustrated in Fig. 1, for joint learning of word representations from textual and perceptual information.
Formally, PIXIE is a probabilistic model of image features x and word-context pairs' labels y. Similar to SKIP-GRAM, PIXIE represents each word w and context c with low dimensional embeddings noted respectively e w ∈ R K and v c ∈ R K . PIXIE further assumes latent visual factors, z w ∈ R K , for each word's visual representation x w . Next we describe the two main components of PIXIE, namely textual and perceptual, in more details.
Perceptual Component Each visual vector x w is drawn conditional on its latent representation z w , i.e., x w ∼ p θ (x|z w ), with p θ (z) = N (0, I). Since x w is real valued, we let p θ (x w |z w ) be a Gaussian parameterized by a generative neural network (or decoder). That is, For tractability purposes, Σ θ is restricted to be diagonal. Moreover, both the co-variance Σ θ (z w ) (its diagonal) and the mean µ θ are the outputs of a decoder network with parameters θ and input z w . Textual Component To model the occurrence/absence of word-context pair δ wc in the linguistic corpus, we adopt a Bernoulli (Ber) model: The function f (·) defines how multimodal embeddings are fused. While many choices can be experimented, we use the simple additive model: For words without visual representation, we simply set the corresponding latent factors z w to the zero vector. Note that, without the visual factors z w , equation (4) reduces to the Skip-Gram with negative sampling objective (1).

Joint Model
The perceptual and the textual information interact through the shared latent z w . The joint model of the above two sources of information takes the following form: The intuition behind our joint formulation is to let the textual information guide the extraction of latent visual factors z w . Through equations (4) and (5), the model will put high probability on factors z w reflecting patterns that can supplement the linguistic embeddings e w in explaining the wordcontext co-occurrences. Thus, the extracted latent visual factors can contribute to improve the performance on predicting the occurrence of a word and its contexts in the linguistic corpus, which would encourage the model to leverage the perceptual information. The underlying assumption here is to infer visual and textual embeddings that can work in concert to represent words.
Visual Information Propagation Equation (5) implies that the embeddings z, e and v will affect each other during the learning process. Interestingly, if a non-visual word w 1 shares a similar context c with a visual word w 2 , then the factor z w 2 will affect e w 1 via v c . In other words, our formulation makes it possible to implicitly propagate perceptual information from one word to another through shared contexts. We illustrate this aspect in our experiments.

Approximate Inference and Learning
Training PIXIE amounts to inferring the posterior over the visual latent factors, p θ (z|x, y), as well as finding the decoder's parameters θ, the word and context embeddings, e and v, that maximize the likelihood (6). However, as in many complex probabilistic models, the likelihood (due to the integral over z) and the posterior are intractable. We therefore resort to approximation techniques. More precisely, we rely on Variational Inference (VI) (Blei et al., 2017). The idea of VI is to introduce a tractable approximate posterior distribution q φ (z|x) (the variational distribution) and optimize a lower bound on the likelihood, known as Evidence Lower BOund (ELBO). The latter can be written for each word w as follows: where KL(· ·) is the Kullback-Leibler divergence. The variational distribution is chosen to be a multivariate Gaussian parameterized by an inference network (or encoder) which takes x as input, namely where we drop the dependency on all y wc variables to be computationally tractable. The pair of encoder and decoder neural networks gives rise to the interpretation of PIXIE's visual component (formed by z and x) as a probabilistic autoencoder. In fact, if we drop the textual part in PIXIE, namely y, e and v, then we recover the Variational Auto-Encoder (VAE) (Kingma and Welling, 2013). Lastly, we approximate the intractable (i) expectation with respect to q φ (z|x) and the (ii) sum over the negative pairs in (7), by relying on a Monte Carlo estimator of L. Concretely, for (ii) we use negative sampling as in (2). Concerning (i), for every observed x w , we sample {z using the reparameterization trick (Kingma and Welling, 2013), i.e., z Then we approximate L with: The last two summands correspond to the familiar conditional likelihood term in the SGNS model, augmented with latent visual factors. We optimize the objective (9) via SGD with respect to both the encoder/decoder networks parameters (θ and φ) and the embeddings (e and v). We evaluate the gradients of L, with respect to θ and φ using backpropagation. Similarly, the gradient with respect to e and v can be easily carried out using automatic differentiation tools. Our learning procedure is summarized in Algorithm 1.

Algorithm 1 Variational PIXIE
Input: x, y, sample sizes B and J Steps: • Use G to update θ, φ, e and v (e.g., with ADAM) until convergence return θ, φ, e and v Inference Once the parameters of the model are learned, for any given word with or without visual representation, we can compute its multimodal embedding. As a short hand, let the binary variable m w denote whether or not the word w has a visual representation. The multimodal embedding for w can be written as where µ(x w ) = E q (z w ) is the output of the encoding neural network, cf. Eq. (8).
In our experiments, we have also studied an alternative way to compose multimodal embeddings by concatenating the two vectors e w and µ(x w ) Note that for non-visual words, only zeros are appended to e w . One advantage of t over s is that if one uses distances to measure similarity, t can be seen as a simple summation of distances in two different spaces (in terms of e and µ respectively).

Related Work
Combining language and perception has been recently considered in various NLP tasks such as machine translation (Calixto and Liu, 2017), visual question generation (Mostafazadeh et al., 2016), image captioning (Klein et al., 2015), etc. In this work, we focus on learning word embeddings from images and texts. Multimodal embeddings have been studied in several recent research work. One strategy is to obtain word embeddings from linguistic data and visual data independently and then proceed with some kind of fusion steps. Kiela and Bottou (2014) simply concatenates pre-trained linguistic word embeddings and visual features computed by convolutional nets. Bruni et al. (2014) performs an additional step of dimensionality reduction via singular value decomposition. Silberer et al. (2017) extend on this work by feeding the linguistic embedding and visual features into a stacked auto-encoder for nonlinear dimensionality reduction. The abovementioned approaches perform a two-stage process to derive multimodal representations (unimodal inference followed by fusion) and have been evaluated only on words for which both perceptual and textual representations are available.
A standing question is how to propagate visual information from words with visual features to words lacking them (for instance, abstract words). While the previous methods fall short on that, the recent work by Collell et al. (2017) addresses this challenge by learning a mapping from language to vision, using a set of words with known linguistic embeddings and visual features. This mapping can then be used to infer visual representations for new words from their textual embeddings.
All the aforementioned methods rely on independently pre-trained linguistic embeddings and visual features. In this work, we propose a different strategy, which consists in adapting those representations so that the information can be fused in earlier stages. In this respect, the closest work to ours is (Lazaridou et al., 2015), which proposes to augment the SKIP-GRAM objective function with a term mapping the textual embeddings to the visual features. Crudely, the linguistic embeddings must therefore predict both the text co-occurrences and (pre-trained) visual features. We emphasize two key differences with our approach. First, instead of performing a regression or mapping from the textual embeddings to the visual features, our model learns to infer perceptual latent factors to retain only the portion of visual information that can supplement the linguistic embeddings in representing words. Second, while Lazaridou et al. (2015) combines two objectives, we use a joint probabilistic model integrating both visual and text information in a principled way. Specifically, our model seeks latent factors that are good at explaining the word-context co-occurrences. For instance, a visual feature of (an image of) OCEAN often contains information about SKY and BLUE -such visual information could be beneficial to predict cooccurrence of tokens in the context of OCEAN. This desiderata further strengthens the learned embeddings to be visually grounded. In our experiments, we show that our approach tends to group concrete visually similar concepts together.

Experiments
In this section, we evaluate our model and contrast it to other competing approaches on two different tasks: word similarity and image/caption retrieval.

Setup
Text corpus We use the Text8 WIKIPEDIA corpus 1 containing over 17 million tokens. Text8 was pre-processed to contain only letters and nonconsecutive spaces. After removing infrequent words, we obtain a vocabulary of 50, 000 unique words.
Visual representation of words For each word in the vocabulary, we recover all the synsets that it belongs to using the WordNet interface of the NLTK module (Python) (Bird et al., 2009). We then remove the synsets not covered by our Im-ageNet dataset. This results in 9,713 words, out of the 50,000 words in the vocabulary. For each visual word, we randomly draw 1,000 distinct images in ImageNet. If the number of images for a word is less than 1,000, we increase the coverage using images belonging to the hypernyms of the considered word's synsets, as in (Kiela and Bottou, 2014). We then take the average of these features as the word's visual representation x.
Hyper-parameter setting For all models, we set the dimension of linguistic and visual embeddings, e and z, to 100, following many previous works. In our model, the encoder/decoder neural networks are implemented as one-hidden-layer neural nets with 500 hidden units each. The dimensions of the inputs and the outputs of the decoder neural networks are 100 and 1, 024 respectively (1, 024 and 100 for the encoder). For the encoder, the hidden units are hyperbolic tangent, and the output units are linear. For the decoder, the hidden units are hyperbolic while the outputs are sigmoid. For SGNS, we set the window size L to 10 and the number of negative samples to 64. Our model is learned by Stochastic Gradient Descent using the ADAM optimizer (Kingma and Ba, 2014) with a learning rate set to 0.001.  (Rubenstein and Goodenough, 1965) 65 SimLex (Hill et al., 2015) 999 MTurk (Radinsky et al., 2011;Halawi et al., 2012) 287 WORDSIM (Finkelstein et al., 2001) 350 REL (Agirre et al., 2009) 150 SIM (Agirre et al., 2009) 200 SEMSIM (Silberer and Lapata, 2014) 5494 VISSIM (Silberer and Lapata, 2014) 5494   (Lazaridou et al., 2015), § : (Collell et al., 2017)

Task 1: Word Similarity
Datasets Word similarity is a common type of evaluation task for measuring the effectiveness of word embeddings. To this end, we retain 10 benchmark datasets consisting of pairs of words associated with similarity scores given by human judges. Table 1 summarizes their basic properties. There are different types of similarities being assessed. SEMSIM, SimLex, SIM, EN-RG and EN-MC focus on semantic or taxonomic similarity -e.g. CAR is similar to AUTOMOBILE. MEN, REL and MTurk consider general relatedness -e.g. CAR is related to GARAGE. VISSIM is about visual similarity -e.g. GOOSE looks like SWAN. Note that SIM and REL are the similarity and relatedness subsets of the full WORDSIM dataset (Finkelstein et al., 2001) respectively. VISSIM contains the same word pairs as SEMSIM.
Competing models We benchmark our model PIXIE against several strong uni-and multi-modal models listed below: • SGNS: Skip-Gram with Negative Sampling (Mikolov et al., 2013). Without the visual component, PIXIE reduces to SGNS. We can thus assess the impact of the perceptual information by comparing PIXIE to SGNS. • VAE: Vatiational Auto-Encoder (Kingma and Welling, 2013), which corresponds to the visualspecific component of PIXIE. • CNN: Visual features extracted from a convolutional neural net as described in Section 4.1. • CNN⊕SGNS (Kiela and Bottou, 2014): Concatenation of CNN and SKIP-GRAM embeddings. • VAE⊕SGNS: Concatenation of VAE and SKIP-GRAM embeddings.
• V-SGNS (Lazaridou et al., 2015): A multimodal approach which augments SGNS with a term that treats CNN visual features as regression targets.
Comparisons with V-SGNS will allow us to evaluate the impact of our modeling assumptions. • IV-SGNS (Collell et al., 2017): Learns a mapping from SGNS embeddings to CNN visual features. Due to a large degree of discrepancies in experimental setups across previously published methods and results, 2 we re-implemented all the baselines and evaluate them under the same conditions. 3 For (Lazaridou et al., 2015), we implemented its model "A" as model "B" is comparable according to the original authors. For (Collell et al., 2017), we implemented both linear and nonlinear variants.
Evaluation metrics We use the cosine to measure the similarity between word representations. To assess the coherence between human ratings and models' predictions, we use the Spearman correlation coefficient.

Main results
The results across different datasets are shown in Table 2. We perform evaluations under two settings: by considering (i) word similarity between visual words only and (ii) between all words (column 100% in Table 2). For the models CNN, VAE and their concatenation with SGNS embeddings, the latter setting is not applicable. The two last rows correspond to the multimodal embeddings inferred from our model. In particular, PIXIE + (resp. PIXIE ⊕ ) represents the multimodal embeddings built using Eq. (10) (resp. Eq. (11)).
Overall, we note that PIXIE ⊕ offers the best performance in almost all situations. This provides strong empirical support for the proposed model. Below, we discuss the above results in more depth to better understand them and characterize the circumstances in which our model performs better.
How relevant is our formulation? Except PIXIE and V-SGNS, most of the multimodal competing methods rely on independently pre-computed linguistic embeddings. As Table 2 shows, PIXIE and V-SGNS are often the best performing multimodal models, which provides empirical evidence that accounting for perceptual information while learning word embeddings from text is beneficial. Moreover, the superior performance of PIXIE ⊕ over V-SGNS suggests that our model does a better job at combining perception and language to learn word representations.
Joint learning is beneficial PIXIE ⊕ outperforms VAE⊕SGNS in almost all cases, which demonstrates the importance of joint learning.
Where does our approach perform better? On datasets that focus on semantic/taxonomic similarity, our approach dominates all other methods.
On datasets focusing on general relatedness, our approach obtains mixed results. While dominating other approaches on MEN, it tends to perform worst than SGNS on MTurk and REL (under the 100% setting). One possible explanation is that general relatedness tends to focus more on "extrapolating" from one word to another word (such as SWAN is related to LAKE), while our approach better models more concrete relationships (such as SWAN is related to GOOSE). The low performance of CNN and VAE confirms this hypothesis.
On the VISSIM dataset focusing on visual similarity, both CNN⊕SGNS and VAE⊕SGNS perform the best, strongly suggesting that visual and linguistic data are complementary. Our approach comes very close to these two methods. Note that our learning objective is to jointly explain visual features and word-context co-occurrences. Thus, two visually similar words, which never occur within the same context, could be mapped into slightly different directions in the latent space.
Visual Propagation Here we wish to evaluate the ability of our model to propagate perceptual information to words lacking visual features. To  this end, we randomly select a subset of 2, 000 words for which we have visual features, and we train our model under two different settings: the visual features of the selected 2K words (i) are taken into account (PIXIE ⊕ ), (ii) are ignored, i.e. set to zero (PIXIE ⊕ ( * ) ). We then perform evaluations, under the two settings, on the datasets of Table 1 considering only pairs composed of words in the above subset of 2K words. As baselines for this experiment, we consider SGNS and the multimodal approaches which can propagate perceptual information, namely V-SGNS and IV-SGNS, as well as their outputs when the 2K visual features are ignored (denoted by V-SGNS ( * ) , and IV-SGNS ( * ) ).
The results are given in Table 3. We observe that PIXIE ⊕ ( * ) outperforms SGNS in almost all cases. Recall that, if we ignore the visual features for all words, PIXIE reduces to SGNS. We can therefore attribute the performance improvement of PIXIE ⊕ ( * ) over SGNS to the propagation of visual information to the subset of 2K words. Compared to multimodal methods, PIXIE ⊕ ( * ) (resp. PIXIE ⊕ ) performs better than V-SGNS ( * ) (resp. V-SGNS) and IV-SGNS ( * ) (resp. IV-SGNS) in almost situations. This suggests that our formulation allows perceptual information to propagate better.  In Table 4, we report the cosine similarity between 5 semantically/visually coherent word pairs (from our subset of 2K words). Although the visual vectors of these words were removed during training, the PIXIE's word embeddings of each pair correlate better as compared to their SGNS counterparts, which provides further support to the propagation of visual information under PIXIE. Table 5 displays several qualitative examples of word similarity. We have selected 4 words: goose, brave, birthstone and savagery. The first two have visual feature representations in our training dataset and the last two do not. Furthermore, for each case, we chose one concrete and one abstract word. 4 For each word, we identify their nearest neighbors in the embedding space.

Qualitative analysis
For the visual words, there is a noticeable difference between our method and others. For instance, for word goose, SGNS expresses more "general" relatedness and returns other animals like pig or shark, while our approach is more specific and tends to give visually similar neighbors by focusing on goose looks-like birds. V-SGNS's result is somewhat in between. On the abstract word brave, we observe that PIXIE ⊕ tends to select more explicit embodiments of the adjective brave than SGNS and V-SGNS.
Moving towards the non-visual words, we do not seem to find a consistent discrepancy pattern between V-SGNS and PIXIE ⊕ , though, as for visual words, both methods seem to select more explicit exemplars compared to SGNS. For instance, for the abstract word savagery, both multimodal approaches suggest cannibals and zombies.

Task 2: Image and Caption Retrieval
We now study the usefulness of the learned word embeddings for the tasks of image and caption retrieval. Our hypothesis is that multimodal word embeddings will perform better for downstream tasks involving multimodal information.
Experimental setup We use the Flickr30K dataset (Young et al., 2014) containing 31,000 images and 155,000 sentences (5 captions per image). The sentences describe the images. The task is to identify the best sentence describing an image or to identify the best image depicting a sentence. We follow the data split setting provided by , in which 1,000 images are used for validation and 1,000 for testing. The rest is used for training.
The retrieval models compute the proximity between the image features and the sentence embeddings. For image features, we use the precomputed features provided by Faghri et al. (2017), which are extracted from the FC7 layer of VGG-19 (Simonyan and Zisserman, 2014). These 4,096dimensional features are then linearly mapped to 1,024-dimensional features. For sentences, we use an one GRU-layer over the sequences of the word embeddings, resulting in 1,024-dimensional sentence embeddings.
We use a triplet loss to train the retrieval model such that the inner product between the corresponding image feature and the sentence is greater than the inner products with incorrect sentences (or im-ages) (Kiros et al., 2014). The linear mapping and the GRU layer are then optimized to minimize the loss. We use the ADAM optimizer with the learning rate of 0.0002 and divide it by 10 every 15 epochs and we train the model for 30 epochs. Note that we do not fine-tune either the original visual feature or the word embeddings.
Results Table 6 summarizes the results. The evaluation metrics are accuracies at top-K (K=1, 5, or 10) retrieved sentences or images. Our model consistently outperforms SGNS and other competing multimodal methods, which provides additional support for the benefits of our approach.

Conclusion
We propose PIXIE, a novel probabilistic model joining textual and perceptual information to infer multimodal word embeddings. In our model, both linguistic and visual latent factors work in concert to explain the co-occurrences of words and their contexts in a corpus. Empirical results show that our model achieves equally competitive or stronger results when compared to state-of-the-art methods for multimodal embeddings.
Currently our model relies on unsupervised learning to infer visual factors. Explicit knowledge of similar and dissimilar visual categories could potentially disentangle latent factors better for alignment with linguistic data. How to incorporate visual domain knowledge more explicitly into the model would be an interesting direction for future research. While we build on skip-gram, the idea of PIXIE could be extended to other word embedding models, e.g., Glove (Pennington et al., 2014), ELMO (Peters et al., 2018), etc.