Learning Visually Grounded Sentence Representations

We investigate grounded sentence representations, where we train a sentence encoder to predict the image features of a given caption—i.e., we try to “imagine” how a sentence would be depicted visually—and use the resultant features as sentence representations. We examine the quality of the learned representations on a variety of standard sentence representation quality benchmarks, showing improved performance for grounded models over non-grounded ones. In addition, we thoroughly analyze the extent to which grounding contributes to improved performance, and show that the system also learns improved word embeddings.


I
Following the word embedding upheaval of the past few years, one of NLP's next big challenges has become the hunt for universal sentence representations: generic representations of sentence meaning that can be "plugged into" any kind of system or pipeline.Examples include Paragraph2Vec (Le & Mikolov, 2014), C-Phrase (Pham et al., 2015), SkipThought (Kiros et al., 2015) and FastSent (Hill et al., 2016a).These representations tend to be learned from large corpora in an unsupervised setting, much like word embeddings, and effectively "transferred" to the task at hand.It has been argued that purely text-based semantic models, which represent word meaning as a distribution over other words (Harris, 1954;Turney & Pantel, 2010;Clark, 2015), suffer from the grounding problem (Harnad, 1990).That is, despite ample evidence of humans representing meaning with respect to an external environment and sensorimotor experience (Barsalou, 2008;Louwerse, 2008), such systems rely solely on textual data.This gives rise to an infinite regress in text-only semantic representations, i.e., words are defined in terms of other words, ad infinitum.Current sentence representation models are doubly exposed to the grounding problem, especially if they represent sentence meaning as a distribution over other sentences.From a more practical standpoint, recent work has shown that grounding word representations yields improvements on a wide variety of tasks (Baroni, 2016;Kiela, 2017).
In short, the grounding problem is characterized by the lack of an alignment between symbols and external information.Here, we address this problem by aligning textual information with paired visual data.Specifically, we hypothesize that sentence representations can be enriched with external information, i.e., grounded, by imposing the constraint that they be useful for tasks that involve visual semantics.We investigate the performance of these sentence representations and the effect of grounding on a wide variety of NLP tasks.
There has been much recent interest in generating actual images from text (Goodfellow et al., 2014;van den Oord et al., 2016;Mansimov et al., 2016).Our method takes a slightly different approach: instead of predicting actual images, we train a deep recurrent neural network to predict the latent feature representation of images.That is, we are specifically interested in the semantic content of visual representations and how useful that information is for learning sentence representations.One can think of this as trying to imagine, or form a "mental picture", of a sentence's meaning.Much like a sentence's meaning in classical semantics is given by its model-theoretic ground truth (Tarski, 1944), our ground truth is provided by images.
Grounding is much more useful for concrete words and sentences: a sentence such as "democracy is a political system" does not yield any coherent mental picture.In fact, there is evidence that meaning is dually coded in the human brain: while abstract concepts are processed in linguistic areas, concrete concepts are processed in both linguistic and visual areas (Paivio, 1990).Anderson et al. (2017) recently corroborated this hypothesis using semantic representations and fMRI studies.In order to accommodate the fact that much of language is abstract, we take sentence representations obtained using text-only data (which are better for representing abstract meaning) and combine them with the grounded representations that our system learns (which are presumably better at representing concrete meaning).
In what follows, we introduce a system for grounding sentence representations by learning to predict visual content.We first examine how well this system grounds, by evaluating on the COCO5K caption and image retrieval dataset.We then analyze the performance of grounded representations on a variety of NLP transfer tasks, showing that grounding increases performance over text-only representations.In the remainder, we investigate the contribution of grounding more in-depth, and show that our system learns grounded word embedding projections that outperform non-grounded ones.To the best of our knowledge, this is the first work to comprehensively study grounding for distributed sentence representations.

R
Sentence representations Although there appears to be a consensus with regard to the methodology for learning word representations, this is much more of an open problem for sentence representations.Recent work has ranged from trying to learn to compose word embeddings (Le & Mikolov, 2014;Pham et al., 2015;Wieting et al., 2016;Arora et al., 2017), to neural architectures for predicting the previous and next sentences (Kiros et al., 2015) or learning representations via large-scale supervised tasks (Conneau et al., 2017).Hill et al. (2016a) compare a wide selection of unsupervised and supervised methods, including a basic caption prediction system that is similar to ours.That study finds that "different learning methods are preferable for different intended applications", i.e., that the matter of optimal universal sentence representations is as of yet far from decided.Here, we show how grounding can help for a variety of NLP classification tasks.
Multi-modal semantics Language grounding in semantics has been motivated by evidence that human meaning representations are grounded in perceptual experience (Jones et al., 1991;Perfetti, 1998;Andrews et al., 2009;Riordan & Jones, 2011).This work is closely related to Chrupała et al. (2015), who also aim to ground language by relating images to captions, but with a different architecture and objective, and less of a focus on universal sentence representations.The field of multi-modal semantics, which aims to address this issue by enriching textual representations with information from other modalities, has mostly been concerned with word representations (Bruni et al., 2014;Baroni, 2016;Kiela, 2017).

Bridging vision and language
There is a large body of work that involves jointly embedding images and text, at the word level (Frome et al., 2013;Joulin et al., 2016), the phrase level (Karpathy et al., 2014;Li et al., 2016), and the sentence level (Karpathy & Fei-Fei, 2015;Klein et al., 2015; Figure 2: Bidirectional LSTM with element-wise max architecture.Kiros et al., 2015;Reed et al., 2016).Our model similarly learns to map sentence representations to be consistent with a visual semantic space, and we focus on studying how these grounded text representations transfer to NLP tasks.Moreover, there has been a lot of work in recent years on the task of image captioning (Bernardi et al., 2016;Vinyals et al., 2015;Mao et al., 2015;Fang et al., 2015).Here, we do the opposite: we predict the correct image (features) from the caption, rather than the caption from the image (features).A similar idea was recently successfully applied to multi-modal machine translation (Elliott & Kádár, 2017).

A
In the following, let D = {(I k , C k )} N k=1 be a dataset where each image I k is associated with one or multiple captions A prominent example of such a dataset is the COCO corpus (Lin et al., 2014), which consists of 123,287 images with up to 5 corresponding captions for each image.The objective of our approach is to encode a given sentence, i.e., a caption C, and learn to ground it in the corresponding image I. To encode the sentence, we train a bidirectional LSTM (BiLSTM) on the caption, where the input is a sequence of projected word embeddings.We combine the final left-to-right and right-to-left hidden states of the LSTM and take the element-wise maximum to obtain a sentence encoding.
We then examine three distinct methods for forcing the sentence encoding to be grounded.In the first method, we try to predict the image features (Cap2Img).That is, we learn to map the caption to the same space as the image features that represent the correct image.We call this strong perceptual grounding, where we take the visual input directly into account.An alternative method is to consider the fact that one image in COCO has multiple captions (Cap2Cap): we learn to predict which other captions are valid descriptions of the same image, which leads to what we call implicit or "weak" grounding.Finally, we experiment with a model that optimizes both these objectives jointly, and so incorporates both strong and weak grounding: that is, we predict both images and alternative captions for the same image (Cap2Both).Please see Figure 1 for an illustration of the various models.In what follows, we discuss them in more technical detail.

B LSTM
To learn sentence representations, we employ a bidirectional LSTM architecture.In particular, let x = (x 1 , . . ., x T ) be an input sequence where each word is represented via an embedding x t ∈ R n .Using a standard LSTM (Hochreiter & Schmidhuber, 1997), the hidden state at time t, denoted where c t denotes the cell state of the LSTM and where Θ denotes its parameters.
To exploit contextual information in both input directions, we process input sentences using a bidirectional LSTM, that reads an input sequence in both normal and reverse order.In particular, for an input sequence x of length T, we compute the hidden state at time t, h t ∈ R 2m via Here, the two LSTMs process x in a forward and a backward order, respectively.We subsequently use max : R d × R d → R d to combine them into their element-wise maximum, yielding the representation of a caption after it has been processed with the BiLSTM: We use GloVe vectors (Pennington et al., 2014) for our word embeddings.The embeddings are kept fixed during training, which allows a trained sentence encoder to transfer to tasks (and a vocabulary) that it has not yet seen, provided GloVe embeddings are available.Since GloVe representations are not tuned to represent grounded information, we learn a global transformation of GloVe space to grounded word space.Specifically, let x ∈ R n be the original GloVe embeddings.We then learn a linear map U ∈ R n×n such that x = Ux and use x as input to the BiLSTM.The linear map U and the BiLSTM are trained jointly.

C 2I
Let v ∈ R I be the latent representation of an image (e.g.the final layer of a ResNet).To ground captions in the images that they describe, we map h T into the latent space of image representations such that their similarity is maximized.In other words, we aim to predict the latent features of an image from its caption.The mapping of caption to image space is performed via a series of projections where ψ denotes a non-linearity such as ReLUs or tanh.
By jointly training the BiLSTM with these latent projections, we can then ground the language model in its visual counterpart.In particular, let Θ = Θ BiLSTM ∪ {P } L =1 be the parameters of the BiLSTM as well as the projection layers.We then minimize the following ranking loss: where where [x] + = max(0, x) denotes the threshold function at zero.Furthermore, N a denotes the set of negative samples for an image or caption and sim(•, •) denotes a similarity measure between vectors.
In the following, we employ the cosine similarity, i.Although this loss is not smooth at zero, it can be trained end-to-end using subgradient methods.
Compared to e.g. an l 2 regression loss, Equation ( 1) is less susceptible to error incurred by subspaces of the visual representation that are irrelevant to the high level visual semantics.Empirically, we found it to be more robust to overfitting.

C 2C
Let x = (x 1 , . . ., x T ), y = (y 1 , . . ., y S ) be a caption pair that describes the same image.To learn weakly grounded representations, we employ a standard sequence-to-sequence model (Sutskever et al., 2014), whose task is to predict y from x.As in the Cap2Cap model, let h T be the representation of the input sentence after it has been processed with a BiLSTM.We then model the joint probability of y given x as To model the conditional probability of y s we use the usual multiclass classification approach over the vocabulary of the corpus V such that Here, y s = ψ(W V g s + b) and g s is hidden state of the decoder LSTM at time s.
To learn the model parameters, we minimize the negative log-likelihood over all caption pairs, i.e.,

C 2B
Finally, we also integrate both concepts of grounding into a joint model, where we optimize the following loss function

G
On their own, features from this system are likely to suffer from the fact that training on COCO introduces biases: aside from the inherent dataset bias in COCO itself, the system will only have coverage for concrete concepts.COCO is also a much smaller dataset than e.g. the Toronto Books Corpus used in purely text-based methods (Kiros et al., 2015).As such, the grounded representations are potentially less "universal" than alternatives.Hence, we complement our systems' representations with more abstract universal sentence representations trained on language-only data.We do so using concatenation, i.e., r gs = r r cap || r ling .Concatenation has been proven to be a strong and straightforward mid-level multi-modal fusion method, previously explored in multi-modal semantics for word representations (Bruni et al., 2014;Kiela & Bottou, 2014).Although it would be interesting to examine multitask scenarios where these representations are jointly learned, we leave this for future work.We call the combined system GroundSent (GS), and distinguish between sentences perceptually grounded in images (GroundSent-Img), weakly grounded in captions (GroundSent-Cap) or grounded in both (GroundSent-Both).

I
We use GloVe (Pennington et al., 2014) embeddings for the initial word representations and optimize using Adam (Kingma & Ba, 2015).We use ELU (Clevert et al., 2016) for the non-linearity in projection layers, set dropout to 0.5 and use a dimensionality of 1024 for the LSTM.The network was initialized with orthogonal matrices for the recurrent layers (Saxe et al., 2014) and He initialization (He et al., 2015) for all other layers.The learning rate and margin were tuned on the validation set using grid search.

D ,
We use the same COCO splits as Karpathy & Fei-Fei (2015) for training (113,287 images), validation (5000 images) and test (5000 images).Image features for COCO were obtained by transferring the

COCO5K Caption Retrieval
Image Retrieval Model R@1 R@5 R@10 MEDR MR R@1 R@5 R@10 MEDR  (Kiros et al., 2015) and layer-normalized Skip-Thought vectors (Ba et al., 2016).In the latter case, we additionally evaluate that system on the exact same evaluations, with identical hyperparameters and settings, to ensure a fair comparison (results from the paper are also reported).

S
A standard set of benchmarks has been established in the research community when it comes to evaluating sentence representations.We evaluate on the following well-known and widely used evaluations: movie review sentiment (MR; Pang & Lee, 2005), product reviews (CR; Hu & Liu, 2004), subjectivity classification (SUBJ; Pang & Lee, 2004), opinion polarity (MPQA; Wiebe et al., 2005), paraphrase identification (MSRP; Dolan et al., 2004) and sentiment classification (SST; Socher et al., 2013, using the binary version).

E
Recent years have seen an increased interest in entailment classification as an appropriate evaluation of sentence representations.We evaluate representations on two well-known entailment, or natural language inference, datasets: the large-scale SNLI dataset (Bowman et al., 2015) and the SICK dataset (Marelli et al., 2014).

I
We implement a simple logistic regression on top of the transferred features.In the cases of SNLI and SICK, as is the standard with these datasets, we use u, v, u * v, |u − v| as features.We tune the seed and an l 2 penalty on the validation sets for each, and train using Adam (Kingma & Ba, 2015), with an initial learning rate of 1e-3 and a mini-batch size of 32.

R
We first examine the capability of our system to perform grounding, i.e., its ability to correctly predict image features from captions, and captions from images.With the quality of grounding established, we then show that grounding high-quality text-only universal sentence representations leads to improved performance.Table 3: Results on semantic classification transfer tasks.

G
Table 1 shows the results on the COCO5K caption and image retrieval tasks for the two models that predict image features, compared to various other methods.As the results show, Cap2Img performs very well on this task, outperforming the compared models on caption retrieval and being very close to order embeddings on image retrieval1.The fact that the system outperforms Order Embeddings on caption retrieval suggests that it has a better sentence encoder.Cap2Both does not work as well as the image-only case, probably because interference from the language signal makes the problem harder to optimize.

T
Having established that we can learn high-quality grounded sentence encodings, we now examine how well grounding transfers.In this section, we combine our grounded features with the state-ofthe-art universal sentence representations of Ba et al. ( 2016), as described in Section 3.5.That is, we concatenate Cap2Img, Cap2Cap or Cap2Both and Skip-Thought with Layer Normalization (ST-LN) representations, and examine their performance.Since we found slightly different results when we used ST-LN features in our own evaluation pipeline2, we report both the results from the paper and the results we achieved.
Table 2 shows the results for the semantic classification tasks.Note that the four final rows use the exact same evaluation pipeline, which makes them more directly comparable.We can see that in all cases, grounding increases the performance.The question of which type of grounding works best is more difficult: generally, grounding with Cap2Cap and Cap2Both appears to do slightly better on most tasks, but on e.g.SST, Cap2Img works better.
1In fact, we found that we can achieve even better performance on this task, by reducing the dimensionality of the encoder.A lower dimensionality in the encoder also reduces the transferability of the features, unfortunately, so we leave a more thorough investigation of this phenomenon for future work.
2This is probably due to different seeds, optimization or other minor implementational details.The entailment task results in Table 3 show a similar picture: in all cases grounding helps.We include a comparison to two feature-based models from Bowman et al. (2015)3: in the case of SNLI, it appears that uni-gram and bi-gram features work better, so there is still some ways to go on SNLI; however in the case of SICK, using grounded distributed representations works much better.

D
There are a few other important questions to investigate with respect to the results above.First and foremost, it is an open question whether the increased performance is due to qualitatively different information from grounding, or simply due to the fact that we have more parameters.Second, the abstractness or concreteness of the evaluation datasets may have a large impact on performance.We examine these two questions here, and also show that the projected word embeddings used in our model are better than non-grounded ones.

T
We implement a SkipThought-like model that also uses a bidirectional LSTM with element-wise max on the final hidden layer (STb).This model is architecturally identical to Cap2Cap, except that its objective is not to predict an alternative caption, but to predict the previous and next sentence in the Toronto Books Corpus, just like SkipThought (Kiros et al., 2015).We train a 1024-dimensional and 2048-dimensional STb model (for one epoch, with all other hyperparameters identical to Cap2Cap) to compare against: if grounding improves results because it introduces qualitatively different information, rather than just from having more parameters (i.e., a higher embedding dimensionality), we should expect the multi-modal 2048-dimensional GS(STb-1024b) model to perform better not only than STb-1024, but also than STb-2048, which has the same number of parameters.In addition, we compare against an "ensemble" of two different STb-1024 models (i.e., a concatenation of two separately trained STb-1024), to ensure that we are not (just) observing an ensemble effect.
As Table 4 shows, a more nuanced picture emerges in this comparison: grounding helps more for some datasets than for others.Grounded models outperform the STb-1024 model (which uses much more data) in all cases, often already without the concatenation.The ensemble of two STb-1024 models performs better than the individual one, and so does the higher-dimensional one.In the cases of CR and MRPC (F1), it appears that improved performance is due to having more data or ensemble effects.These are small datasets however, so it is hard to be definitive.For the other datasets, grounding clearly yields better results.

C
As we have seen, performance across datasets and models can vary substantially.A dataset's concreteness is an important factor in the relative merit of applying grounding: a dataset consisting mostly of abstract words is less likely to benefit from grounding than one that uses mostly concrete words, for which it is easy (or easier) to predict image features.In order to examine this effect, we calculate the average concreteness of the evaluation datasets used in this study.For all words in each dataset, we obtain (where available) its human-annotated concreteness rating, and average those.The concreteness ratings were obtained by Brysbaert et al. (2014) in a large-scale study, yielding ratings for 40,000 English words.The averaged concreteness scores of each dataset can be found in Table 5.
We observe that the two entailment datasets are more concrete, which is due to the fact that the premises are derived from caption datasets (Flickr30K in the case of SNLI; Flickr8K and video captions in the case of SICK).This explains why grounding can clearly be seen to help in these cases.
For the semantic classification tasks, the more concrete datasets are MRPC and SST.The picture is less clear for the first, but in SST we see that the grounded representations definitely do work better.
It is interesting to see that these concreteness values make it easier to analyze performance, but are apparently not direct indicators of representational quality.

G
Our models contain a projection layer that maps the GloVe word embeddings that they receive as inputs to a different embedding space.There has been a lot of interest in grounded word representations in recent years, so it is interesting to examine what kind of word representations our models learn.We omit Cap2Cap for reasons of space (it performs similarly to Cap2Both).As shown in Table 6, the grounded word projections that our network learns yield higher-quality word embeddings on four standard lexical semantic similarity benchmarks: MEN (Bruni et al., 2014), SimLex-999 (Hill et al., 2016b), Rare Words (Luong et al., 2013) and WordSim-353 (Finkelstein et al., 2001).

C
We have investigated grounding for universal sentence representations.We achieved good performance on caption and image retrieval tasks on the large-scale COCO dataset.We subsequently showed how the sentence encodings that the system learns can be transferred to various NLP tasks, and that grounded universal sentence representations lead to improved performance.In the discussion, we analyzed the source of the improvements that grounding yielded, and showed that the increased performance appears to be due to the introduction of qualitatively different information (i.e., grounding), rather than simply having more parameters or applying ensemble methods.Lastly, we showed that our systems learned high-quality grounded word embeddings that outperform nongrounded ones on standard semantic similarity benchmarks.
In future work, we aim to investigate full multi-task or joint-learning, where we train on a large-scale text corpus while grounding concepts when considered useful.It would also be interesting to further examine the relative contributions of individual aspects of the approaches outlined in this work, in particular the type of sentence encoder and the loss function.

Table 4 :
Comparison of grounded and ungrounded models on the datasets, ensuring equal number of components, illustrating the contribution of grounding.STb=SkipThought-like model with bidirectional LSTM+max.2×STb-1024=ensemble of 2 different STb models.GS=GroundSent.

Table 5 :
These results indicate that grounding captures qualitatively different information, yielding better universal sentence representations.Mean and variance of dataset concreteness.

Table 6 :
Spearman ρ s correlation on four standard semantic similarity evaluation benchmarks.