Incorporating Visual Semantics into Sentence Representations within a Grounded Space

Language grounding is an active field aiming at enriching textual representations with visual information. Generally, textual and visual elements are embedded in the same representation space, which implicitly assumes a one-to-one correspondence between modalities. This hypothesis does not hold when representing words, and becomes problematic when used to learn sentence representations — the focus of this paper — as a visual scene can be described by a wide variety of sentences. To overcome this limitation, we propose to transfer visual information to textual representations by learning an intermediate representation space: the grounded space. We further propose two new complementary objectives ensuring that (1) sentences associated with the same visual content are close in the grounded space and (2) similarities between related elements are preserved across modalities. We show that this model outperforms the previous state-of-the-art on classification and semantic relatedness tasks.


Introduction
Representing text by vectors that capture meaningful semantics is a long-standing issue in Artificial Intelligence.Distributional Semantic Models (Mikolov et al., 2013;Peters et al., 2018) are wellknown recent efforts in this direction, based on the distributional hypothesis (Harris, 1954).They rely on large text corpora to learn word embeddings.At another granularity level, having high-quality and general-purpose sentence representations is crucial for all models that encode sentences into semantic vectors, such as the ones used in machine translation (Bahdanau et al., 2014) or relation extraction (Wang et al., 2019).Moreover, encoding semantics of sentences is paramount because * Equal contribution.sentences describe relationships between objects, and thus convey complex and high-level knowledge better than individual words (Norman, 1972).
Relying only on text can lead to biased representations and unrealistic predictions such as "the sky is green" (Baroni, 2016).Besides, it has been shown that human understanding of language is grounded in physical reality and perceptual experience (Fincher-Kiefer, 2001).To overcome this limitation, an emerging approach is to ground language in the visual world: this consists in leveraging visual information, usually from images, to enrich textual representations.*   Leveraging images resulted in improved linguistic representations on intrinsic and downstream tasks (Bruni et al., 2014;Silberer and Lapata, 2014).In most of these approaches, crossmodal projections are learned to incorporate visual semantics in the final representations (Lazaridou et al., 2015b;Collell et al., 2017;Kiela et al., 2018).These works rely on paired textual and visual data and the hypothesis of a one-to-one correspondence between modalities is implicitly assumed: an image of an object univocally represents a word.However, there is no obvious reason implying that the structure of the two spaces should match.Indeed, Collell and Moens (2018) empirically show that cross-modal projection of a source modality does not resemble the target modality in terms of its neighborhood structure.This is especially the case for sentences, where many different sentences can describe a similar image.Therefore, we argue that learning grounded representations with projections to a visual space is particularly inadequate in the case of sentences.
To overcome this issue, we propose an alter- * In the Computer Vision community, grounding can also refer to the task of linking phrases with image regions (Xiao et al., 2017), but this is not the focus of the present paper.

arXiv:2002.02734v1 [cs.CL] 7 Feb 2020
native approach where the structure of the visual space is partially transferred to the textual space.This is done by distinguishing two types of complementary information sources.First, the cluster information: the implicit knowledge that sentences associated with the same image refer to the same underlying reality.Second, the perceptual information, which is contained within highlevel representations of images.These two sources of information aim at transferring the structure of the visual space to the textual space.Besides, to preserve textual semantics and to avoid an overconstrained textual space, we propose to incorporate the visual information to textual representations using an intermediate representation space that we call grounded space, on which cluster and perceptual objectives are trained.
Our contributions are the following: (1) we define two complementary objectives to ground the textual space, based on implicit and explicit visual information; (2) we propose to incorporate visual semantics through the means of an intermediate space, within which the objectives are learned.Moreover (3) we perform quantitative and qualitative evaluations on several transfer tasks, showing the advantages of our approach with respect to previous grounding methods.

Related work
Over the last years, several approaches have been proposed to learn semantic representations for sentences.This includes supervised and task-specific techniques with recursive networks (Socher et al., 2013), convolutional networks (Kalchbrenner et al., 2014) or self-attentive networks (Lin et al., 2017;Conneau et al., 2017), but also unsupervised methods producing universal representations given large text corpora.Examples of the latter include models such as FastSent (Hill et al., 2016), QuickThought (Logeswaran and Lee, 2018), Word Information Series (Arroyo-Fernández et al., 2019), Universal Sentence Encoder (Cer et al., 2018), or SkipThought (Kiros et al., 2015), where a sentence is encoded with a Gated Recurrent Unit (GRU), and two GRU decoders are trained to reconstruct the adjacent sentences in a dataset of ordered sentences.
To model the way language conveys meaning, traditional approaches consider language as a purely symbolic system based on words and syntactic rules (Chomsky, 1980;Burgess and Lund, 1997).However, W. Barsalou (1999);Fincher-Kiefer (2001) insist on the intuition that language has to be grounded in physical reality and perceptual experience.The importance of language grounding is underlined in Gordon and Van Durme (2013), where an important bias is reported: the frequency at which objects, relations, or events occur in natural language is significantly different from their real-world frequency (e.g., in texts, people are murdered four times more than they breathe).Thus, leveraging visual resources, in addition to textual resources, is a promising way to acquire commonsense knowledge (Lin and Parikh, 2015;Yatskar et al., 2016) and to cope with the bias between text and reality.This intuition has motivated several works for learning visually grounded representations for words, using images -or abstract scenes (Kottur et al., 2016).Two lines of work can be distinguished.First, sequential techniques combine textual and visual representations that were learned separately (Bruni et al., 2014;Silberer and Lapata, 2014;Collell et al., 2017).Second, joint methods learn a multimodal representation from multiple sources simultaneously.The advantage is that visual information associated with concrete words can be transferred to more abstract ones, which usually have no associated visual data (Hill et al., 2014;Lazaridou et al., 2015b).Closer to our contribution, some approaches learn grounded word embeddings by building upon the skip-gram objective (Mikolov et al., 2013) and enforcing word vectors to be close to their corresponding visual features (Lazaridou et al., 2015b) or their visual context (Zablocki et al., 2018) in a multimodal representation space.These approaches learn word representations while we specifically target sentences.This task is more challenging since sentences are inherently different than words due to their sequential and compositional nature.Moreover, a great number of different sentences can be generated for the same image.
Finally, some works learn sentence representations by aligning visual data to sentences within captioning datasets.Chrupala et al. (2015) propose the IMAGINET model where two sentence encoders share word embeddings: a first GRU encoder learns a language model objective and the other one is trained to predict the visual features associated to a sentence.The model of Kiela et al. (2018) is close to IMAGINET and additionally hypothesizes that associated captions ground the meaning of a sentence.Both of these works learn a projection of the sentence representation to the corresponding image, which we argue is problematic as it over-constrains the textual space and can degrade textual representations.Indeed, Collell and Moens (2018) empirically demonstrated that when a cross-modal mapping is learned, the projection of the source modality does not resemble the target modality, in the sense of nearest neighbor comparison.This suggests that cross-modal projections are not appropriate to incorporate visual semantics in text representations.
3 Incorporating visual semantics within an intermediate grounded space

Model overview
In this work, we aim at learning grounded representations by jointly leveraging the textual and visual contexts of a sentence.We note S a sentence and s = F t (S; θ t ) its representation computed with a sentence encoder F t parametrized by θ t .We follow the classical approach developed in the language grounding literature at the word level (Lazaridou et al., 2015b;Zablocki et al., 2018), which balances a textual objective L T with an additional grounding objective L G : The parameters θ t of the sentence encoder F t are shared in L T and L G , and therefore benefit from both textual and grounding objectives.θ i denotes extra grounding parameters, including the weights of the image encoder F i .Note that any textual objective L T and sentence encoder F t can be used.In our experiments, we choose the wellknown SkipThought model (Kiros et al., 2015), trained on a corpus of ordered sentences.
In what follows, we focus on the modeling of the grounding objective L G , learned on a captioning corpus, where each image is associated with several captions.Grounding approaches generally leverage visual information by embedding textual and visual elements within the same multimodal space (Silberer and Lapata, 2014;Kiela et al., 2018).However, it is not satisfying since texts and images are forced to be in one-to-one correspondence.Moreover, a caption can (1) have a wide variety of paraphrases and related sentences describing the same scene (e.g., the kitten is devouring a mouse versus a cat eating a mouse), (2) be visually ambiguous (e.g., a cat is eating can be associated with many different images, depending on the visual scene/context), or (3) carry non-visual information (e.g., cats often think about their meals).Usual grounding objectives, that embed sentences in the visual space, can discard non-visual information (3) through the projection function.They can handle (1) by projecting related sentences to the same location in the visual space.However, they are over-sensitive to visual ambiguity (2), because ambiguous sentences should be projected to different locations of the visual space, which is not possible with current grounding models.
To overcome this lack of flexibility, we propose the following approach, illustrated in Figure 1.To cope with (1), sentences associated with the same image should be close -we call this cluster information.To cope with (2), we want to avoid projecting sentences to a particular point of the visual space: instead, we require that the similarity between two images in the visual space (which is linked to the context discrepancy) should be close to the similarity between their associated sentences in the textual space.We call this perceptual information.Finally, as we want to preserve non-visual information in sentence representations (3), we make use of an intermediate space, called grounded space, that allows textual representations to benefit from visual properties without degrading the semantics brought by the textual objective L T .

Grounding space and objectives
In this section, we introduce more formally the grounded space and the different information (cluster and perceptual) captured in the grounding loss L G .

Grounded space
The grounded space relaxes the assumption that textual and visual representations should be guided by one-to-one correspondences.It rather assumes that the structure of the textual space might be partially modeled on the structure of the visual space.Thus, instead of directly applying the grounding objectives on a sentence s embedding, we propose to train the grounding objective L G on an intermediate space called grounded space.Practically, we use a projection g(s; θ i g ) of a sentence s from the textual space to the grounded space.We denote it g(s) for simplicity, where g is a multi-layer perceptron with input s = F t (S; θ t ) and parameters θ i g Cluster information (C g ) The cluster information leverages the fact that two sentences describe, or not, the same underlying reality.In other words, the goal is to measure if two sentences are visually equivalent (assumption (1) in Section 3.1) without considering the content of related images.For convenience, two sentences are said to be visually equivalent (resp.visually different) if they are associated with the same image (resp.different images), i.e. if they describe the same (resp.different) underlying reality.We call cluster a set of visually equivalent sentences.For instance, in Figure 1, sentences The tenniswoman starts on her serve and The woman plays tennis are visually equivalent and belong to the same cluster.
Our hypothesis is that the similarity between visually equivalent sentences (s, s + ) should be higher than visually different sentences (s, s − ).We translate this hypothesis into the constraint in the grounded space: cos(g(s), g(s + )) ≤ cos(g(s), g(s − )).Following (Karpathy and Li, 2015;Carvalho et al., 2018), we use a max-margin ranking loss to ensure the gap between both terms is higher than a fixed margin γ (cf.red elements in Figure 1) resulting in the cluster loss L C : (2) where s + (resp.s − ) is a randomly sampled visually equivalent (resp.different) sentence to s.This loss function is also used in the cross-modal retrieval literature to enforce structure-preserving constraints between sentences describing a same image (Wang et al., 2016).
Perceptual information (P g ) The cluster hypothesis alone ignores the structure of the visual space and only uses the visual modality as a proxy to assess if two sentences are visually equivalent or different.Moreover, the ranking loss L C simply drives apart visually different sentences in the representation space, which can be a problem when two images have a closely related content.For instance, the baseball and tennis images in Figure 1 may be different, but they are both sports images, and thus their corresponding sentences should be somehow close in the grounded space.Finally, it supposes that we have a dataset of images associated with several captions.
To cope with these limitations, we consider the structure of the visual space and use the content of images.The intuition is that the structure of the textual space should be modeled on the structure of the visual one to extract visual semantics.We choose to preserve similarities between related elements across spaces (cf.green elements in Figure 1).We thus assume that the similarity between two sentences in the grounded space should be correlated with the similarity between their corresponding images in the visual space.We translate this hypothesis into the perceptual loss L P : where ρ is the Pearson correlation, are respectively textual and visual similarities computed over several randomly sampled pairs of matching sentences and images.
Grounded loss Taking altogether, the grounded space and cluster/perceptual information leads to the grounding objective L G (θ t , θ i ) as a linear combination of the aforementioned objectives: where α C and α P are hyper-parameters weighting contributions of L C and L P .θ i corresponds to all the grounding-related parameters, i.e. those of the image encoder F i and of the projection function g (i.e., θ i g ).
4 Evaluation protocol 4.1 Datasets Textual dataset.Following (Kiros et al., 2015;Hill et al., 2016), we use the Toronto BookCorpus dataset as the textual corpus.This corpus consists of 11K books, and 74M ordered sentences, with an average of 13 words per sentence.
Visual dataset.We use the MS COCO (Lin et al., 2014)

Baselines and Scenarios
In the experiments, we focus on one of the most established sentence models: SkipThought (noted T) as the textual baseline: the parameters of the sentence embedding model are obtained by minimizing L T .Then, we derive several baselines and scenarios based on T, each representing a different approach of grounding.Since our focus is to study the impact of grounding on sentence representations, all baselines and scenarios share the same representation dimension d t = 2048 and are trained on the same datasets (cf.sect.4.1).We also report a textual model of dimension dt 2 that we call T 1024 , to compare with the GroundSent model of (Kiela et al., 2018).
Model Scenarios.We test variants of our grounding model presented in sect.3, all based on T: T + C g , T + P g , T + C g + P g , where C g (resp.P g ) represents the loss L C (resp.L P ).We also consider scenarios where g equals the identity function (no grounded space), which we note C id , P id , C id + P id , etc. Finally, we also performed preliminary analysis learning only from the visual modality: C g/id , P g/id , C g/id + P g/id .
Baselines.We adapt two classical multimodal word embedding models for sentences.Accordingly, models from the two existing model families are considered: Cross-modal Projection (CM): Inspired by Lazaridou et al. (2015b), this baseline learns to project sentences in the visual space using a max-margin loss: where f is a MLP, γ a fixed margin and i − a non-matching image.Similarly to our scenarios, the sentence encoder is initialized with T.
Sequential (SEQ): Inspired by Collell et al. (2017), we learn a linear regression model (W, b) to predict the visual representation of an image, from the representation of a matching caption.The grounded word embedding is the concatenation of the original SkipThought vector T and its predicted ("imagined") representation W T + b, which is projected using a PCA into dimension d t .
In both cases, the parameters to be learned, in addition to the sentence encoder, are the crossmodal projections -and the sentence representation is obtained by averaging word vectors.

GroundSent Model
We re-implement the GroundSent models of Kiela et al. (2018), obtaining comparable results.The authors propose two objectives to learn a grounded vector: (a) Cap2Img: the cross-modal projections of sentences are pushed towards their respective images via a max-margin ranking loss, and (b) Cap2Cap: a visually equivalent sentence is predicted via a LSTM sentence decoder.The Cap2Both objective is a combination of these two objectives.Once the grounded vectors are learned, they are concatenated with a textual vector (learned via a SkipThought objective) to form the GS-Img, GS-Cap and GS-Both vectors.

Evaluation tasks and metrics
In line with previous works (Kiros et al., 2015;Hill et al., 2016), we consider several benchmarks to evaluate the quality of our grounded embeddings: Semantic relatedness.We use two semantic similarity benchmarks: STS (Cer et al., 2017) and SICK (Marelli et al., 2014a) Table 1: Intrinsic evaluations carried out on the grounded space for models with g = MLP; the textual space for T, CM (text) and models with g = id; and the visual space for CM (vis).
pairs of sentences that are associated with humanlabeled similarity scores.STS is subdivided into three textual sources: Captions contain concrete sentences describing daily-life actions, whereas the others contain more abstract sentences: news headlines in News and posts from user forums in Forum.The Spearman correlations are measured between the cosine similarity of our learned sentence embeddings and human-labeled scores.
Structural measures.To probe the learned grounded space, we define structural measures, and report their values on the validation set of MS COCO (5K images, 25K captions).First, we report the mean Nearest Neighbor Overlap (mNNO) metric, as defined in Collell and Moens (2018), that indicates the proportion of shared nearest neighbors between image representations and their corresponding captions in their respective spaces.
To study perceptual information, we define ρ vis , the Pearson correlation ρ(cos(s, s ), cos(v s , v s )) between images and their corresponding sentences' similarities.For cluster information, we introduce C intra = E vs=v s [cos(s, s )], which measures the homogeneity of each cluster, and which measures how well clusters are separated from each other.

Implementation details
Images are processed using a pretrained Inception-v3 network (Szegedy et al., 2016) The model is trained with ADAM (Kingma and Ba, 2014) and a learning rate l r = 8.10 −4 .As done in Kiros et al. (2015), our sentence encoder is a GRU with a vocabulary of 20K words, represented in dimension 620; we perform vocabulary expansion at inference.All hyperparameters are tuned using the Pearson correlation measure on the validation set of the SICK benchmark: γ = γ = 0.5, α C = α P = 0.01, d g = 512; functions f and g are 2-layer MLP.As done in (Kiela et al., 2018), we set d t = 2048.

Experiments and Results
Our main objective is to study the contribution brought by the visual modality to the grounded sentence representations, and we do not attempt to outperform purely textual sentence encoders from the literature.We show that textual models can benefit from grounding approaches without requiring any changes to the original textual objectives L T .We report quantitative and qualitative insights (sect.5.1), and quantitative results on the SentEval benchmark (sect.5.2).

Study of the grounded space
We study the impact of the various grounding hypotheses on the structure of the grounded space, using intrinsic measures.In Table 1, we report the structural measures and the semantic relatedness scores of the baselines, namely T and CM, and on the various scenarios of our model.The textual loss is discarded to isolate the effect of the different grounding hypotheses.
Query: A woman sitting on stone steps with a suitcase full of books.
A woman sitting on stairs has a suitcase full of books.
A woman reads a book while sitting on steps near a suitcase full of books.
The woman is setting on the steps with a case of books.
A woman sitting inside of an open suitcase.
A woman sitting on the ground next to luggage.
A young woman sits near three suitcases of luggage.
A young woman sitting cross legged on an apartment sofa.
A girl sitting next to three old suitcases.
A woman sitting on a couch in front of a laptop.
A woman standing on a tennis court holding a racquet.
A woman standing on a tennis court holding a racquet.

Query image
Nearest image The woman is setting on the steps with a case of books.The impact of grounding We investigate the effect of grounding on sentence representations.Results highlight that all grounded models improve over the baseline T.Moreover, our model C g + P g is generally the most effective regarding the mNNO measure and semantic relatedness tasks.

Influence of concreteness
To understand in which cases grounding is useful, we compute the average visual concreteness c of the STS benchmark, which is divided in three categories (Captions, News, Forum).This is done by using a concreteness dataset built by Brysbaert et al. (2013) consisting of human ratings of concreteness (between 0 and 5) for 40,000 English words; for a given benchmark, we compute the sum of these scores and average over all words that are in the concreteness dataset.The performance gain ∆ between C g +P g and T are observed when the visual concreteness c is high: for Captions (c = 3.10), the improvement is substantial: (∆ = +43); for benchmarks with a lower concreteness (News with c = 2.61 and Forum with c = 2.39), the improvement is smaller (∆ = +12).Thus, grounding brings useful complementary information, especially for concrete sentences.

t-SNE visualization
This finding is also supported by a qualitative experiment showing that grounding groups together similar visual situations.Using sentences from CMPlaces (Castrejon et al., 2016), which describe visual scenes (e.g., coast, shoe-shop, plaza, etc.) and are classified in 205 scene categories, we randomly sample 5 visual scenes and plot in Figure 3 the corresponding sentences using t-SNE (Maaten and Hinton, 2008).We notice that our grounded model is better able to cluster sentences that have a close visual meaning than the text-only model.This is reinforced by the structural measures computed on the five clusters of Figure 3: C inter = 19, C intra = 22 for T, C inter = 11, C intra = 27 for C g + P g .Indeed, C inter (resp.C intra ), is lower (resp.higher) for the grounded model C g + P g compared to T, which shows that clusters corresponding to different scenes are more clearly separated (resp.sentences corresponding to a given scene are more packed).

Nearest neighbors search
Furthermore, we show in Table 2 that concrete knowledge acquired via our grounded model can also be transferred to abstract sentences.To do so, we manually build sentences using words with low concreteness (between 2.5 and 3.5) from the USF dataset (Nelson et al., 2004).Then, nearest neighbors are retrieved from the set of sentences of Flickr30K (Plummer et al., 2015).In this sample, we see that our grounded model is more accurate than the purely textual model to capture visual meaning.The observation that visual information propagates from stands for the average accuracies reported in the other columns.' †': the model has been re-implemented (we obtained higher scores than the one given in the original papers).' ‡': the baseline is an adaptation of the model to the case of sentences.' * ': significantly differs from the best scenario among our models.
concrete sentences to abstract ones is analogous to findings made in previous research on word embeddings (Hill and Korhonen, 2014).

Neighboring structure
To illustrate the discrepancy on the mNNO metric observed between C g + P g and T, we select a query image Q in the validation set of MS COCO, along with its corresponding caption S; we display, in Figure 2, the nearest neighbor of Q in the visual space, noted N , and the nearest neighbors of S in the grounded space.With our grounded model, the neighborhood S is mostly made of sentences corresponding to Q or N .

Hypotheses validation
We now validate our hypotheses (cf.sect.3.1) on the grounded space, using the Cross-Modal Projection baseline (CM) and our model scenarios as outlined in Table 1.For fair comparison, metrics for the baseline CM are estimated either on the visual or the textual space depending on whether our models rely on the grounded space (g) or not (id).These results correspond to the rows CM (text) and CM (vis.) in Table 1.
Results highlight that: (1) Using a grounded space is beneficial; indeed, semantic relatedness and mNNO scores are higher in the lower half of Table 1, e.g., C g > C id , P g > P id and C g + P g > C id + P id ; (2) Solely using cluster information leads to the highest C intra and lowest C inter , which suggests that C • is the most efficient model at separating visually different sentences; (3) Using only perceptual information in P • logically leads to highly correlated textual and visual spaces (highest ρ vis ), but the local neighborhood structure is not well preserved (lowest C intra ); (4) Our model C • + P • is better than CM at capturing cluster in-formation (higher C intra , lower C inter ) and perceptual information (higher ρ vis ).This also translates in a higher mNNO measure for C • + P • , leading us to think that the conjunction of both perceptual and cluster information leads to high correlation of modalities, in terms of neighborhood structure.Moreover, this high mNNO score results in better performances for our model C • + P • in terms of semantic relatedness.

Evaluation on transfer tasks
We now focus on extrinsic evaluation of the embeddings.Table 3 reports evaluations of our baselines and scenarios on SentEval (Conneau and Kiela, 2018), a classical benchmark used for evaluating sentence embeddings.Before further analysis, we find that our grounded models systematically outperform the textual baseline T, on all benchmarks, which shows the first substantial improvement brought by grounding and visual information in a sentence representation model.Indeed, models GS-Cap, GS-Img and GS-Both from (Kiela et al., 2018), despite improving over T 1024 , perform worse than the textual model of the same dimension T -this is consistent with what they report in their paper.
Our results interpretation is the following: (1) our joint approach shows superior performances over the sequential one, confirming results reported at the word level (Zablocki et al., 2018).Indeed, both sequential models, GS models (Kiela et al., 2018) and SEQ (inspired from (Collell et al., 2017)) are systematically worse than our grounded models for all benchmarks.(2) Preserving the structure of the visual space is more effective than learning cross-modal projections; indeed, all our models outperform T + CM on average ('AVG' column).(3) Making use of a grounded space yields slightly improved sentence representations.Indeed, our models that use the grounded space (g = MLP) can take advantage of more expression power provided by the trainable g than models which integrate grounded information directly in the textual space (g = id).(4) Among our model scenarios, T + P g has maximal scores on the most tasks; however, it shows lower scores on SNLI and SICK, which are entailment tasks.Models using cluster information C g are naturally more suited for these tasks and hence obtain higher results.Finally, the combined model T + C g + P g shows a good balance between classification and entailment tasks.

Conclusion
We proposed a multimodal model aiming at preserving the structure of visual and textual spaces to learn grounded sentence representations.Our contributions include (1) leveraging both perceptual and cluster information and (2) using an intermediate grounded space enabling to relax the constraints on the textual space.Our approach is the first to report consistent positive results against purely textual baselines on a variety of natural language tasks.As future work, we plan to use visual information to specifically target complex downstream tasks requiring commonsense and reasoning such as question answering or visual dialogue.
Figure 1: Model overview.Red circles indicate visual clusters.Red arrows represent the gradient of the cluster loss, which gathers visually equivalent sentences -the contrastive term in loss LC is not represented.The green arrow and angles illustrate the perceptual loss, ensuring that cosine similarities correlate across modalities.The origin is at the center of each space.

Figure 2 :
Figure 2: Nearest neighbors of a selected sentence in the validation set of MS COCO, for both grounded and purely textual models.Q is the query image, N is the nearest neighbor of Q in the visual space.Sentences that are caption of Q or N are prefixed with Q or N .Query Textual model Grounded model Two people are in love Two people are fencing indoors A couple just got married and are taking a picture with family A man is horrified A man and a woman are smiling A teenage boy wearing a cap looks irritated This is a tragedy A group of people are at a party Men doing a war reenactment

Figure 3 :
Figure 3: t-SNE visualization on CMPlaces sentences for a set of randomly sampled visual scenes.Left: textual model T. Right: grounded model C g + P g .
dataset as the visual corpus.This image captioning dataset consists of 118K/5K/41K (train/val/test) images, each with five English descriptions.Note that the number of sentences in the training set of COCO (590K sentences) only represents 0.8% of the sentence data in BookCorpus, which is negligible, and the additional textual training data cannot account for performance discrepancies between textual and grounded models.
, which consist of

Table 2 :
Qualitative analysis: nearest neighbor of a given query (containing an abstract word) among Flickr30K sentences.