Learning to Represent Image and Text with Denotation Graph

Learning to fuse vision and language information and representing them is an important research problem with many applications. Recent progresses have leveraged the ideas of pre-training (from language modeling) and attention layers in Transformers to learn representation from datasets containing images aligned with linguistic expressions that describe the images. In this paper, we propose learning representations from a set of implied, visually grounded expressions between image and text, automatically mined from those datasets. In particular, we use denotation graphs to represent how specific concepts (such as sentences describing images) can be linked to abstract and generic concepts (such as short phrases) that are also visually grounded. This type of generic-to-specific relations can be discovered using linguistic analysis tools. We propose methods to incorporate such relations into learning representation. We show that state-of-the-art multimodal learning models can be further improved by leveraging automatically harvested structural relations. The representations lead to stronger empirical results on downstream tasks of cross-modal image retrieval, referring expression, and compositional attribute-object recognition. Both our codes and the extracted denotation graphs on the Flickr30K and the COCO datasets are publically available on https://sha-lab.github.io/DG.


Introduction
There has been an abundant amount of aligned visual and language data such as text passages * Work done while at Google † Authors Contributed Equally ‡ On leave from USC (feisha@usc.edu) describing images, narrated videos, subtitles in movies, etc. Thus, learning how to represent visual and language information when they are semantically related has been a very actively studied topic. There are many vision + language applications: image retrieval with descriptive sentences or captions (Barnard and Forsyth, 2001;Barnard et al., 2003;Hodosh et al., 2013;Young et al., 2014), image captioning (Chen et al., 2015;Xu et al., 2015), visual question answering (Antol et al., 2015), visual navigation with language instructions (Anderson et al., 2018b), visual objects localization via short text phrases (Plummer et al., 2015), and others. A recurring theme is to learn the representation of these two streams of information so that they correspond to each other, highlighting the notion that many language expressions are visually grounded.
the embeddings of one modality using information from the other one. This is achieved by using co-attention or cross-attention (in addition to selfattention) in Transformer layers. The second is to leverage the power of pre-training (Radford et al., 2019;Devlin et al., 2019): given a large number of parallel corpora of images and their descriptions, it is beneficial to identify pre-trained embeddings on these data such that they are useful for downstream vision + language tasks.
Despite such progress, there is a missed opportunity of learning stronger representations from those parallel corpora. As a motivating example, suppose we have two paired examples: one is an image x 1 corresponding to the text y 1 of TWO DOGS SAT IN FRONT OF PORCH and the other is an image x 2 corresponding to the text y 2 of TWO DOGS RUNNING ON THE GRASS. Existing approaches treat the two pairs independently and compute the embeddings for each pair without acknowledging that both texts share the common phrase y 1 ∩ y 2 = TWO DOGS and the images have the same visual categories of two dogs.
We hypothesize that learning the correspondence between the common phrase y 1 ∩ y 2 and the set of images {x 1 , x 2 }, though not explicitly annotated in the training data, is beneficial. Enforcing the alignment due to this additionally constructed pair introduces a form of structural constraint: the embeddings of x 1 and x 2 have to convey similar visual information that is congruent to the similar text information in the embeddings of y 1 and y 2 .
In this paper, we validate this hypothesis and show that extracting additional and implied correspondences between the texts and the visual information, then using them for learning leads to better representation, which results in a stronger performance in downstream tasks. The additional alignment information forms a graph where the edges indicate how visually grounded concepts can be instantiated at both abstract levels (such as TWO DOGS) and specific levels (such as TWO DOGS SAT IN FRONT OF THE PORCH). These edges and the nodes that represent the concepts at different abstraction levels form a graph, known as denotation graph, previously studied in the NLP community (Young et al., 2014;Lai and Hockenmaier, 2017;Plummer et al., 2015) for grounding language expressions visually.
Our contributions are to propose creating visuallygrounded denotation graphs to facilitate representation learning. Concretely, we apply the technique originally developed for the FLICKR30K dataset (Young et al., 2014) also to COCO dataset (Lin et al., 2014) to obtain denotation graphs that are grounded in each domain respectively ( § 3). We then show how the denotation graphs can be used to augment training samples for aligning text and image ( § 4). Finally, we show empirically that the representation learned with denotation graphs leads to stronger performance in downstream tasks ( § 5).

Related Work
Learning representation for image and text There is a large body of work that has been focusing on improving the visual or text embedding functions (Socher et al., 2014;Eisenschtat and Wolf, 2017;Nam et al., 2017;Gu et al., 2018).
Another line of work, referred to as cross-stream methods infer fine-grained alignments between local patterns of visual (i.e., local regions) and linguistic inputs (i.e., words) between a pair of image and text, then use them to derive the similarity between the image and the text.  uses cross-modal attention mechanism (Xu et al., 2015) to discover such latent alignments.  Figure 1: (Left) A schematic example of denotation graph showing the hierarchical organization of linguistic expression (adapted from https://shannon.cs.illinois.edu/DenotationGraph/) (Right) A randomsubgraph from the denotation graph extracted from the FLICKR30K dataset, with images attached to concepts at different levels of hierarchy.
In contrast to those work, we focus on exploiting additional correspondences between image and text that are not explicitly given in the many image and text datasets. By analyzing the linguistic structures of the texts in those datasets, we are able to discover more correspondences that can be used for learning representation. We show the learned representation is more powerful in downstream tasks.

Denotation Graph (DG)
Visually grounded text expressions denote the images (or videos) they describe. When examined together, these expressions reveal structural relations that do not exhibit when each expression is studied in isolation. In particular, through linguistic analysis, these expressions can be grouped and partially ordered and thus form a relation graph, representing how (visually grounded) concepts are shared among different expressions and how different concepts are related. This insight was explored by Young et al. (2014) and the resulting graph is referred to as a denotation graph, schematically shown in the top part of Fig. 1. In this work, we focus on constructing denotation graphs from the FLICKR30K and the COCO datasets, where the text expressions are sentences describing images.
Formally, a denotation graph G is a polytree where a node v i in the graph corresponds to a pair of a linguistic expression y i and a set of images X i = {x 1 , x 2 , · · · , x n i }. A directed edge e ij from a node v i to its child v j represents a subsumption relation between y i and y j . Semantically, y i is more abstract (generic) than y j , and the tokens in y i can be a subset of y j 's. For example, TWO DOGS describes all the images which TWO DOGS ARE RUNNING describes, though less specifically. Note that the subsumption relation is defined on the semantics of these expressions. Thus, the tokens do not have to be exactly matched on their surface forms. For instance, IN FRONT OF PERSON or IN FRONT OF CROWD are also generic concepts that  Fig. 1 for another example.
More formally, the set of images that correspond to v i is the union of all the images corresponding to v i 's children ch(v i ): We also use pa(v j ) to denote the set of v j 's parents.
Denotation graphs (DG) can be seen as a hierarchical organization of semantic knowledge among concepts and their visual groundings. In this sense, they generalize the tree-structured object hierarchies that have been often used in computer vision. The nodes in the DG are composite phrases that are semantically richer than object names and the relationship among them is also richer.
Constructing DG We used the publicly available tool 1 , following Young et al. (Young et al., 2014). For details, please refer to the Appendix and the reference therein. Once the graph is constructed, we attach the images to the proper nodes by setunion images of each node's children, starting from the sentence-level node.
DG-FLICKR30K and DG-COCO 2 We regenerate a DG on the FLICKR30K dataset 3 (Young et al., 2014) and construct a new DG on the COCO (Lin et al., 2014) dataset. The two datasets come from different visual and text domains where the former contains more iconic social media photos and the latter focuses on photos with complex scenes and has more objects. Figure 1 shows a random subgraph of DG-FLICKR30K. note that in both graphs, a large number of internal nodes (more abstract concepts or phrases) are introduced. For such concepts, the linguistic expressions are much shorter and the number of images they correspond to is also larger.

Learning with Denotation Graphs
The denotation graphs, as described in the previous section, provide rich structures for learning representations of text and image. In what follows, we describe three learning objectives, starting from the most obvious one that matches images and their descriptions ( § 4.1), followed by learning to discriminate between general and specialized concepts ( § 4.2) and learning to predict concept relatedness ( § 4.3). We perform ablation studies of those objectives in § 5.4.

Matching Texts with Images
We suppose the image x and the text y are represented by (a set of) vectors φ(x) and ψ(y) respectively. A common choice for φ(·) is the last layer of a convolutional neural network (He et al., 2015;Xie et al., 2017) and for ψ(·) the contextualized word embeddings from a Transformer network (Vaswani et al., 2017). The embedding of the multimodal pair is a vector-valued function over φ(x) and ψ(y): There are many choices of f (·, ·). The simplest one is to concatenate the two arguments. We can also use the element-wise product between the two if they have the same embedding dimension (

Matching Model
We use the following probabilistic model to characterize the joint distribution where the exponent s(x, y) = θ T v is referred as the matching score. To estimate θ, we use the maximum likelihood estimation where x ik is the kth element in the set X i . However, this probability is intractable to compute as it requires us to get all possible pairs of (x, y). To approximate, we use negative sampling.

Negative Sampling
For each (randomly selected) positive sample (x ij , y i ), we explore 4 types of negative examples and assemble them as a negative sample set D − ik : Visually mismatched pair We randomly sample an image x − / ∈ X i to pair with y i , i.e., (x − , y i ). Note that we automatically exclude the images from v i 's children.
Semantically mismatched pair We randomly sample a text y j = y i to form the pair (x ik , y j ). Note that we constrain y j not to include concepts that could be more abstract than y i as the more abstract can certainly be used to describe the specific images x ik .
Semantically hard pair We randomly sample a text y j that corresponds to an image x j that is visually similar to x ik to form (x ik , y j ). See (Lu et al., 2019) for details.

DG Hard Negatives
We randomly sample a sibling (but not cousin) node v j to v i such that x ik / ∈ X j to form (x ik , y j ) Note that the last 3 pairs have increasing degrees of semantic confusability. In particular, the 4th type of negative sampling is only possible with the help of a denotation graph. In that type of negative samples, y j is semantically very close to y i (from the construction) yet they denote different images. The "semantically hard pair", on the other end, is not as hard as the last type as y i and y j could be very different despite high visual similarity.
With the negative samples, we estimate θ as the minimizer of the following negative log-likelihood

Learning to Be More Specific
The hierarchy in the denotation graph introduces an opportunity for learning image and text representations that are sensitive to fine-grained distinctions. Concretely, consider a parent node v i with an edge to the child node v j . While the description y j matches any images in its children nodes, the parent node's description y i on a higher level is more abstract. For example, the concepts INSTRUMENT  To incorporate this modeling notion, we introduce as a specificity loss, where [h] + = max(0, h) denotes the hinge loss. The loss is to be minimized such that the matching score for the less specific description y i is smaller than that for the more specific description y j .

Learning to Predict Structures
Given the graph structure of the denotation graph, we can also improve the accuracy of image and text representation by modeling high-order relationships. Specifically, for a pair of nodes v i and v j , we want to predict whether there is an edge from v i to v j , based on each node's corresponding embedding of a pair of image and text. Concretely, this is achieved by minimizing the following negated likelihood We use a multi-layer perceptron with a binary output to parameterize the log-probability.

The Final Learning Objective
We combine the above loss functions as the final learning objective for learning on the DG where λ 1 , λ 2 are the hyper-parameters that tradeoff different losses. Setting them to 1.0 seems to work well. The performance under different λ 1 and λ 2 are reported in Table 12 and Table 13. We study how each component could affect the learning of representation in § 5.4.

Experiments
We examine the effectiveness of using denotation graphs to learn image and text representations. We first describe the experimental setup and key implementation details ( § 5.1). We then describe key image-text matching results in § 5.2, followed by studies about the transfer capability of our learned representation ( § 5.3). Next, we present ablation studies over different components of our model ( § 5.4). Finally, we validate how well abstract concepts can be used to retrieve images, using our model ( § 5.5).

Experimental Setup
We list major details in the following to provide context, with the full details documented in the Appendix for reproducibility.

Embeddings and Matching Models
Our aim is to show denotation graphs improve state-of-the-art methods. To this end, we experiment with two recently proposed state-of-the-art approaches and their variants for learning from multi-modal data: ViLBERT (Lu et al., 2019) and UNITER . The architecture diagrams and the implementation details are in the Appendix, with key elements summarized in the following.
Both the approaches start with an image encoder, which obtains a set of embeddings of image patches, and a text encoder which obtains a sequence of word (or word-piece) embeddings. For ViLBERT, text tokens are processed with Transformer layers and fused with the image information with 6 layers of co-attention Transformers. The output of each stream is then element-wise multiplied to give the fused embedding of both streams. For UNITER, both streams are fed into 12 Transformer layers with cross-modal attention. A special token CLS is used, and its embedding is regarded as the fused embedding of both streams.
For ablation studies, we use a smaller ViLBERT for  and Fei-Fei, 2015). Key characteristics for the two DGs are reported in Table 1.

Evaluation Tasks
We evaluate the learned representations on three common vision + language tasks. In text-based image retrieval, we evaluate two settings: the text is either a sentence or a phrase from the test corpus. In the former setting, the sentence is a leaf node on the denotation graph, and in the latter case, the phrase is an inner node on the denotation graph, representing more general concepts. We evaluate the FLICKR30K and the COCO datasets, respectively. The main evaluation metrics we use are precisions at recall R@M where M = 1, 5 or 10 and RSUM which is the sum of the 3 precisions (Wu et al., 2019). Conversely, we also evaluate using the task of image-based text retrieval to retrieve the right descriptive text for an image.
In addition to the above cross-modal retrieval, we also consider two downstream evaluation tasks, i.e., Referring Expression and Compositional Attribute-Object Recognition.
(1) Referring Expression is a task where the goal is to localize the corresponding object in the image given an expression (Kazemzadeh et al., 2014). We evaluate on the dataset REFCOCO+, which contains 141,564 expressions with 19,992 images. We follow the previously established protocol to evaluate on the validation split, the TestA split, and the TestB split.
We are primarily interested in zero-shot/few-shot learning performance.
(2) Compositional Attribute-Object Recognition is a task that requires a model  (7) are set to 1.0, unless specified (see the Appendix). Table 2 and Table 3 report the performances on cross-modal retrieval. On both datasets, models trained with denotation graphs considerably outperform the corresponding ones which are not.  For the image-based text retrieval task, ViLBERT and UNITER on FLICKR30K suffers a small drop in R@10 when DG is used. On the same task, UNITER on COCO 5K Test Split decreases more when DG is used. However, note that on both splits of COCO, ViLBERT is a noticeably stronger model, and using DG improves its performance.

Zero/Few-Shot and Transfer Learning
Transfer across Datasets Table 4 illustrates that the learned representations assisted by the DG have better transferability when applied to another dataset (TARGET DOMAIN) that is different from the SOURCE DOMAIN dataset which the DG is based on. Note that the representations are not finetuned on the TARGET DOMAIN. The improvement on the direction COCO →FLICKR30K is stronger than the reverse one, presumably because the COCO dataset is bigger than FLICKR30K. (R@5 and R@10 are reported in the Appendix.) Zero/Few-shot Learning for Referring Expression We evaluate our model on the task of referring expression, a supervised learning task, in the setting of zero/few-shot transfer learning. In zero-shot learning, we didn't fine-tune the model on the referring expression dataset (i.e. REFCOCO+). Instead,  we performed a "counterfactual" inference, where we measure the drop in the compatibility score (between a text describing the referring object and the image of all candidate regions) as we removed individual candidates results. The region that causes the biggest drop of compatibility score is selected. As a result, the selected region is most likely to correspond to the description. In the setting of fewshot learning, we fine-tune our COCO-pre-trained model on the task of referring expression in an endto-end fashion on the referring expression dataset (i.e. REFCOCO+).
The results in Table 5 suggest that when the amount of labeled data is limited, training with DG performs better than training without. When the amount of data is sufficient for end-to-end training, the advantage of training with DG diminishes.

Compositional Attribute-Object Recognition
We evaluate our model for supervised compositional attribute-object recognition (Misra et al., 2017), and report results on recognizing UNSEEN attribute-object labels on the MIT-STATE test data (Isola et al., 2015). Specifically, we treat the text of image labels (i.e., attribute-object pairs as compound phrases) as the sentences to fine-tune the ViLBERT models, using the MATCH objective. Table 6 reports the results (in top-K accuracies) of both prior methods and variants of ViLBERT, which are trained from scratch (N/A), pre-trained on COCO and DG-COCO, respectively. ViLBERT models pre-trained with parallel pairs of images and texts (i.e., COCO and DG-COCO) improve sig- nificantly over the baseline that is trained on the MIT-STATE from scratch. The model pre-trained with DG-COCO achives the best results among ViL-BERT variants. It performs on par with the previous state-of-the-art method in top-1 accuracy and outperforms them in top-2 and top-3 accuracies.

Ablation Studies
The rich structures encoded in the DGs give rise to several components that can be incorporated into learning representations. We study whether they are beneficial to the performances on the downstream task of text-based image retrieval. In the notions of §4, those components are: (1) remove "DG HARD NEGATIVES" from the MATCH loss and only use the other 3 types of negative samples ( § 4.1); (2) align images with more specific text descriptions ( § 4.2); (3) predict the existences of edges between pairs of nodes ( § 4.3). Table 7 shows the results from the ablation studies. We report results on two versions of ViLBERT: In ViLBERT (reduced), the number of parameters in the model is significantly reduced by making the model less deep, and thus faster for development. Instead of being pre-trained, they are trained on the FLICKR30K dataset directly for 15 epochs with a minibatch size of 96 and a learning rate of 4e −5 .
In ViLBERT (Full), we use the aforementioned settings. We report RSUM on the FLICKR30K dataset for the task of text-based image retrieval.
All models with DG perform better than the mod- els without DG. Secondly, the components of DG HARD NEGATIVES, SPEC , and EDGE contribute positively and their gains are cumulative.

Image Retrieval from Abstract Concepts
The leaf nodes in a DG correspond to complete sentences describing images. The inner nodes are shorter phrases that describe more abstract concepts and correspond to a broader set of images, refer to Table 2 for some key statistics in this aspect. Fig. 2 contrasts how well abstract concepts can be used to retrieve images. The concepts are the language expressions corresponding to the leaf nodes, the nodes that are one level above (LEAF-1), or two levels above (LEAF-2) the leaf nodes from the DG-FLICKR30K. Since abstract concepts tend to correspond to multiple images, we use mean averaged precision (mAP) to measure the retrieval results. ViLBERT+DG outperforms ViLBERT significantly. The improvement is also stronger when the concepts are more abstract.
It is interesting to note that while the MATCH used in ViLBERT w/ DG incorporates learning representations to align images at both specific and abstract levels, such learning benefits all levels. The improvement of retrieving at abstract levels does not sacrifice the retrieval at specific levels.

Conclusion
Image and text aligned data is rich in semantic correspondence. Besides treating text annotations as "categorical" labels, in this paper, we show that we can make full use of those labels. Concretely, denotation graphs (DGs) encode structural relations that can be automatically extracted from those texts with linguistic analysis tools. We proposed several ways to incorporate DGs into learning representation and validated the proposed approach on several tasks. We plan to investigate other automatic tools in curating more accurate denotation graphs with a complex composition of fine-grained concepts for future directions. and 7 maximum levels, respectively. The training hyper-parameters remain the same as ViLBERT + DG-FLICKR30K with 3 maximum layers. The aim is to check how much gain we could get from the additional annotations. We report the results in Table 8. It shows that actually, the model trained with 3 levels of DG achieves the best performance. This might be because those high-level layers of DG (counting from the sentences) contain very abstract text concepts, such as "entity" and "physical object", which is non-informative in learning the visual grounding.
Once the graph is constructed, we attach the images to the proper nodes by set-union images of each node's children, starting from the sentence-level node.

A.2 Model architectures of ViLBERT and UNITER
A comparison of these models is schematically illustrate in Fig. 3 ViLBERT model contains 121 million parameters, while UNITER contains 111 million parameters.

A.3 Training Details
All models are optimized with the Adam optimizer (Kingma and Ba, 2015). The learning rate is initialized as 4e −5 . Following ViLBERT (Lu et al., 2019), a warm-up training session is employed, during which we linearly increase the learning rate from 0 to 4e −5 in the first 1.5% part of the training epochs. The learning rate is dropped to 4e −6 and 4e −7 at the 10th and the 15th epochs, respectively. For ViLBERT (Reduced), we randomly initialized the model parameters in the image stream. The text stream is initialized from the first 3 layers of the pre-trained BERT model, and its co-attention Transformer layers are randomly initialized. For ViLBERT (Full) and UNITER , we load the model's weights pre-trained on the Conceptual Caption dataset to initialize them. After tokenization, the tokens are transformed to 768 dimension features by a word embedding initialized from BERT pre-trained model. The 768dimensional position features are included in the input to represent the position of each token.

A.5 Visual Pre-processing
For both ViLBERT and UNITER, we use the image patch features generated by the bottom-up attention features, as suggested by the original papers (Anderson et al., 2018a). The image patch features contain up to 100 image patches with their dimensions to be 2048. Besides this, a positional feature is used to represent the spatial location of bounding boxes for both ViLBERT and UNITER. Specifically, ViLBERT uses 5-dimensional position feature that encodes the normalized coordinates of the upper-left and lower-right corner for the bounding boxes, as well as one additional dimension encoding the normalized patch size. UNITER uses two additional spatial features that encode the normalized width and height of the object bounding box.

B Full Experimental Results
In this section, we include additional experimental results referred to by the main text. Specifically, we include results from a variety of models (e.g., ViL-BERT, ViLBERT + DG, UNITER, and UNITER + DG) on COCO dataset 5K test split (Karpathy and Fei-Fei, 2015) in § B.1. Then we provide a comprehensive ablation study on the impact of λ 1 and λ 2 of Eq. 7 in the main text in § B.3.

B.1 Complete Results on COCO Dataset
We report the full results on COCO dataset (1K test split and 5K test split) in Table 9 and Table 10. Additionally, we contrast to other existing approaches on these tasks. It could be seen that ViLBERT + DG and UNITER + DG improves the performance over the counterparts without DG by a significant margin on both COCO 1K and 5K test split -the only exception is that on the task of imagebased text retrieval, UNITER performs better than UNITER+DG.
These results support our claim that training with DG helps the model to learn better visual and lin-

B.2 Complete Results on FLICKR30K Dataset
We contrast to other existing approaches in Table 11 on the task of text-based image retrieval on the FLICKR30K dataset.

B.3 Ablation Study on λ 1 and λ 2
We conduct an ablation study on the impact of the two hyper-parameters λ 1 and λ 2 in Eq. 7 of the main text. We conduct the study with two ViL-BERT variants: ViLBERT Reduced and ViLBERT. The results are reported in Table 12 and Table 13. As we have two hyper-parameters λ 1 and λ 2 , we analyze their impacts on the final results by fixing one λ to be 1. Fixing the λ 2 = 1 and changing λ 1 , we observe that ViLBERT prefers larger λ 1 , while ViLBERT Reduced achieves slightly worse performance when λ 1 is smaller or larger. Fixing the λ 1 = 1 and changing λ 2 , we observe that performance of both architectures slightly reduced when λ 2 = 0.5 and λ 2 = 2.

B.4 Full Results on Zero/Few-Shot and
Transfer Learning Implementation Details for Zero-shot Referring Expression Specifically, the learned ViL-BERT and ViLBERT w/DG models are used first to produce a base matching score s BASE between the expression to be referred and the whole image. We then compute the matching score s MASKED between the expression and the image with each region feature being replaced by a random feature in turn. As the masked image region might be a noisy region, s MASKED might be larger than s BASE . Therefore, the model's prediction of which region the expression refers to is the masked region which causes the largest score in s REGION , where Here I[·] is an indicator function. Table 5 shows that ViLBERT + DG-COCO outperforms ViLBERT on this task. Table 14 reports the full set of evaluation metrics on transferring across datasets. Training with DG improves training without DG noticeably.

C Visualization of Model's Predictions on Denotation Graphs
We show several qualitative examples of both success and failure cases of ViLBERT + DG, when retrieving the text matched images, in Fig. 4 and Fig. 5. The image and text correspondence is generated by the Denotation Graph, which are derived from the caption and image alignment. We observe that in the Fig.4, the ViLBERT + DG successfully Text-based Image Retrieval Image-based Text Retrieval Method recognizes the images that are aligned with the text: "man wear reflective vest", while the ViLBERT fails to retrieve the matched image. In the failure case in Fig. 5, although ViLBERT + DG fails to retrieve the images that are exactly matched to the text, it still retrieves very relevant images given the query.   retrieve the text matched images. We mark the correct sample in green and incorrect one in red.