An Effective Framework for Weakly-Supervised Phrase Grounding

Phrase localization is a task that studies the mapping from textual phrases to regions of an image. Given difficulties in annotating phrase-to-object datasets at scale, we develop a Multimodal Alignment Framework (MAF) to leverage more widely-available caption-image datasets, which can then be used as a form of weak supervision. We first present algorithms to model phrase-object relevance by leveraging fine-grained visual representations and visually-aware language representations. By adopting a contrastive objective, our method uses information in caption-image pairs to boost the performance in weakly-supervised scenarios. Experiments conducted on the widely-adopted Flickr30k dataset show a significant improvement over existing weakly-supervised methods. With the help of the visually-aware language representations, we can also improve the previous best unsupervised result by 5.56%. We conduct ablation studies to show that both our novel model and our weakly-supervised strategies significantly contribute to our strong results.


Introduction
Language grounding involves mapping language to real objects or data. Among language grounding tasks, phrase localization-which maps phrases to regions of an image-is a fundamental building block for other tasks. In the phrase localization task, each data point consists of one image and its corresponding caption, i.e., d = I, S , where I denotes an image and S denotes a caption. Typically, the caption S contains several query phrases P = {p n } N n=1 , where each phrase is grounded to a particular object in the image. The goal is to find the correct relationship between (query) phrases in the caption and particular objects in the image. Existing work (Rohrbach et al., 2016;Kim et al., 2018;Li et al., 2019;Yu et al., 2018;Liu et al., 2020) mainly focuses on the supervised phrase localization setting. This requires a large-scale annotated dataset of phrase-object pairs for model training. However, given difficulties associated with manual annotation of objects, the size of grounding datasets is often limited. For example, the widely-adopted Flickr30k (Plummer et al., 2015) dataset has 31k images, while the caption dataset MS COCO (Lin et al., 2014) contains 330k images.
To address this limited data challenge, two different approaches have been proposed. First, a weakly-supervised setting-which requires only caption-image annotations, i.e., no phrase-object annotations-was proposed by Rohrbach et al. (2016). This is illustrated in Figure 1. Second, an unsupervised setting-which does not need any training data, i.e., neither caption-image and phraseobject annotation-was proposed by Wang and Specia (2019). To bring more semantic information in such a setting, previous work (Yeh et al., 2018;Wang and Specia, 2019) used the detected object labels from an off-the-shelf object detector (which we will generically denote by PreDet) and achieved promising results. In more detail, for a given im-An older gentleman is standing next to the man with a red accordion over his shoulder. . Afterward, all the query phrases P and the detected objects O are fed into an alignment model to predict the final phrase-object pairs. However, purely relying on the object labels causes ambiguity. For example, in Figure 2, the grounded objects of phrases "an older man" and "the man with a red accordion" are both labeled as "man," and thus they are hard to differentiate.
Given these observations, we propose a Multimodal Alignment Framework (MAF), which is illustrated in Figure 3. Instead of using only the label features from the PreDet (in our case, a Faster R- CNN (Ren et al., 2015;Anderson et al., 2018a)), we also enhance the visual representations by integrating visual features from the Faster R-CNN into object labels. (This is shown in Figure 2.) Next, we build visually-aware language representations for phrases, which thus could be better aligned with the visual representations. Based on these representations, we develop a multimodal similarity function to measure the caption-image relevance with phrase-object matching scores. Furthermore, we use a training objective to score relevant captionimage pairs higher than irrelevant caption-image pairs, which guides the alignment between visual and textual representations.
We evaluate MAF on the public phrase localization dataset, Flickr30k Entities (Plummer et al., 2015). Under the weakly-supervised setting (i.e., using only caption-image annotations without the more detailed phrase-object annotations), our method achieves an accuracy of 61.43%, out-performing the previous weakly-supervised results by 22.72%. In addition, in the unsupervised setting, our visually-aware phrase representation improves the performance from the previous 50.49% by 5.56% up to 56.05%. Finally, we validate the effectiveness of model components, learning methods, and training techniques by showing their contributions to our final results.

Related Work
With the recent advancement in research in computer vision and computational linguistics, multimodal learning, which aims to explore the explicit relationship across vision and language, has drawn significant attention. Multimodal learning involves diverse tasks such as Captioning (Vinyals et al., 2015;Xu et al., 2015;Karpathy and Fei-Fei, 2015;Venugopalan et al., 2015), Visual Question Answering (Anderson et al., 2018a;Kim et al., 2018;Tan and Bansal, 2019), and Vision-and-Language Navigation (Anderson et al., 2018b;Thomason et al., 2020). Most of these tasks would benefit from better phrase-to-object localization, a task which attempts to learn a mapping between phrases in the caption and objects in the image by measuring their similarity. Existing works consider the phrase-to-object localization problem under various training scenarios, including supervised learning (Rohrbach et al., 2016;Yu et al., 2018;Liu et al., 2020;Plummer et al., 2015;Li et al., 2019) and weakly-supervised learning (Rohrbach et al., 2016;Yeh et al., 2018;Chen et al., 2018). Besides the standard phrase-object matching setup, previous works (Xiao et al., 2017;Akbari et al., 2019;Datta et al., 2019) have also explored a pixellevel "pointing-game" setting, which is easier to model and evaluate but less realistic. Unsupervised learning was studied by Wang and Specia (2019), who directly use word similarities between object labels and query phrases to tackle phrase localization without paired examples. Similar to the phrase-localization task, Hessel et al. (2019) leverages document-level supervision to discover image-sentence relationships over the web. . A dataset of images and their captions is the input to our model. PreDet predicts bounding boxes for objects in the image and their labels, attributes, and features, which are then integrated into visual feature representations. Attention is applied between word embedding and visual representations to compute the visually-aware language representations for phrases. Finally, a multi-modal similarity function is used to measure the caption-image relevance based on the phraseobject similarity matrix.
final output feature of PreDet (denoted as f m ) as the VFR, and Wang and Specia (2019) uses the label embedding (denoted as l m ) of the predicted label from PreDet as the VFR. This unitary VFR usually lacks the counter-side information. Hence, we exploit different aspects of features extracted from PreDet for each object o m in the image. In particular, we consider the output feature f m , the label embedding l m , and the attribute embedding t m of the object o m as the VFR, where W t and W f are two projection matrices. Naively initializing W t and W f will lead the model to a sub-optimal solution. In Section 4, we discuss the effectiveness of different initializations.

Textual Feature
Representations. Existing works for textual feature representation (TFR) (Kim et al., 2018;Yu et al., 2018;Wang and Specia, 2019) commonly treat it independently of the VFR. From a different angle, we use the attention between the textual feature and the VFR v m to integrate the visual information from the object into TFR. In more detail, we first use the GloVe embedding (Pennington et al., 2014) to encode the K n words in the phrase p n to {h n,k } Kn k=1 , where h n,k ∈ R d . Here, the dimension of h n,k is the same as v m . We then define a word-object matching score a m n,k for each h n,k in the phrase to all object features v m . In particular, for each word h n,k in the phrase, we select the object with the highest matching score, Finally, we normalize the attention weights for each word in the phrase p n to obtain the final TFR, e n : where W p is a projection matrix. In Section 4, we study the (superb) performance of the weight β n,k over simply the average h n,k as well as the importance of the initialization of W p .

Training Objective and Learning Settings
Contrastive loss. For the weakly-supervised setting, we use a contrastive loss to train our model, due to the lack of phrase-object annotations. The contrastive objective L aims to learn the visual and textual features by maximizing the similarity score between paired image-caption elements and minimizing the score between the negative samples (i.e., other irrelevant images). Inspired by the previous work in caption ranking (Fang et al., 2015), we use the following loss, .
Here, sim(I, S) is the similarity function defined below. Particularly, for each caption sentence, we use all the images I in the current batch as candidate examples.
Multimodal Similarity Functions. Following the document-level dense correspondence function in Hessel et al. (2019), our multimodal similarity function is defined as: Here, A ∈ R N ×M is the phrase-object similarity matrix, and its component is computed as and sim(I, S) measures the image-caption similarity. It is calculated based on the similarity score between each phrase in the caption and each object in the image. Note that the maximum function max m A n,m directly connects our training objective and inference target, which alleviates the discrepancy between training and inference.
Weakly-supervised setting. During training, our PreDet model is frozen. The word embeddings, W t , W f , and W p are trainable parameters. Here, the word embedding is initialized with GloVe (Pennington et al., 2014). We study the different initialization methods for the rest in Section 4. During inference, for the n-th phrase p n in an image-caption pair, we choose the localized object by Unsupervised setting. In the unsupervised setting, the localized object is determined by We drop the parameters W t , W f , and W p here, because there is no training in the unsupervised setting. β n,k is only calculated based on l m (instead of v m ).

Empirical Results
Dataset details. The Flickr30k Entities dataset contains 224k phrases and 31k images in total, where each image will be associated with 5 captions and multiple localized bounding boxes. We use 30k images from the training set for training and 1k images for validation. The test set consists of 1k images with 14,481 phrases. Our evaluation metric is the same as Plummer et al. (2015). 2 We consider a prediction to be correct if the IoU (Intersection of Union) score between our predicted bounding box and the ground-truth box is larger than 0.5. Following Rohrbach et al. (2016), if there are multiple ground-truth boxes, we use their union regions as a single ground-truth bounding box for evaluation.
Weakly-supervised Results. We report our weakly-supervised results on the test split in Table 1. We include here upper bounds (UB), which are determined by the correct objects detected by the object detectors (if available). Our MAF with ResNet-101-based Faster R-CNN detector pretrained on Visual Genome (VG) (Krishna et al., 2017) can achieve an accuracy of 61.43%. This outperforms previous weakly-supervised methods by 22.71%, and it narrows the gap between weaklysupervised and supervised methods to 15%. We also implement MAF with a VGG-based Faster R-CNN feature extractor pretrained on PASCAL VOC 2007 (Everingham et al., 2010), following the setting in KAC (Chen et al., 2018), and we use the same bounding box proposals as our ResNetbased detector. We achieve an accuracy of 44.39%, which is 5.68% higher than existing methods, showing a solid improvement under the same backbone model. Unsupervised Results. 3 We report our unsupervised results for the phrase localization method (described in Section 3.2) in Table 2. For a fair comparison, we re-implemented Wang and Specia (2019) with a Faster R-CNN model trained on Visual Genome (Krishna et al., 2017). This achieves 49.72% accuracy (similar to 50.49% as reported in their paper). Overall, our result (with VG detector) significantly outperforms the previous best result by 5.56%, which demonstrates the effectiveness of our visually-aware language representations. Entities. w2v-max refers to the similarity algorithm proposed in (Wang and Specia, 2019); Glove-att refers to our unsupervised inference strategy in Section 3.2; CC, OI, and PL stand for detectors trained on MS COCO (Lin et al., 2014), Open Image (Krasin et al., 2017), and Places (Zhou et al., 2017). Ablation Experiments. In this section, we study the effectiveness of each component and learning strategy in MAF. The comparison of different feature representations is shown in Table 3. Replacing the visual attention based TFR with an average pooling based one decreases the result from 61.43% to lower than 60%. For the VFR, using only object label l m or visual feature f m decreases the accuracy by 4.20% and 2.94%, respectively. One interesting finding here is that the performance with all visual features (last row) is worse than the model with only l m and f m . Actually, we can infer that attributes cannot provide much information in localization (24.08% accuracy if used alone), partly because attributes are not frequently used to differentiate objects in Flickr30k captions. We then investigate the effects of different initialization methods for the two weight matrices, W f and W p . The results are presented in Table 4. Here ZR means zero initialization, RD means random initialization with Xavier (Glorot and Bengio, 2010), and ID+RD means identity with small random noise initialization. We run each experiment for five times with different random seeds and compute the variance. According to Table 4, the best combination is zero initialization for W f and identity+random initialization for W p . The 3 More unsupervised results are available in Appendix B.   Table 3), thus using RD on initializing W f will disturb the feature from l m ; (ii) For W p , an RD initialization will disrupt the information from the attention mechanism, while ID+RD can both ensure basic text/visual feature matching and introduce a small random noise for training.

Conclusions
We present a Multimodal Alignment Framework, a novel method with fine-grained visual and textual representations for phrase localization, and we train it under a weakly-supervised setting, using a contrastive objective to guide the alignment between visual and textual representations. We evaluate our model on Flickr30k Entities and achieve substantial improvements over the previous state-of-the-art methods with both weakly-supervised and unsupervised training strategies. Detailed analysis is also provided to help future works investigate other critical feature enrichment and alignment methods for this task.

B Baselines
In Table 5, we report the results of different unsupervised methods: • Random: Randomly localize to a detected object.
• Center-obj: Localize to the object which is closest to the center of image, where we use an L 1 distance D = |x−x center |+|y −y center |.
• Max-obj: Localize to the object with the maximal area.
• Whole Image: Always localize to the whole image.
• Direct Match: Localize with the direct match between object labels and words in the phrase, e.g., localize "a red apple" to the object with the label "apple." If multiple labels are matched, we choose the one with the largest bounding box.
• Glove-max: Consider every word-label similarity independently and select the object label with the highest semantic similarity with any word.
• Glove-avg: Represent a phrase using an average pooling over Glove word embeddings and select the object label with highest the semantic similarity with the phrase representation.
• Glove-att: Use our visual attention based phrase representation, as is described in the Methodology 3.1.
Note that in all label-based methods (Direct Match (Wang and Specia, 2019), and our unsupervised method), if multiple bounding boxes share the same label, we choose the largest one as the predicted box.

C Qualitative Analysis
To analyze our model qualitatively, we show some visualization results in Figure 4 and Figure 5. Figure 4 shows examples with consistent predictions between supervised and unsupervised models. In these cases, both methods can successfully learn to localize various objects, including persons ("mother"), clothes ("shirt"), landscapes ("wave"), and numbers ("56"). Figure 5 shows examples   where supervised and unsupervised methods localize to different objects. In the first image, they both localize the phrase "entrance" incorrectly. In the remaining three images, the supervised method learns to predict a tight bounding box on the correct object, while the unsupervised method localizes to other irrelevant objects. For example (bottom left figure for Figure 5), if the object detector fails to detect the "blanket," then the unsupervised method can never localize "green blanket" to the right object. Still, the supervised method can learn from negative examples and obtain more information.