Decoupled Box Proposal and Featurization with Ultrafine-Grained Semantic Labels Improve Image Captioning and Visual Question Answering

Object detection plays an important role in current solutions to vision and language tasks like image captioning and visual question answering. However, popular models like Faster R-CNN rely on a costly process of annotating ground-truths for both the bounding boxes and their corresponding semantic labels, making it less amenable as a primitive task for transfer learning. In this paper, we examine the effect of decoupling box proposal and featurization for down-stream tasks. The key insight is that this allows us to leverage a large amount of labeled annotations that were previously unavailable for standard object detection benchmarks. Empirically, we demonstrate that this leads to effective transfer learning and improved image captioning and visual question answering models, as measured on publicly-available benchmarks.


Introduction
Object detection has been employed extensively as a primitive task for vision and language tasks such as image captioning and visual question answering (VQA); see (Anderson et al., 2018) and the work that follows it. One motivation is that the ability to recognize salient regions and objects may be too difficult to learn from weakly-supervised topdown signals, in the form of captions and questionanswer pairs. Indeed, bottom-up signals provided by object detection often correspond to semantic units of language such as words or phrases, making them suitable for text generation and imagetext alignment.
However, object detection itself can be broken down into multiple subtasks (Liu et al., 2018). A family of "two-stage" object detectors first proposes category-agnostic bounding box candidates and then featurizes and classifies the cropped regions into one of the available semantic labels.

Q: How much money is this? Q: What is this? FRCNN (VG) A: 1 dollar A: beer Ultra A: 20
A: bbq sauce Figure 1: Ultrafine-grained semantic labels (at "instance level") provide transfer learning power to downstream tasks like visual question answering.
Even "one-stage" object detection approaches, where these boxes become category-specific, can be formulated in a bottom-up manner as detecting and grouping extreme and center points (Zhou et al., 2019). Can we take advantage of this observation to learn to transfer more effectively? In this work, we take a step in this direction by examining the effect of decoupling box proposal and featurization on downstream vision and language tasks. In particular, we consider a two-stage object detector and set a goal of pushing the "featurization" aspect of the task further than before. Our choice to break free from "featurization by object detection models" has at least two advantages. First, there is a larger amount of labeled data that can be leveraged to train a better featurization module, even if such data do not support learning box proposals. To put it another way, the quality of features directly provided by object detectors is limited by the fact that annotating ground-truths for both the bounding boxes and their corresponding semantic labels is costly and scales poorly. By separating them, we reintroduce the freedom to annotate for object-agnostic box segmentation, without the burden of baking in an-notation decisions related to the granularity level of the semantic labels (i.e., do we use as semantic labels "money", "euro", or "20 euro"?). As illustrated in Figure 1, the granularity level of the semantic labels plays a crucial role for downstream tasks such as VQA.
Second, this approach is better suited to downstream tasks whose domains are different from the one the object detector is trained on. In other words, it allows us to benefit from transfer learning, which is a great advantage given the relatively modest amount of available supervised data for these downstream vision and language tasks.
We empirically demonstrate the abovementioned advantages through a focused study of the effect of improved featurization on image captioning and VQA in transfer learning settings. In particular, we (i) leverage ultra-fine-grained semantic labels (e.g., "golden gate bridge" vs. "bridge") for featurization (Juan et al., 2019); and, (ii) focus on scenarios in which object detection modules trained on Visual Genome (VG) (Krishna et al., 2017) are applied to out-of-domain images: image captioning on the Conceptual Captions dataset (Sharma et al., 2018), and VQA on the VizWiz dataset (Gurari et al., 2018). Our results indicate that there are ways to incorporate low-level pre-training tasks that benefit vision and language models via higher-quality bottom-up signals.

Related Work
Attention-based deep models are popular in image captioning and VQA. Early work used fixed partitions of images as candidate regions (Xu et al., 2015). However, variable sized regions that are better correlated with object boundaries have gained momentum (Fu et al., 2017;Pedersoli et al., 2017;Anderson et al., 2018). Indeed, Anderson et al. (2018) established new state-of-theart performance over both image captioning and VQA tasks on the MSCOCO and VQA2 benchmarks using a Faster R-CNN detector trained on Visual Genome. As both Visual Genome and VQA2 were built on images from MSCOCO, the object detector was applied largely to in-domain images. In contrast, our work focuses on more realistic settings in which domains of different tasks may not be perfectly aligned .
We leverage image representations extracted from a network pre-trained over large amounts of labeled data. Prior work demonstrated the power of pre-training with image classification at scale (Sun et al., 2017;Mahajan et al., 2018;Wu et al., 2019). However, we consider downstream vision and language tasks (image captioning and visual question answering), in contrast to less complex vision-only tasks explored in such work: object detection and in some cases semantic segmentation and human pose estimation. Furthermore, our transfer learning technique is based on decoupled region proposal and ultra-finegrained featurization, not fine-tuning the pre-trained network.
Another set of closely related work utilized additional data for scaling up either vision tasks (Hoffman et al., 2016;Tang et al., 2017;Redmon and Farhadi, 2017) or vision and language tasks (Venugopalan et al., 2017;Lu et al., 2018;Noh et al., 2019). For instance, YOLO9000 (Redmon and Farhadi, 2017) built a "WordTree" hierarchy based on the WordNet synsets (Miller et al., 1990), mapped categories in both COCO object detection and ImageNet classification datasets into the hierarchy, and proposed a joint detection and classification training framework. Our approach to transfer learning with ultrafine-grained featurization can similarly address the long-tail nature of target vocabulary (see Figure 2) while being simpler (e.g., not require carefully merging different sets of vocabulary as in YOLO9000). The number of classes we consider is also several orders of magnitude larger.
Incorporating object detection signals in downstream tasks appropriately is non-trivial and an active subject for research (Santoro et al., 2017;. In this work, we ask the orthogonal question of whether it is necessary to accept the object detector's output as-is.

Features and Experimental Setup
Our starting point is a two-stage object detector, which consists of two core modules. One is responsible for category-agnostic box proposal, and the other for featurizing each cropped region for semantic label prediction. In this paper, we select Faster R-CNN (Ren et al., 2015b), a widely-used object detector in image captioning and VQA.
Faster R-CNN Model We reimplement the Faster R-CNN model, training it to predict both 1,600 object and 400 attribute labels in Visual Genome (Krishna et al., 2017), following the standard setting from Anderson et al. (2018). ResNet-101 (He et al., 2016) pre-trained on ImageNet (Russakovsky et al., 2015) is used as the core featurization network 1 . We achieve a mAP@50 of 10.96 for object detection and 1.5 for attribute detection. Given an image, Faster R-CNN proposes K bounding box regions, each of which comes with a D-dimensional feature vector as well as object/attribute class predictions (along with their scores). K is set to 100 and D to 2048 in our experiments. Using output features on the task of VQA and our model described in Section 5, we obtain an accuracy of 66.9% on the validation set of the VQA2 dataset (Goyal et al., 2017). For comparison, this number already surpasses all validation accuracy numbers in Table 2 for a strong model by , suggesting that our Faster R-CNN features are of high-quality.
Decoupled Box Proposal and Featurization with Ultra-finegrained Semantic Labels In standard use of object detectors following Anderson et al. (2018), downstream tasks receive "knowledge" merely about a few thousand classes and four hundred attributes. Here, we exploit the fact that box proposal and featurization can be decoupled, and work on improving the object representation (featurization).
More concretly, we conduct a study toward understanding the utility of improved featurization on downstream tasks. To this end, we exploit a graph-based, semi-supervised representation learning approach called Graph-Regularized Image Semantic Embedding (Graph-RISE) (Juan et al., 2019). Specifically, Graph-RISE is based on ResNet-101 where the 10x10x2K feature map is first average pooled to 4x4x2K, and then flattened and projected to a 64-dimensional embedding before the softmax layer. Learned from O(260M) web images and O(40M) (noisy) semantic labels, these compact 64-dimensional feature vectors are trained to capture a whole spectrum of semantic similarity, ranging from coarse-grained / category-level (e.g., "bridge"), fine-grained level (e.g., "steel red bridge"), to ultrafine-grained / instance-level (e.g., "golden gate bridge").
Our Objective The main goal is to compare two approaches in using bottom-up signals: 1) FRCNN: use the default visual features from the Faster R-CNN detector; 2) Ultra: use bounding boxes from the Faster R-CNN detector, then fea-

Image Captioning
Dataset We use the Conceptual Captions (CC) dataset (Sharma et al., 2018), consisting of 3.3 million training and 15,000 validation images/caption pairs. Another 12,000 image/caption pairs comprise the hidden test set. Official scores on the test set are obtained by submitting models to the CC Challenge server 2 . Unlike other image captioning datasets, images from CC are pulled from across the web and thus exhibit a wide variety of both images and image-caption styles. Most notably, the domain of images can be very different from Visual Genome, unlike in popular benchmarks such as MSCOCO (Lin et al., 2014).
Model We adopt the encoder-decoder model from (Sharma et al., 2018), whose basic building block is a Transformer Network (Vaswani et al., 2017). To convert multi-modal inputs to a sequence of encoder feature vectors, we use up to three types of image features: L : Label embeddings, obtained by embedding predicted object semantic labels from Google Cloud Vision APIs 3 into a 512D feature vector. These semantic labels are then mapped to embeddings using an embedding layer pre-trained to predict label co-occurrences in web documents using a word2vec model (Mikolov et al., 2013).
For both B and L, we select the inputs with highest scores and order the sequence inputs based on such scores from high to low. Additionally for B, we remove box regions whose scores are lower than 0.001. We use beam search with width 5 for the decoder in all of our experiments 4 .  Metrics We adopt the standard automatic metrics for image captioning: CIDEr (Vedantam et al., 2015), ROUGE-L (Lin and Och, 2004), and SPICE (Anderson et al., 2016), as implemented in the COCO-caption evaluation toolkit 5 .

Results
We report results on both the dev and test sets for Conceptual Captions in Table 1. "Base" uses the G feature only. We first compare the Base G against each of the feature types (B-FRCNN, B-Ultra, and L). We then perform ablations under the +B condition (FRCNN/Ultra) to the Base G or stronger G + L models. According to dev CIDEr scores, global or box Graph-RISE features G and B-Ultra are (individually) clearly stronger than box features by Faster R-CNN B-FRCNN or label embeddings L features. Nevertheless, these features are considerably complementary. Specifically, box features B-Ultra complements the Base G, pushing the score from 0.868 to 0.912. It is also worth noting that, albeit their low individual scores, B-FRCNN or L improves upon each model they are added to.
Our models with Ultra features clearly outperform the ones with FRCNN. This is demonstrated in three conditions: when they are on their own, when they are added to the simple G model, and when they are added to the stronger G + L model. Manual inspection of the models' predictions further supports this; a qualitative comparison of B-Ultra vs. B-FRCNN in Figure 2 suggests that ultra-finegrained featurization leads to an improved correspondence between visual inputs and caption tokens of unfamiliar objects (such as "monks" and "staircase"). 5 https://github.com/tylin/coco-caption.
To get test scores, we submit our best model using FRCNN and our best model using Ultra (based on dev CIDEr) to the CC Challenge server. Test scores for other models were not obtained due to the limited number of submissions per time period. As of August 30, 2019, the G + B-Ultra + L model outperforms all other single baselines 6 , for both CIDEr and SPICE (and tie on ROUGE-L).

Visual Question Answering
Dataset We use the recently-proposed VizWiz dataset (Gurari et al., 2018), in which both images and questions originate from visually-impaired or blind people. It consists of 20,000/3,173 image, question, answers triplets in the train/val splits, and additional 8,000 triplets for the test split. Each question is independently annotated with 10 answers. We choose the VizWiz benchmark specifically because it is a more suitable benchmark for measuring transfer learning effects. Other VQA datasets, including VQA1.0 (Antol et al., 2015), VQA2.0 (Goyal et al., 2017), Visual7W (Zhu et al., 2016), COCOQA (Ren et al., 2015a), and GQA (Hudson and Manning, 2019) are completely or partly based on MSCOCO or Visual Genome. As such, they may not provide unbiased grounds for measuring the impact of objectdetection features based on Visual Genome versus alternative featurization techniques.
Model We follow the setting described in Pythia v0.1 (Jiang et al., 2018), the winning entry to the VQA challenge 2018. In particular, the architecture is a simplified "up-down" model from (Anderson et al., 2018) 7 . The featurization of the bounding boxes follows the description from Section 4. For the base condition, we use the box features based on Faster R-CNN (B-FRCNN), following the majority of previous work. For the test condition, we replace them with the Ultra-based features (B-Ultra).
Metrics As commonly done in previous work (Antol et al., 2015), we use as our accuracy metric the average score over 9 subsets of the groundtruth 10 answers, where each score is computed by the formula: min(# humans that provided that answer / 3, 1). Accuracy on the test-dev and teststandard splits is obtained by submitting the mod-Ground-truth "monks clean a garden at a temple ." "black sesame seeds on a white background" Box FRCNN (VG) "a woman walks through the streets ." "a pile of dried flowers" Box Ultra "monks walking in front of a temple" "black chia seeds on a white background" Ground-truth "a photo of a staircase inside a historic house" "car & tree ornament this heart of mine" Box FRCNN (VG) "the interior of the church" "digital art selected for the #" Box Ultra "the staircase of the house" "christmas tree in a toy car"  Table 2: Accuracy (%) on the test-standard split for the VQA task on the VizWiz dataset. Additionally, we provide accuracy per answer type: yes/no (y/n), number (num), unanswerable (unans), and the rest (other). The baselines include VizWiz (Gurari et al., 2018) and BAN (Kim et al., 2018). els to the VizWiz Challenge server 8 .

Results
We report results on the VizWiz benchmark in Table 2. Our model with FRCNN provides a strong baseline, slightly outperforming the previous-best model, BAN (Kim et al., 2018), a different architecture that also uses the FRCNNbased features for object bounding boxes. The model using Ultra features further improves upon this; at 53.7%, it outperforms the one using FR-CNN by a significant margin (1.8% accuracy on "all" question types). Moreover, this 1.8% improvement is a weighted average across answer types; the per-answer-type numbers indicate that 8 evalai.cloudcv.org/web/challenges/challengepage/102/overview our approach achieves even better improvements on two of the more difficult answer types, "number" (+4.5%) and "rest" (+3.3%). These improvements are illustrated by the examples provided in Figure 1.
This illustrates the effectiveness of decoupling bounding box proposal and featurization, and quantifies the impact of using transfer learning via large amounts of training data and ultrafinegrained semantic labels used for object representations.

Conclusion
In this work, we propose to (re)decouple box proposal and featurization. We show that this allows us to leverage additional signals and annotations, leading to more effective transfer learning for downstream vision and language tasks: image captioning and visual question answering. This result suggests that large-scale datasets with finegrained image-level semantic labels, even when they do not dissect complex visual scenes, can benefit current state-of-the-art models -especially when applied to benchmarks where images are from diverse domains.