Discriminative Learning of Open-Vocabulary Object Retrieval and Localization by Negative Phrase Augmentation

Thanks to the success of object detection technology, we can retrieve objects of the specified classes even from huge image collections. However, the current state-of-the-art object detectors (such as Faster R-CNN) can only handle pre-specified classes. In addition, large amounts of positive and negative visual samples are required for training. In this paper, we address the problem of open-vocabulary object retrieval and localization, where the target object is specified by a textual query (e.g., a word or phrase). We first propose Query-Adaptive R-CNN, a simple extension of Faster R-CNN adapted to open-vocabulary queries, by transforming the text embedding vector into an object classifier and localization regressor. Then, for discriminative training, we then propose negative phrase augmentation (NPA) to mine hard negative samples which are visually similar to the query and at the same time semantically mutually exclusive of the query. The proposed method can retrieve and localize objects specified by a textual query from one million images in only 0.5 seconds with high precision.


Introduction
Our goal is to retrieve objects from large-scale image database and localize their spatial locations given a textual query. The task of object retrieval and localization has many applications such as spatial position-aware image searches (Hinami et al., 2017) and it recently has gathered much attention from researchers. While much of the previous work mainly focused on object instance retrieval wherein the query is an image (Shen et al., 2012;Tao et al., 2014;Tolias et al., 2016), recent approaches (Aytar and Zisserman, 2014;Hinami and Satoh, 2016) enable retrieval of more generic concepts such as an object category. Al- though such approaches are built on the recent successes of object detection including that of R-CNN , object detection methods can generally handle only closed sets of categories (e.g., PASCAL 20 classes), which severely limits the variety of queries when they are used as retrieval systems. Open-vocabulary object localization is also a hot topic and many approaches are proposed to solve this problem (Plummer et al., 2015;Chen et al., 2017). However, most of them are not scalable to make them useful for large-scale retrieval.
We first describe Query-Adaptive R-CNN as an extension of the Faster R- CNN (Ren et al., 2015) object detection framework to open-vocabulary object detection simply by adding a component called a detector generator. While Faster R-CNN learns the class-specific linear classifier as learnable parameters of the neural network, we generate the weight of the classifier adaptively from text descriptions by learning the detector generator (Fig. 2b). All of its components can be trained in an end-to-end manner. In spite of its simple archi-tecture, it outperforms all state-of-the-art methods in the Flickr30k Entities phrase localization task. It can also be used for large-scale retrievals in the manner presented in (Hinami and Satoh, 2016).
However, training a discriminative classifier is harder in the open-vocabulary setting. Closedvocabulary object detection models such as Faster R-CNN are trained using many negative examples, where a sufficient amount of good-quality negative examples is shown to be important for learning a discriminative classifier (Felzenszwalb et al., 2010;Shrivastava et al., 2016). While closedvocabulary object detection can use all regions without positive labels as negative data, in openvocabulary detection, it is not guaranteed that a region without a positive label is negative. For example, as shown in Fig. 1b, a region with the annotation a man is not always negative for skier. Since training data for open-vocabulary object detection is generally composed of images, each having region annotations with free descriptions, it is nearly impossible to do an exhaustive annotation throughout the dataset for all possible descriptions. Another possible approach is to use the regions without positive labels in the image that contains positive examples, as shown in Fig. 1c. Although they can be guaranteed to be positive by carefully annotating the datasets, negative examples are only limited to the objects that cooccur with the learned class.
To exploit negative data in open-vocabulary object detection, we use mutually exclusive relationships between categories. For example, an object with a label dog is guaranteed to be negative for the cat class because dog and cat are mutually exclusive. In addition, we propose an approach to select hard negative phrases that are difficult to discriminate (e.g., selecting zebra for horse). This approach, called negative phrase augmentation (NPA), significantly improves the discriminative ability of the classifier and improves the retrieval performance by a large margin.
Our contributions are as follows. 1) We propose Query-Adaptive R-CNN, an extension of Faster R-CNN to open vocabulary, that is a simple yet strong method of open-vocabulary object detection and that outperforms all state-of-the-art methods in the phrase localization task. 2) We propose negative phrase augmentation (NPA) to exploit hard negative examples when training for open-vocabulary object detection, which makes the classifier more discriminative and robust to distractors in retrieval. Our method can accurately find objects amidst one million images in 0.5 second.

Related work
Phrase localization. Object grounding with natural language descriptions has recently drawn much attention and several tasks and approaches have been proposed for it (Guadarrama et al., 2014;Kazemzadeh et al., 2014;Mao et al., 2016;Plummer et al., 2015). The most related task to ours is the phrase localization introduced by Plummer et al. (Plummer et al., 2015), whose goal is to localize objects that corresponds to noun phrases in textual descriptions from an image. Chen et al. (Chen et al., 2017) is the closest to our work in terms of learning region proposals and performing regression conditioned upon a query. However, most phrase localization methods are not scalable and cannot be used for retrieval tasks. Some approaches (Plummer et al., 2017b;Wang et al., 2016a) learn a common subspace between the text and image for phrase localization. Instead of learning the subspace between the image and sentence as in standard crossmodal searches, they learn the subspace between a region and a phrase. In particular, Wang et al. (Wang et al., 2016a) use a deep neural network to learn the joint embedding of images and text; their training uses structure-preserving constraints based on structured matching. Although these approaches can be used for large-scale retrieval, their accuracy is not as good as recent state-of-the-art methods.
Object retrieval and localization. Object retrieval and localization have been researched in the context of particular object retrieval (Shen et al., 2012;Tao et al., 2014;Tolias et al., 2016), where a query is given as an image.
Aytar et al. (Aytar and Zisserman, 2014) proposed retrieval and localization of generic category objects by extending the object detection technique to large-scale retrieval.
Hinami and Satoh (Hinami and Satoh, 2016) extended the R-CNN to large-scale retrieval by using approximate nearest neighbor search techniques. However, they assumed that the detector of the category is given as a query and require many sample images with bounding box annotations in order to learn the detector. Several other approaches have used the external search engines (e.g., Google image search) to get training images from textual queries (Arandjelovi et al., 2012;Chatfield et al., 2015). Instead, we generate an object detector directly from the given textual query by using a neural network.
Parameter prediction by neural network. Query-Adaptive R-CNN generates the weights of the detector from the query instead of learning them by backpropagation. The dynamic filter network (De Brabandere et al., 2016) is one of the first methods that generate neural network parameters dynamically conditioned on an input. Several subsequent approaches use this idea in zero-shot learning (Ba et al., 2016) and visual question answering (Noh et al., 2016). Zhang et al. (Zhang et al., 2017) integrates this idea into the Fast R-CNN framework by dynamically generating the classifier from the text in a similar manner to (Ba et al., 2016). We extend this work to the case of large-scale retrieval. The proposed Query-Adaptive R-CNN generates the regressor weights and learn the region proposal network following Faster R-CNN. It enables precise localization with fewer proposals, which makes the retrieval system more memory efficient. In addition, we propose a novel hard negative mining approach, called negative phrase augmentation, which makes the generated classifier more discriminative.

Query-Adaptive R-CNN
Query-adaptive R-CNN is a simple extension of Faster R-CNN to open-vocabulary object detection. While Faster R-CNN detects objects of fixed categories, Query-Adaptive R-CNN detects any objects specified by a textual phrase. Figure 2 illustrates the difference between Faster R-CNN and Query-Adaptive R-CNN. While Faster R-CNN learns a class-specific classifier and regressor as parameters of the neural networks, Query-Adaptive R-CNN generates them from the query text by using a detector generator. Query-Adaptive R-CNN is a simple but effective method that surpasses state-of-the-art phrase localization methods and can be easily extended to the case of largescale retrieval. Furthermore, its retrieval accuracy is significantly improved by a novel training strategy called negative phrase augmentation (Sec. 3.2). While Faster R-CNN learns the classifier of a closed set of categories as learnable parameters of neural networks, Query-Adaptive R-CNN generates a classifier and regressor adaptively from a query text by learning a detector generator that transforms the text into a classifier and regressor.

Architecture
The network is composed of two subnetworks: a region feature extractor and detector generator, both of which are trained in an end-to-end manner. The region feature extractor takes an image as input and outputs features extracted from sub-regions that are candidate objects. Following Faster R- CNN (Ren et al., 2015), regions are detected using a region proposal network (RPN) and the features of the last layer (e.g., fc7 in VGG network) are used as region features. The detector generator takes a text description as an input and outputs a linear classifier and regressor for the description (e.g., if a dog is given, a dog classifier and regressor are output). Finally, a confidence and a regressed bounding box are predicted for each region by applying the classifier and regressor to the region features.
Detector generator. The detector generator transforms the given text t into a classifier w c and regressor (w r x , w r y , w r w , w r h ), where w c is the weight of a linear classifier and (w r x , w r y , w r w , w r h ) is the weight of a linear regressor in terms of x, y, width w, and height h, following . We first transform a text t of variable length into a text embedding vector v. Other phrase localization approaches uses the Fisher vector encoding of word2vec (Klein et al., 2015;Plummer et al., 2015) or long-short term memory (LSTM) (Chen et al., 2017) for the phrase embedding. However, we found that the simple mean pooling of word2vec (Mikolov et al., 2013) performs better than these methods for our model (comparisons given in the supplemental material). The text embedding is then transformed into a de- Here, we use a linear transformation for G c (i.e., w c = Wv, where W is a projection matrix). For the regressor, we use a multi-layer perceptron with one hidden layer to predict each of (w r x , w r y , w r w , w r h ) = G r (v). We tested various architectures for G r and found that sharing the hidden layer and reducing the dimension of the hidden layer (up to 16) does not adversely affect the performance, while at the same time it significantly reduces the number of parameters (see Sec. 5.2 for details).

Training with Negative Phrase Augmentation
All components of Query-Adaptive R-CNN can be jointly trained in an end-to-end manner. The training strategy basically follows that of Faster R-CNN. The differences are shown in Figure 3. Faster R-CNN is trained with the fixed closed set of categories (Fig. 3a), where all regions without a positive label can be used as negative examples. On the other hand, Query-Adaptive R-CNN is trained using the open-vocabulary phrases annotated to the regions (Fig. 3b), where sufficient negative examples cannot be used for each phrase compared to Faster R-CNN because a region without a positive label is not guaranteed to be negative in open-vocabulary object detection. We solve this problem by proposing negative phrase augmentation (NPA), which enables us to use good quality negative examples by using the linguistic relationship (e.g., mutually exclusiveness) and the confusion between the categories (Fig. 3c). It significantly improves the discriminative ability of the generated classifiers.

Basic Training
First, we describe the basic training strategy without NPA (Fig. 3b). Training a Query-Adaptive R-CNN requires the phrases and their corresponding bounding boxes to be annotated. For the ith image (we use one image as a minibatch), let us assume that C i phrases are associated with the image. The are assigned to the region proposals generated by RPN (each of the dotted rectangles in Fig 3b); a positive label is assigned if the box overlaps the ground truth box by more than 0.5 in IoU and negative labels are assigned to other RoIs under the assumption that all positive objects of C i classes are annotated (i.e., regions without annotations are negative within the image). 1 We then compute the classification loss by using the training labels and classification scores. 2 The loss in terms of RPN and bounding box regression is computed in the same way as Faster R- CNN (Ren et al., 2015).

Negative Phrase Augmentation
Here, we address the difficulty of using negative examples in the training of open-vocabulary object detection. As shown in Fig. 1b, our generated classifier is not discriminative enough. The reason is the scarcity of negative examples when using the training strategy described in Sec. 3.2.1; e.g., the horse classifier is not learned with the zebra as a negative example except for the rare case that both a zebra and a horse are in the same image. Using hard negative examples has proven to be effective in the object detection to train a discriminative detector (Felzenszwalb et al., 2010;Shrivastava et al., 2016). However, adding negative examples is usually not easy in the openvocabulary setting, because it is not guaranteed that a region without a positive label is negative. For example, an object with the label man is not a negative of person even though person is not annotated. There are an infinite number of categories in open-vocabulary settings, which makes it difficult to exhaustively annotate all categories throughout the dataset. How can we exploit hard examples that are guaranteed to be negative? We can make use of the mutually exclusive relationship between categories: e.g., an object with a dog label is negative for cat because dog and cat are mutually exclusive. There are two ways we can add to a minibatch: add negative images (regions) or negative phrases. Adding negative phrases (as in Fig. 3c) is generally better because it involves a much smaller additional training cost than adding images in terms of the both computational cost and GPU memory usage. In addition, to improve the discriminative ability of the classifier, we select only hard negative phrases by mining the confusing categories. This approach, called negative phrase augmentation (NPA), is a generic way of exploiting hard negative examples in openvocabulary object detection and leads to large improvements in accuracy, as we show in Sec. 5.3.
Confusion table. We create a confusion table that associates a category with its hard negative categories, from which negative phrases are picked as illustrated in Fig. 3c. To create the entry for category c, we first generate the candidate list of hard negative categories by retrieving the top 500 scored objects from all objects in the vali-dation set of Visual Genome (Krishna et al., 2016) (using c as a query). After that, we remove the mutually non-exclusive category relative to c from the list. Finally, we aggregate the list by category and assign a weight to each category. Each of the registered entries becomes like dog:{cat:0.5, horse:0.3, cow:0.2}. The weight corresponds to the probability of selecting the category in NPA, which is computed based on the number of appearances and their ranks in the candidate list. 3 Removal of mutually non-exclusive phrases. To remove non-mutually exclusive phrases from the confusion table, we use two approaches that estimate whether the two categories are mutually exclusive or not. 1) The first approach uses the WordNet hierarchy: if two categories have parentchild relationships in WordNet (Miller, 1995), they are not mutually exclusive. However, the converse is not necessarily true; e.g., man and skier are not mutually exclusive but do not have the parent-child relationship in the WordNet hierarchy. 2) As an alternative approach, we propose to use Visual Genome annotation: if two categories co-occur more often in the Visual Genome dataset (Krishna et al., 2016), these categories are considered to be not mutually exclusive. 4 These two approaches are complementary, and they improve detection performance by removing the mutually non-exclusive words (see Sec. 5.3).
The training pipeline with NPA is as follows: (1) Update the confusion table: The confusion table is updated periodically (after every 10k iterations in our study). Entries were created for categories that frequently appeared in 10k successive batches (or the whole training set if the size of the dataset is not large).
(2) Add hard negative phrases: Negative phrases are added to each of the C i phrases in a minibatch. We replace the name of the category in each phrase with its hard negative category (e.g., generate a running woman for a running man), where the category name is obtained by extracting nouns. A negative phrase is randomly selected from the confusion table on the basis of the assigned probability. 3 We compute the weight of each category as the sum of 500 minus the rank for all ranked results in the candidate lists normalized over all categories in order to sum to one. 4 We set the ratio at 1% of objects in either category. For example, if there are 1000 objects with the skier label and 20 of those objects are also annotated with man (20/1000=2%), we consider that skier and man are not mutually exclusive.
(3) Add losses: As illustrated in Fig. 3c, we only add negative labels to the regions where a positive label is assigned to the original phrase. The classification loss is computed only for the regions, which is added to the original loss.

Large-Scale Object Retrieval
Query-Adaptive R-CNN can be used for largescale object retrieval and localization, because it can be decomposed into a query-independent part and a query-dependent part, i.e., a region feature extractor and detector generator. We follow the approach used in large-scale R-CNN (Hinami and Satoh, 2016), but we overcome its two critical drawbacks. First, a large-scale R-CNN can only predict boxes included in the region proposals; these are detected offline even though the query is unknown at the time; therefore, to get high recall, a large number of object proposals should be used, which is memory inefficient. Instead, we generate a regressor as well as a classifier, which enables more accurate localization with fewer proposals. Second, a large-scale R-CNN assumes that the classifier is given as a query, and learning a classifier requires many samples with bounding annotations. We generate the classifier from a text query directly by using the detector generator of Query-Adaptive R-CNN. The resulting system is able to retrieve and localize objects from a database with one million images in less than one second. Database indexing. For each image in the database, the region feature extractor extracts region proposals and corresponding features. We create an index for the region features in order to speed up the search. For this, we use the IV-FADC system (Jégou et al., 2011) in the manner described in (Hinami and Satoh, 2016).
Searching. Given a text query, the detector generator generates a linear classifier and bounding box regressor. The regions with high classification scores are then retrieved from the database by making an IVFADC-based search. Finally, the regressor is applied to the retrieved regions to obtain the accurately localized bounding boxes.

Experimental Setup
Model: Query-Adaptive R-CNN is based on VGG16 (Simonyan and Zisserman, 2015), as in other work on phrase localization. We first initialized the weights of the VGG and RPN by using Faster R-CNN trained on Microsoft COCO (Lin et al., 2014); the weights were then fine-tuned for each dataset of the evaluation. In the training using Flickr30k Entities, we first pretrained the model on the Visual Genome dataset using the object name annotations. We used Adam (Kingma and Ba, 2015) with a learning rate starting from 1e-5 and ran it for 200k iterations.
Tasks and datasets: We evaluated our approaches on two tasks: phrase localization and open-vocabulary object detection and retrieval. The phrase localization task was performed on the Flickr30k Entities dataset (Plummer et al., 2015). Given an image and a sentence that describes the image, the task was to localize region that corresponds to the phrase in a sentence. Flickr30k datasets contain 44,518 unique phrases, where the number of words of each phrase is 1-8 (2.1 words on average). We followed the evaluation protocol of (Plummer et al., 2015). We did not use Flickr30k Entities for the retrieval task because the dataset is not exhaustively annotated (e.g., not all men appearing in the dataset are annotated with man), which makes it difficult to evaluate with a retrieval metric such as AP, as discussed in Plummer et al. (Plummer et al., 2017b). Although we cannot evaluate the retrieval performance directly on the phrase localization task, we can make comparisons with other approaches and show that our method can handle a wide variety of phrases.
The open-vocabulary object detection and retrieval task was evaluated in the same way as the standard object detection task. The difference was the assumption that we do not know the target category at training time in open-vocabulary settings; i.e., the method does not tune in to a specific category, unlike the standard object detection task. We used the Visual Genome dataset (Krishna et al., 2016) and selected the 100 most frequently object categories as queries among its 100k or so categories. 5 6 We split the dataset into training, validation, and test sets following . We also evaluated our approaches on the PASCAL VOC 2007 dataset, which is a widely used dataset  for object detection. 7 As metrics, we used topk precision and average precision (AP), computed from the region-level ranked list as in the standard object detection task. 8

Phrase localization
Comparison with state-of-the-art. We compared our method with state-of-the-art methods on the Flickr30k Entities phrase localization task. We categorized the methods into two types, i.e., nonscalable and scalable methods (Tab. 1). 1) Nonscalable methods cannot be used for large-scale retrieval because their query-dependent components are too complex to process a large amount of images online, and 2) Scalable methods can be used for large-scale retrieval because their querydependent components are easy to scale up (e.g., the L 2 distance computation); these include common subspace-based approaches such as CCA.
Our method also belongs to the scalable category.
We used a simple model without a regressor and 7 We used the model trained on Visual Genome even for the evaluation on the PASCAL dataset because of the assumption that the target category is unknown. 8 We did not separately evaluate the detection and retrieval tasks because both can be evaluated with the same metric. NPA in the experiments. Table 1 compares Query-Adaptive R-CNN with the state-of-the-art methods. Our model achieved 65.21% in accuracy and outperformed all of the previous state-of-the-art models including the non-scalable or joint localization methods. Moreover, it significantly outperformed the scalable methods, which suggests the approach of predicting the classifier is better than learning a common subspace for the open-vocabulary detection problem.
Bounding box regressor. To demonstrate the effectiveness of the bounding box regressor for precise localization, we conducted evaluations with the regressor at different IoU thresholds. As explained in Sec. 3.1, the regressor was generated using G r , which transformed 300-d text embeddings x into 4096-d regressor weights w r x , w r y , w r w , and w r h . We compared three network architectures for G r : 1) 300-n(-4096) MLP having a hidden layer with n units that is shared across the four outputs, 2) 300(-n-4096) MLP having a hidden layer that is not shared, and 3) 300(-4096) linear transformation (without a hidden layer). Table 2 shows the results with and without regressor. The regressor significantly improved the accuracy with high IoU thresholds, which demonstrates that the regressor improved the localization accuracy. In addition, the accuracy did not decrease as a result of sharing the hidden layer or reducing the number of units in the hidden layer. This suggests that the regressor lies in a very low-dimensional manifold because the regressor for one concept can be shared by many concepts (e.g., the person regressor can be used for man, woman, girl, boy, etc.). The number of parameters was significantly reduced by these tricks,   to even fewer than in the linear transformation. The accuracy slightly decreased with a threshold of 0.5, because the regressor was not learned properly for the categories that did not frequently appear in the training data.

Open-Vocabulary Object Retrieval
Main comparison. Open-vocabulary object detection and retrieval is a much more difficult task than phrase localization, because we do not know how many objects are present in an image. We used NPA to train our model.  region-based CCA (Plummer et al., 2017b), which is scalable and shown to be effective for phrase localization; for a fair comparison, the subspace was learned using the same dataset as ours. An approximate search was not used to evaluate the actual performance at open-vocabulary object detection. Table 3 compares different training strategies. NPA significantly improved the performance: more than 25% relative improvement for all metrics. Removing mutually non-exclusive words also contributed the performance: WN and VG both improved performance (5.8% and 6.9% relative AP gain, respectively). Performance improved even further by combining them (11.8% relative AP gain), which shows they are complementary. AP was much improved by NPA for the PASCAL dataset as well (47% relative gain). However, the performance was still much poorer than those of the state-of-the-art object detection methods (Redmon and Farhadi, 2017;Ren et al., 2015), which suggests that there is a large gap between open-vocabulary and closed-vocabulary object detection.
Detailed results of NPA. To investigate the effect of NPA, we show the AP with and without NPA for individual categories in Figure 5, which are sorted by relative AP improvement. It shows that AP improved especially for animals (elephant, cow, horse, etc.) and person (skier, surfer, girl), which are visually similar within the same upper category. Table 4 shows the most confused category and its total count in the top 100 search results for each query, which shows what concept is confusing for each query and how much the confusion is reduced by NPA. 9 This shows that visually similar categories resulted in false positive without NPA, while their number was suppressed by training with NPA. The reason is that these confusing categories were added for negative phrases in NPA, and the network learned to reject them. Figure 4 shows the qualitative search results for each query with and without NPA (and CCA as a baseline), which also showed that NPA can discriminate confusing categories (e.g., horse and zebra). These results clearly demonstrate that NPA significantly improves the discriminative ability of classifiers by adding hard negative categories.
Large-scale experiments. Finally, we evaluated the scalability of our method on a large image database. We used one million images from the ILSVRC 2012 training set for this evaluation. Table 5 show the speed and memory. The mean 9 For each query, we scored all the objects in the Visual Genome testing set and counted the false alarms in the top 100 scored objects.  and standard deviation of speed are computed over 20 queries in PASCAL VOC dataset. Our system could retrieve objects from one million images in around 0.5 seconds. We did not evaluate accuracy because there is no such large dataset with bounding box annotations. 10 Figure 6 shows the retrieval results from one million images, which demonstrates that our system can accurately retrieve and localize objects from a very large-scale database.

Conclusion
Query-Adaptive R-CNN is a simple yet strong framework for open-vocabulary object detection and retrieval. It achieves state-of-the-art performance on the Flickr30k phrase localization benchmark and it can be used for large-scale object retrieval by textual query. In addition, its retrieval accuracy can be further increased by using a novel training strategy called negative phrase augmentation (NPA) that appropriately selects hard negative examples by using their linguistic relationship and confusion between categories. This simple and generic approach significantly improves the discriminative ability of the generated classifier. Acknowledgements: This work was supported by JST CREST JPMJCR1686 and JSPS KAK-ENHI 17J08378.