CoNAN: A Complementary Neighboring-based Attention Network for Referring Expression Generation

Daily scenes are complex in the real world due to occlusion, undesired lighting conditions, etc. Although humans handle those complicated environments well, they evoke challenges for machine learning systems to identify and describe the target without ambiguity. Most previous research focuses on mining discriminating features within the same category for the target object. One the other hand, as the scene becomes more complicated, human frequently uses the neighbor objects as complementary information to describe the target one. Motivated by that, we propose a novel Complementary Neighboring-based Attention Network (CoNAN) that explicitly utilizes the visual differences between the target object and its highly-related neighbors. These highly-related neighbors are determined by an attentional ranking module, as complementary features, highlighting the discriminating aspects for the target object. The speaker module then takes the visual difference features as an additional input to generate the expression. Our qualitative and quantitative results on the dataset RefCOCO, RefCOCO+, and RefCOCOg demonstrate that our generated expressions outperform other state-of-the-art models by a clear margin.


Introduction
Generating referring expressions (Mao et al., 2016;Yu et al., 2016;Liu et al., 2017;Yu et al., 2017;Tanaka et al., 2019), which identify target objects with simple words and phrases in everyday discourse, has attracted attention from both computer vision (CV) and natural language processing (NLP) communities. With the rapid development of RNNs (Hochreiter and Schmidhuber, 1997; and the emergence of transformers (Vaswani et al., 2017), machine learning systems can generate linguistically correct expressions in most cases. However, the remaining issue in the referring expression generation (REG) field is to avoid ambiguities (i.e. the generated expression should refer to a unique target object). This issue becomes increasingly important when referring to an object in complex daily scenes where occlusion, undesired lighting conditions, complex formation of objects, occur regularly. This complex nature inhibits the system from mining the unique features automatically for the target object among its visually similar ones.
Previous works have mainly investigated model architectures to generate less ambiguous expressions. Speaker-Listener models (Mao et al., 2016;Liu et al., 2017) are widely adopted to encourage a speaker to generate expressions that can be comprehended by listener model. Further, Yu et al. (2017) employs a reinforcer module to reward the system if the generated expression is bonded to the target object. In order to find discriminative features for the target object and generate less ambiguous expression, plenty of research utilizes the visual differences between the target objects and the objects that belong to the same category determined by an object detector.
However, when multiple visually similar objects appear in the scene, mining the discriminative features for the target object becomes challenging. Instead, a human would use the surrounding objects to help clarify the target one. Motivated by that, we propose a Complementary Neighboring-based Attention Network (CoNAN) that explicitly utilizes and highlights the visual difference between the target object and its neighbors, instead of mining discriminative feature within a class. CoNAN first finds and computes visual differences of the k spatial neighbors for each target object, and then uses the attentional ranking module to rank the potential contribution of each the neighbor object. Finally, the speaker (expression generator) in CoNAN additionally takes the top-M ranking visual differences together with the target object and the global representation as inputs to generate referring expressions.
Note that the CoNAN is compatible with most current learning-based expression generation systems. In particular, we adopt SLR (Yu et al., 2017) as our baseline system. Experimental evaluation shows a significant improvement for the generated expression compared to the state-of-the-art on the three RefCOCO datasets.
2 Related Work 2.1 Image Captioning Image captioning (Vinyals et al., 2015;Anderson et al., 2018) is the task of generating textual sentences of the given image. Similarly, the referring expression generation task aims at describing a specific object in the daily environment unambiguously. Therefore, it requires machine learning system to figure out the key discriminating aspect of the target object for unambiguity while image captions only describe the general visual content. Most recent approaches use either recurrent models (Anderson et al., 2018) or visual transformers (Lu et al., 2019), on top of object-based bottom-up attention for speaker models. To achieve the unambiguity, REG models employ another comprehension module to check if the generated expressions can be grounded back to the target object.

Referring Expression Datasets
Referring expression generation (REG) has been studied for a long time using artificial dataset. The field has become more active with the appearance of RefCLEF (Kazemzadeh et al., 2014), a largescale dataset with 20,000 real-world images. RefCLEF was collected in a two-player game, where one player clicks on the correct object with given the expression generated by another player. If the player correctly matches the object and the expression, both players get points and their roles switch. With the same idea, the authors collected RefCOCO and RefCOCO+ dataset from COCO images (Yu et al., 2016). The two datasets each contain about 50,000 objects. RefCOCO+ additionally uses location information for the expressions which are prohibited on RefCOCO. RefCOCOg (Mao et al., 2016) uses a non-interactive framework to build more complex expressions with further details that contain 54,822 objects with 85,474 referring expressions. Tanaka et al. (2019) proposes RefGTA that contains a complex composition of images from GTA V with sufficiently diverse appearances and locations.

Referring Expression Generation
Referring expression generation aims at generating unambiguous sentence given a specific region or object in a full image. Initial works have been studied on rule-based approaches (Gupta and Stent, 2005;Janarthanam and Lemon, 2010). Since large-scale datasets (RefCOCO, RefCOCO+, RefCOCOg, etc) were collected, many studies have tried to use the CNN-LSTM framework in the real world images (Mao et al., 2016;Yu et al., 2016;Liu et al., 2017;Yu et al., 2017;Tanaka et al., 2019) for automation.
To reduce the ambiguity of object descriptions, Mao et al. (2016) introduced Maximum Mutual Information (MMI) training which induces the speaker to generate more discriminative sentences based on the listener's response. In detail, the speaker is trained to generate more descriptive captions for the specific object so that the listener can easily localize the specific region. Yu et al. (2016) proposed to incorporate a better measure of visual context into the speaker to jointly generate expressions for all same category objects depicted in an image. Liu et al. (2017) introduced attribute embedding generation which improves the visual representation of the generation model. Yu et al. (2017) proposed a unified framework for the tasks of generation and comprehension where speaker-listener are trained complementarily by end-to-end learning with the reinforcer giving guidance to the speaker to generate a more discriminative  Figure 1: Framework of CoNAN: we extract the target feature and detected object features. We select the neighbors of the target based on euclidean distance metric. We calculate the visual difference between target and neighbor features. We perform a scaled dot-product attention function with ranking strategy. The aggregated features consist of global feature, target feature, weighted visual difference features and location/size difference features. We then train the expression generator using those aggregated features by minimizing the total loss.
sentence. Tanaka et al. (2019) focused on utilizing the environment around the target easy for a human to locate a target region.

Model
In this section, we present a Complementary Neighboring-based Attention Network (CoNAN) for generating unambiguous referring expressions. In particular, we first extract k neighbor objects for each target object, detailed in Section 3.1. To mine discriminating features for the target object, visual differences between it and its neighbors are utilized as complementary inputs. To better encode the local context, we also employ an attentional ranking strategy that weighs the neighbors to select meaningful ones in Section 3.2. Finally, we present an expression generating module in Section 3.3 that takes the attentional visual difference, target object feature, and the global features as inputs to generate high quality expressions.

Extracting Neighbor Objects
We present the approach of extracting the set of neighbor objects for the target object o . To avoid duplication, we first perform non-maximum-suppression (NMS) to filter out the objects whose intersection-ofunion (IoU) with the target box is over 0.5. Then, we extract k-nearest neighbors according to Euclidean distance between the center of the neighbors' bounding boxes and that of the target one. We denote the target object feature as o and the i-th object's as o i .

Visual Difference as Complementary Features
Yu et al. (2016) emphasizes the importance of using the visual difference between the target object and the other from the same category to reduce the ambiguity. As a result, both unique attributes and spatial relationships to characterize the target object can be considered.
Instead of comparing to the objects from the same category, our visual differences compare the target object with all of the neighbors to mine the complementary aspects of the target in the local complex scene. To avoid overly complex contexts and preserve the briefness of the expression, we construct an attentional ranking module that ranks and selects interesting neighbor objects when generating the expression for the target object.

Computing Visual Differences
We adopted the bottom-up features as the representations for the target and its neighbor objects. In particular, following (Anderson et al., 2018), the Visual Genome (Krishna et al., 2017) pretrained Faster-RCNN (Ren et al., 2015 is used as the object feature extractor, resulting in a 2, 048-d vector for each object in the image. The visual difference δ v i between the target object, o , and the i-th neighbor object, o i , are calculated as

Complementary Neighboring-based Attention
In contrast to (Yu et al., 2016), our system utilizes the visual differences between the target object and its neighbors for all categories to mine the complementary features. Considering all the neighbors may introduce an overly complex local context. To address this issue, our system learns sparse attention for each neighbor, and only select top-M meaningful objects as the concise local context.
Technically, given the target feature o and visual difference features δ v i for the i-th object, we compute the scaled attention α i as shown in Eq.1. In particular, both the target feature o and visual difference δ v i first go through a separate feed-forward network, which then we compute the attention logits by the inner product of the out projected features scaled by 1 √ d . Note that, f denote a linear transformation, where different f do not share parameters, d the dimension of the hidden feature vector.
To focus on helpful neighbors for generating unambiguous yet concise expressions, we only select top-M neighbors according to the learnt attention logits α i to form a complementary neighbor object set S = {i|α i ≥ topM }, where topM denotes the M largest attention logits.
Then, the final complementary visual difference features δ is computed as the weighted sum of the visual differences of the object in S as shown in Eq. 2

Referring Expression Generator
We employ five different types of features to generate referring expressions using CNN-LSTM framework. In particular, we consider the target object o , global context g, target location/size l, target context δ v , target location/size context δ l .
Global context g is modeled as averaged feature vector of all the detected objects using the pretrained Faster-RCNN in the image. The location/size representation of target is modeled as a 5 dimension vectors l = x tl W , y tl H , x br W , y br H , w·h W ·H , where w, h denote the width and height of the target bounding box and W, H are the width and height of the image, x tl , y tl , x br , y br are the coordinates of the top-left, top-right, bottom-left, bottom-right corner. This feature presents the relative position and the size of the object.
With the selected neighbors, we perform complementary neighboring attention to obtain fine-grained target context δ v as described in the previous section. The final visual representation v is a combination of the above features followed by one linear layer We use v i to denote the joint feature v that regards the i-th object as the target object, and use r i to denote the human expression for the i-th object. To generate the expressions for each referred object, the joint feature v i is fed into an LSTM and we minimize the negative log-likelihood with the parameters θ as shown in Eq. 3.

Training Objectives
Following Mao et al. (2016), we use the Maximum Mutual Information (MMI) constraint to encourage the model to generate expression for the target object o i that can be discriminated from the expression for another object. In particular, we consider two prior knowledge in advance, (1) ground-truth expression r i should be more likely generated using the target object o i than other randomly sampled objects o k (2) the target object is more likely to generate the ground-truth expression r i instead of other expression r j for the positive pairs. Therefore, we adopt a margin loss as shown in Eq. 4. Note that, λ s 1 , λ s 2 , M 1 , M 2 are hyper-parameters We also use a reinforcer model (Yu et al., 2017) to generate a more precise and discriminative expression for the target object. Specifically, we build an MLP network to evaluate the consistency between the generated expression and visual features. Then, we use the evaluation score as a reward. In particular, we use the local-scene-aware target object feature v i as the visual feature, and an LSTM to encode the generated expression as the sentence feature. We adopt the policy-gradient technique to optimize the reward function as shown in Eq. 5.
To achieve better performance, we adopt the re-ranking mechanism that selects the generated expression whose referred object by the listener module is the closest to the target one.
The overall loss of our speaker model L s is a summation of (Eqn. 3), (Eqn. 4) and (Eqn. 5) where λ r is a hyper-parameter on the weight of reward loss term 4 Experiments

Datasets
Our model is trained and evaluated on the three state-of-the-art referring expression datasets, RefCOCO, RefCOCO+ and RefCOCOg. Each dataset use the image data from COCO (Lin et al., 2014), where RefCOCO and RefCOCO+ are collected using ReferitGame (Kazemzadeh et al., 2014), and RefCOCOg is collected with a non-interactive setting. Further details of each dataset are listed in following sections:  Table 1: Comparison of our results with state-of-the-art baseline methods on Referring Expression Dataset of RefCOCO, RefCOCO+, RefCOCOg. "+rerank" notes the reranking process for the generated expression according to the listener module. "+attn" indicates the addition of scaled dot-product attention with ranking strategy. SLR denotes the original SLR model, and the re-SLR is a reimplemented version that uses ResNet as the image feature extractor from (Tanaka et al., 2019).

Implementation
We optimize the speaker module using the Adam (Kingma and Ba, 2014) optimizer with a batch size of 128 with initializing the learning rate to 4e-4. The learning rate is set to decay by 0.5 every 500 iterations. The size of the hidden state and word embedding is set to 512. Also, we empirically found that taking 20 neighbors with 0.2 IoU in the NMS stage achieves optimal results. In the choice of ranking strategy, we set m to 8 for obtaining the sparse attention weights. For reinforcement learning, our model generates 3 sampled sentences to estimate the rewards. During test phase, we use a beam search with a beam size of 10. We set λ s 1 = 1, λ s 2 = 0.1 and M 1 = 1, M 2 = 1 for the hyper-parameters of the margin loss. We set the weight of the reward loss in the total loss function as λ r = 1.
For the object representation, following (Anderson et al., 2018), we use object detection as bottom-up attention, which provides salient image regions with clear boundaries. In particular, a Faster R-CNN head (Ren et al., 2015) in conjunction with a ResNet-101 base network (He et al., 2016) is adopted as our detection module. The detection head is first pre-trained on the Visual Genome dataset (Krishna et al., 2017) and is capable of detecting 1, 600 objects categories and 400 attributes. To generate an output set of object features in the image, we take the final detection outputs and perform non-maximum suppression (NMS) for each object category using an IoU threshold of 0.7. Finally, a fixed number of 36 detected objects for each image are extracted as the image features (2, 048 dimensional vector for each object)

Training Details
We trained our referring expression generator on three series of RefCOCO, RefCOCO+, RefCOCOg datasets following with the LSTM loss, reward loss, and hinge loss. In particular, we first train the reinforcer model by maximizing the reward for the consistency of image features and sentence features. We then jointly train the speaker and listener model with reinforcer's reward.

Comparison with State-Of-The-Arts Models
In this section, we perform both quantitative and qualitative experiments for SLR (Yu et al., 2017) and RefGTA (Tanaka et al., 2019). For quantitative analysis, we evaluate our generated referring expressions  Table 2: Comprehension evaluation on the RefCOCO, RefCOCO+ and RefCOCOg. Ensemble refers to the use of both speaker and listener or reinforcer. Our three modules (speaker, listener, reinforcer) show better performance in most cases compared to the previous state-of-the-art models. "+attn" states that the model is applied to the scaled dot-product attention with ranking strategy. SLR denotes the original SLR model, and the re-SLR is a reimplemented version that uses ResNet as the image feature extractor from (Tanaka et al., 2019).
on RefCOCO, RefCOCO+, RefCOCOg datasets. To evaluate the quality of the expressions, we also adopt the METEOR and CIDEr automatic metrics commonly used in the field of image captioning. Our work confirms the effectiveness of our listener module in the following sections.

Quantitative Results
Evaluation on referring expression generation. We compare our generated expression with the recent models, including SLR (Yu et al., 2017), re-SLR (Tanaka et al., 2019) and RefGTA (Tanaka et al., 2019). We observed that using the reranking mechanism with the listener module generally improves the performance, although it was not quite helpful for the RefGTA model. For the well-generated expressions, the listener module contributed the most to enhancing the power of the model. In particular, the SR without the listener model performs much worse than using the listener as shown in the last three or four rows in Table 1. Since the reranking technique have a higher effect on our listener model compared to RefGTA, our model is able to outperform RefGTA on the comprehension evaluation as shown in Table 2. We found that our neighboring-based attention function helps to improve the performance of both speaker and listener module compared to the baseline of our model without attention. We analyze the effect that our proposed attention function results in that our model selects the meaningful neighboring objects to generate the referring expression as well as eliminating unnecessary neighbor objects that helps the listener model to focus on the target object by using the discrimination from surroundings.
Evaluation on referring expression comprehension. To find out the impact of each module for generation, we validate the performance of each speaker, listener, reinforcer module on comprehension evaluation. We compare following two models (Yu et al., 2017;Tanaka et al., 2019) based on speakerlistener-reinforcer for fair evaluation. We calculate the score of reinforcer, speaker by using ground truth bounding boxes for all the objects given r, o * = argmax i F (r, o i ), o * = argmax i P (r|o i ).
We report the expression comprehension results in  in our model compared to others on improving the quality of expression. This is because our system additionally considers the neighbor objects' features for the target object. In particular, our speaker and listener module show better performance on the evaluation using the attended visual difference features compared to using simple visual difference features between target and neighbors. This is partly due to the scaled dot-product attention along with ranking strategy which not only reduces the complexity of the visual difference features to generate the referring expressions but also makes it easy to be referred back from the listener module. As a result, our proposed method which considers the target's neighbors and performs the attention mechanism between target and visual difference features can improve the performance of the speaker and listener module by selecting the important context objects to identify the target surroundings itself.

Qualitative Results
In this section, we qualitatively analyze the tested data and results with comparison to SLR and RefGTA. Fig. 2 and Fig. 3 shows the generated sentences for each referring expression dataset: RefCOCO, RefCOCO+, and RefCOCOg. Particular objects are expressed with more detail as shown in Fig. 2 for RefCOCO dataset e.g. skater in air, black keyboard. Besides, some commonly mistaken objects are correctly spoken with additional descriptive location information for RefCOCO+ dataset e.g. man behind fence, bed closest to us. This indicates that our proposed method CoNAN has the potential to express the target object with a good description such as the location, attributes, and additional information covering the interaction with other objects. The consideration of the relationship with the target and neighbors along with base ideas effectively simplifies the listener's task of retrieving the object from the spoken expression without ambiguity, e.g. blue couch, the person on right.
Since RefCOCOg is known to contain longer and more complex expressions, the expectation of the performance boost with CoNAN is much more higher compared to other datasets. As the sentences were allowed to be long and complex, it is very important to contain as many details as possible. Fig. 3 shows our excellent and superior results compared to SLR and RefGTA. CoNAN generates a detailed expression for the baby along with the interaction information with other objects (e.g. holding a cell phone) for the first image. Also, CoNAN correctly generates the expression for the "arm behind" which GT: • a black laptop with the screen open and a white paper on the keys

SLR:
• the laptop on the left RefGTA: • a black laptop with a black laptop on it Ours: • a black laptop that is being used by a man in a black shirt

GT:
• man in the left side with white color speaker SLR: • a man in a white shirt holding a wii controller RefGTA: • a man holding a video game controller Ours: • a man in a black shirt holding a wii controller GT: • a green and white small laptop SLR: • the laptop on the right RefGTA: • a white laptop Ours: • the green laptop in the righthand picture

GT:
• a smiling child holding a device • the arm on the back of the couch SLR: • the little girl in the righthand picture • the back of a man's head on the left RefGTA: • a baby • a person in a white shirt Ours: • a boy holding a cell phone • the arm of a person's hand RefCOCOg Figure 3: Qualitative comparisons of the generated referring expressions with (Yu et al., 2017), (Tanaka et al., 2019) and human annotation on RefCOCOg dataset. The order of expressions corresponds to green, orange, yellow red box, respectively. are falsely assumed as a head or a person in a white shirt. Similarly, for the second and third image, CoNAN expresses the target object with far more details compared to other methods with having very low difference compared to the ground truth.
Interestingly, sometimes CoNAN can give out explanations that are far more clear compared to the ground truth as shown in the fourth image. While it is not clear to retrieve a black laptop with the screen open, it is more intuitive and easy to retrieve a black laptop which is being used by a man in a black shirt. This shows that taking the relationship with the neighbor object into account further helps the model to semantically understand the complex scene. The base generator and reinforcer are expected to have huge synergistic energy with additional object-level relation information, whereas, in the real world, humans tend to understand a given object along with the relationship to its surroundings.

Conclusion
In this work, we present an approach to explicitly mining complementary aspects for the target object in the local scene. In particular, the visual differences between the target and its neighbors are adopted. Instead of using all of the neighbors, we employ an attentional ranking module to filter out irrelevant neighbor objects. Finally, the speaker module is built upon the global features, target object features, and our complementary neighbor features to generate the expression. Our quantitative results show that CoNAN effectively enhances the performance for referring expression generation outperforming other state-of-the-art methods by a clear margin. Besides, our qualitative results state that CoNAN has the potential to give out the more descriptive expression for each target object sometimes even far superior to the ground truth.