Language-Conditioned Feature Pyramids for Visual Selection Tasks

Referring expression comprehension, which is the ability to locate language to an object in an image, plays an important role in creating common ground. Many models that fuse visual and linguistic features have been proposed. However, few models consider the fusion of linguistic features with multiple visual features with different sizes of receptive fields, though the proper size of the receptive field of visual features intuitively varies depending on expressions. In this paper, we introduce a neural network architecture that modulates visual features with varying sizes of receptive field by linguistic features. We evaluate our architecture on tasks related to referring expression comprehension in two visual dialogue games. The results show the advantages and broad applicability of our architecture. Source code is available at https://github.com/Alab-NII/lcfp .


Introduction
Referring expressions are a ubiquitous part of human communication (Krahmer and Van Deemter, 2012) that must be studied in order to create machines that work smoothly with humans. Much effort has been taken to improve methods of creating visual common ground between machines, which have limited means of expression and knowledge about the real world, and humans, from the perspectives of both referring expression comprehension and generation (Moratz et al., 2002;Tenbrink and Moratz, 2003;Funakoshi et al., 2004Funakoshi et al., , 2005Funakoshi et al., , 2006. Even now, researchers are exploring possible methods of designing more realistic scenarios for applications, such as in visual dialogue games (De Vries et al., 2017;Haber et al., 2019;Udagawa and Aizawa, 2019).
Many models have been proposed for referring expression comprehension so far. As image recognition matured, Guadarrama et al. (2014) studied object retrieval methods based on category labels predicted by the recognition models. Hu et al. (2016b) extended this approach to broader natural language expression including categories of objects, their attributes, positional configurations, and interactions. In recent years, models that fuse linguistic features with visual features using deep learning have been studied (Hu et al., 2016b,a;Anderson et al., 2018;Deng et al., 2018;Misra et al., 2018;Yang et al., 2019a,b;Liu et al., 2019;Can et al., 2020).
When fusing the linguistic features of a spatial referring expression with visual features, the size of the receptive field of visual features 1 is important. Let us take Figure 1 as an example. We can refer to the gray dot in the figure in various ways: • a gray dot • a dot next to the small dot • a dot below and to the right of the large dot 1 In this paper, we picture the size of the receptive field of visual features as the grid size in the input image. Note that the size of the receptive field in a real model is wider than the grid size in general because of multiple convolutional layers.
• the rightmost dot in a triangle consisting of three dots • the third largest dot of four dots As shown in the figure, there is an optimum size of receptive field when fusing the features of these expressions with the visual features. Although the small receptive field (in the second panel to the left) matches the expression a gray dot, it does not capture information about the triangle consisting of three dots to the upper left. Conversely, the largest receptive field (in the panel to the right) includes the triangle, but contains too much information to determine the color of the gray dot. Thus, linguistic and visual features have an optimum size of receptive field for fusion. Few existing models, however, use fusion of linguistic features with visual features with different receptive field sizes. This is possibly because major datasets for referring expression comprehension, for example, Kazemzadeh et al. (2014);Plummer et al. (2015); Mao et al. (2016); Yu et al. (2016), use photographs and weigh expressions related to object category more often than positional relationships. Tenbrink and Moratz (2003); Tanaka et al. (2004); Liu et al. (2012 reveal that people often use group-based expressions (relative positional relationships of multiple objects) when there is no clear difference between objects; therefore, these expressions are not so unusual. Further investigation should be done on methods that handle referring expressions based on positional relationships.
For this reason, we focus on the OneCommon corpus (Udagawa and Aizawa, 2019), a recently proposed corpus on a visual dialogue game using composite images of simple figures. It captures various expressions based on positional relationships, such as group-based expressions, as shown in Figure 2.
In this paper, we introduce a neural network architecture for referring expression comprehension considering visual features with different sizes of the receptive fields, and evaluate it on the OneCommon task. Our structure combines feature pyramid networks (FPN) (Lin et al., 2017) and feature-wise linear modulation (FiLM) (Perez et al., 2018) and modulates visual features with different sizes of the receptive fields with linguistic features of referring expressions. FPN is an architecture that uses each layer of the hierarchical convolutional neural network (CNN) feature extractor for object detection; Figure 2: Example of OneCommon view and dialogue. In the OneCommon framework, two players observe slightly different views due to parallel shift. The game requires them to create common ground about the views through free conversation and identify the same dot. We show part of an utterance and underline some expressions that refer to an object or a group.
whereas, FiLM is a structure that robustly fuses linguistic features with visual features.
To confirm the broad applicability of our architecture, we further evaluate it on another task, which is expected to require the ability of object category recognition more than OneCommon does because it uses photographs. We find that our architecture achieves better accuracy in these tasks than some existing models, suggesting the advantage of fusion of linguistic features with multiple visual features that have different receptive fields.
The contributions of this paper are as follows: 1. We propose the language-conditioned feature pyramid (LCFP) architecture, which modulates visual features with multiple sizes of receptive fields using language features.
2. We apply LCFP to dialogue history object retrieval; our evaluation demonstrates the advantage of our architecture on referring expression comprehension in visual dialogue.

Dialogue History Object Retrieval
The main focus of this paper is the task of predicting the final object selected by the speaker given a dialogue history, a scene image, and candidate objects in the image. A dialogue history consists of a list of speaker and utterance pairs. We consider dialogues where speakers switch every turn. Candidate objects are indicated by bounding boxes in the image. Some task instances provide additional information, such as object categories. Here, we call this task dialogue history object retrieval.
OneCommon Target Selection Task OneCommon is a dialogue corpus for common grounding. It contains 6,760 dialogues from a collaborative referring game where two players are given a view that contains 7 dots, as shown in Figure 2. Dots have four attributes: x/y coordinates on a plane, size, and color. Only some dots are seen in common because the centers of the players' views are different. The goal of the game is to select the same dot after talking. Target selection is a subtask of the game, requiring prediction of the dot that a player chose based on a given player's view and dialogue history.
GuessWhat?! Guessor Subtask GuessWhat?! (De Vries et al., 2017) is a game related to multimodal dialogue. Two players play the roles of oracle and questioner. They are given a photo and the oracle mentally selects an object. Then, the questioner asks the oracle yes-or-no questions to guess the object. The goal of the game is to select the object at the end of a question sequence. A published collection of game records consists of 150,000 games with human players, with a total of 800,000 visual question-answer pairs on 66,000 images extracted from the MS COCO dataset (Lin et al., 2014). The guesser subtask is to predict the correct object from 3-20 candidate objects based on a given photo and set of question-answer pairs. Candidate information includes bounding boxes and object category.
In addition to dialogue history object retrieval, there is an increasing amount of research on task design for visual dialogue games that require unique common understanding. For example, in the Photo-Book dataset (Haber et al., 2019), two participants are presented with multiple images, and they predict whether an image is presented only to them or also to the other person through conversation.

Related Work
This section first describes an overview of the models for referring expression comprehension and then gives some details about models related to the OneCommon Corpus and GuessWhat?!.

Models for Referring Expression Comprehension
Models for extracting objects from an image are often based on object detection (Ren et al., 2015;Liu et al., 2016;Lin et al., 2017;Redmon and Farhadi, 2018) or image segmentation (Ronneberger et al., 2015). Object detection considers only the bounding boxes of the objects. Image segmentation extracts the areas indicated by the outlines of the objects. Referring expression comprehension also includes reference detection (Hu et al., 2016b;Anderson et al., 2018;Deng et al., 2018;Yang et al., 2019a,b) and segmentation (Hu et al., 2016a;Misra et al., 2018;Liu et al., 2019;Can et al., 2020) correspondingly. The standard reference detection consists of two stages: detecting candidate objects and selecting objects that match the expression from the candidates. Essentially, they do not fuse visual feature maps with language when detecting candidates. Yang et al. (2019b) proposes a one-stage model that combines the feature map of the object detector with language to directly select the referred object. Whereas their model fuses linguistic and visual features after reducing visual features of the different receptive field sizes, ours fuses them before the reduction. Zhao et al. (2018) also proposes a model with a structure that fuses multiple scales and languages for weakly supervised learning. However, they use concatenation as the method of fusion, whereas we use FiLM.
For reference segmentation,  point out a lack of multi-scale semantics and propose a method that recursively fuses feature maps of different scales using a recurrent neural network (RNN). However, this method concatenates linguistic features with only the first input of the RNN; hence, the feature map in each scale and the linguistic features may be poorly fused. U-Net-based models (Misra et al., 2018;Can et al., 2020) have the most similar structure to ours. They produce hierarchical feature maps with CNNs, modulate those maps with language, and unify them into a single map through consecutive deconvolution operations.
The major difference between those U-Netbased models and ours is fusion architecture. The U-Net-based models generate kernels from linguistic features to convolve visual features. Our model operates an affine transformation on visual features using coefficients made from linguistic features in FiLM blocks. Suppose the dimensions of the source and modulated visual features are D s and D m , respectively. Then, the size of the kernel for convolution is D s D m and the size of the coefficients for affine transformation is 2D m . Because of this independency of D s , our model has the advantage of being able to handle visual features with large dimensions, such as the last layer of ResNet50 (He et al., 2016) typically with 2048 dimensions.

Models for Dialogue History Object Retrieval
OneCommon Target Selection Udagawa and Aizawa (2019) proposed the baseline model TSEL, which creates the features of a candidate taking into account its attributes (size, color and position) and the average of the differences between its attributes and attributes of the other candidates. This model does not use visual features directly. Udagawa and Aizawa (2020)  GuessWhat?! Guesser Subtask The Guess-What?! paper proposes baseline models that use object category and position to create candidate features. Although the paper reports that the extension of their baseline model to visual features from object recognition does not have any advantages, some models that use visual features, for example, A-ATT (Deng et al., 2018) and HACAN (Yang et al., 2019a) have recently improved the performance on GuessWhat?!. Their approach, based on reference detection and attention mechanism, fuses linguistic features with visual features that have a single size of the receptive fields.

Preliminary
We introduce two prerequisite architectures to describe our proposal.

Feature-wise Linear Modulation
A feature-wise linear modulation (Perez et al., 2018) block fuses a given language vector and feature map to make a new feature map. Let the output feature map dimension be d out , the language vector v lang with dimension d lang , and the feature map f in with dimension d in and shape (h, w).
First, it performs a linear transformation on v lang to obtain the coefficients of the affine transformation, Second, it applies CNV (1) to f in after concatenating a positional encode (PE), where F is an activation function, typically a rectified linear unit (ReLU) (Nair and Hinton, 2010), PE(f in ) denotes the concatenation of the twodimensional position of each pixel in f in normalized in a range of [−1, 1] on each axis.
Last, the second convolutional layer CNV (2) with BN and affine transformation is applied to where denotes the element-wise product. Language and vision are fused in this equation. f film is the FiLMed feature map. Note that f film can be divided into language-independent f vis and languagedependent f fuse parts. We analyze the effect of the terms in Section 6.3

Feature Pyramid Networks
Feature Pyramid Networks (FPN) (Lin et al., 2017) use an object recognition model as a backbone and reconstruct semantically rich feature maps from the feature extraction results. Here, we suppose that the backbone is ResNet.

ResNet and Stages of Feature Map
The ResNet family has a common structure for reducing the size of the input images. First, it converts an input image into a feature map with half the resolution of the image with a convolutional layer. Next, it reduces the map by a factor of two with the pooling operation. Subsequently, it applies some residual blocks, gradually reducing the resolution by half. This task is repeated until the size becomes 1/32 of the original image. We define the final layer of each resolution as the feature map of the stage; Pi = CNV (i) (Ci) + Resize 2 (P(i + 1)) (2).
where P6 = 0 and Resize 2 denotes the operation to enlarge the image twice. This means that Pi contains information about higher and coarser stages, which hold more complex semantics in general because of their wider receptive fields.

Proposed Method
Our architecture consists of language-conditioned feature pyramids (LCFP) for general feature extraction and a feature extractor for specific tasks, as shown in Figure 3. In this section, we describe LCFP and the following structure for dialogue history object retrieval.
2 The reason we do not mention P1 is that the original paper does not use C1 and P1 owing to their large memory footprint.

Language-Conditioned Feature Pyramids
Language Encoder LCFP requires a fixedlength vector of language information to generate input for FiLM blocks. We can use any fixed vector, such as the last hidden layers of RNNs or transformer-based language models such as Devlin et al. (2019). Our proposal adopts gated recurrent unit (GRU) (Cho et al., 2014) in accordance with the FiLM paper (Perez et al., 2018). Suppose that d lang is the dimension of hidden layer, h lang = GRU(text) ∈ R d lang .
Visual Feature Extractor We use ResNet as our backbone. In addition to the C2-C5 described in Section 4.2, we use C1 because our goal is to incorporate information in the low stages, i.e., visual features with small receptive fields.

Fusing Language and Vision
The key idea to combine aforementioned two architectures is to replace convolutional layers of FPN in Equation 2 with FiLM blocks.
We represent the block as a function FiLM(v lang , f in ). Then, our feature reconstruction can be expressed as follows: where the weights of the FiLM block in each stage are different from each other. We set kernel sizes for CNV (1), (2) i in each FiLM block 1 × 1 and 3 × 3, respectively, according to Perez et al. (2018). {Pi; i = 1, ..., 5} is the output of LCFP.

LCFP-Based Dialogue History Object Retrieval
We formulate dialogue history object retrieval as a classification that predicts a selected object based on a dialogue history, scene image, and set of candidate information. The candidate information consists of a bounding box (x 1 , y 1 , x 2 , y 2 ) in an image and a fixed-length vector v that represents the additional information.
Candidate Features We extract a region corresponding to a bounding box of each candidate from the feature map P1 obtained via LCFP. For candidate i, the features in the region are averaged to be converted into a fixed-length vector: where region i and P1 k indicate the region of candidate i and the vector at position k in feature map P1, respectively. We concatenate f i with v i additional information vector for candidate i to make a full feature vector: Probability Calculation We apply a linear layer with ReLU activation to each feature and another linear layer with a one-dimensional output to obtain a logit for each candidate: We apply softmax over all logits of the candidates when we need probability of the selected candidate.

Experiments
We first validate the advantage of our architecture on two tasks in dialogue history object retrieval described in Section 2. We then investigate the cause of the advantage through ablation studies.
Common Text Processing We consider dialogue history as a text that starts with task name followed by a <text> token, with a sequence of utterances and a <selection> token at the end. Each utterance is interposed between a speaker token, <you> or <them>, and an end-of-sequence token <eos>. Tokenization of utterances is different for each task. Common Implementation We implemented our model with the PyTorch framework (Paszke et al., 2019). We used ResNet50 provided from the PyTorch vision package, which is pretrained on object recognition tasks with the ImageNet dataset (Deng et al., 2009) as a backbone. All weights of the backbone, including those of statistics for batch normalization, are fixed. The dimensions of token embeddings, GRU hidden states, feature maps, additional information, and the last linear layer are 256, 1024, 256, 256 and 1024 respectively. For optimization, we used ADAM (Kingma and Ba, 2014) with alpha 5e-4, eps 1e-9, and mini-batch size 32. No regularization was used except for BN. We ran 5 epochs in a trial and chose the weight set with the lowest validation loss.

OneCommon Target Selection Task
Model Detail Tokenization was performed by splitting using white spaces; all tokens are uncased. Tokens that appear fewer than five times in the training dataset were replaced with an <unk> token. We drew the game views based on candidate dot data in a 224px square image. The additional information vector is disabled by inputting a vector that denotes that information is not provided.
Results Table 1 compares accuracy between the existing models and ours. Our model achieves better accuracy than the three models described in Section 3.2, although the accuracy is lower than with human performance. In particular, our model out-

GuessWhat?! Guesser Subtask
Although it contains many referring expressions related to positional relationships, OneCommon uses a view with simple figures. We next evaluated our architecture on the Guesser subtask of Guess What?!, which uses photographs, to verify whether our structure can be applied to more complex visual information.
Model Detail We tokenized utterences by NLTK's TweetTokenizer under case-insensitive conditions and omitted tokens appearing fewer than five times in the training dataset. We resized the photos to 224px square, regardless of their aspect ratio. As additional information, we input object categories provided by the dataset by converting them into one-hot embedding vectors.
Results Table 2 shows the error rate of the task. The table also shows the learning methods of the models. Our model achieves the lowest error rate of models of supervised learning, including models that use visual features (LSTM+VGG, HRED+VGG, A-ATT and HACAN w/o HAST). This demonstrates that our architecture can be applied to visual input of natural objects as well as simple figures. Our method alone does not match the results of the method using reinforcement learning; however, our method can be combined with those more sophisticated learning methods. Examining such combinations will be an interesting topic for the future.

Ablation
To confirm the importance of fusing multiple visual features that have different receptive field sizes with linguistic features, we performed ablation in two settings: Stage ablation and Languageconditioned parts ablation. The former examines the effect of applying FiLM to small receptive fields by removing FiLM for some stages. The latter examines the effect of language modulation by leaving only the language-independent parts of FiLM. Table 3 compares A5, A3 and Full models. A5 uses only the last stage of the image extractor and Full uses all stages. A3 is in the middle. The same trend exists for both OneCommon and GuessWhat?!; The Full model outperforms A5 and achieves a slightly better result than A3. This shows that considering visual features with a small receptive field size improves performance.

Language-Conditioned Parts Ablation
This ablation introduces A5' and A3' models that use the language-independent f vis part in all stages but do not use the language-dependent f fuse part in some stages (see Equation. 1 in Section 4.1 for the definition of f vis and f fuse ). Comparing A5 and A5' and A3 and A3' shows that the models consistently achieve better results when using the language-dependent part, suggesting that the language fusion has a positive impact. Although the  impacts of the language fusion in stages 2 and 1 were expected to be relatively small owing to the small difference between Full and A3' model, they still have some impact on the performance.
Combining these, we conclude that the advantage evaluated in the previous subsection is a result of the fusion of linguistic features with multiple visual features with different receptive field sizes.

Discussion
Finally, this section focuses on linguistic expressions. We discuss the effect of our architecture on group-based referring expressions and our first intuition regarding the relationship between expression and receptive fields using OneCommon.

Effect on Group-Based Expression Comprehension
To obtain an insight into the performance of groupbased referring expression, we performed an aggregation over examples in which dialogue includes tokens related to groups. We took the six tokens shown in Table 4 as a marker that indicates that the dialogue contains a group-based referring expression. If the model struggles to handle group-based referring expressions, the accuracy should be lower than the overall accuracy. Table 4 shows the results. The baseline model TSEL yields low accuracy on triangle, group, pair, square, and trapezoid with large drops ranging from 6% to 24% compared to the overall accuracy. Conversely, our architecture reduces the drop. In the worst case triangle, accuracy drops by 3% . This supports the idea that our architecture improves the understanding of group-based referring expressions.
Note that dialogue history object retrieval resolves the final reference of the dialogue. The existance of a group-based referring expression does not necessarily mean that it relates to the answer; hence, this is indirect support.

Expressions and the Size of Receptive Fields
We visualized the activation pattern of the modulated features in our architecture to verify our first intuition that linguistic and visual features have an optimum size of receptive field for fusion. Figure 4 shows the results. For visualization, we input simple expressions related to single attributes such as select the largest dot (size) or select the darkest dot (color). The stage with the most activated pattern varies depending on attributes in the expressions. We observed this phenomenon on different view inputs from the view in Figure 4. The model pays the most attention to stage 1, which has the smallest receptive field, when it receives an input expression related to color. Then, it moves to the stages with the larger receptive fields as the input changes to size and position. That is likely to correspond to the typical magnitude of localization.
These results suggest that the model selects visual features by the size of the receptive field according to the referring expression, supporting our first intuition.
Failure Cases Although the model makes a good predictions regarding size and color, it does not handle position well. Thus, there is still room to improve expression related to positional relationships, although the model improves this ability.
Through this visualization, we observed that our model tends to set the wrong range. For example, for four position-related expressions in Figure 4, the model predicts answers only from dots in the salient triangle formed by dots c, d and e.
A possible explanation of this observation is data bias. Because the OneCommon game framework rewards players if they successfully create common ground with each other, players may think to mention to more salient dots to increase the success rate. As a result, the variation of expressions could be restricted. In fact, Udagawa and Aizawa (2019) reports these trends on color and size attributes. This suggests the importance of exploring task design for data collection from the viewpoint of collecting a wide range of general reference expressions.

Conclusion
To improve referring expression comprehension, this paper proposes a neural network architecture that modulates visual features; the visual features have different sizes of receptive fields in each hierarchy extracted by CNNs with linguistic features. As our architecture affine transforms visual features with linguistic features, it requires a lower calculation cost than methods that generate convolution kernels.
Our evaluation of referring expression comprehension tasks on two visual dialogue games demonstrates the model's advantage in the understanding of referring expressions and the broad applicability of our architecture. Ablation studies support the importance of multiple fusion.
We expect that hierarchical visual information is also important to generation. However, our architecture is difficult to directly apply to referring expression generation because it outputs modulated feature maps. Therefore, the future direction is to extend our architecture to language generation.