Dual Attention Networks for Visual Reference Resolution in Visual Dialog

Visual dialog (VisDial) is a task which requires a dialog agent to answer a series of questions grounded in an image. Unlike in visual question answering (VQA), the series of questions should be able to capture a temporal context from a dialog history and utilizes visually-grounded information. Visual reference resolution is a problem that addresses these challenges, requiring the agent to resolve ambiguous references in a given question and to find the references in a given image. In this paper, we propose Dual Attention Networks (DAN) for visual reference resolution in VisDial. DAN consists of two kinds of attention modules, REFER and FIND. Specifically, REFER module learns latent relationships between a given question and a dialog history by employing a multi-head attention mechanism. FIND module takes image features and reference-aware representations (i.e., the output of REFER module) as input, and performs visual grounding via bottom-up attention mechanism. We qualitatively and quantitatively evaluate our model on VisDial v1.0 and v0.9 datasets, showing that DAN outperforms the previous state-of-the-art model by a significant margin.


Introduction
Thanks to the recent progresses in natural language processing and computer vision, there has been an extensive amount of effort towards developing a cognitive agent that jointly understand natural language and vision information. Over the last few years, vision-language tasks such as image captioning (Xu et al., 2015) and visual question answering (VQA) (Antol et al., 2015;Anderson et al., 2018) have provided a testbed for developing a cognitive agent. However, the agent performing these tasks still has a long way to go to be used in real-world applications (e.g., aiding visually impaired users, interacting with humanoid robots) in that it does not consider the continuous interaction over time. Specifically, the interaction in image captioning is that the agent simply talks to human about visual content, without any input from human. While the VQA agent takes a question as input, it is required to answer a single question about a given image.
Visual dialog (VisDial) (Das et al., 2017) task has been introduced as a generalized version of VQA. A dialog agent needs to answer a series of questions such as "How many people are in the image?", "Are they indoors or outside?", utilizing not only visually-grounded information but also contextual information from a dialog history. To address these two challenges, researchers have recently tackled a problem called visual reference resolution in VisDial. The problem of visual reference resolution is to resolve ambiguous expressions on their own (e.g., it, they, any other) and ground them to a given image.
In this paper, we address the visual reference resolution in a visual dialog task. We first hypothesize that humans address the visual reference resolution through a two-step process: (1) linguistically resolve the ambiguous questions by recalling the dialog history from one's memory and (2) find a local region of a given image for the resolved questions. For example, as shown in Figure 1, the question "Does it look like a nice one?" is ambiguous on its own because we do not know what "it" refers to. So we believe that humans try to recall the dialog history and implicitly find out "it" refers to the "skateboard". After the resolution step, we believe that they will finally try to find the skateboard in the image and answer the question. For these processes, we propose Dual Attention Networks (DAN) which consists of two kinds of attention modules, RE-FER and FIND. REFER module learns to retrieve the relevant previous dialogs for clarifying am- . We propose two kinds of attention modules, REFER and FIND. REFER learns latent relationships between a given question and a dialog history to retrieve the relevant previous dialogs. FIND performs visual grounding, taking image features and reference-aware representations (i.e., the output of REFER). ⊗, ⊕, and denote matrix multiplication, concatenation and element-wise multiplication, respectively. The multi-layer perceptron is omitted in this figure for simplicity. biguous questions. Inspired by the self-attention mechanism (Vaswani et al., 2017), REFER module computes multi-head attention over all previous dialogs in a sentence-level fashion, followed by feed-forward networks to get the referenceaware representations. FIND module takes image features and the reference-aware representations, and performs visual grounding via bottom-up attention mechanism. From this pipeline, we expect our proposed model to be capable of question disambiguation by using REFER module and ground the resolved reference properly to the given image.
The main contributions of this paper are as follows. First, we propose Dual Attention Networks (DAN) for visual reference resolution in visual dialog based on REFER and FIND modules. Second, we validate our proposed model on the largescale datasets: VisDial v1.0 and v0.9. Our model achieves a new state-of-the-art results compared to other methods. We also conduct ablation studies by four criteria to demonstrate the effectiveness of our proposed components. Third, we make a comparison between DAN and our baseline model to demonstrate the performance improvements on semantically incomplete questions needed to be clarified. Finally, we perform qualitative analysis of our model, showing that DAN reasonably attends to the dialog history and salient image regions.

Related Work
Visual Dialog. Visual dialog (VisDial) task was recently proposed by (Das et al., 2017), providing a testbed for research on the interplay between computer vision and dialog systems. Accordingly, a dialog agent performing this task is not only required to find visual groundings of linguistic expressions but also capture semantic nuances from human conversation. Attentionbased approaches were primarily proposed to address these challenges, including memory networks (Das et al., 2017), history-conditioned image attentive encoder (Lu et al., 2017), sequential co-attention (Wu et al., 2018), and synergistic coattention networks (Guo et al., 2019).

Visual Reference Resolution.
Recently, researchers have tackled a problem called visual reference resolution (Seo et al., 2017;Kottur et al., 2018;Niu et al., 2018) in VisDial. To resolve visual references, (Seo et al., 2017) proposed an attention memory which stores a sequence of previous visual attention maps in memory slots. They retrieved the previous visual attention maps by applying a soft attention over all the memory slots and combined it with a current visual attention. Furthermore, (Kottur et al., 2018) attempted to resolve visual references at a word-level, relying on an off-the-shelf parser. Similar to the attention memory (Seo et al., 2017), they proposed a reference pool which stores visual attention maps of recognized entities and retrieved the weighted sum of the visual attention maps by applying a soft attention. (Niu et al., 2018) proposed a recursive visual attention model that recursively reviews the previous dialogs and refines the current visual attention. The recursion is continued until the question itself is determined to be unambiguous. A binary decision whether the questions is ambiguous or not is made by Gumbel-Softmax approximation (Jang et al., 2016;Maddison et al., 2016). To resolve the visual references, above approaches attempted to retrieve the visual attention of the previous dialogs, and applied it on the current visual attention. These approaches have limitations in that they store all previous visual attentions, while researches in human memory system show that the visual sensory-memory, due to its rapid decay property, hardly stores all previous visual attentions (Sperling, 1960;Sergent et al., 2011). Based on this biologically inspired motivation, our proposed model calculates the current visual attention by using linguistic cues (i.e., dialog history).

Proposed Algorithm
In this section, we formally describe the visual dialog task and our proposed algorithm, Dual Attention Networks (DAN). The visual dialog task (Das et al., 2017) is defined as follows. A dialog agent is given an input such as an image I, a follow-up question at round t as Q t , and a dialog history (including the image caption) until round t − 1, A gt t denotes the ground truth answer (i.e., human response) at round t. By using these inputs, the agent is asked to rank a list of 100 candidate answers, A t = A 1 t , · · · , A 100 t . Given the problem setup, DAN for visual dialog task can be framed as an encoder-decoder architecture: (1) an encoder that jointly embeds the input (I, Q t , H) and (2) a decoder that converts the embedded representation into the ranked listÂ t . From this point of view, DAN consists of three components which are REFER, FIND, and the answer decoder. As shown in Figure 1, REFER module learns to attend relevant previous dialogs to re-solve the ambiguous references in a given question Q t . FIND module learns to attend to the spatial image features that the output of REFER module describes. Answer decoder ranks the list of candidate answers A t given the output of FIND module.
We first introduce the language features, as well as the image features in Sec. 3.1. Then we describe the detailed architectures of the REFER and FIND modules in Sec. 3.2 and 3.3, respectively. Finally, we present the answer decoder in Sec. 3.4.

Input Representation
Language Features. We first embed each word in the follow-up question Q t to {w t,1 , · · · , w t,T } by using pre-trained GloVe (Pennington et al., 2014) embeddings, where T denotes the number of tokens in Q t . We then use a two-layer LSTM, generating a sequence of hidden states {u t,1 , · · · , u t,T }. Note that we use the last hidden state of the LSTM u t,T as a question feature, denoted as q t ∈ R L .
Also, each element in the dialog history {H i } t−1 i=0 and the candidate answers A i t 100 i=1 are embedded as the follow-up question, yielding H, and A t are embedded with same word embedding vector and three different LSTMs.
Image Features. Inspired by bottom-up attention (Anderson et al., 2018), we use the Faster R- CNN (Ren et al., 2015) pre-trained with Visual Genome (Krishna et al., 2017) to extract the object-level image features. We denote the output features as v ∈ R K×V , where K and V are the total number of object detection features per image and dimension of the each feature, respectively. We adaptively extract the number of object features K ranging from 10 to 100 for reflecting the complexity of each image. K is fixed during training.

REFER Module
In this section, we formally describe the singlelayer REFER module. Given the question and dialog history features, REFER module aims to attend to the most relevant elements of dialog history with respect to the given question. Specifically, we first compute scaled dot product attention (Vaswani et al., 2017) in multi-head settings which are called multi-head attention. Let q t and M t = {h i } t−1 i=0 be the question and dialog history feature vectors, respectively. q t and M t are projected to d ref dimensions by different and learnable projection matrices. We then conduct dot product of these two projected matrices, divide by d ref , and apply the softmax function to obtain the attention weights on the all elements in the dialog history. It is formulated as below, Note that dot product attention is computed h times with different projection matrices, yielding {head n } h n=1 . Accordingly, we can get the multi-head representations x t , concatenating all {head n } h n=1 , followed by linear projection. Also, we can computex t by applying a residual connection , followed by layer normalization (Ba et al., 2016).
where ⊕ denotes the concatenation operation, and W o ∈ R hd ref ×L is the projection matrix. Next, we applyx t to two-layer feed-forward networks with a ReLU in between, where W f 1 ∈ R L×2L and W f 2 ∈ R 2L×L . The residual connection and layer normalization is also applied in this step.
Finally, REFER module returns the referenceaware representations by concatenating the contextual representationĉ t and the original question representation q t , denoted as e ref t ∈ R 2L . In this work, we use d ref = 256. Figure 2 illustrates the pipeline of the REFER module.
Furthermore, we stack the REFER modules in multiple layers to get a high-level abstraction of the reference-aware representations. Details are to be discussed in Sec. 4.5.

FIND Module
Instead of relying on the visual attention maps of the previous dialogs as in (Seo et al., 2017;Kottur et al., 2018;Niu et al., 2018), we expect the FIND module to attend to the most relevant regions of the image with respect to the reference-aware representations (i.e., the output of REFER module). In order to implement the visual grounding for the reference-aware representations, we take inspiration from bottom-up attention mechanism (Anderson et al., 2018). Let v ∈ R K×V and e ref t ∈ R 2L be the image feature vectors and reference-aware representations, respectively. We first project these two vectors to d f ind dimensions and compute soft attention over all the object detection features as follows: where f v (·) and f ref (·) denote the two-layer multi-layer perceptrons which convert to d f ind dimensions, and W r ∈ R d f ind ×1 is the projection matrix for the softmax activation. denotes hadamard product (i.e., element-wise multiplication). From these equations, we can get the visual attention weights α t ∈ R K×1 . Next, we apply the visual attention weights to v and compute the vision-language joint representations as follows: where f v (·) and f ref (·) also denote the two-layer multi-layer perceptrons which convert to d f ind dimensions, and W z ∈ R d f ind ×L is the projection matrix.
Note that e f ind t ∈ R L is the output representations of the encoder as well as FIND module which is decoded to score the list of candidate answers. In this work, we use d f ind = 1024.

Answer Decoder
Answer decoder computes each score of candidate answers via a dot product with the embedded representation e f ind t , followed by a softmax activation to get a categorical distribution over the candidates.
∈ R 100×L be the feature vectors of 100 candidate answers. The distribution p t is formulated as follows: In training phase, DAN is optimized by minimizing the cross-entropy loss between the one-hot encoded label vector (i.e., y t ) and probability distribution (i.e., p t ).
Where p t,k denotes the probability of the k-th candidate answer at round t. In test phase, the list of candidate answers is ranked by the distribution p t , and evaluated by the given metrics.

Experiments
In this section, we describe the details of our experiments on the VisDial v1.0 and v0.9 datasets. We first introduce the VisDial datasets, evaluation metrics, and implementation details in Sec. 4.1, Sec. 4.2, and Sec. 4.3, respectively. Then we report the quantitative results by comparing our proposed model with the state-of-the-art approaches and baseline model in Sec. 4.4. Then, we conduct the ablation studies by four criteria to report the relative contributions of each components in Sec. 4.5. Finally, we provide the qualitative results in Sec. 4.6.

Datasets
We evaluate our proposed model on the VisDial v0.9 and v1.0 dataset. VisDial v0.9 dataset (Das et al., 2017) has been collected from two annotators chatting log about MS-COCO (Lin et al., 2014) images. Each dialog is made up of an image, a caption from MS-COCO dataset and 10 QA pairs. As a result, VisDial v0.9 dataset contains 83k dialogs and 40k dialogs as train and validation splits, respectively. Recently, VisDial v1.0 dataset (Das et al., 2017) has been released with an additional 10k COCO-like images from Flickr. Dialogs for the additional images have been collected similar to v0.9. Overall, VisDial v1.0 dataset contains 123k (all dialogs from v0.9), 2k, and 8k dialogs as train, validation, and test splits, respectively.

Evaluation Metrics
We evaluate individual responses at each question in a retrieval setting according to (Das et al., 2017). Specifically, the dialog agent is given a list of 100 candidate answers of each question and asked to rank the list. There are three kinds of evaluation metrics for retrieval performance: (1) mean rank of human response, (2) recall@k (i.e., existence of the human response in top-k ranked response), and (3) mean reciprocal rank (MRR). Mean rank, recall@k, and MRR are highly correlated with the rank of human response. In addition, (Das et al., 2017) proposed to use the robust evaluation metric, normalized discounted cumulative gain (NDCG). NDCG takes into account all relevant answers from the ranked list, where the relevance scores are densely annotated for VisDial v1.0 test split. NDCG penalizes the lower rank of the candidate answers with high relevance scores.

Implementation Details
The dimension of image features V and hidden states in all LSTM L is 2048 and 512, respectively. All the language intputs are embedded into a 300dimensional vector initialized by GloVe (Pennington et al., 2014). The number of attention heads h is fixed to 4 except for the ablation study that changes it. We apply Adam optimizer (Kingma and Ba, 2014) with learning rate 1 ×10 −3 , decreased by 1 ×10 −4 per epoch until epoch 7, decayed by 0.5 per epoch from 8 to 12 epochs.
Results on VisDial v1.0 and v0.9 datasets. As shown in Table 1  higher than all other methods on both single ground-truth answer (R@1) and all relevant answers on average (NDCG).
Results on ensemble model. We report the performance of ensemble model in comparison with the top-three entries in the leaderboard 1 of Vis-Dial Challenge 2018. We ensemble six DAN models, using the number of attention heads (i.e., h) ranging from one to six. We average the probability distribution (i.e., p t ) of the six models to rank the candidate answers. In Table 2, our model significantly outperforms all three challenge entries, including the challenge winner model, Synergistic (Guo et al., 2019). They ensembled ten models with different weight initialization and also used bottom-up attention features (Anderson et al., 2018) as image features.
Results on semantically complete & incomplete questions. We first define the questions that contain one or more pronouns (i.e., it, its, they, their, them, these, those, this, that, he, his, him, she,  Table 3: VisDial v1.0 validation performance on the semantically complete (SC) and incomplete (SI) questions. We observe that SI questions obtain more benefits from the dialog history than SC questions.
her) as the semantically incomplete (SI) questions. Also, we can declare the questions that do not have pronouns as semantically complete (SC) questions. Then, we have checked the contribution of the reference-aware representations for the SC and SI questions, respectively. Specifically, we make a comparison between DAN, which utilizes reference-aware representations (i.e., e ref t ), and No REFER, which exploits question representations (i.e., q t ) only. From the Table 3, we draw three observations: (1) DAN shows significantly better results than the No REFER model for SC questions. It validates that the context from dialog history enriches the question information, even when the question is semantically complete.
(2) SI questions obtain more benefits from the dialog history than SC questions. It indicates that DAN is more robust to the SI questions than SC questions. (3) A dialog agent faces greater difficulty in answering SI questions compared to SC questions. No REFER is equivalent to the FIND + RPN model in the ablation study section.

Ablation Study
In this section, we perform ablation study on Vis-Dial v1.0 validation split with the following four model variants: (1) Model only using the single attention module, (2) Model that uses different image features (pre-trained VGG-16 is used), (3) Model that does not use the residual connection in REFER module, and (4) Model that stacks the RE-FER modules up to four layers with each different number of attention heads.
Single Module. The first four rows in Table 4 show the performance of a single module. FIND denotes the use of FIND module only, and RE-FER denotes the use of single-layer REFER module only. Specifically, REFER uses the output of REFER module as the encoder outputs. On  the other hand, FIND does not take the referenceaware representations (i.e., e ref t ) but the question feature (i.e., q t ). The single models show relatively poor performance compared with the dual module model. We believe that the results validate two hypotheses: (1) VisDial task requires contextual information from dialog history as well as the visually-grounded information.
(2) REFER and FIND modules have complementary modeling abilities.
Image Features in FIND Module. To report the impact of image features, we replace the bottomup attention features (Anderson et al., 2018) with ImageNet pre-trained VGG-16 (Simonyan and Zisserman, 2014) features. In detail, we use the output of the VGG-16 pool5 layer as image features. In Table 4, RPN denotes the use of the region proposal networks (Ren et al., 2015) which are equivalent to the use of bottom-up attention features. Similar to VQA task, we observe that DAN with bottom-up attention features achieves Figure 4: Qualitative results on the VisDial v1.0 dataset. We visualize the attention over dialog history from REFER module and the visual attention from FIND module. The object detection features with top five attention weights are marked with colored box. A red colored box indicates the most salient visual feature. Also, the attention from REFER module is represented as shading, darker shading indicates the larger attention weight for each element of the dialog history. Our proposed model not only responds to the correct answer, but also selectively pays attention to the previous dialogs and salient image regions.
better performance than with VGG-16 features. In other words, the use of object-level features boosts the MRR performance of DAN.
Residual Connection in REFER Module. We also conduct an ablation study to investigate the effectiveness of the residual connection in REFER module. As shown in Table 4, the use of the residual connection (i.e., Res) boosts the MRR score of DAN. In other words, DAN utilizes the excellence of deep residual learning as in Rocktäschel et al., 2015;Yang et al., 2016;Kim et al., 2016;Vaswani et al., 2017).

Stack of REFER Modules & Attention Heads.
We stack the REFER modules up to four layers with each different number of attention heads, h ∈ {1, 2, 4, 8, 16, 32, 64}. In other words, we conduct the ablation experiments with twenty-eight models to set the hyperparameters of our model. Figure 3 shows the results of the ablation experiments. For n ≥ 2, REFER (n) indicates that DAN uses a stack of n identical REFER modules. Specifically, for each pair of successive modules, the output of the previous REFER module is fed into the next REFER module as a query (i.e., q t ). Due to the small number of elements in each dialog history, the overall performance pattern shows a tendency to decrease as the number of attention heads in-creases. It turns out that the two-layer REFER module with four attention heads (i.e., REFER (2) and h = 4) performs the best among all models in ablation study, recording 64.17% on MRR.

Qualitative Results
In this section, we visualize the inference mechanism of our proposed model. Figure 4 shows the qualitative results of DAN. Given a question that is needed to be clarified, DAN correctly answers the question by selectively attending to each element of the dialog history and salient image regions. In case of the visual attention, we mark the object detection features with top five attention weights of each image. On the other hand, the attention weights from REFER module are represented as shading; darker shading indicates the larger attention weight for each element of the dialog history. These attention weights are calculated by averaging over all the attention heads.

Conclusion
We introduce Dual Attention Networks (DAN) for visual reference resolution in visual dialog task. DAN explicitly divides the visual reference resolution problem into a two-step process. Rather than relying on the previous visual attention maps as in prior works, DAN first linguistically resolves ambiguous references in a given question by using REFER module. Then, it grounds the resolved references in the image by using FIND module. We empirically validate our proposed model on VisDial v1.0 and v0.9 datasets. DAN achieves the new state-of-the-art performance, while being simpler and more grounded.