Visual-Textual Alignment for Graph Inference in Visual Dialog

As a conversational intelligence task, visual dialog entails answering a series of questions grounded in an image, using the dialog history as context. To generate correct answers, the comprehension of the semantic dependencies among implicit visual and textual contents is critical. Prior works usually ignored the underlying relation and failed to infer it reasonably. In this paper, we propose a Visual-Textual Alignment for Graph Inference (VTAGI) network. Compared with other approaches, it makes up the lack of structural inference in visual dialog. The whole system consists of two modules, Visual and Textual Alignment (VTA) and Visual Graph Attended by Text (VGAT). Specially, the VTA module aims at representing an image with a set of integrated visual regions and corresponding textual concepts, reflecting certain semantics. The VGAT module views the visual features with semantic information as observed nodes and each node learns the relationship with others in visual graph. We also qualitatively and quantitatively evaluate the model on VisDial v1.0 dataset, showing our VTAGI outperforms previous state-of-the-art models.


Introduction
Cross-modal semantic understanding has become an attractive challenge in natural language processing and computer vision, inspiring many tasks such as image captioning (Xu et al., 2015;Vinyals et al., 2015) and visual question answering (VQA) (Antol et al., 2015;Anderson et al., 2018;Shimizu et al., 2018). However, in these missions, the co-reference between vision and language is usually performed in a single round and they do not have many interactions with human over a period of time. In 2017, Das et al. introduced a continuous conversational task, visual dialog (Das et al., 2017). This task needs an AI agent to answer a sequence of questions based on visually-grounded information and contextual information from a dialog history.
Recently, a manual investigation (Kim et al., 2020) on the Visual Dialog dataset (VisDial) tried to figure out how many questions can be answered with images and how many of them need conversation history to be answered. The investigation shows that around 80% of the questions can be answered with images and about 20% of the questions need the knowledge from dialog history. Therefore, one of the key challenges in visual dialog is how to effectively utilize these underlying contents in the textual and visual information, i.e., input questions, dialog history and input image. In previous works, such as RvA (Niu et al., 2019) and DAN (Kang et al., 2019), both tended to explicitly reason over past dialog interactions by referring back to previous references, but they ignored the underlying relational structure which contributes to dialog inference. Nowadays, researchers have attempted to consider the fixed graph attention or embedding to resolve the problem with structural representations Schwartz et al., 2019). They focused on the textual modality but neglected the rich underlying information in the image. In this task, despite its significance to artificial intelligence and human-computer interaction, the agent requires understanding a series of multi-modal entities, and reasons the rich information in both vision Figure 1: An overview of VTAGI. We present two modules, VTA and VGAT. VTA takes the attended visual features and attended textual features including question and history as inputs, resulting in integrated image representations reflecting semantics of certain objects in the image. VGAT aims to construct a visual graph combined with textual context. The relations among the nodes are top-5 important. For example, the most thickest link between the node v 1 and v k indicates the most important dependencies of them. ⊗ and denote matrix multiplication and element-wise product, respectively. and language. An ideal inference algorithm should be able to find out the underlying relational structure and give a reasonable answer based on this structure.
To address aforementioned problem, we pay more attention on visual-textual relation and propose the VTAGI in Figure 1 to explore potential information for structural inference. The agent will first obtain the question features, history features and visual features by employing different attention mechanisms. However, the semantics of the visual features and the textual concepts are usually inconsistent, and the representations of the image lack of global structural information. Thus, the VTA module is to align the visual features and global textual contents with their relevant counterparts in each image domain. As a result, the visual features contain more specific semantic information. For example, in Figure 1, the visual feature v 1 contains the semantic information of "giraffe, only, beside, trees", because each visual feature considers all contextual information. In order to infer more reasonably and connect all of the individual visual features, we design the VGAT module to construct a visual graph that shows the different relationships among various visual features. For example, considering the feature v 1 in Figure 1, the thickest link between the v 1 and v k indicates the most important relationship between the two features. This module learns how to select other nodes related to the current node. In the last step in this process, each visual feature node in this structural module is connected to its related nodes. Through the two modules, the final visual features possess more related information and they are intra-connected in the graph, which are beneficial to inference.

Related Work
Visual Dialog. Most studies on the task of visual dialog introduced by Das et al. (Das et al., 2017) can be categorized into four groups. Fusion-based Models: late fusion (LF) (Das et al., 2017) and hierarchical recurrent network (HRE) (Das et al., 2017) directly encoded the multi-modal inputs (image, question, dialog history) and decoded the answer. Attention-based Models: memory network (MN) (Das et al., 2017), history-conditioned image attention (HCIAE) (Lu et al., 2017), sequential co-attention (CoAtt) (Wu et al., 2018) and synergistic co-attention network (Sync) (Guo et al., 2019) computed attended representations of inputs. Visual Co-reference Resolution (VCoR)-based Models: attention memory (AMEM) (Seo et al., 2017), neural module networks (CorefNMN) (Kottur et al., 2018), recursive visual attention mechanism (RvA) (Niu et al., 2019) and dual attention network (DAN) (Kang et al., 2019) , these solutions clarified ambiguous expressions (e.g., he, she, they) in the text and focused on explicit visual co-reference resolution. Graph-based Models attempt to construct some structures to obtain more underlying information. Zheng et al.  designed a structural inference model based on an EM-style (expectation-maximization) GNNs (graph neural networks) to conduct the textual co-reference. Schwartz et al.(Schwartz et al., 2019) proposed a factor graph mechanism and constructed the graph over all the multi-modal features. Guo et al.(Guo et al., 2020) utilized the word-level attention of question to construct a context-aware graph. The aforementioned graph-related models did not highlight the visual features and their relationships. While, in our work, which also belongs to the forth group, building a relational graph based on visual objects can contain more information.
Visual-semantic Alignment. In image captioning, Karpathy and Li(Karpathy and Fei-Fei, 2015) introduced the notion of visual-semantic alignment, which was based on a novel combination of Convolution Neural Network over image regions, bidirectional Recurrent Neural Network over sentences and a structural objective that aligned two modalities through a multi-modal embedding. In the field of VQA, some recent efforts (Nam et al., 2017;Kim et al., 2018;Nguyen and Okatani, 2018;Ben-Younes et al., 2017) have also been dedicated to studying similar alignment between image and question. To acquire integrated image representations, they normally aligned the visual features and textual concepts, which were beneficial to explore the latent relation. In this paper, we align heterogeneous modalities (image, question and history) based on distinct attention mechanisms, to make each region in the image possessing more specific and detailed contents. Especially, we make the visual features contain two levels of semantic information. By adding history features, the visual features can acquire the global textual information. And through integrating question features, visual features also get the logical contents.
Graph Neural Network. The concept of graph neural network (GNN) was first proposed by (Scarselli et al., 2008), who extended existing neural networks for processing the data represented in graph domain. GNNs have been applied in various tasks (Gu et al., 2019;Liu et al., 2018;. The core was to combine the graphical structural representation with neural networks. The GNN follows a strategy that controls how the representation vector of a node calculated by its neighboring nodes to capture specific patterns of a graph. The neighborhood connectivity information in GNNs is unrestricted and potentially irregular, giving them greater applicability than convolutional neural networks (CNNs), which impose a fixed regular neighborhood structure. In this paper, we apply GNN to learn the relation among visual features with multi-modal contexts for inferring answers. Through this GNN, each visual feature can connect with other associated features.

Proposed Approach
In this section, we firstly define the visual dialog task as in Das et al.(Das et al., 2017). Formally, a visual dialog agent takes image I, question Q t and dialog history H t as input. Among them, the Q t is asked in the current round t. The H t is consist of Q&A pairs till round t-1, while in the first round it only contains the caption C about the image I. The agent is required to return an answer A t ={A t 1 , A t 2 , · · · , A t 100 } to the Q t , by ranking a list of 100 candidate answers in a discriminative manner.
We will present the language features and the image features in section 3.1, followed by section 3.2 describing VTA module. Finally, the detailed information of VGAT module is provided in section 3.3.

Feature Representation
Language Features. We first embed each word in the question Q t as W Q ={w t,1 , w t,2 , · · · , w t,T } by using GloVe (Pennington et al., 2014) embeddings, where T denotes the number of tokens in Q t . Then we use a Bi-LSTM to encode W Q into a sequence U Q ={q t 1 , q t 2 , · · · , q t T }. Similarly, we get the history embedding vectors as W H and the sequence representation of U H ={h i } t−1 i=0 . We then adopt the attention mechanism (Vaswani et al., 2017) of Eq.1 to obtain the question features Q by setting the inputs Q A , K A , V A as U Q . In the same way, the history features H attended by the question Q can also utilize Eq.1, but the inputs K A and V A come from U H and the Q A come from attended vectors of Q. As shown in Figure 2, the question features pay particular attention to pronouns and nouns, such as "it", "zoo". The history questions, attended by question, pay more attention to global and logical content, such as "giraffe beside trees", "No people".
Visual Features. Inspired by bottom-up attention (Anderson et al., 2018), we use Faster R- CNN(Ren et al., 2015) to extract object-level image features of {v 1 , v 2 , · · · and v k }. Firstly, we fuse the question and history features by matrix multiplication of Eq.2. Then, co-attention (Lu et al., 2016) of Eq.3 is exploited to get the attended visual features V. The C is the affinity matrix, H v is the image attention maps. In this paper, we select top-k region proposals from each image, where k is simply fixed as 36.

Visual and Textual Alignment
In most previous works for this area, visual features generally contain low-level visual information and are difficult to align with textual contents. The purpose of the VTA is to form accurate alignment between the visual regions and the textual words. For this purpose, the global textual concepts are introduced to compensate the lack of high-level semantic information in visual features. As shown in Figure 2, to obtain the final visual features with matched semantic features, such as v 1 matches with "giraffe only, beside trees", we deal with the history and question features successively. We adopt the attention mechanism from Vaswani et al. (Vaswani et al., 2017) to learn the correlated features in a certain domain by querying the other domain. The multi-head attention is composed of h parallel heads and each head is formulated as a scaled dot-product attention. We evaluate the alignment between visual and history features as follows: where V ∈ R k×d h and H ∈ R t×d h for k visual features and t history features respectively; the W o 1 ∈ R d h ×d k is the parameter to be learned. The multi-head attention integrates t history features into k visual features. In this step, the features correspond to the global textual features. As illustrated in Figure 2, the v k matches with "trees, yes, giraffe". Similarly, we integrate t question features into k visual features of Eq.6 and Eq.7, which make the feature v k adding "fence, no" logical semantic information.
The equations are as follows: Figure 2: Visual and Textual Alignment(VTA) module. We first integrate the attended visual features and history features as inputs to get global semantic information. For example, the v k is first matched with "trees yes, giraffe". Next, the question features will be deployed to compensate the logical semantic information for the visual features. For example, the logical semantic features ("fence no") corresponding to the v k are added, and the logical information is underlined in green. The v 1 , v 2 , · · · and v k are visual features matched with certain textual contents.
Finally, the visual features V f contain more consistent semantic features including global and logical contexts, resulting in the following graph construction more accurately and effectively.

Visual Graph Attended by Text
The previous VTA module ensures that the refined visual features only contain homogeneous information. For example, the visual features v k in Fig.2 only contains related information of "fence no", but not the "only". Whereas, the visual features are independent and isolated. In order to establish the latent connection among them, we introduce a VGAT module. This module aims to build a graph which takes both visual and textual contents into account. Here, we build the visual graph by finding the visual relationships in the sentence/word-level textual information and corresponding semantics in visual features. The construction of the graph is denoted as G={V f ,ε}, where the node v i denotes a joint visual feature; the directed edge ε i→j represents the relational dependency from node v i to node v j (i, j = 1, 2, · · · , k). From the Figure 3 in step S i (i = 0, 1, · · · , k), it shows that the construction of the graph has two textual operations with different colors. The graph is denoted as G (i) ={V f ,ε (i) }: where [;] is the concatenation operation; T s is the visual-related textual feature in the sentence-level stage; T i w is the textual features in word-level stage. To construct the original visual graph in S 0 by introducing sentence-level information, we first calculate the question and history features attended by In two different stages, we use different colors to represent different textual operations influencing the construction of visual graph. For example, in step S 0 , we construct the graph based on the visual features that are guided by sentence-level question and history features with pink link. In step S 1 , to select some related neighbor visual features, we focus on the feature v 1 and integrate the word-level textual features shown in blue link. The thicker the connection between nodes, the more important the relationship between them. Finally, in step S k , we focus on the feature v k and the whole VGAT module is finished.
visual features that are generated from VTA module. Then concatenate the two textual features.
The next step S i (i>0), we adopt the word-level textual features, so the T i w is defined as follows: where f Next, we describe the correlation among different nodes in the graph G. We define A (i) ∈ R k×k as the adjacency correlation matrix of the G (i) . In the matrix, the value A (i) p→q represents the connection weight of the edge ε where W 1 , W 2 , W 3 are learnable parameters, and is the element-wise product.
It is a fact that there are always only a part of the detected objects in the image related to the similar textual contents. Therefore, the node at each step in the graph is required to connect with the most relevant neighbor nodes. In order to obtain a set of relevant nodes R (i) in G (i) (i =1, 2, · · · , k), we adopt a ranking method as : R (i) = top-5(A (i) ), where top-5 returns the indices of the 5 largest values in the matrix of A (i) . The R (i) retains the most relevant nodes attributing to the final answer inference. Finally, the learning on each node in the graph not only integrates visual and textual features, but also involves context-visual relational learning. In this module, we establish links among all independent visual features.
Finally, we learn the representation of text and visual features with e t which is fed into the discriminative decoder, where, W g , b g , W e , P g are learnable parameters. Q, H are attended textual features described in section 3.1, e vg denotes the attended graph visual representation.

Dataset and Evaluation Metrics
We evaluate the proposed approach on VisDial v1.0 (Das et al., 2017), which includes additional 10k coco-like images from Flicker compared with v0.9 (Das et al., 2017). The collection of dialogs on Flicker images is similar to that on MS-COCO images (Lin et al., 2014). The train, validation, test sets in v1.0 dataset contain 123k, 2k and 8k dialogs, respectively. Different from train and validation sets in v1.0 where each image is associated with a 10-round Q&A pair, the dialog in the test set has a random length within 10 rounds. We follow (Das et al., 2017) to evaluate the response at each round. Specially, the dialog agent is given a list of 100 candidate answers, the model is expected to rank over the candidates and return a ranked list for further evaluation. The standard retrieval metrics are: mean rank evaluates the ground truth response (Mean), recall@K (K=1, 5, 10) evaluates where the ground truth is positioned in the sorted list(R@K), mean reciprocal rank evaluates the precision of the model by ranking where a ground truth answer is positioned (MRR), and normalized cumulative gain evaluates relative relevance of the predicted answers (NDCG). Higher value for R@K, MRR and NDCG is better, while lower value for Mean is better.

Quantitative Results
Comparing Methods. We compare our proposed model with the state-of-the-art approaches on VisDial v1.0 dataset. Based on the design of encoders, these methods can be grouped into: Fusion-base Models (LF and HRE (Das et al., 2017)), they fused image, question and history features at different stages; Attention-based Models (MN (Das et al., 2017) and Sync (Guo et al., 2019)), they established attention mechanisms over image, question and history; VCoR (Visual Co-reference Resolution) based Models (CorefNMN (Kottur et al., 2018), RvA (Niu et al., 2019), DAN (Kang et al., 2019) and HACAN(Yang et al., 2019)), they focused on explicit visual co-reference resolution based on textual features; Graphbased Models (GNN , FGA (Schwartz et al., 2019) and CAG (Guo et al., 2020)), they proposed graph structure to explore more information from different modalities. The first two ways did not fully integrate textual information and image information. And the third way, the extracted information was too scattered and lacked structural guidance.
Results on VisDial v1.0. As shown in Table 1, our VTAGI outperforms the state-of-the-art method across all the metrics. We mainly compare our method with the graph-based ones. GNN  constructed a graph exploring the dependencies among the textual-history. In contrast, our model builds a graph about the visual-objects integrated with question-history contexts. Compared with GNN, our model achieves 5.2% improvements on NDCG. FGA (Schwartz et al., 2019) constructed a graph, which simply combined representations of all modalities. In contrast, our method focuses more on the relationships among visual features. Compared with FGA, our model achieves about 6% improvements on NDCG. CAG (Guo et al., 2020) achieved the best performance on the metric NDCG based on the graph method for visual dialog, which designed a visual graph guided by the current question. However, our construction of visual graph guided by the different (sentence/word) levels question and history features at different stages and our method is more accurate in information extraction because of the operation of alignment. Specifically, compared with CAG (Guo et al., 2020), our result lifts NDCG from 56.64 to 58.02. In Fig. 4, we show four examples of our graphical inference. For convenience, in these examples, we only show top-2 related objects. Each example has three processes of P 1 , P 2 and P 3 . For example, in Fig. 4(a), the P 3 means that the dialog history already includes C, Q 1 &A 1 , Q 2 &A 2 and the current question is Q 3 . In this process, the more important relationships are between the boy and his clothes. Thus, the agent can relate the boy to his clothes, and infer the answer to Q 3 is "Yes.".

Ablation Study
In this section, we perform ablation study on VisDial v1.0 dataset with the following two model variants: Model only using VGAT module (B+VGAT) and Model only using VTA module (B+VTA). The baseline model (B) was introduced by Niu et al. (Niu et al., 2019), which proposed a novel attention mechanism RvA to capture question-relevant dialog history but ignored the structural visual inference Figure 4: Visualization results of our model. It shows the relationships among various regions in the given image, and those values on links are calculated through the semantic in contextual and visual information. The higher the value, the thicker the line, the more important the relationship between the two objects. Different objects are linked with different colored lines. For example, in (c) P 3 , the question Q 3 focuses on the bed in the given image. In the visual graph, we can see the object "bed" is connected to "quilt" and "pillow" with two lines having the higher weights of 0.42 and 0.38, respectively. based on semantics. In our work, the main system not only aligns the visual and textual contents, but also constructs a visual relational features graph for effective inference. The B+VTA+VGAT(w/o question) and the B+VTA+VGAT(w/o history) confirm the importance of question and history information in VGAT module, respectively. In Table 2, B+VGAT and B+VTA improve the NDCG by about 2% respectively. Meanwhile, the combined architecture (B+VGAT+VTA) raises the NDCG from 55.59% to 58.02%.
Our ablation experiments illustrate the necessity and rationality of each part in our model. The VTA allows the visual representations to describe salient image regions with semantic perspective through the alignment between textual and visual features. This module provides more underlying information in image, thus the following module VGAT can make use of the information in both textual and visual features to learn the relationships among all features in the given image. The VGAT makes the visual features more fine-grained and correlational, and the structure of visual graph is helpful for answer inference. From the experimental results, our method is superior to the baseline and those models based on the graph method.

Conclusion
In this paper, we introduce Visual-Textual Alignment for Graph Inference (VTAGI) network based on graph method for the visual dialog task. Rather than relying on the visual attention maps in prior works, VTAGI introduces alignment operation influenced by textual information and graph neural network approach. Our method is committed to obtaining more fine-grained and semantic-grounded image presentations with the help of linguistic clues. We empirically validate our proposed model on VisDial v1.0 dataset. Results show that our method is able to find and utilize underlying information for dialog inference, demonstrating its effectiveness. In future work, we aim to integrate positional relationships among visual objects by understanding the context.