Open-Ended Visual Question Answering by Multi-Modal Domain Adaptation

We study the problem of visual question answering (VQA) in images by exploiting supervised domain adaptation, where there is a large amount of labeled data in the source domain but only limited labeled data in the target domain, with the goal to train a good target model. A straightforward solution is to fine-tune a pre-trained source model by using those limited labeled target data, but it usually cannot work well due to the considerable difference between the data distributions of the source and target domains. Moreover, the availability of multiple modalities (i.e., images, questions and answers) in VQA poses further challenges in modeling the transferability between various modalities. In this paper, we address the above issues by proposing a novel supervised multi-modal domain adaptation method for VQA to learn joint feature embeddings across different domains and modalities. Specifically, we align the data distributions of the source and target domains by considering those modalities both jointly and separately. Extensive experiments on the benchmark VQA 2.0 and VizWiz datasets demonstrate that our proposed method outperforms the existing state-of-the-art baselines for open-ended VQA in this challenging domain adaptation setting.


Introduction
The task of visual question answering (VQA) is building a model to answer questions given an image-question pair.Recently, it has received much attention of the researchers in the area of computer vision [1,13,14,24,28,29].VQA requires techniques from both image recognition and natural language processing, and most existing works use Convolutional Neural Networks (CNNs) to extract visual features from images and Recurrent Neural Network (RNNs) to generate textual features from questions, and combine them to generate the final answers.
However, most of the existing VQA datasets are artificially created and thus may not be suitable as training data for real-world applications.For example, VQA 2.0 [7] and Visual7W [30], arguably two of the most popular datasets for VQA, were created using images from MSCOCO [18] with questions asked by crowd workers.Therefore, the images are typically of high quality and the questions are less conversational than the reality.On the contrary, the recently proposed VizWiz [9] dataset was collected from blind people who take photos and ask questions about them.Therefore, the images in VizWiz are often of poor quality, and questions are more conversational while some of the questions might be unanswerable due to the poor quality of the images.The VizWiz dataset reflects more realistic setting for VQA, but its size is much smaller due to the difficulty of collecting such data.A straightforward method to solve this problem is to first train a model on the VQA 2.0 dataset and then fine-tune it using the VizWiz data.This solution can only provide limited improvement.There are two major issues.First, the VQA datasets are constructed in a different way, making them differ significantly in visual features, textual features and answers.[22] did an experiment to classify different VQA datasets with a simple multi-layer perceptron (MLP) of one hidden layer, which achieved over 98% accuracy.This is a strong indication of the significant bias across different datasets.Our experiments also show that directly fine-tuning the model trained on VQA 2.0 results in minor improvement on VizWiz.Second, the two modalities (visual and textual) also pose a big challenge on the generalizability across datasets.It is challenging to consistently bridge the domain gap in a coordinated fashion when multiple modalities are involved due to the nature of the multi-modal heterogeneity with no common feature representations.Domain adaptation methods, which handle the difference between two domains, have been developed to address the first issue [4,6,8,11,15,23,25,27].However, most of the existing domain adaptation methods focus on single-modal tasks such as image classification and sentiment classification, and thus may not be directly applicable to multi-modal settings.On the other hand, these methods usually are subject to a strong assumption on the label distribution in that the source domain and the target domain share the same (usually small) label space, which may be unrealistic in real-world applications.[20] proposed a new framework for unsupervised multi-modal domain adaptation, but it did not target at the VQA tasks.Recently, several VQA domain adaptation methods have been proposed to address the multi-modal challenge.However, to the best of our knowledge, all the existing VQA domain adaptation methods focus on the multiple choice setting, where several answer candidates are provided and the model only needs to select one from them.In contrast, we focus on a more challenging open-ended setting where there is no prior knowledge of answer choices and the model can select any term from a vocabulary.
In this paper, we address the aforementioned challenges by proposing a novel multi-modal domain adaptation framework.We develop a method under the framework which can simultaneously learn a domain invariant and downstream-task-discriminative multi-modal feature embedding based on an adversarial loss and a classification loss.We additionally incorporate the maximum mean distance (MMD) to further reduce the domain distribution mismatch for multiple modalities, i.e., visual embeddings, textual embeddings and joint embeddings.We conduct experiments on two popular VQA benchmark datasets.The results show that the proposed model outperforms the stateof-the-art VQA models and the proposed domain adaptation method surpasses other state-of-the-art domain adaptation methods on the VQA task.Our contributions are summarized as follows: • We propose a novel supervised multi-modal domain adaptation framework.
• We tackle the more challenging open-ended VQA task with the proposed domain adaptation method.To the best of our knowledge, this is the first attempt of using domain adaptation for open-ended VQA.
• The proposed method can simultaneously learn domain invariant and downstream-task-discriminative multi-modal feature embedding with an adversarial loss and a classification loss.At the same time, it minimizes the difference of cross-domain feature embeddings jointly over multiple modalities.
• We conduct extensive experiments between two popular VQA benchmark datasets, VQA 2.0 and VizWiz, and the results show the proposed method outperforms the existing state-of-theart methods.

Related Works
VQA Datasets Over the past few years, several VQA datasets [2,7,9,16,30] and tasks were proposed to encourage researchers to develop algorithms that answer visual questions.One limitation of many existing datasets is that they were created either automatically or from an existing large vision dataset like MSCOCO [18], and the questions were either generated automatically or contrived by human annotators on Amazon Mechanical Turk (AMT).Therefore, the images in these datasets are typically of high quality but the questions are less conversational.They might not be directly applicable to real-world applications such as [9] which aims to answer the visual questions asked by blind people in their daily life.The main differences between [9] and other artificial VQA datasets are as follows: 1) Both the image and question quality of [9] are lower as they suffer from poor lighting, out of focus and audio recording problems like clipping a question at either end or catching background audio content; 2) The questions can be unanswerable since blind people cannot verify whether the images contain the visual content they are asking about, due to blurring, inadequate lighting, framing errors, finger covering the lens, etc.Our experiments also reveal that fine-tuning the model trained on the somewhat artificial VQA 2.0 dataset provides limited improvement on VizWiz, due to the significant difference in bias between these two datasets.
VQA Settings There are two main VQA settings, namely multiple choice and open-ended.Under the multiple choice setting, the model is provided with multiple candidates of answers and is expected to select the correct one from them.VQA models following this setting usually take characteristics of all answer candidates like word embeddings as the input to make a selection [12,22].However, in the open-ended setting, there is neither prior knowledge nor answer candidates provided, and the model can respond with any free-form answers.This makes this setting more challenging and realistic [1,13,14,24].
VQA Models Recently, a plethora of VQA models were proposed by researchers [1,13,14,24,29].Most of them consist of image and question encoders, and a multi-modal fusion module followed by a classification module.[13] used an LSTM to encode the question and a residual network [10] to compute the image features with a soft attention mechanism.[1] implemented a bottom-up attention using Faster R-CNN [21] to extract features of detected image regions, and then a top-down mechanism used task-specific context to predict an attention distribution over the image regions.The final output was generated by an MLP after fusing the image and question features.[14] used a bilinear attention between two groups of input channels on top of low-rank bilinear pooling which extracted the joint representations for each pair of channels.[24] proposed an approach that takes original image features, bottom-up attention features from object detection module, question features and the optical character recognition (OCR) strings detected from the image as the input, and answers either with an answer from the fixed answer vocabulary or by selecting one of the OCR strings detected in the image.Similar to the state-of-the-art model [24], our VQA base model also takes original image features, bottom-up attention features and question features to predict the final answer.Details of our VQA base model is described in the next section.Domain Adaptation Domain adaptation techniques have been proposed to learn a common domain invariant latent feature space where the distributions of two domains are aligned.Recent works typically focused on transferring neural networks from a labeled source domain to a target domain where there is no or limited labeled data [4,6,8,11,15,23,25].[11] optimized for domain in-variance to facilitate domain transfer and used a soft label distribution matching loss to transfer information between tasks.[25] proposed a framework which combines discriminative modeling, untied weight sharing and a GAN loss to reduce the difference between domains.[23] estimated empirical Wasserstein distance between the source and the target samples and optimized the feature extractor network to minimize the estimated Wasserstein distance in an adversarial manner.[4] utilized gradient reversal layer to incorporate the training process of domain classifier, label classifier and feature extractor to align domains.Similarly, [8] simultaneously minimized the classification error, preserved the structure within and across domains, and restricted similarity on target samples.The major difference between our work and these works is that we propose a novel multi-modal domain adaptation framework, while these works assumed a single modality.Domain Adaptation for VQA Although domain adaptation has been successfully applied to computer vision tasks, its applicability to VQA has yet to be well-studied.There was a recent work that investigated domain adaptation for VQA [22].It reduces the difference in statistical distributions by transforming the feature representation of the data in the target domain.However, one major limitation is the assumption of a multiple choice setting, where four answer candidates are provided as the input to the model.It is unrealistic in real-world applications because one can never guarantee that the ground truth answer is among four candidates.Moreover, it is unclear how to create answer candidates for an image-question pair.On the contrary, our model is only provided with an imagequestion pair and can generate any free-form answers.This makes our task more challenging and realistic.
3 The VQA Framework In this section, we describe our base VQA framework.Given an image I and a question Q, the VQA model estimates the most likely answer â from a large vocabulary based on the content of the image, which can be written as follows: Our base framework consists of four components: 1) a question encoder; 2) an image encoder; 3) a multi-modal fusion module; and 4) a classification module at the output end.We will elaborate each component in the following subsections.Question Encoding The question Q of length T is first tokenized and encoded using word embedding based on pretrained GloVe [19] as S = {x 0 , x 1 , ..., x T }.These embeddings are then fed into a GRU cell [3].The encoded question is obtained from the last hidden state at time step T denoted as and d q is the feature dimension.
Image Encoding Similar to [1] and [24], we first feed the input image I to an object detector [5] pretrained on the Visual Genome dataset [16] based on Feature Pyramid Networks (FPN) [17] with ResNeXt [26] as the backbone.The output from the fully connected f c6 layer is used as the regionbased features, i.e., V r = {v 1 , v 2 , ..., v K } with v i as the feature for i-th object.In the meanwhile, we divide the entire image into a 7 × 7 grid, and obtain the grid-based features V g by average pooling features from the penultimate layer 5c of a pretrained ResNet-101 network [10] on ImageNet dataset.Finally, we combine V r and V g as well as question embedding q to obtain the joint feature embedding in a multi-modal fusion module as described in the next paragraph.

Multi-Modal Fusion and Classification
The question embedding q is used to obtain the topdown, i.e. region-based attention on image features V r .Then, the region-based features V r are Figure 1: The proposed multi-modal domain adaptation framework.X a s , X b s , X a t , X b t denote original features for two modalities.The blue arrow denotes forward propagation while the orange arrow denotes the loss calculation.The purple and green arrows denote backward propagation for discriminator loss L adv .Note that the sign is reversed when the loss backpropagates through the gradient reversal layer in domain discriminator.averaged based on the attention weights to obtain the weighted region-based image features.Similarly, grid-based features V g are fused with question embedding q by concatenation.The fused grid-based features and the weighted region-based image features are then concatenated to obtain the final image features v.We denote the final image feature embedding as v = f v (q, I; θ v ).The final joint embedding e = f j (q, v) is then calculated by taking the Hadamard product of q and v, which is then fed to an MLP f c (e; θ c ) for classification, i.e., a = f c (e; θ c ).The final answer is represented by â = argmax a f c (e; θ c ).

Multi-Modal Domain Adaptation
In this section, we present our framework for supervised multi-modal domain adaptation.We assume there are two modalities1 of source samples where a, b denote the two modalities, and labels Y s drawn from a source domain joint distribution P s (x, y), as well as the two modalities of target samples X t = [X a t , X b t ] and labels Y t drawn from a target joint distribution P t (x, y).We also assume there are sufficient source data so that a good pretrained source model can be built but the amount of target data is limited so that learning on only the target data leads to poor performance.Our goal is to learn target representations for two modalities f a t , f b t , multi-modal fusion f j t and target classifier f c t with the help of pretrained source representations f a s , f b s , f j s and source classifier f c s .For the VQA task in our work, a, b denote visual and textual modalities, respectively.
A typical approach to achieving this goal is to regularize the learning of the source and target joint representations by minimizing the distance of empirical distributions between the source and target domains, i.e., between f j s f a s (X a s ; θ a s ), f b s (X b s ; θ b s ); θ j s and f j t f a t (X a t ; θ a t ), f b t (X b t ; θ b t ); θ j t .In this way, the data from the source domain and the target domain are projected onto a similar latent space, such that well-performing source model can lead to well-performing target model.Following this idea, we propose a novel multi-modal domain adaptation framework as shown in Figure 1.

Joint Embedding Alignment
We propose to reduce the difference of joint embeddings between the source and the target domains by minimizing the Maximum Mean Discrepancy (MMD).The intuition is that two distributions are identical if and only if all of their moments coincide.Suppose we have two distributions P s , P t over a set X .Let ϕ : X → H, where H is a reproducing kernel Hilbert space (RKHS).Then, we have: where µ P = k(x, •)P (dx) is the kernel mean embedding of P and k is a kernel function such as a Gaussian kernel.Let X s = {x s 1 , ..., x s ns } ∼ P s and X t = {x t 1 , ..., x t nt } ∼ P t , the empirical estimate of the distance between P s and P t is We then define the loss function as where e s = f j s f a s (X a s ; θ a s ) , f b s (X b s ; θ b s ); θ j s and e t = f j t f a t (X a t ; θ a t ), f b t (X b t ; θ b t ); θ j t .By minimizing the difference between source and target joint embeddings, we enforce that the joint embeddings of both source domain and target domain will be projected onto a similar latent space.

Multi-Modal Embedding Alignment
It is more challenging to reduce multi-modal domain shift than conventional single-modal domain shift.The previous loss L j in Eq. ( 4) does not explicitly consider the multi-modal property.Aligning only the joint feature embedding is insufficient to adapt the source domain to the target domain.This is because the feature extractor for each modality has its own complexity of domain shift, which often differs from each other (e.g., visual vs. textual).Aligning only the fused features cannot fully reduce domain differences.
Therefore, we introduce the following term to minimize the maximum mean discrepancy between every single modality, i.e., MMD (f a s (X a s ; θ a s ), . Then, the loss function we try to minimize can be written as where γ a and γ b are trade-off parameters for two modalities.

Classification
While minimizing the distance between source and target embeddings, we also want to maintain the classification performance on both the source domain and the target domain.Similarly as in a standard supervised learning setting, we employ the cross entropy loss for classification: where CE denotes the standard cross entropy loss, and γ c is a trade-off parameter between the two domains.

Domain Discriminator
We also propose to use a domain classifier f d to reduce the mismatch between the source domain and target domain by confusing the domain classifier such that it cannot correctly distinguish a sample from source domain or target domain.The domain classifier f d has a similar structure to f c t or f c s except the last layer outputs a scalar in [0, 1] with the value indicating how likely the sample comes from the source domain.Thus, f d can be optimized according to a standard cross-entropy loss.
To make the features domain-invariant, the source and target mappings are optimized according to a constrained adversarial objective.The domain classifier minimizes this objective while the encoding model maximizes this objective.The generic formulation for domain adversarial technique is: For simplicity, we denote θ F = θ a s , θ a t , θ b s , θ b t , θ j s , θ j t as the parameters of all feature mappings and θ C = (θ c s , θ c t ) as the parameters of all label predictors.Putting all together, we obtain our final objective function to minimize as follows: where we seek the parameters which attain a saddle point θF , θC , θd of L, satisfying the following conditions: At the saddle point, the parameters θ d of the domain classifier minimize the domain classification loss L adv while the parameters θ C of the label predictor minimize the label prediction loss L c .The feature mapping parameters θ F minimize the label prediction loss such that the features are discriminative, while maximizing the domain classification loss such that the features are domaininvariant.

Experiments
In this section, we validate our method on the open-ended VQA task and compare it with state-ofthe-art methods.

Datasets
Two standard VQA benchmarks are used in our experiments, VQA 2.0 [7] and VizWiz [9].A comparison of the statistics for these datasets are listed in Table 1, which shows that the scale of VizWiz is much smaller in terms of the numbers of images and questions.Although VizWiz has more unique answers, only 824 out of its top 3,000 answers overlap with the top 3,000 answers in VQA 2.0.This explains why models trained on VQA 2.0 perform poorly on VizWiz, and their limited transferability.We find 28.63% of questions in VizWiz are even not answerable due to reasons mentioned before, making the domain gap even more significant.Figure 2 shows some examples from both VQA 2.0 and VizWiz datasets.

Evaluation Metrics
In VQA, each question is usually associated with 10 valid answers from 10 annotators.We follow the conventional evaluation metric on the open-ended VQA setting to compute the accuracy using the following formula: An answer is considered correct if at least three annotators agree on the answer.Note that the true answers in VizWiz test set are not publicly available.In order to obtain the performance on the test set, results need to be uploaded to the official online submission system (https://evalai.cloudcv.org/web/challenges/challenge-page/102).

Implementation Details
In all our experiments, we extract K = 100 objects for each image to construct our region-based features V r and set the visual feature dimension to 2048.We also set the hidden dimension of GRU to 1024 and hidden dimension after fusion to 4096.The question length is truncated at 24.In the training phase, we apply a warm-up strategy by gradually increasing the learning rate η from 0.001 to 0.01 in the first 2000 iterations.It is then multiplied by 0.15 after every 4000 iterations.We use a batch size of 128.

Experimental Setup
First, we conduct experiments by using the VQA 2.0 dataset as the source domain and the VizWiz dataset as the target domain, to evaluate the effectiveness of our proposed method for multi-modal domain adaptation.We also conduct experiments in the opposite way, using VizWiz as the source domain and VQA 2.0 as the target domain, to further demonstrate the effectiveness of our approach.
We need to emphasize that we choose not to use an overly strong base model (i.e., question embedding from FastText, complex fusion techniques, OCR tokens etc.), as our focus is on multimodal adaptation instead of the base model itself.Despite that, we will show that our proposed domain adaptation method with a weaker base model still outperforms the fine-tuned state-of-theart model.

Results and Analysis
Adaptation from VQA 2.0 to VizWiz As discussed in previous sections, we first pretrain a source model on the VQA 2.0 dataset, and then adapt the pretrained source model to the target dataset VizWiz.The results of our proposed method and other leading methods are shown in Table 2.
We first compare our method with the original VizWiz baseline proposed by [9], the previous state-of-the-art VQA model BAN by [14] and the current state-of-the-art VQA model Pythia by [24].It is clear that our method outperforms the state-of-the-art models by a significant margin from Table 2.
In order to validate that the better performance of our method is not due to a strong base model, we additionally report the results of our method in Table 3, with 1) training our single base model from scratch using only the VizWiz dataset (Target only), 2) fine-tuning from the model pretrained on the VQA 2.0 dataset (Fine-tune), and 3) our proposed domain adaptation method (DA).From Results breakdown into answer categories Table 4 shows the accuracy breakdown into different answer categories.The results show that our model achieves new state-of-the-art performance on "Number" and "Other" categories as well as overall accuracy.Note that the overall accuracy for Pythia in this table is 54.22% instead of 54.72% which we were unable to reproduce using the released code and there are no breakdown numbers reported associated with it.The best we can achieve with Pythia (after fine-tuning from VQA 2.0) is 54.22% and the corresponding breakdown numbers are reported in the table.
Ablation study We conduct an ablation study to show the contributions of different components of our method.Specifically, we consider: 1. Target only: Training the base model using only the data in the target domain.2. +Fine-tune: Pretrain a model on the source VQA 2.0 dataset and then finetune the model on the target VizWiz dataset.Please note that this is unavailable during adaptation thus it is marked inside "()".3. +MMD on V and Q, CLS: Our domain adaptation method with MMD alignment on visual and textual features separately, and classification modules applied for both domains.4. +MMD, GRL on joint: Our domain adaptation method with MMD alignment also on the joint embeddings of both domains, along with the domain discriminator by gradident reversal layer. 5. +Ensemble of 3 models.The results show that the multi-modal MMD brings the most significant performance gain, which validates that aligning on every single modality is beneficial to the transferability of multi-modal tasks.In addition, MMD on joint embedding and discriminator is

Conclusion
We have presented a novel supervised multi-modal domain adaptation framework for open-ended visual question answering.Under the framework, we have developed a new method for VQA which can simultaneously learn domain-invariant and downstream-task-discriminative multi-modal feature embedding.We validate our proposed method on two popular VQA benchmark datasets, VQA 2.0 and VizWiz, in both directions of adaptation.The experimental results show our method outperforms the state-of-the-art methods.

Figure 2 :
Figure 2: Sample image-question pairs and valid answers for VQA 2.0 and VizWiz datasets.Please note that for each image-question pair, there are 10 answers provided by 10 different crowd workers.

Table 1 :
The statistics of VQA 2.0 and VizWiz dataset.Numbers denote train/validation/test information, respectively.

Table 2 :
Accuracy (in %) of different methods on VizWiz.

Table 3 :
Accuracy (in %) comparison for our base model.Target only denotes training from scratch, Fine-tune means fine-tuning and DA presents our domain adaptation method.

Table 3 ,
it shows that our model fine-tuned from VQA 2.0 is about 0.75 percent worse than Pythia fine-tuned from VQA 2.0 (53.97% vs. 54.72%),indicating that the better performance of our final model than the state-of-the-art is not from a strong base model.Moreover, the accuracy of our base model trained from scratch is 53.11%, falling behind 0.6 percent to Pythia trained from scratch, which is consistent with our observation that our method even with a weaker base model can achieve superior final results.

Table 4 :
Results breakdown into different categories of different methods for domain adaptation from VQA 2.0 to VizWiz.Breakdown numbers are performance on VizWiz test-dev split.

Table 7 :
Results comparison using less data.

Table 8 :
Accuracy (in %) comparison for our single base model adapted from VizWiz to VQA 2.0.achieves comparable performance to the state-of-the-art on VQA 2.0.