Multimodal Differential Network for Visual Question Generation

Generating natural questions from an image is a semantic task that requires using visual and language modality to learn multimodal representations. Images can have multiple visual and language contexts that are relevant for generating questions namely places, captions, and tags. In this paper, we propose the use of exemplars for obtaining the relevant context. We obtain this by using a Multimodal Differential Network to produce natural and engaging questions. The generated questions show a remarkable similarity to the natural questions as validated by a human study. Further, we observe that the proposed approach substantially improves over state-of-the-art benchmarks on the quantitative metrics (BLEU, METEOR, ROUGE, and CIDEr).


Introduction
To understand the progress towards multimedia vision and language understanding, a visual Turing test was proposed by (Geman et al., 2015) that was aimed at visual question answering (Antol et al., 2015). Visual Dialog  is a natural extension for VQA. Current dialog systems as evaluated in (Chattopadhyay et al., 2017) show that when trained between bots, AI-AI dialog systems show improvement, but that does not translate to actual improvement for Human-AI dialog. We believe that this is because, the questions generated by bots are not natural (human-like) and therefore does not translate to improved human dialog. Therefore an improvement in the quality of questions could enable dialog agents to perform well in human interactions. Further, (Ganju et al., 2017) show that unanswered questions can be used for improving VQA, Image captioning and Object Classification. An interesting line of work in this respect is the work of (Mostafazadeh et al., 2016). Here the au-thors have proposed the challenging task of generating natural questions for an image. One aspect that is central to a question is the context that is relevant to generate it. However, this context changes for every image. As can be seen in Figure 1, an image with a person on a skateboard would result in questions related to the event. Whereas for a little girl, the questions could be related to age rather than the action. How can one have widely varying context provided for generating questions? To solve this problem, we use the context obtained by considering exemplars, specifically we use the difference between relevant and irrelevant exemplars. We consider different contexts in the form of Location, Caption, and Part of Speech tags. Our method implicitly uses a differential context obtained through supporting and contrasting exemplars to obtain a differential embedding. This embedding is used by a question decoder to decode the appropriate question. As discussed further, we observe this implicit differential context to perform better than an explicit keyword based context. The difference between the two approaches is illustrated in Figure 2. This also allows for better optimization as we can backpropagate through the whole network. We provide detailed empirical evidence to support our hypothesis. As seen in Figure 1 our method generates natural questions and improves over the state-ofthe-art techniques for this problem.
Figure 2: Here we provide intuition for using implicit embeddings instead of explicit ones. As explained in section 1, the question obtained by the implicit embeddings are natural and holistic than the explicit ones.
To summarize, we propose a multimodal differential network to solve the task of visual question generation. Our contributions are: (1) A method to incorporate exemplars to learn differential embeddings that captures the subtle differences between supporting and contrasting examples and aid in generating natural questions. (2) We provide Multimodal differential embeddings, as image or text alone does not capture the whole context and we show that these embeddings outperform the ablations which incorporate cues such as only image, or tags or place information. (3) We provide a thorough comparison of the proposed network against state-of-the-art benchmarks along with a user study and statistical significance test.

Related Work
Generating a natural and engaging question is an interesting and challenging task for a smart robot (like chat-bot). It is a step towards having a natural visual dialog instead of the widely prevalent visual question answering bots. Further, having the ability to ask natural questions based on different contexts is also useful for artificial agents that can interact with visually impaired people. While the task of generating question automatically is well studied in NLP community, it has been relatively less studied for image-related natural questions. This is still a difficult task (Mostafazadeh et al., 2016) that has gained recent interest in the community.
Recently there have been many deep learning based approaches as well for solving the textbased question generation task such as (Du et al., 2017). Further, (Serban et al., 2016) have proposed a method to generate a factoid based question based on triplet set {subject, relation and ob-ject} to capture the structural representation of text and the corresponding generated question.
These methods, however, were limited to textbased question generation. There has been extensive work done in the Vision and Language domain for solving image captioning, paragraph generation, Visual Question Answering (VQA) and Visual Dialog. (Barnard et al., 2003;Farhadi et al., 2010;Kulkarni et al., 2011) proposed conventional machine learning methods for image description. (Socher et al., 2014;Vinyals et al., 2015;Karpathy and Fei-Fei, 2015;Fang et al., 2015;Chen and Lawrence Zitnick, 2015;Johnson et al., 2016;Yan et al., 2016) have generated descriptive sentences from images with the help of Deep Networks. There have been many works for solving Visual Dialog (Chappell et al., 2004;Das et al., 2016. A variety of methods have been proposed by (Malinowski and Fritz, 2014;Lin et al., 2014;Antol et al., 2015;Ren et al., 2015;Ma et al., 2016;Noh et al., 2016) for solving VQA task including attention-based methods (Zhu et al., 2016;Fukui et al., 2016;Xu and Saenko, 2016;Shih et al., 2016;Patro and Namboodiri, 2018). However, Visual Question Generation (VQG) is a separate task which is of interest in its own right and has not been so well explored (Mostafazadeh et al., 2016). This is a vision based novel task aimed at generating natural and engaging question for an image. (Yang et al., 2015) proposed a method for continuously generating questions from an image and subsequently answering those questions. The works closely related to ours are that of (Mostafazadeh et al., 2016) and (Jain et al., 2017). In the former work, the authors used an encoder-decoder based framework whereas in the latter work, the authors extend it by using a variational autoencoder based sequential routine to ob-tain natural questions by performing sampling of the latent variable. Figure 3: An illustrative example shows the validity of our obtained exemplars with the help of an object classification network, RESNET-101. We see that the probability scores of target and supporting exemplar image are similar. That is not the case with the contrasting exemplar. The corresponding generated questions when considering the individual images are also shown.

Approach
In this section, we clarify the basis for our approach of using exemplars for question generation. To use exemplars for our method, we need to ensure that our exemplars can provide context and that our method generates valid exemplars.
We first analyze whether the exemplars are valid or not. We illustrate this in figure 3. We used a pre-trained RESNET-101  object classification network on the target, supporting and contrasting images. We observed that the supporting image and target image have quite similar probability scores. The contrasting exemplar image, on the other hand, has completely different probability scores.
Exemplars aim to provide appropriate context. To better understand the context, we experimented by analysing the questions generated through an exemplar. We observed that indeed a supporting exemplar could identify relevant tags (cows in Figure 3) for generating questions. We improve use of exemplars by using a triplet network. This network ensures that the joint image-caption embedding for the supporting exemplar are closer to that of the target image-caption and vice-versa. We empirically evaluated whether an explicit approach that uses the differential set of tags as a one-hot encoding improves the question generation, or the implicit embedding obtained based on the triplet network. We observed that the implicit multimodal differential network empirically provided better context for generating questions. Our understanding of this phenomenon is that both target and supporting exemplars generate similar questions whereas contrasting exemplars generate very different questions from the target question. The triplet network that enhances the joint embedding thus aids to improve the generation of target question. These are observed to be better than the explicitly obtained context tags as can be seen in Figure 2. We now explain our method in detail.

Method
The task in visual question generation (VQG) is to generate a natural language questionQ, for an image I. We consider a set of pre-generated context C from image I. We maximize the conditional probability of generated question given image and context as follows: where θ is a vector for all possible parameters of our model. Q is the ground truth question. The log probability for the question is calculated by using joint probability over {q 0 , q 1 , ....., q N } with the help of chain rule. For a particular question, the above term is obtained as: where N is length of the sequence, and q t is the t th word of the question. We have removed θ for simplicity. Our method is based on a sequence to sequence network (Sutskever et al., 2014;Vinyals et al., 2015;Bahdanau et al., 2014). The sequence to sequence network has a text sequence as input and output. In our method, we take an image as input and generate a natural question as output. The architecture for our model is shown in Figure 4. Our model contains three main modules, (a) Representation Module that extracts multimodal features (b) Mixture Module that fuses the multimodal representation and (c) Decoder that generates question using an LSTM-based language model.
During inference, we sample a question word q i from the softmax distribution and continue sampling until the end token or maximum length for the question is reached. We experimented with both sampling and argmax and found out that argmax works better. This result is provided in the supplementary material.

Multimodal Differential Network
The proposed Multimodal Differential Network (MDN) consists of a representation module and a joint mixture module.

Finding Exemplars
We used an efficient KNN-based approach (k-d tree) with Euclidean metric to obtain the exemplars. This is obtained through a coarse quantization of nearest neighbors of the training examples into 50 clusters, and selecting the nearest as supporting and farthest as the contrasting exemplars. We experimented with ITML based metric learning (Davis et al., 2007) for image features. Surprisingly, the KNN-based approach outperforms the latter one. We also tried random exemplars and different number of exemplars and found that k = 5 works best. We provide these results in the supplementary material.

Representation Module
We use a triplet network (Frome et al., 2007;Hoffer and Ailon, 2015) in our representation module. We refereed a similar kind of work done in (Patro and Namboodiri, 2018) for building our triplet network. The triplet network consists of three subparts: target, supporting, and contrasting networks. All three networks share the same parameters. Given an image x i we obtain an embedding g i using a CNN parameterized by a function G(x i , W c ) where W c are the weights for the CNN. The caption C i results in a caption embedding f i through an LSTM parameterized by a function F (C i , W l ) where W l are the weights for the LSTM. This is shown in part 1 of Figure 4.
Similarly we obtain image embeddings g s & g c and caption embeddings f s & f c .

Mixture Module
The Mixture module brings the image and caption embeddings to a joint feature embedding space. The input to the module is the embeddings obtained from the representation module. We have evaluated four different approaches for fusion viz., joint, element-wise addition, Hadamard and attention method. Each of these variants receives image features g i & the caption embedding f i , and outputs a fixed dimensional feature vector s i . The Joint method concatenates g i & f i and maps them to a fixed length feature vector s i as follows: where g i is the 4096-dimensional convolutional feature from the FC7 layer of pretrained VGG-19 Net (Simonyan and Zisserman, 2014 (Schroff et al., 2015) is to obtain context vectors that bring the supporting exemplar embeddings closer to the target embedding and vice-versa. This is obtained as follows: where D(t(s i ), t(s j )) = ||t(s i ) − t(s j )|| 2 2 is the Euclidean distance between two embeddings t(s i ) and t(s j ). M is the training dataset that contains all set of possible triplets. T (s i , s + i , s − i ) is the triplet loss function. This is decomposed into two terms, one that brings the supporting sample closer and one that pushes the contrasting sample further. This is given by Here D + , D − represent the Euclidean distance between the target and supporting sample, and target and opposing sample respectively. The parameter α(= 0.2) controls the separation margin between these and is obtained through validation data.

Decoder: Question Generator
The role of decoder is to predict the probability for a question, given s i . RNN provides a nice way to perform conditioning on previous state value using a fixed length hidden vector. The conditional probability of a question token at particular time step q t is modeled using an LSTM as used in machine translation (Sutskever et al., 2014). At time step t, the conditional probability is denoted by P (q t |I, C, q 0 , ...q t−1 ) = P (q t |I, C, h t ), where h t is the hidden state of the LSTM cell at time step t, which is conditioned on all the previously generated words {q 0 , q 1 , ...q N −1 }. The word with maximum probability in the probability distribution of the LSTM cell at step k is provided as an input to the LSTM cell at step k + 1 as shown in part 3 of Figure 4. At t = −1, we are feeding the output of the mixture module to LSTM. Q = {q 0 ,q 1 , ...q N −1 } are the predicted question tokens for the input image I. Here, we are usingq 0 andq N −1 as the special token START and STOP respectively. The softmax probability for the predicted question token at different time steps is given by the following equations where LSTM refers to the standard LSTM cell equations: Loss t+1 = loss(ŷ t+1 , y t+1 ) Whereŷ t+1 is the probability distribution over all question tokens. loss is cross entropy loss.

Cost function
Our objective is to minimize the total loss, that is the sum of cross entropy loss and triplet loss over all training examples. The total loss is: where M is the total number of samples,γ is a constant, which controls both the loss. L triplet is the triplet loss function 5. L cross is the cross entropy loss between the predicted and ground truth questions and is given by: where, N is the total number of question tokens, y t is the ground truth label. The code for MDN-VQG model is provided 1 .

Variations of Proposed Method
While, we advocate the use of multimodal differential network for generating embeddings that can be used by the decoder for generating questions, we also evaluate several variants of this architecture. These are as follows: Tag Net: In this variant, we consider extracting the part-of-speech (POS) tags for the words present in the caption and obtaining a Tag embedding by considering different methods of combining the one-hot vectors. Further details and experimental results are present in the supplementary. This Tag embedding is then combined with the image embedding and provided to the decoder network.
Place Net: In this variant we explore obtaining embeddings based on the visual scene understanding. This is obtained using a pre-trained PlaceCNN (Zhou et al., 2017) that is trained to classify 365 different types of scene categories. We then combine the activation map for the input image and the VGG-19 based place embedding to obtain the joint embedding used by the decoder.
Differential Image Network: Instead of using multimodal differential network for generating embeddings, we also evaluate differential image network for the same. In this case, the embedding does not include the caption but is based only on the image feature. We also exeperimented with using multiple exemplars and random exemplars. Further details, pseudocode and results regarding these are present in the supplementary material.

Dataset
We conduct our experiments on Visual Question Generation (VQG) dataset (Mostafazadeh et al., 2016), which contains human annotated questions based on images of MS-COCO dataset. This dataset was developed for generating natural and engaging questions based on common sense reasoning. We use VQG-COCO dataset for our experiments which contains a total of 2500 training images, 1250 validation images, and 1250 testing images. Each image in the dataset contains five natural questions and five ground truth captions. It is worth noting that the work of (Jain et al., 2017) also used the questions from VQA dataset (Antol et al., 2015) for training purpose, whereas the work by (Mostafazadeh et al., 2016) uses only the VQG-COCO dataset. VQA-1.0 dataset is also built on images from MS-COCO dataset. It contains a total of 82783 images for training, 40504 for validation and 81434 for testing. Each image is associated with 3 questions. We used pretrained caption generation model (Karpathy and Fei-Fei, 2015) to extract captions for VQA dataset as the human annotated captions are not there in the dataset. We also get good results on the VQA dataset (as shown in Table 2) which shows that our method doesn't necessitate the presence of ground truth captions. We train our model separately for VQG-COCO and VQA dataset.

Inference
We made use of the 1250 validation images to tune the hyperparameters and are providing the results on test set of VQG-COCO dataset. During inference, We use the Representation module to find the embeddings for the image and ground truth caption without using the supporting and contrasting exemplars. The mixture module provides the joint representation of the target image and ground truth caption. Finally, the decoder takes in the joint features and generates the question. We also experimented with the captions generated by an Image-Captioning network (Karpathy and Fei-Fei, 2015) for VQG-COCO dataset and the result for that and training details are present in the supplementary material.

Experiments
We evaluate our proposed MDN method in the following ways: First, we evaluate it against other variants described in section 4.4 and 4.1.3. Second, we further compare our network with stateof-the-art methods for VQA 1.0 and VQG-COCO dataset. We perform a user study to gauge human opinion on naturalness of the generated question and analyze the word statistics in Figure 6. This is an important test as humans are the best  Figure 6: Sunburst plot for VQG-COCO: The i th ring captures the frequency distribution over words for the i th word of the generated question. The angle subtended at the center is proportional to the frequency of the word. While some words have high frequency, the outer rings illustrate a fine blend of words. We have restricted the plot to 5 rings for easy readability. Best viewed in color.
deciders of naturalness. We further consider the statistical significance for the various ablations as well as the state-of-the-art models. The quantitative evaluation is conducted using standard metrics like BLEU (Papineni et al., 2002), ME-TEOR (Banerjee and Lavie, 2005), ROUGE (Lin, 2004), CIDEr (Vedantam et al., 2015). Although these metrics have not been shown to correlate with 'naturalness' of the question these still provide a reasonable quantitative measure for comparison. Here we only provide the BLEU1 scores, but the remaining BLEU-n metric scores are present in the supplementary. We observe that the proposed MDN provides improved embeddings to the decoder. We believe that these embeddings capture instance specific differential information that helps in guiding the question generation. Details regarding the metrics are given in the supplementary material.

Ablation Analysis
We considered different variations of our method mentioned in section 4.4 and the various ways to obtain the joint multimodal embedding as described in section 4.

Baseline and State-of-the-Art
The comparison of our method with various baselines and state-of-the-art methods is provided in table 2 for VQA 1.0 and table 3 for VQG-COCO dataset. The comparable baselines for our method are the image based and caption based models in which we use either only the image or the caption embedding and generate the question. In both the tables, the first block consists of the current stateof-the-art methods on that dataset and the second contains the baselines. We observe that for the VQA dataset we achieve an improvement of 8% in BLEU and 7% in METEOR metric scores over the baselines, whereas for VQG-COCO dataset this is 15% for both the metrics. We improve over the previous state-of-the-art (Yang et al., 2015) for VQA dataset by around 6% in BLEU score and 10% in METEOR score. In the VQG-COCO dataset, we improve over (Mostafazadeh et al., 2016) by 3.7% and (Jain et al., 2017) by 3.5% in terms of METEOR scores.

Statistical Significance Analysis
We have analysed Statistical Significance (Demšar, 2006)    module mentioned in section 4.1.3 and also against the state-of-the-art methods. The Critical Difference (CD) for Nemenyi (Fišer et al., 2016) test depends upon the given α (confidence level, which is 0.05 in our case) for average ranks and N (number of tested datasets). If the difference in the rank of the two methods lies within CD, then they are not significantly different and vice-versa. Figure 7 visualizes the post hoc analysis using the CD diagram. From the figure, it is clear that MDN-Joint works best and is statistically significantly different from the state-of-the-art methods. Figure 7: The mean rank of all the models on the basis of METEOR score are plotted on the x-axis. Here Joint refers to our MDN-Joint model and others are the different variations described in section 4.1.3 and Natural (Mostafazadeh et al., 2016), Creative (Jain et al., 2017). The colored lines between the two models represents that these models are not significantly different from each other. Here every question has different number of responses and hence the threshold which is the half of total responses for each question is varying. This plot is only for 50 of the 100 questions involved in the survey. See section 5.4 for more details.

Perceptual Realism
A human is the best judge of naturalness of any question, We evaluated our proposed MDN method using a 'Naturalness' Turing test  on 175 people. People were shown an image with 2 questions just as in figure 1 and were asked to rate the naturalness of both the questions on a scale of 1 to 5 where 1 means 'Least Natural' and 5 is the 'Most Natural'. We provided 175 people with 100 such images from the VQG-COCO validation dataset which has 1250 images. Figure 8 indicates the number of people who were fooled (rated the generated question more or equal to the ground truth question). For the 100 images, on an average 59.7% people were fooled in this experiment and this shows that our model is able to generate natural questions.

Conclusion
In this paper we have proposed a novel method for generating natural questions for an image. The approach relies on obtaining multimodal differential embeddings from image and its caption. We also provide ablation analysis and a detailed comparison with state-of-the-art methods, perform a user study to evaluate the naturalness of our generated questions and also ensure that the results are statistically significant. In future, we would like to analyse means of obtaining composite embeddings. We also aim to consider the generalisation of this approach to other vision and language tasks.