Question Relevance in VQA: Identifying Non-Visual And False-Premise Questions

Visual Question Answering (VQA) is the task of answering natural-language questions about images. We introduce the novel problem of determining the relevance of questions to images in VQA. Current VQA models do not reason about whether a question is even related to the given image (e.g. What is the capital of Argentina?) or if it requires information from external resources to answer correctly. This can break the continuity of a dialogue in human-machine interaction. Our approaches for determining relevance are composed of two stages. Given an image and a question, (1) we first determine whether the question is visual or not, (2) if visual, we determine whether the question is relevant to the given image or not. Our approaches, based on LSTM-RNNs, VQA model uncertainty, and caption-question similarity, are able to outperform strong baselines on both relevance tasks. We also present human studies showing that VQA models augmented with such question relevance reasoning are perceived as more intelligent, reasonable, and human-like.


Introduction
Visual Question Answering (VQA) is the task of predicting a suitable answer given an image and a question about it.VQA models (e.g., (Antol et al., 2015;Ren et al., 2015)) are typically discriminative models that take in image and question representations and output one of a set of possible answers.
Our work is motivated by the following key observation -all current VQA systems always output an answer regardless of whether the input question makes any sense for the given image or not.Fig. 1 Non-Visual Visual True-Premise Who is the president of the USA?
What is the girl wearing?
What is the cat wearing?
Visual False-Premise shows examples of relevant and irrelevant questions.
When VQA systems are fed irrelevant questions as input, they understandably produce nonsensical answers (Q: "What is the capital of Argentina?"A: "fire hydrant").Humans, on the other hand, are unlikely to provide such nonsensical answers and will instead answer that this is irrelevant or use another knowledge source to answer correctly, when possible.We argue that this implicit assumption by all VQA systems -that an input question is always relevant for the input image -is simply untenable as VQA systems move beyond standard academic datasets to interacting with real users, who may be unfamiliar, or malicious.The goal of this work is to make VQA systems more human-like by providing them the capability to identify relevant questions.
While existing work has reasoned about crossmodal similarity, being able to identify whether a question is relevant to a given image is a novel problem with real-world applications.In human-robot interaction, being able to identify questions that are dissociated from the perception data available is important.The robot must decide whether to process the scene it perceives or query external world knowledge resources to provide a response.
As shown in Fig. 1, we study three types of arXiv:1606.06622v1[cs.CV] 21 Jun 2016 question-image pairs: Non-Visual.These questions are not questions about images at all -they do not require information from any image to be answered (e.g."What is the capital of Argentina?").Visual False-Premise.While visual, these questions do not apply to the given image.For instance, the question "What is the girl wearing?" makes sense only for images that contain a girl in them.Visual True-Premise.These questions are relevant to (i.e., have a premise which is true for) the image at hand.We introduce datasets and train models to recognize both non-visual and false-premise questionimage (QI) pairs in the context of VQA.First, we identify whether a question is visual or non-visual; if visual, we identify whether the question has a truepremise for the given image.For visual vs. nonvisual question detection, we use a Long Short-Term Memory (LSTM) recurrent neural network (RNN) trained on part of speech (POS) tags to capture visual-specific linguistic structure.For true vs. falsepremise question detection, we present one set of approaches that use the uncertainty of a VQA model, and another set that use pre-trained captioning models to generate relevant captions (or questions) for the given image and then compare them to the given question to determine relevance.
Our proposed models achieve accuracies of 94% for detecting non-visual, and 76% for detecting false-premise questions, which significantly outperform strong baselines.We also show through human studies that a VQA system that reasons about question relevance is picked significantly more often as being more intelligent, human-like and reasonable than a baseline VQA system which does not.Our code and datasets will be made publicly available.

Related Work
There is a large body of existing work that reasons about cross-modal similarity: how well an image matches a query tag (Liu et al., 2009) in text-based image retrieval, how well an image matches a caption (Feng and Lapata, 2013;Xu et al., 2015;Ordonez et al., 2011;Karpathy and Fei-Fei, 2015;Fang et al., 2015), and how well a video matches a description (Donahue et al., 2014;Lin et al., 2014a).
In our work, if a question is deemed irrelevant, the VQA model says so, as opposed to answering the question anyway.This is related to perception systems that do not respond to an input where the system is likely to fail.Such failure prediction systems have been explored in vision (Zhang et al., 2014;Devarakota et al., 2007) and speech (Zhao et al., 2012;Sarma and Palmer, 2004;Choularton, 2009;Voll et al., 2008).A related idea is to avoid a highly specific prediction if there is a chance of being wrong, instead making a more generic prediction that is more likely to be right (Deng et al., 2012).
To the best of our knowledge, our work is the first to study the relevance of questions in VQA.Chen et al. ( 2012) classify users' intention of questions for community question answering services.Most related to our work is (Dodge et al., 2012) that extracts visual text from within Flickr photo captions to be used as supervisory signals for training image captioning systems.Our motivation is to endow VQA systems the ability to detect non-visual questions to respond in a human-like fashion.Moreover, we also detect a more fine-grained notion of question relevance via true-and false-premise.

Datasets
For the task of detecting visual vs. non-visual questions, we assume that all the questions in the VQA dataset (Antol et al., 2015) are visual, since the Amazon Mechanical Turk (AMT) workers were specifically instructed to ask questions about a displayed image while creating that dataset.We also collected non-visual philosophical and general knowledge questions from the internet (see appendix).Combining the two, we have 121,512 visual questions from the validation set of VQA and 9,9521 generic non-visual questions collected from the internet.We call this dataset Visual vs. Non-Visual Questions (VNQ).
We also collect a dataset of true-vs.false-premise questions by showing AMT workers images paired with random questions from the VQA dataset and asking them to annotate whether they are applicable or not.We had three workers annotate each QI pair.We take the majority vote as the final ground truth label. 2 We have 10,793 QI pairs on 1,500 unique images out of which 79% are non-applicable (falsepremise).We refer to this visual true-vs.falsepremise questions dataset as VTFQ.
Since there is a class imbalance in both of these datasets, we report the average per-class (i.e.normalized) accuracy for all approaches.All datasets will be made publicly available.

Approach
Here we present our approaches for detecting (1) visual vs. non-visual QI pairs, and (2) true-vs.falsepremise QI pairs.

Visual vs. Non-Visual Detection
Recall that the task here is to detect visual questions from non-visual ones.Non-visual questions, such as "Do dogs fly?" or "Who is the president of the USA.?", often tend to have a difference in the linguistic structure from that of visual questions, such as "Does this bird fly?" or "What is this man doing?".We compare our approach (LSTM) with a baseline (RULE-BASED): 1. RULE-BASED.A rule-based approach to detect non-visual questions based on the part of speech (POS)3 tags and dependencies of the words in the question.E.g., if a question has a plural noun with no determiner before it and is followed by a singular verb ("Do dogs fly?"), it is a non-visual question.4 2. LSTM.We train an LSTM with 100-dim hidden vectors to embed the question into a vector and predict visual vs. not.Instead of feeding question words (['what', 'is', 'the', 'man', 'doing', '?']), the input to our LSTM is embeddings of POS tags of the words (['pronoun', 'verb', 'determiner', 'noun', 'verb']).Embeddings of the POS tags are learnt end-to-end.This captures the structure of imagegrounded questions, rather than visual vs. nonvisual topics.The latter are less likely to generalize across domains.

True-vs. False-Premise Detection
Our second task is to detect whether a question Q entails a false-premise for an image I.We present two families of approaches to measure this QI 'compatibility': (i) using uncertainty in VQA models, and (ii) using pre-trained captioning models.
Using VQA Uncertainty.Here we work with the hypothesis that if a VQA model is uncertain about the answer to a QI pair, the question may be irrelevant for the given image since the uncertainty may mean it has not seen similar QI pairs in the training data.We test two approaches: 1. ENTROPY.We compute the entropy of the softmax output from a state-of-the art VQA model (Antol et al., 2015;Lu et al., 2015) for a given QI pair and train a decision stump classifier.2. VQA-MLP.We feed in the softmax output to a two-layer multilayer perceptron (MLP) with 30 hidden units in each layer, and train it as a binary classifier to predict whether a question has a true-or falsepremise for the given image.
Using Pre-trained Captioning Models.Here we utilize (a) an image captioning model, and (b) an image question-generation model -to measure QI compatibility.Note that both these models generate natural language capturing the semantics of an image -one in the form of statement, the other in the form of a question.Our hypothesis is that a given question is relevant to the given image if it is similar to the language generated by these models for that image.Specifically: 1. Question-Caption Similarity (Q-C SIM).We use NeuralTalk2 (Karpathy and Fei-Fei, 2015) pretrained on the MSCOCO dataset (Lin et al., 2014b) (images and associated captions) to generate a caption C for the given image, and then compute a learned similarity between Q and C (details below).2. Question-Question Similarity (Q-Q' SIM).We use NeuralTalk2 re-trained (from scratch) on the questions in the VQA dataset to generate a question Q' for the image.Then, we compute a learned similarity between Q and Q'.We now describe our learned Q-C similarity function (the Q-Q' similarity is analogous).Our Q-C similarity model is a 2-channel LSTM+MLP (one channel for Q, another for C).Each channel sequentually reads word2vec embeddings of the corresponding language via an LSTM.The last hidden state vectors (40-dim) from the 2 LSTMs are concatenated and fed as inputs to the MLP, which outputs a 2-class (relevant vs. not) softmax.The entire model is learned end-to-end on the VTFQ dataset.We also experimented with other representations (e.g.bag of words) for Q, Q', C, which are included in the appendix for completeness.Finally, we also compare our proposed models above to a simpler baseline (Q-GEN SCORE), where we compute the probability of the input question Q under the learned question-generation model.The intuition here is that since the question generation model has been trained only on relevant questions (from the VQA dataset), it will assign a high probability to Q if it is relevant.

Experiments and Results
Visual vs. Non-Visual Detection.We use a random set of 100,000 questions from the VNQ dataset for training, and the remaining 31,464 for testing.The results are shown in Table 1.We see that LSTM performs 18.59% better than RULE-BASED.
True-vs.False-Premise Detection.We use a random set of 7,195 (66%) QI pairs from the VTFQ dataset to train and the remaining 3597 (33%) to test.Table 1 shows the results.While the VQA model uncertainty based approaches (ENTROPY, VQA-MLP) perform reasonably well (with the MLP helping over raw entropy), the learned similarity approaches perform much better (12% gain in normalized accuracy).High uncertainty of the model may suggest that a similar QI pair was not seen during training; however, that does not seem to translate to detecting irrelevance.The language generation models (Q-C SIM, Q-Q' SIM) seem to work significantly better at modeling the semantic interaction between the question and the image.The generative approach (Q-GEN SCORE) is outperformed by the discriminative approaches (VQA-MLP, Q-C SIM, Q-Q' SIM) that are trained explicitly for the task at hand.

Human Qualitative Evaluation
We also perform human studies where we compare two agents: (1) AGENT-BASELINE-always answers every question.(2) AGENT-OURS-reasons about question relevance before responding.If question is classified as visual true-premise, AGENT-OURS answers the question using the same VQA model as AGENT-BASELINE (using (Lu et al., 2015)).Otherwise, it responds with a prompt indicating that the question does not seem meaningful for the image.A total of 100 questions (22% relevant, 78% irrelevant, mimicking the distribution of the VTFQ dataset) were used.Of the relevant questions, 54% were answered correctly by the VQA model.Five human subjects on AMT were shown the response of both agents and asked to pick the agent that sounded more intelligent, more reasonable, and more human-like.AGENT-OURS was picked 65% of the time as the winner, AGENT-BASELINE was picked only 2% of the time, and both considered equally (un)reasonable in the remaining cases.Interestingly, humans often prefer AGENT-OURS over AGENT-BASELINE even when both models are wrong -AGENT-BASELINE answers the question incorrectly and AGENT-OURS incorrectly predicts that the question is irrelevant and refuses to answer a legitimate question.Users seem more tolerant to mistakes in relevance prediction than VQA.

Conclusion
We introduced the novel problem of identifying irrelevant (i.e., non-visual or visual false-premise) questions for VQA.Our proposed models significantly outperform strong baselines on both tasks.A VQA agent that utilizes our detector and refuses to answer certain questions significantly outperforms a baseline (that answers all questions) in human studies.Such an agent is is perceived as more intelligent, reasonable, and human-like.Our system can be further augmented to communicate to users what the assumed premise of the question is that is not satisfied by the image, e.g., respond to "What is the woman wearing?" for an image of a cat by saying "There is no woman."vector.The MLP used on top of BOW is a 3-layer MLP with 30, 20 and 10 hidden units respectively.2. AVG.W2V.We extract word2vec (Mikolov et al., 2013) features for the question and captions words, compute the average of the features separately for the question and caption and then concatenate them.Similar to BOW, we train a 3-layer MLP with 200, 150 and 80 hidden units respectively.3. LSTM W2V.These are the features we used in the main paper.The LSTM has 40 hidden units using a 2-layer MLP with 40 and 20 hidden units respectively.
Table 2 shows a comparison of the performance in recall, precision and normalized accuracy.

Figure 2 :
Figure 2: Success Cases: The first row illustrates examples that our model thought were True-Premise, and were also labeled so by humans.The second row shows success cases for False-Premise detection.

Figure 3 :
Figure 3: Failure Cases: The first row illustrates examples that our model thought was False-Premise, but were actually labeled as True-Premise by humans.Vice versa in the second row

Table 1 :
Normalized accuracy results for visual vs. non-visual detection and true-vs.false-premise detection.

Table 2 :
True-vs.false-premise question detection.words that are present in either the question or the caption and a 2 when the word is present in both.This means each question-caption pair is represented by a 9,952-dim (vocab length)