Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

We conduct large-scale studies on `human attention' in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Overall, our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans.


Introduction
It helps to pay attention. Humans have the ability to quickly perceive a scene by selectively attending to parts of the image instead of processing the whole scene in its entirety (Rensink, 2000). Inspired by human attention, a recent trend in computer vision and deep learning is to build computational models of attention. Given an input signal, these models learn to attend to parts of it for further processing and have been successfully applied in machine translation (Bahdanau et al., 2014;Firat et al., 2016), object recognition Mnih et al., 2014;Sermanet et al., 2014), image captioning Cho et al., 2015) and visual question answer- * Denotes equal contribution. ing (Yang et al., 2015;Lu et al., 2016;Xu and Saenko, 2015;Xiong et al., 2016). In this work, we study attention for the task of Visual Question Answering (VQA). Unlike image captioning, where a coarse understanding of an image is often sufficient for producing generic descriptions (Devlin et al., 2015), visual questions selectively target different areas of an image including background details and underlying context. This suggests that a VQA model may benefit from an explicit or implicit attention mechanism to answer a question correctly. In this work, we are interested in the following questions: 1) Which image regions do humans choose to look at in order to answer questions about images? 2) Do deep VQA models with attention mechanisms attend to the same regions as humans?
We design and conduct studies to collect "human attention maps". Figure 1 shows human attention maps on the same image for two different questions. When asked 'What type is the surface?', humans choose to look at the floor, while attention for 'Which game is being played?' is concentrated around the player and racket. These human attention maps can be used both for evaluating machine-generated attention maps and for explicitly training attention-based models.
Contributions. First, we design and test multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antol et al., 2015); this VQA-HAT (Human ATtention) dataset will be released publicly. Second, we perform qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models (Yang et al., 2015;Lu et al., 2016) and a task-independent saliency baseline (Judd et al., 2009) against our human attention maps through visualizations and rank-order correlation. We find that machine-generated attention maps from the most accurate VQA model have a mean rank-correlation of 0.26 with human attention maps, which is worse than task-independent saliency maps that have a mean rank-correlation of 0.49. It is well understood that task-independent saliency maps have a 'center bias' (Tatler, 2007;Judd et al., 2009). After we control for this center bias in our human attention maps, we find that the correlation of task-independent saliency is poor (as expected), while trends for machine-generated VQA-attention maps remain the same (which is promising).

Related Work
Our work draws on recent work in attention-based VQA and human studies in saliency prediction.
We work with the free-form and open-ended VQA dataset released by (Antol et al., 2015). VQA Models. Attention-based models for VQA typically use convolutional neural networks to highlight relevant regions of image given a question. Stacked Attention Networks (SAN) proposed in (Yang et al., 2015) use LSTM encodings of question words to produce a spatial attention distribution over the convolutional layer features of the image. Hierarchical Co-Attention Network (Lu et al., 2016) generates multiple levels of image attention based on words, phrases and complete questions, and is the top entry on the VQA Challenge 1 as of the time of this submission. Another interesting approach uses question parsing to compose the neural network from modules, attention being one of the sub-tasks addressed by these modules (Andreas et al., 2016). Note that all these works are unsupervised attention models, where "attention" is simply an intermediate variable (a spatial distribution) that is produced by the model to optimize downstream loss (VQA cross-entropy). The fact that some (it's unclear how many) of these spatial distributions end up being interpretable is simply fortuitous. In contrast, we study where humans choose to look to answer visual questions. These human attention maps can be used to evaluate unsupervised maps. Human Studies. There's a rich history of work in collecting eye tracking data from human subjects to gain an understanding of image saliency and visual perception (Jiang et al., 2014;Judd et al., 2009;Fei-Fei et al., 2007;Yarbus, 1967). Eye tracking data to study natural visual exploration (Jiang et al., 2014;Judd et al., 2009) is useful but difficult and expensive to collect on a large scale. (Jiang et al., 2015) established mouse tracking as an accurate ap- Figure 3: Deblurring procedure to collect attention maps. We present subjects with a blurred image and ask them to sharpen regions of the image that will help them answer the question correctly, in a smooth, click-and-drag, 'coloring' motion with the mouse.
proach to collecting attention maps. They collected large-scale attention annotations for MS COCO (Lin et al., 2014) on Amazon Mechanical Turk (AMT). While (Jiang et al., 2015) studies natural exploration and collects task-independent human annotations by asking subjects to freely move the mouse cursor to anywhere they wanted to look on a blurred image, our approach is task-driven. Specifically, as described in 3, we collect ground truth attention annotations by instructing subjects to sharpen parts of a blurred image that are important for answering the questions accurately. Section 4 covers evaluation of unsupervised attention maps generated by VQA models against our human attention maps.

VQA-HAT (Human ATtention) Dataset
We design and test multiple game-inspired novel interfaces for conducting large-scale human studies on AMT. Our basic interface design consists of a "deblurring" exercise for answering visual questions. Specifically, we present subjects with a blurred image and a question about the image, and ask subjects to sharpen regions of the image that will help them answer the question correctly, in a smooth, click-and-drag, 'coloring' motion with the mouse. The sharpening is gradual: successively scrubbing the same region progressively sharpens it. Figure 3 shows intermediate steps in our attention annotation interface, from a completely blurry image to a deblurred attention map.

Attention Annotation Interface
Our interface starts by showing a low-resolution blurry version of the image. This is to convey a partial 'holistic' understanding of the scene to the subjects so they may intelligently choose which regions to sharpen. Gradual sharpening with strokes was aimed to capture initial exploration as they tried to get a better sense of the scene, and eventually focussed sharpening to answer the question. Next we describe the three variants of our attention annotation interface that we experimented with.

Blurred Image without Answer
In our first interface, subjects were shown a blurred image and a question without the answer, and were asked to deblur regions and enter the answer. We found that this interface sometimes resulted in 'exploratory attention', where the subject lightly sharpens large regions of an image to find salient regions that eventually lead them to the answer. However, subjects often ended up with 'incomplete' attention maps since they did not see the high-resolution image and the answer, so they did not know when to stop deblurring or exploring. For instance, for an image with 3 players playing a sport, if the question is "How many players are visible in the image?", the subject might sharpen a region that seems to have the players, count the 2 players in there and answer 2, and completely miss another region of the image that had 1 more. The resulting attention map in this case is incomplete since there are 3 players in the image. This effect of incomplete human attention maps was seen in counting ("How many ...") and binary ("Is there ...") types of questions, and as a result, the answers to these were often incorrect.

Blurred Image with Answer
In our second interface, subjects were shown the correct answer in addition to the question and blurred image. They were asked to sharpen as few regions They were asked to sharpen as few regions as possible such that someone can answer the question just by looking at the blurred image with sharpened regions. (c) To encourage exploitation instead of exploration, in our third interface, subjects were shown the question-answer pair and full-resolution original image. Out of the three interfaces, Blurred Image with Answer (b) struck the right balance between exploration and exploitation, and gives the highest accuracy on evaluation by humans as described in section 3.2.
as possible such that someone can answer the question just by looking at the blurred image with sharpened regions. This interface is shown in Figure 4b. Providing the answer fixed the failure cases from the 1st interface, i.e. for counting and binary questions, since the subjects now knew the answer, they continued to explore till they found the answer region in the image.

Blurred and Original Image with Answer
To encourage exploitation instead of exploration, in our third interface, subjects were shown the question-answer pair and full-resolution original image. In principle, seeing the original (fullresolution) image, the question, and answer provides most information to subjects, thus enabling them to provide the most 'accurate' attention maps. However, this task turns out to be fairly counterintuitive -subjects are shown full-resolution images and the answer, and asked to imagine a scenario where someone else has to answer the question without looking at the original image. Figure 4 shows screen-captures of the 3 attention annotation interfaces.

Dataset Evaluation
We ran pilot studies on AMT to experiment with the above described three interfaces. In order to quantitatively evaluate the interfaces, we conducted a second human study where (a second set of) subjects where shown the attention-sharpened images generated from each of the attention interfaces from the first experiment and asked to answer the question. The intuition behind this experiment is that if the attention map revealed too little information, this second set of subjects would answer the question incorrectly. Table 1 shows VQA accuracies of the answers given by human subjects under these 3 interfaces. We can see that the "Blurred Image with Answer" interface (section 3.1.2) gives the highest accuracy on evaluation by humans.
Since the payments structure on AMT encourage completing tasks as quickly as possible, this implicitly incentivizes subjects to deblur as few regions as possible, and our human study shows that humans can still answer questions. Thus, overall we achieve a balance between highlighting too little or too much.  Figure 2 shows examples of collected human attention maps. This VQA-HAT dataset will be released publicly. To visualize the collected dataset, we cluster the human attention maps and visualize the average attention map and example questions falling in each of them for 6 selected clusters in Figure 5.

Human Attention Maps vs Unsupervised Attention Models
Now that we have collected these human attention maps, we can ask the following question -do unsupervised attention models learn to predict attention maps that are similar to human attention maps? To rephrase, do neural networks look at the same regions as humans to answer a visual question? VQA Attention Models. We evaluate maps generated by the following unsupervised models: Figure 6: Random samples of human attention (column 2) v/s machine-generated attention (columns 3-5).

• Hierarchical
Co-Attention Network (HieCoAtt) (Lu et al., 2016) with word-level (HieCoAtt-W), phrase-level (HieCoAtt-P) and question-level (HieCoAtt-Q) attention maps; we evaluate all three maps 3 . Comparison Metric: Rank Correlation. We first scale both the machine-generated and human attention maps to 14x14, rank the pixels according to their spatial attention and then compute correlation between these two ranked lists. We choose an orderbased metric so as to make the evaluation invariant to absolute spatial probability values which can be made peaky or diffuse by tweaking a 'temperature' parameter.
We can see that both SAN-2 and HieCoAtt attention maps are positively correlated with human attention maps, but not as strongly as task-independent Judd saliency maps. Table 2 shows rank-order correlation averaged over all image-question pairs on the validation set. We compare with random attention maps and taskindependent saliency maps generated by a model trained to predict human eye fixation locations where subjects are asked to freely view an image for 3 seconds (Judd et al., 2009). Both SAN-2 and HieCoAtt attention maps are positively correlated with human attention maps, but not as strongly as task-independent Judd saliency maps. Our findimageqa-san. 3 Code available at https://github.com/ jiasenlu/HieCoAttenVQA ings lead to two take-away messages with significant potential impact on future research in this active field. First, current VQA attention models do not seem to be 'looking' at the same regions as humans to produce an answer. Second, as attentionbased VQA models become more accurate (58.9% SAN → 62.1% HieCoAtt), they seem to be (slightly) better correlated with humans in terms of where they look. Our dataset will allow for a more thorough validation of this observation as future attention-based VQA models are proposed. Figure 6 shows examples of human attention and machine-generated attention maps with corresponding rank-correlation coefficients.
To put these numbers in perspective, we computed inter-human agreement on the validation set by collecting 3 human attention maps per image-question pair and computing mean rank-correlation, which is 0.623. Lastly, all reported correlation values are averaged over 3 trials by adding random noise (order of 10 −14 ) to the human attention maps to account for ranking variations in case of uniformly weighted regions. Center Bias. Judd saliency maps aim to predict human eye fixations during natural visual exploration. These tend to have a strong center bias (Tatler, 2007;Judd et al., 2009). Although our human attention maps dataset is not an eye tracking study, the center bias still exists albeit not as severe. One potential source of this center bias is the fact that the VQA dataset was human-generated by subjects looking at the images. Thus, salient objects in the center of the image are likely be potential subjects of the questions. We compute rank-correlation of a synthetically generated central attention map with Judd saliency and human attention maps. Judd saliency maps have a mean rank-correlation of 0.877 and human attention maps have a mean rank-correlation of 0.458 on the validation set. To eliminate the effect of center bias in this evaluation, we removed human attention maps that have a positive rank-correlation with the center attention map. We compute rank-correlation of machinegenerated attention with human attention on this reduced set. See Table 3. Mean correlation goes down significantly for Judd saliency maps since they have a strong center bias. Relative trends among SAN-2 & HieCoAtt are similar to those over the whole Model Rank-correlation SAN-2 (Yang et al., 2015) 0.038 ± 0.011 HieCoAtt-W (Lu et al., 2016) 0.062 ± 0.012 HieCoAtt-P (Lu et al., 2016) 0.048 ± 0.010 HieCoAtt-Q (Lu et al., 2016) 0  Table 3: Mean rank-correlation coefficients (higher is better) on the reduced set without center bias; error bars show standard error of means. We can see that correlation goes down significantly for Judd saliency maps since they have a strong center bias. Relative trends among SAN-2 & HieCoAtt are similar to those over the whole validation set (reported in Table 2).
validation set (reported in Table 2). HieCoAtt-Q now has a higher correlation with human attention maps than Judd saliency. This demonstrates that discounting the center bias, VQA-specific machine attention maps correlate better with VQA-specific human attention maps than task independent machine saliency maps.

Conclusion & Discussion
We introduce and release the VQA-HAT dataset. This dataset can be used to evaluate attention maps generated in an unsupervised manner by attentionbased VQA models, or to explicitly train models with attention supervision for VQA. We quantify whether current attention-based VQA models are 'looking' at the same regions of the image as humans do to produce an answer. Necessary vs Sufficient Maps. Are human attention maps 'necessary' and/or 'sufficient'? If regions highlighted by the human attention maps are sufficient to answer the question accurately, then so is any region that is a superset. For example, if attention mass is concentrated on a 'cat' for 'What animal is present in the picture?', then an attention map that assigns weights to any arbitrary-sized region that includes the 'cat' is sufficient as well. On the contrary, a necessary and sufficient attention map would be the smallest visual region sufficient for answering the question accurately. It is an ill-posed problem to define a necessary attention map in the space of pix-els; random pixels can be blacked out and chances are that humans would still be able to answer the question given the resulting subset attention map. Our work thus poses an interesting question for future work -what is the right semantic space in which it is meaningful to talk about necessary and sufficient attention maps for humans?