Visuo-Lingustic Question Answering (VLQA) Challenge

Understanding images and text together is an important aspect of cognition and building advanced Artificial Intelligence (AI) systems. As a community, we have achieved good benchmarks over language and vision domains separately, however joint reasoning is still a challenge for state-of-the-art computer vision and natural language processing (NLP) systems. We propose a novel task to derive joint inference about a given image-text modality and compile the Visuo-Linguistic Question Answering (VLQA) challenge corpus in a question answering setting. Each dataset item consists of an image and a reading passage, where questions are designed to combine both visual and textual information i.e., ignoring either modality would make the question unanswerable. We first explore the best existing vision-language architectures to solve VLQA subsets and show that they are unable to reason well. We then develop a modular method with slightly better baseline performance, but it is still far behind human performance. We believe that VLQA will be a good benchmark for reasoning over a visuo-linguistic context. The dataset, code and leaderboard is available at https://shailaja183.github.io/vlqa/.


Introduction
Question answering (QA) is a crucial way to evaluate the system's ability to understand text and images. In recent years, a large body of natural language QA (NLQA) datasets and visual QA (VQA) datasets have been compiled to evaluate the ability of a system to understand text and images. For most VQA datasets, the text is used merely as a question-answering mechanism rather than an actual modality that provides contextual information. On the other hand, deriving inference from combined visual and textual information is an important ⇤ corresponding author Figure 1: Example of Visuo-Linguistic Question Answering (VLQA) task for joint reasoning over imagetext context. skill for humans to perform day-to-day tasks. For example, product assembly using instruction manuals, navigating roads while following street signs, interpreting visual representations (e.g., charts) in various documents such as newspapers and reports, understanding concepts using textbook-style learning, etc. The importance of joint reasoning has also been emphasized in the design of standardized / psychometric tests like PISA (OECD, 2019) and GRE 1 , as evident from Figure 2. PISA assessments conducted post 2018 take into account "the evolving nature of reading in digital societieswhich requires an ability to compare, contrast and integrate information from multiple sources". The GRE has 'data interpretation' questions that assess a student's ability to "analyze given data as a combination of text and charts." Both the aforementioned evidence motivate the need to develop Visuo-Linguistic QA (VLQA) system, posing a further challenge to state-of-the-art vision and language research. There are no benchmarking datasets that focus on reasoning over both images and text to our best knowledge. We formalize the task of deriving joint inference, where a system must utilize both visual and textual information to correctly answer the question, as demonstrated in Figure 1. To create a benchmark for this task, we develop and present a new dataset: VLQA (Visuo-Linguistic Question Answering) 3 as our main contribution. VLQA dataset consists of text together with a diverse range of visual elements. Since manuals, documents and books containing texts and visuals are ubiquitous, the VLQA dataset is very much grounded in the real world. The dataset is curated from multiple resources (books, encyclopedias, web crawls, existing datasets, etc.) through combined automated and manual efforts. The dataset consists of 9267 image-passage-QA tuples with detailed annotation, which are meticulously crafted to assure its quality.
We then evaluate the best existing visionlanguage architectures with respect to our VLQA dataset. This includes LXMERT (Tan and Bansal, 2019), VL-BERT , ViLBERT (Su et al., 2019) and VisualBERT . Our results demonstrate that despite a significant improvement over vision and language tasks separately, the best existing techniques cannot reason well on the joint tasks. We then propose a modular method HOLE (HOpping and Logical Entailment), which demonstrates slightly better baseline performance and offers more transparency for the interpretation of intermediate outputs. The results indicate that VLQA task is relatively harder compared to existing vision-language tasks due to diversity of figures and additional textual component, demanding the need of better approaches to tackle multi-modal question answering. The VLQA challenge thus has the potential to open new research avenues spanning language and vision.

Related Work
We identify Image-Text Multi-modality, Multi-hop Reasoning and variants of Visual Question Answering (VQA) closest to VLQA and compare with relevant datasets in these areas (refer Appendix A.1 for comprehensive comparison with more datasets).

Image-Text Multi-modality
Multimodal learning aims to build models that can process and relate information from two or more modalities. Image-Text multi-modality has received growing interest from the Artificial Intelligence (AI) community recently. Diagram QA component of TQA (Kembhavi et al., 2017) and a portion of AI2D  with additional text are most relevant to ours. They share similarities with VLQA in terms of the presence of additional text, diagram style images and QA style evaluation, but there are important distinctions.
First, TQA uses long lessons (⇠50 sentences and 4-5 images) to describe concepts in textbook-style learning, whereas text passages for subsets of AI2D and VLQA are short (1-5 sentences). The goal of TQA aligns with the careful selection of necessary facts from the long-tailed contexts, which is perhaps less important in VLQA as the context is much smaller. At the same time, AI2D aims at AIbased diagram understanding. Contrary to that, we focus on enhancing the capability of AI models for joint reasoning. Secondly, AI2D and TQA are curated from the school science curriculum whereas, we have a broader horizon of possible reasoning. Lastly, TQA and AI2D do not impose that one must use both modalities while answering, unlike VLQA. For TQA, one can answer 40% of text QA using a single sentence and 50% of diagram QA using the only image. In that case, a significant portion of the dataset becomes analogous to machine comprehension or ordinary VQA, losing out on the actual purpose of multi-modality.

Multi-Hop Reasoning
In the natural language processing (NLP) domain, multi-hop reasoning is proposed to encourage the development of models that can reason about two or more textual contexts. QAngaroo (Welbl et al., 2018) and ComplexWebQuestions (Talmor and Berant, 2018) include multi-hop questions that can be answered by linking entities from a knowledge base (KB). HotpotQA ) is a multihop benchmark over pairs of text paragraphs from wikipedia, not being constrained by retrieval from fixed KB schemas. QASC (Khot et al., 2019) dataset made this task further challenging, which first requires to retrieve necessary facts from a large corpus (knowledge ranking) and compose them to answer a multi-hop question.
Solving VLQA examples requires linking information from image and text. Therefore, VLQA can be considered a novel kind of multi-hop task involving images and text, which we believe will drive future vision-language research.

Visual Question Answering (VQA)
Followed by the success of the VQA dataset (Antol et al., 2015), several variants of visual QA have been proposed. The following are most relevant; Reasoning-based VQA Reasoning-based VQA datasets aim at measuring a system's capability to reason about a set of objects, their attributes and relationships. HowManyQA (Trott et al., 2017) and TallyQA (Acharya et al., 2019) have object counting questions over images. SNLI-VE (Xie et al., 2019), VCOPA (Yeo et al., 2018) focus on causal reasoning whereas CLEVR (Johnson et al., 2017), NLVR (Suhr et al., 2017) target spatial reasoning. FigureQA (Kahou et al., 2017), DVQA (Kafle et al., 2018) are testbeds for QA over charts/plots. The objective of VLQA is to equip AI models with diverse reasoning capabilities over the image-text context. A model solving VCR  dataset first answers a question in VQA style, then needs to provide a rationale explaining why the answer is true. Therefore, items in VCR could be turned to particular VLQA data items. However, images in VCR are much more specific than ours e.g., they do not have charts, diagrams, or multiple images. Also, the rationale selection is limited to 'Why' questions, not so in VLQA. We identify 10 broad reasoning categories needed to solve VLQA, which is described in Section 3.3.
Knowledge-based VQA There are several vision-language tasks that require additional knowledge beyond the provided image and text. F-VQA , KB-VQA (Wang et al., 2015) and KVQA (Shah et al., 2019) rely on retrieving commonsense or world-knowledge from a Knowledge Base (KB), whereas OK-VQA (Marino et al., 2019) is related to open-ended knowledge extraction from the web. In VLQA, 61% of samples require commonsense or domain knowledge, which is not explicitly stated in image-text context. Knowledge extraction for VLQA is kept open-ended as of now.

VLQA Dataset
We formally define the VLQA task, explain our approach to curate this dataset and necessary measures for quality assurance below;

Task Overview
A datapoint in VLQA is a 4-tuple <I, P, Q, A>; Image(I) It is provided imagery, which ranges from daily life scenes, a variety of data representations to complex diagrams. A portion of VLQA examples also requires reasoning over multiple images. For the simplicity of processing and retrieval, we compose all images into a single file. Each image is bounded by a red box and provided an explicit detection tag ([0],[1],..) for identification purposes, inspired by VCR  annotations. This also provides a convenient way to reference images in passage, question, or answers.

Passage(P)
It is a textual modality that provides additional contextual information related to the image. The passages in VLQA dataset is composed of 1-5 sentences, which consists of facts, imaginary scenarios or their combination. Question(Q) It is a question in natural language that tests the reasoning capability of a model over a given image-passage context. In addition to standard 'Wh' patterns and fact-checking style (True/-False), some questions in VLQA are of 'do-asdirected' form, similar to standardized tests.
Answer Choices(A) VLQA is formed as a classification task over 2-way or 4-way plausible choices, with exactly one of the candidate answers being correct. Answer choices may contain boolean, alphanumeric phrases, image tags or their combination.
Task Given the VLQA dataset as a collection of 4-tuple <I, P, Q, A> as shown in Figure 3, the task is to build an AI model that can answer a given question using image-text multi-modal context. The correctness of the prediction is measured against the ground-truth answer. Additionally, we provide rich annotations and classification on several aspects such as image types, question types, required reasoning capability and need for external knowledge. However, this metadata is optional and useful for researchers interested in tackling specific subsets of VLQA.

Data Collection
The main goal of our work is to collect a QA dataset that requires to derive joint inference from imagetext modality. We classify our data sources as Primary and Secondary; We obtain raw textual/visual information through primary sources, which can be later used as a modality in VLQA. For example, text crawls from wikipedia containing facts or images crawled by keyword-search can be used as passage and image respectively. Similarly, we collect tabular data from CIA 'world factbook' (Central Intelligence Agency, 2019), WikiTables (Pasupat and Liang, 2015) and convert them into templated figures like bar charts, pie charts, scatter plots, etc. We consider existing structured or semi-structured materials as a secondary data source, which can be quickly manipulated to use for our purpose; educational materials, standardized tests, and existing vision-language datasets are important. We used scrapers to collect textbook exercises, encyclopedias, practice worksheets and question banks. Further, we obtained a subset of interesting samples from existing datasets Figure 4: VLQA data creation process: collect data using primary and secondary sources, then perform postprocessing (if any), then finally create question-answers that require joint reasoning.
We then refactor textual/ visual information collected from the above sources and mold it as per our task requirements. Figure 4 illustrates this process. Refactoring includes manual or semiautomated post-processing such as replacing given textual/visual attributes with equivalent visual/textual counterparts, adding/removing partial information to/from text or visuals, and creating factual or hypothetical situations around images. Then we standardize all information collected the using above methods as Multiple Choice Questions (MCQ) and get the initial version of the dataset.
Since we impose the condition that a question must be answered through joint reasoning over both the modalities, our annotation process becomes non-trivial and requires careful manual annotation. We opted for a limited number of in-house expert annotators for quality purposes rather than a noisier hard-to-control crowdsourcing alternative.

Ensuring dataset integrity
A combined understanding about visual and textual inputs is a key aspect of the VLQA task. As we model it as a classification task, some models might exploit various biases in the dataset to get good performance without proper reasoning. To discourage such models, we employ 3-level verification over the full dataset to ensure the quality.
Firstly, for all collected image-passage pairs, human annotators quickly verify if a portion of image and passage represent identical information. All such image-passage pairs are discarded from the dataset. Secondly, we create 3 baselines-questiononly, passage-only and image-only which ignore at least one modality (among image and passage) and try to predict answers. We repeat this experiment 3 times by shuffling answer choices with a fixed seed. We remove samples that are answered correctly by any unimodal baseline in all trials.
Finally, we perform another round of manual quality checks. We instruct workers first to answer a question based only on image(s) and then try to answer a question based only on the text passage. If a question can be answered using a single modality, we suggest annotators to mark the checkbox. Finally, we look over all bad samples and either provide a fix or remove, on a case-by-case basis. (refer Appendix A.2 for detailed explanation on dataset creation process)

VLQA Dataset Analysis
In this section we analyze VLQA on following aspects; Table 1 provides a summary of relevant statistics.

Multi-modal Contexts
The final version of the VLQA dataset has 9267 unique image-passage-QA items. For each item, the multi-modal context is created by pairing images (roughly 10k collected) with the relevant text passages (roughly 9k retrieved or manually written).

Text-length Analysis
We provide analysis about lengths of various textual components in our dataset i.e., passages, questions and answers. Length of each textual component is calculated by counting the tokens separated by whitespaces and then averaged out across the dataset. The average passage length of 34.1 tokens indicates that in VLQA textual contexts are relatively smaller than Reading Comprehension tasks and in most cases, it contains precise context necessary for the joint reasoning. The average question length of 10.0 tokens is larger compared to most other VQA datasets provided in (Hudson and Manning, 2019). Shorter answer lengths (1.7 tokens) suggest that most of the dataset questions have short answers, which provides inherent flexibility if someone wants to leverage generative models to solve this task. The dataset has a vocabulary size of 13259, contributed by all three textual components together.
Image types We categorize images in VLQA into 3 major kinds: Natural Images, Templatebased Figures and Free-form Figures. Natural images incorporate day-to-day scenes around us, containing abundant objects and actions. Templatebased figures are visuals that follow a common structure for information representation. We further categorize template-based figures into 20 sub-types like bar, pie, maps, tables, cycles, processes, etc. The images which neither fit in any templates nor are natural have been put into a free-form category (e.g., science experiments, hypothetical scenarios, etc.). In VLQA, it is also possible that the visual context has multiple related images to reason about.
Answer types 4-way or 2-way image MCQ contains 4 and 2 images as plausible answer choices respectively, where the model needs to correctly pick the image best described by the passage and question. 4-way or 2-way text MCQ contains 4 and 2 alphanumeric text as plausible answer choices respectively, where the model needs to reason about given image-text scenario and pick the most likely answer to the question. 4-way Sequencing task assesses a model's capability to order 4 spatial or temporal events represented as a combination of images and text. Binary Classification (Yes/No or True/False) can be considered a fact-checking task where we want to determine the truth value of a question provided image-passage context.
Knowledge and Reasoning types 61% of VLQA items are observed to incorporate some commonsense or domain knowledge beyond the provided context. This missing knowledge has to be retrieved through the web. The remaining 39% samples can be answered through a simple join of information from visuo-linguistic context. We observe the following 10 most-frequent reasoning types needed to solve VLQA questions; conditional retrieval, math operations, deduction, temporal, spatial, causal, abductive, logical, and verbal reasoning. We further categorize VLQA samples based on whether it requires a single-step or multi-step inference to answer the question. By multi-step inference, we mean that answering a question involves more than one reasoning types.

Measure
Stats. Difficulty Level Determining difficulty levels is a subjective notion therefore, we asked an odd number of annotators to rate VLQA items as 'easy', 'moderate', or 'hard' based on their personal opinion. Then we take a majority vote of all annotators to assign difficulty level to each question.

Multimodal Context
Dataset Splits VLQA contains 9267 items in <I,P,Q,A> format, with detailed classification based on figure types, answer types, reasoning skills, requirement of external knowledge and difficulty levels as explained above. The data is split in train-test-val (80-10-10%), ensuring the uniform distribution based on the above taxonomies. To preserve the integrity of the test results, we do not release the test set publicly. Note that the use of the metadata for model design is completely optional.

Benchmarking
Human Performance We performed human evaluation on 927 test samples with a balanced variety of questions by image types, answer types, knowledge/reasoning types and hardness. First, we ask 3 in-house experts to take tests in isolation. We also ask them to rate questions based on the difficulty levels (easy/medium/hard) and an option to mark a dataset sample 'ambiguous'. Then we match their predictions against ground-truth answers, which turned out to be 84%.

Random Baseline VLQA dataset contains 4way and 2-way multiple choice questions (MCQs)
where each answer choice is likely to be picked with 25% and 50% chance. Based on the answertype distribution provided in Table 1, the performance of the random baseline is 31.36%.
Question-only, Passage-only and Image-only Baselines We use three unimodal baselines only for automated quality assurance of VLQA data (and do no not train) to prevent models from exploiting bias in data. Question-only, Passage-only and Image-only models are implemented using RoBERTa (Liu et al., 2019) finetuned on ARC , ALBERT (Lan et al., 2019) finetuned on RACE and LXMERT (Tan and Bansal, 2019) finetuned on VQA (Antol et al., 2015) respectively. We report the poor performance of these baselines over resulting VLQA data to indicate the need for joint reasoning over multi-modal context.

Best Existing Architectures
Recently, several attempts have been made to derive transformerbased pre-trainable generic representations for visuo-linguistic tasks. We pick top-performing single-model architectures VL-BERT (Su et al., 2019), VisualBERT , ViLBERT  and LXMERT (Tan and Bansal, 2019) that support Visual Question Answering (VQA) downstream task. For the VQA task, the input is an image and a question. To finetune VQA style models with VLQA data, we compose all images into one (in case of multiple images) as a single visual input, and concatenate Passage and Question as a single language input. Hyperparameters and Performance of all 4 architectures is reported in 2 and 3 respectively.

Fusion of HOpping and Logical Entailment (HOLE) to solve VLQA
We propose 'HOLE'-a fusion of modality HOpping (Image-to-passage hop and Passage-to-Image hop) and Logical Entailment as a modular baseline for VLQA, shown in Figure 5. We leverage 'answer types' metadata from the annotations and learn a simple 5-class classifier ('4-way Image', '2-way Image', '4-way Sequencing', 'Binary Classification' or '4-way Text') in order to decide between modality hopping and logical entailment. Note that our model is not end-to-end.

Modality Hopping based Solver
4-way text MCQ are solved using modality hopping approach (lower half pipeline in Figure 5). We first compute Image-to-Question Attention (I2Q) and Passage-to-Question Attention (P2Q) scores to determine which modality is important as a starting point for solving a question. I2Q is computed Figure 5: Proposed HOLE method to solve VLQA: Based on the answer type classification, a dataset item is solved as a sequence of Logical Entailment operations or performs Hopping between modalities to find the correct answer.
using Stacked Attention Network (SAN) (Yang et al., 2016), which takes Convolution Neural Network (CNN) encoding of I and Q. Whereas, P2Q is computed using a variant of Bi-Directional Attention Flow (BIDAF)  trained using Embeddings from Language Models (ELMo) (Peters et al., 2018) over Long-Short Term Memory (LSTM) encoding of Q and P.
A higher I2Q score suggests that Q has more overlap with I than P. Therefore, image modality should be used first and then incorporate passage to compute the answer. This is termed as an 'Imageto-Passage Hop'. This is identical to a Visual Question Answering (VQA) scenario that takes an image and a question as input. Since we have P as an additional text component, we combine passage (P+Q). This is implemented through pre-trained architecture LXMERT (Tan and Bansal, 2019) which is state-of-the-art on VQA that picks the most likely answer choice as a correct answer.
Similarly, a higher P2Q score suggests that Q has more overlap with P than I. Therefore, passage modality should be used first and then incorporate image to compute the answer. This is termed as a 'Passage-to-Image Hop'. This can be achieved by a machine comprehension model followed by a VQA model. We use ALBERT (Lan et al., 2019) as a machine comprehension model which takes in P and Q to generate an open-ended response in the style of SQuAD (Rajpurkar et al., 2016), which we refer to as A'. Now we want to determine where is A' located in the image I. Therefore, we formulate a new question Q' as "Where is A'?", where A' is substituted by the answer from ALBERT. We then use LXMERT (Tan and Bansal, 2019) that takes image I, new question Q' and original answer choices A to pick the most likely one. 4 For all other answer types, we leverage Logical Entailment (upper half pipeline in 5) of image and text to answer questions. We create an 'Entailment Toolbox' which consists of image-image, imagetext (Xie et al., 2019), text-image and textual entailment  sub-modules and use them as required. For image-image and image-text entailment, we augment Visual COPA (Yeo et al., 2018) dataset and train custom network for both. (refer Supplementary Material B for more details) 4-way or 2-way image MCQ contains images as an answer choice, which is similar to an Image Selection task (Hu et al., 2019). The goal here is to identify an image that best matches the description of P or mathematically, determine P`A k (i.e., text-image entailment) with maximum score. A k represents answer choices where k=4 and k=2 for 4-way and 2-way image problems respectively.

Logical Entailment based Reasoner
Binary Classification can be considered as a fact-checking task where we want to determine the truth value of a question provided image-passage, or mathematically, P [ I`Q. We use textual entailment to determine P`Q and image-text entailment to determine I`Q. If both entailment modules' confidence score is above 0.65 then it is determined as True, otherwise False. 4`i s the symbolic representation of entailment 4-way Sequencing task assesses a model's capability to order 4 spatial or temporal events . If we  consider I-II-III-IV as a sequence of events, it is  equivalent to 3 entailment tasks: I-II, II-III, and  III-IV, where each I to IV can be an image or a text. Among the answer choices, the sequence with maximum overall confidence is selected as an answer.

Results & Discussion
Multi-modality brings both pros and cons while developing new Artificial Intelligence (AI) benchmarks. The presence of multiple modalities provide natural flexibility for varied inference tasks, simultaneously making the reasoning process more complex as information is now spanned across them and requires cross-inferencing. In this work, we focused on joint reasoning over image-text multimodal context and developed a Visuo-Linguistic Question Answering (VLQA) Dataset. Our proposed VLQA dataset has important distinctions from existing VQA datasets. Firstly, it incorporates a text passage that contains additional contextual information. Secondly, it offers various figure types including natural images, templated images and free-form images (unstructured), which is not so common for other VQA datasets. Thirdly, it tests diverse reasoning capabilities, including crossinferencing between visual and textual modalities.
We then use several baselines and benchmark their performance over the resulting VLQA dataset. As VLQA has multiple choice questions with exactly one correct answer, we use standard accuracy as an evaluation metric. From the results in 3, we can observe that pre-trained vision-language models fail to solve a significant portion of the VLQA items. Our proposed modular method HOLE slightly outperforms them and is more interpretable for analysis. We also report the performance of Question-only, Image-only and Passageonly baselines which we used for quality check. The poor performance of these baselines indicate that the VLQA dataset requires models to jointly understand both image and text modalities and is relatively harder than other vision-language tasks.
For human evaluation of the VLQA test-set, the reported accuracy is 84.0%. For 148 wrongly predicted answers, we group them according to 4 reasons for failures, which are listed in 4. The results demonstrate a room for significant improvement in existing vision-language models that are far behind the human performance. This stimulates the  Lacked necessary knowledge 27 (18.2%) Misunderstood the provided info 47 (31.7%) Mistake in deduction/calculation 63 (42.5%) Felt that data item is ambiguous 11 (7.4%) Table 4: Classification of incorrectly predicted answers in Human-evaluation of VLQA test-data need for more complex reasoning capabilities of AI models. We suspect that VLQA questions that purely rely on facts might be exploited by the latest language models, despite strong measures taken through manual and automated quality control during the creation of the dataset. We would like to explore this further in the future.

Conclusion
In this work, we introduced the Visuo-Linguistic Question Answering (VLQA) challenge that we believe has the potential to open new research avenues in areas of joint vision & language. Our experiments show that a system equipped with state-of-the-art vision-language pre-training does not perform well on the task that requires joint image-text inference. There is a room for significant improvement in capability of these models to tackle multi-modal contexts. Our future work would include further expansion of this dataset and building generic AI models that can learn novel visual concepts from a small set of examples.