MIMOQA: Multimodal Input Multimodal Output Question Answering

Multimodal research has picked up significantly in the space of question answering with the task being extended to visual question answering, charts question answering as well as multimodal input question answering. However, all these explorations produce a unimodal textual output as the answer. In this paper, we propose a novel task - MIMOQA - Multimodal Input Multimodal Output Question Answering in which the output is also multimodal. Through human experiments, we empirically show that such multimodal outputs provide better cognitive understanding of the answers. We also propose a novel multimodal question-answering framework, MExBERT, that incorporates a joint textual and visual attention towards producing such a multimodal output. Our method relies on a novel multimodal dataset curated for this problem from publicly available unimodal datasets. We show the superior performance of MExBERT against strong baselines on both the automatic as well as human metrics.


Introduction
Multimodal content is at the heart of digital revolution happening around the world. While the term modality has multiple connotations, one of its common usage is to indicate the content modality i.e. images, text, audio etc. It has been shown that multimodal content is more engaging and provides better cognitive understanding to the end user (Dale, 1969;Moreno and Mayer, 2007;Sankey et al., 2010). With recent improvements in visionlanguage grounding and multimodal understanding (Bisk et al., 2020;Luo et al., 2020;Sanabria et al., 2018;Das et al., 2018), several works have explored beyond unimodal machine comprehension (Hermann et al., 2015;Kočiskỳ et al., 2018;Nguyen et al., 2016;Kwiatkowski et al., 2019) towards a holistic multimodal comprehension (Antol et al., 2015;Das et al., 2017;Anderson et al., 2018;Zhu et al., 2018;Goyal et al., 2017; Fayek and Johnson, 2020) with significant improvements.
However, all these explorations on multimodal understanding, question answering in particular, have limited their focus to unimodal outputs even with multimodal inputs. For example -Visual Question Answering (VQA) task takes a textual query and an image to produce a textual answer. The multimodal question answering tasks (Antol et al., 2015;Kafle et al., 2018; take multiple input modalities, but the output is limited to text only. Even the recently proposed ManyModalQA (Hannan et al., 2020) relies on multimodal understanding to produce a textual answer. These works implicitly assume that the textual answers can satisfy the needs of the query across multiple input modalities. We posit that such an assumption is not always true; while textual answer can address several queries, a multimodal answer almost always enhances the cognitive understanding of the end user; understanding the answer through visuals is faster and provides enhanced user satisfaction.
In this paper, we propose a new task, Multimodal Input Multimodal Output Question Answering (MIMOQA), which not only takes multimodal input but also answers the question with a multimodal output. Our key contributions are: 1) We introduce the problem of multimodal input multimodal output question answering. We establish the importance of such multimodal outputs in question-answering for enhanced cognitive understanding via human experiments. 2) We propose MExBERT, a novel multimodal framework for extracting multimodal answers to a given question and compare it against relevant strong baselines. Our proposed method includes a novel pretraining methodology and uses a proxy supervision technique for the image selection.
3) We curate a large dataset for the introduced prob-lem by extending the MS-MARCO (Nguyen et al., 2016) and Natural Question (Kwiatkowski et al., 2019) datasets to account for multimodal outputs. We propose the use of different automatic metrics and conduct human experiments to show their effectiveness.

Multimodal Output
Multimodal output not only provides better understanding to the end user but also provides grounding to the actual answer. For e.g., the multimodal output for the question in Figure 1(a) aids in better comprehension of the answer, while also providing grounding to words like 'stick', 'knob'. In some cases, textual answer might even be insufficient, especially, for questions which seek explicit visual understanding (questions about colors, structures, etc). In such cases, existing systems apply image understanding on top of the images to arrive at a 'textual description' of the desired answer. While this might suffice in some cases, a multimodal output can almost always enhance the quality of such answers. In Fig. 1(b), the textual answer is insufficient and gets completed only with the help of the final image-text combination.

What is a Shillelagh?
A wooden walking stick and club or cudgel typically made from a stout knotty stick with a large knob at the top.

How does Coronavirus look?
They have characteristic club-shaped spikes that project from their surface, (B) (A) Figure 1: (a): The textual answer is sufficient but images provides better understanding, (b) The textual answer is insufficient and is completed by an image To verify the hypothesis, we collated 200 Question-Answer pairs (refer to supplementary for details); for each pair, we created its unimodal and multimodal answers. We conducted a human experiment where each question-answer pair was judged by 5 annotators; each annotator rating if the textual answer is sufficient for the input query. Irrespective of its sufficiency, the annotators were also asked whether the image in the multimodal variant enhances the understanding of the answer and adds value to it. To avoid the natural bias towards richer multimodal response in such experiments, we had explicitly inserted a few questions with irrelevant images ( 20%) and only considered the annotations which did not exhibit any bias in such questions.
Out of 80.27% of the total responses where the annotators felt that textual answers were sufficient, 87.5% felt the image enhanced their understanding even with such sufficient textual answer validating the importance of a multimodal answer. However, only 22.2% of the annotators felt the same when an irrelevant image was shown, indicating the absence of a strong bias towards richer responses. When the text was insufficient (19.73% of the responses), the relevant image boosted the understanding in 90.62% of the cases, further indicating that text only answers are not always sufficient and in such cases, an appropriate image can aid in better understanding. Here again, only 27.65% felt that an irrelevant image will add such a value, again indicating the lack of a strong bias towards multimodal answers just because they are richer. This experiment establishes that multimodal answers almost always improves the overall understanding irrespective of the sufficiency of textual answer. Motivated by this, we propose the novel problem of multimodal input, multimodal output (MIMO) QAwhich attends to multiple modalities and provides responses in multiple modalities.

Multimodal Output QA
Formally, given a piece of input text T along with a set of related images I and a query Q, our problem is to extract a multimodal answer M from {I, T}.
In an ideal case, multimodal answer does not have to be multi-modal, especially when there is no relevant image in the input. However, for the sake of simplicity, we assume that there is at least one image in the input that can complement the textual answer even if the image is not extremely critical to the textual answer for it to make sense. This follows our human experiments which showed that image adds value to the response over 90% of the time, irrespective of the sufficiency of the textual answers. Thus, our multimodal answer M consists of a text M T and an accompanying image M I . Multimodal Extractive BERT (MExBERT): As we show later, a major problem with independently extracting the textual answer and matching an image is the absence of joint understanding of visual and textual requirements for the query. We, therefore, propose a joint attention Multimodal Extractive BERT based framework (MExBERT) using query Q over both input text T and input images I. Figure 2 shows the overall architecture of our proposed MExBERT framework. Inspired by the recent visuo-lingual models (  , our framework has two separate streams -textual and visual stream; textual stream takes the query and input passage as input while visual stream takes the images as input. The textual stream is extended from the BERT-QA framework (Devlin et al., 2018) and consists of self-attention transformer (Vaswani et al., 2017) layers. The input to the textual stream as shown in Figure 2 is tokenized BERT embedding of words in both passage and query. We also use the standard [CLS] and [SEP] tokens -the former prepended in the beginning and the latter embedded between query and the input passage. We use positional embedding to additionally provide positional and segment information for the MExBERT to better distinguish between query and passage. Unlike the the canonical BERT-QA, our textual stream employs two types of layers -regular self-attention layers and additional cross-attention layers. The initial layers of the textual stream include N Ta regular self-attention based transformer layers similar to the canonical BERT-QA. The latter half of the textual stream is composed of N T b layers each of which consists of an additional cross-attention block along with the regular self-attention. Representing the attention computation in query-keyvalue format, the cross-attention block uses textual tokens as query and image representation from the visual stream as keys and values. This is different from self-attention where (query, keys and values) are all input textual tokens of the textual stream. The cross-attention block enables the framework to choose spans that are also coherent with the the visual stream. If the i th textual token's features and j th image's features used as input for k th textual stream layer and (k − N Ta ) th visual stream layer (as discussed later) are given by T i k−1 and V j k−1 ; attention with q query, k keys, and v values is attn(q, k, v), the self-attention and cross-attention is given by, where T k : {T 0 k , ..., T n k } and V k : {V 0 k , ..., V m k }. Here, n is the number of textual tokens and m is the number of input images. The final layer of the textual stream is used to calculate the start and end position of the answer, similar to the canonical BERT-QA (Devlin et al., 2018) where one linear layer predicts the starting token and another layer predicts ending token through softmax applied over all tokens. The goal is to optimize the cross entropy loss over both the token position predictions.
The visual stream is similar to the textual stream with two key differences -(i) There is only one type of layer in the network and the number of layers N V = N T b and (ii) All the layers consist of only cross-attention blocks (along with feedforward layers and residual connections) and do not contain self-attention block as shown in Figure  2. The self-attention was not used as the images mostly derive their relevance/context from the textual counterparts (powered by the cross-attention) in the input passage or query rather than other input images. The cross-attention is similar to the textual stream except that query is an image feature vector and the keys and values are textual tokens' representation from the corresponding textual stream layer. The input to the visual stream is the global VGG-19 (Simonyan and Zisserman, 2014) features of each of the images. We do not use positional/segment encodings in the visual stream. We use a linear head on top of visual features to predict whether a particular image should be in the output answer and use weighted binary cross-entropy for training where the weights w and 1 − w come from the proxy supervision values (as discussed later). The image with the highest confidence score on inclusion in the answer is regarded as the predicted image during inference.
Extract & Match: A natural framework to output a multimodal response would be to combine existing state-of-the-art frameworks in question answering and visuo-lingual understanding. To illustrate the shortcomings of such an assembled framework and motivate the need for a holistic framework, we implement such a framework using existing models as our Extract & Match baseline. Given the input query (Q) and the input text (T) and images (I), we first extract the textual answer using unimodal BERT-QA. (Devlin et al., 2018). We use this extracted answer, query, and input text to select an image from the input images using UNITER  to rank the images. UNITER has been trained on millions of image-text pairs for imagetext matching task -the task of identifying whether a given image-text pair are actually the image and its caption. Due to strong pretraining, UNITER has achieved SOTA performance on a variety of vision and language task, including zero shot image-text matching. So, we use this as our baseline for image selection. We provide each image along with the text (answer, query and input) to UNITER and use the classification confidence predicted by imagetext matching head to rank the images. The image which receives the highest confidence score for a given text is taken as the matched output.

Dataset & Pretraining
Since there is no existing dataset which satisfies the requirements of the task, we curate a new dataset (refer to supplementary for details on curation strategy and data samples) by utilizing the existing public datasets. We observe that several QA datasets contain answers that come from a Wikipedia article. Since most Wikipedia articles come with a set of related images, such images could feature as the input I in our setup. Extending this heuristic, we use two QA datasets -MS-MARCO (Nguyen et al., 2016) and Natural Question (NQ) (Kwiatkowski et al., 2019), to extract those question-answer pairs which are originally extracted from Wikipedia and scrape all images from the original article. More details about the curation process and examples of the images scraped for questions can be found in the appendix. Table 1 shows various statistics about the dataset. The dataset includes large number of images making the task of selecting appropriate image nontrivial. The variety of images also necessitates a robust visual and language understanding by our model. The passages have been formed by combining the answer source passage and randomly chosen 2 − 3 'distractor' passages from the original Wikipedia article. This allows the model to learn to find the right answer in unseen conditions also. The # of tokens in our input passages are large enough to be regarded to as a full input (instead of using the entire article) considering the focus here is on multimodal output and not article-passage ranking. Proxy Supervision: Although we have scraped the images from the original articles, we do not have any supervision for these images in our dataset. In order to train the model to judge which images are relevant to an answer, we heuristically compute proxy targets by using two types of information about the image -its position in the original article and its caption. We use the caption and position information only to obtain the target scores during training and not as an explicit input to our model since such information is not always readily available. Thus, our model is able to infer the correct multimodal response irrespective of the availability of such information at inference time. Since MS-MARCO and Natural Questions provide information about the original source passage for the final answer, we know the position of the source passage. We calculate the proximity distance P between the first token of source passage of answer and an image with number of tokens chosen as the distance unit. We, further, normalize this with the total number of tokens present in the entire article. We calculate the TF-IDF similarity of the caption against the Query, Answer and source passage ( Figure 3). The overall supervision score is calculated as a weighted sum of these 4 scores where proximity score is calculated as 1 − P . The normalized supervision scores (between 0 − 1) are used as targets for linear layer of the visual stream. Pretraining: Vision and Language Tasks have relied on pretraining to address the complexities in building visuo-lingual relationships (Tan and Bansal, 2019;Lu et al., 2019;. Following this, we leverage pretraining to better initialize our model. Further, our signals (even after including proxy supervision) are relatively   Figure 3: Calculation of proxy supervision scores for an image. We compute the TF-IDF similarities of the caption with the question, answer and relevant paragraph, and also compute the distance of the image from the answer paragraph. These are summed in a weighted fashion to get the final score.
sparse for a visuo-lingual task, calling for a stronger model initialization. We use Conceptual Captions (Sharma et al., 2018) as it has been shown to impart a generic V-L understanding . We use the standard Masked Language Modelling (MLM) task over the Conceptual Captions to pretrain the textual stream and employ the cross entropy loss over the masked tokens. While the task is intended to train the textual stream, since the entire caption is generated from the visual information through the cross-attention mechanism, visual stream is also fine-tuned in this process. Since, our final model uses segment IDs, we randomly assign a segment ID of either query or passage to each caption during pretraining in order to imbibe language understanding for both type of tokens. For pretraining the visual stream, we modify the Conceptual Captions (Sharma et al., 2018) by choosing a random number between (3 − 10) (N) for each caption followed by selecting N-1 negative images (i.e. those images which have different captions) along with the image that is associated with the caption. We provide the caption as input to the textual stream and these N images as input to the visual stream. We train the model to predict the image corresponding to the caption by using binary cross entropy loss over images. Again, while this tasks is focused majorly on visual stream initialization, the textual stream is also fine-tuned due to the cross-attention layers between the two streams.

Experiments
We conduct extensive experiments and ablations for the proposed MExBERT framework and compare it against the E&M baseline. We divide our curated dataset into train, development and test sets as shown in Table 1. As mentioned before, we used the 3.2 million Image-Caption pairs from Conceptual Captions dataset (Sharma et al., 2018) for pretraining MExBERT layers. For proxy supervision, we empirically determine the weights: the proximity weight w px = 0.4, passage weight w p = 0.3, query weights w q = 0.15 and answer weight w a = 0.15 after analyzing the manually selected images in the dev set (as discussed later).
For the E&M baseline, we pretrain the text extraction with the SQUAD dataset (Rajpurkar et al., 2016) and finetune it on our dataset. For the image matching, we use image ranking using the input query (Q), input passage P and the extracted input answer A all concatenated together. For MExBERT, we tested different variants with and without proxy supervision (PS); with different pre-training setups -pretraining the textual stream alone, visual stream alone and both -to test the independent value of different pre-training.
Except pretraining experiments and baseline experiments, all our experiments on MExBERT have been conducted with 3 random seeds and the reported scores have been averaged over the 3 seeds. We use BERT pretrained embeddings for the textual stream of MExBERT and use N Ta = N T b = N V = 6. For finetuning MExBERT, we use Adam optimizer initialized with a learning rate of 0.0001 and train it till the validation loss saturates. The model was trained over 4 V100 machines using a batch size of 8 for finetuning and 64 for pretraining. For pretraining, we use an Adam optimizer with a learning rate of 0.0001 for 2 Epochs over 3.2 million Image-Text pairs for all our ablations during pretraining stage. We use 768 dimensional textual embeddings with a vocabulary size of 30, 522 and intermediate hidden embedding size 3072 for both textual and visual features. We project 4096 dimensional VGG-19 image features into 2048 dimensions and use it as input to the visual stream. Evaluation Metrics: We independently evaluate the text and image part of the extracted answer using various metrics. For the text, we considered standard metrics like ROUGE, BLEU popularly used in the literature for textual question answering task. For images, we use the precision @1,2 and 3 in which we measure if the predicted image is in top-1,2 or 3 images as selected in the ground truth. Although these metrics are standard, we verify their utility in the multi-modal case by conducting a human experiment and calculating their correlations with human judgments.
To further validate the choice of our metrics, we collated a subset of 200 examples which have their ground truth available (collected as discussed later). We, then, apply our best performing model for these examples and generate the multimodal answers. For each of 200 pairs, we have both its predicted as well as ground truth counterparts. We conduct a human experiment where the annotators are asked to rate the quality of both textual and image part of the answer on relevance R and user satisfaction S. The overall quality of the answer is high if it is both relevant and provides high user satisfaction. For each pair, 5 different annotators rate the answers resulting in independent ratings for both predicted and ground truth answers. We calculate the overall quality of a predicted answer Q a with respect to the ground truth by calculating the ratio between the quality (which we represent by R*S) of predicted answer and the ground truth answer, Q a = R * S f or predicted R * S f or ground truth . We compute the pearson correlation between different metrics and Q a . We observe that Rouge-1, Rouge-2, Rouge-L and BLEU yielded a correlation scores of 0.2899, 0.2716, 0.2918 and 0.2132 -indicating a moderate correlation and reassuring their viability for evaluating textual answer even in our multimodal setup. For image metrics, we found precision@1 to be most strongly correlated with human judgement (0.5421). While the expectation might be that such a metric has a perfect correlation, the user judgement is also biased by the corresponding textual answer leading to different scores even if the image is same in actual and predicted answer. Table 2 shows the performances of E&M against MExBERT (and its ablations) on extracting the right textual part of the multimodal answer. In order to test whether the visual attention on it's own makes any difference to the text answer quality, we also compare two variants of MExBERT -one where the visual input is zeroed out and another where the images are given as input without any supervision on the image selection. In the latter case we use the average attention weights of an image to determine its relevance to an answer. While not drastically large, we observed noticeable improvements with the visual input as compared to zero visual input, affirming our understanding about the value of utilizing multimodal input and cross-modal learning. We notice a marginal improvement in the text scores if we use proxy supervision scores during training. Intuitively, this is because of better focus of  query on the target image which further enhances its attention over the correct part of the answer in the input. Due to relatively smaller corpus as compared to text only QA datasets used usually in recent works, we considered pretraining to be a natural choice to improve our model further. While the improvements in text scores with the visual training are marginal (which is expected since this training is directed at visual stream), language pretraining yields reasonable improvements as shown in Table 2. Evaluating Image Output: We rank images in test set using our proxy supervision scores. We also select the image with the highest score as predicted by the respective model. We deem this image as Precise @1,2 or 3 depending upon if it is present in top-1, top-2 or top-3 images as ranked by our proxy-supervision mechanism. While conducting evaluation, we skip those data points which have no-image or only a single image in the input to avoid any bias in the evaluation. After removing such datapoints, there were 2, 800 test datapoints with 2 or more images. As mentioned before, in the E&M, we retrieve the highest scoring image matched based on concatenation of Q, Passage P, and the extracted Answer A as the matching text, so that model has access to the whole textual input. Evidently, the results obtained are better than random but are still far from accurate. In fact, they are just more than half as good as those obtained with our heuristically created proxy scores when compared with human preferences as shown in Table 4. This shows that the problem is much harder than just using image retrieval models calling for a joint attention to understand the relevance of question, passage and answer. Using questions and answers as input text for UNITER were either poorer or similar, and hence not reported due to space limitation. The power of joint multimodal attention is strongly evident as even without any visuo-lingual pretraining, we obtain meaningful (better than random) scores with just the averaged attention. The that in about 40,000 B.C. the inhab itants had built a low wall of rocks and

Q:
P: Figure 4: Comparison of the retrieved images for MExBERT and EM models. We observe that the joint attention mechanism incorporates a better multimodal understanding, enabling MExBERT to extract the correct images.  Table 3: Results showing the performance of E&M and MExBERT over the image modality of the multimodal answer as measured against the proxy scores over test set assumption, while using the highest average attention weights for selection the image, is that the model learns to focus on relevant images while being trained to optimize for better textual answer generation. Applying our proxy supervision mechanism while training the model, we find a very significant improvement specially in PRECISION @ 1 scores. PRECISION @ 2,3 scores are however similar to what we obtained with E&M. That is perhaps due to the fact that UNITER is good at estabilishing the relationships between text and images resulting in good PRECISION@2,3 scores but it fails at deciding the top image with high confidence due to lack of explicit understanding about where to focus on the text. Such a joint understanding is the main strength of MExBERT. Visual pretraining yields larger improvements on PRECISION@1 metric, while the language pretraining provides marginal improvements. Human Evaluation: While our proxy scores have been intuitively designed, they are error prone. We therefore collected human annotations over the entire test corpus to further validate our model's performance. We conduct a Mechanical Turk experiment where the turkers were asked to select an image from a given set of input images for (question, answer, source passage) triplet which embellishes the textual response. Every question-answer pair was annotated by 5 annotators, with each annotator annotating 5 such pairs; we pay $0.2 for every such annotation. We also provide an option of selecting 'no image' since some inputs might not have any relevant image that could go well with answer. We find an agreement rate of over 50 % for the selected image in over 90 % of the cases. We, therefore, use the average number of votes per image as a 'preference' score for the image, and use this to compute the precision values in Table  4. The performance of MExBERT against such human annotations is better than its performance when calculated over proxy scores indicate that the proposed MExBERT is robust to the noise that  Table 4: Results comparing performance of E&M and MExBERT over the image modality of the multimodal answer based on Human Evaluation over test set might have crept in the proxy-supervision and generalizes well. This also explains why the precision is lower in the noisy setting of proxy supervision than the low-noise setting based on the human annotations. High precision values of proxy scores over the human preference scores demonstrate the effectiveness of our proposed heuristic for preparing proxy training targets.

Related Works
Machine reading comprehension and questionanswering have been explored for a while, with the earliest works dating back to 1999 (Hirschman et al., 1999). Most of these works dealt with single modality at a time until recently. While earlier datasets were small, beginning with SQuAD (Rajpurkar et al., 2016) several large datasets (Rajpurkar et al., 2018;Yang et al., 2018;Choi et al., 2018;Reddy et al., 2019;Kwiatkowski et al., 2019) have been proposed. Though many of these are extractive in nature, there are a few multiple-choice datasets (Mihaylov et al., 2018;Richardson et al., 2013). Datasets like QAngaroo and HotpotQA (Welbl et al., 2018;Yang et al., 2018) enable reasoning across multiple documents. Recently, several Table-QA datasets have also been proposed, aimed at providing a natural language answer by reasoning over tables. While some datasets like WikiTableQuestions (Pasupat and Liang, 2015) and MLB (Cho et al., 2018) have natural language questions, others like TabMCQ (Jauhar et al., 2016) have multiple choice questions. A popular exploration in multimodal question answering is Visual Question Answering or VQA (Antol et al., 2015;Goyal et al., 2017;Anderson et al., 2018;Lu et al., 2016Lu et al., , 2019Tan and Bansal, 2019) where the input is a textual query along with an image and the output is a text answer. Another variant of this, Charts Question Answering (Kafle et al., 2020(Kafle et al., , 2018Kahou et al., 2017;Chaudhry et al., 2020), allows for the input to be a chart instead of a natural image. While both of these problems involve multimodality (image + question or chart + question), the output is still textual (specifically an answer class since this is modelled as a classification problem usually). While the question is received as a text in these problems, the reasoning is performed over a single modality only. In our work, we reason out across multimodal input by simultaneously attending to images and text in the input to arrive at our target output.
To overcome unimodal reasoning, there are attempts at truly multimodal reasoning with the datasets such as ManyModalQA (Hannan et al., 2020), RecipeQA (Yagcioglu et al., 2018), and TVQA . While RecipeQA aims reasoning over recipes and the associated pictures, TVQA involves multimodal comprehension over videos and their subtitles. The recently proposed ManyModalQA goes a step further by adding tables to the multimodal reasoning as well. However, these datasets provide responses in a single modality only, either an MCQ or textual response. With the rate at which multimodal consumption is taking place in our lives, it is important that the answering systems also enable multimodal output which, as discussed, already can provide better cognitive understanding when combined with textual modality.

Conclusion
We presented one of the first exploration, to the best of our knowledge, of multimodal output question answering from multimodal inputs and proposed usage of publicly available textual datasets for it. We proposed strong baselines by utilizing the existing frameworks for extract textual answers and independently match them with an appropriate image. We demonstrate the value of a joint-multimodal understanding for multimodal outputs in our problem setup by developing a multimodal framework MExBERT which outperformed the baselines significantly on several metrics. We also developed a proxy supervision technique in absence of labelled outputs and showed its effectiveness for improved multimodal question answering. We used some existing metrics to compare the different models and justified the usage of these metrics based on a human experiment.
While it is an interesting and challenging task even in its current shape, we believe there are several limitations in our proposed framework. While our datasets had multimodal elements, modeling multimodal reasoning from multimodal inputs and using it to arrive at a multimodal answer calls for a more careful question curation that includes these challenges. Recently proposed datasets such as MultimodalQA have created questions explicitly aimed at reasoning across multimodal input, but however, lack the multimodal output component. Future works could include questions which specifically aim for a visual elements making the output requirement multimodal. Also, free form answer generation in the multimodal input/output context is another interesting subject of further research. Junnan Zhu, Haoran Li, Tianshang Liu, Yu Zhou, Jiajun Zhang, and Chengqing Zong. 2018. Msmo: Multimodal summarization with multimodal output. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 4154-4164.

A Implementation details
Our models were trained on 4 V100 machines and takes just 1 sec for the whole people in such setting. As mentioned, Except pretraining experiments and baseline experiments, all our experiments on MExBERT have been conducted with 3 random seeds and the re-ported scores have been averaged over the 3 seeds.

B Human Evaluation
We conduct elaborate human experiments for analyzing the performance of our models as well as the utility of the task. As mentioned, we perform an experiment to establish the need of such a task and how multimodal outputs provide enhanced understanding to the end user. Before that, however, we perform a human experiment to label the relevant image for each question in the test set as discussed in the section on human evaluation. The interface for the experiment is as shown in Fig.  5. For each HIT, we provide the turkers with 5 (Question, Answer, Passage) triplets and multiple choice options where they select the most relevant image corresponding to the question-answer pair.
We demonstrate the what relevance means with the help of an example as shown. We also provide them with an option to select the option 'None of these' as in some cases, no image might be relevant. In order to ensure the quality of responses (to accept or reject turkers' responses), out of 5 questions, we insert random images in one random question. Ideally, a turker paying attention while providing responses is expected to select 'None of these' for the question. We find more than 90% acceptance ratio in the first event indicating the high quality annotation.
After creating the test set (over 3.5k examples), we randomly select 200 examples from the test set (ensuring there are atleast 2 images in the selected examples) and provide a unimodal as well multimodal answer for the annotators to analyze in another experiment. As shown in Figure 6, we ask the annotators a set of overall 6 questions. We have already discussed the outcomes of the experiment in the main paper. We, here, highlight how we maintain the quality of the responses. In some random inputs to the annotator, we make text-image pair incompatible while in some cases we make the answer non-recoverable from the input passage. A turker paying appropriate attention to the task will be easily able to identify the answer -'No' to the two additional questions given at the end. The answers to those two questions determine whether a particular HIT is accepted or rejected. Since, we provided reasonable amount for annotation, we find ¿95% acceptance ratio indicating that the evaluation so performed is pure and can be reliable used to make conclusions.

C Dataset
In this section, we describe the dataset collection process and present some statistics about the dataset.
As already described in the main paper, we create our dataset by curating and subsampling a set of questions with images from MS-Marco and Natural Questions dataset. Fig. 7 shows the distribution of different types of tokens in the dataset. We have only retained those frequently occuring tokens (for both levels) which have more than 5% of the total frequency for their category for the simplicity of representation. Filtering for MS-MARCO From the MS-MARCO dataset we filter out the entries which do not have a Wikipedia page as a source for the answer paragraph. Since, we are focusing on extractive multimodal outputs in this paper, we further eliminate all those question-answer pairs where the answer does not appear in the selected passages. Instead of eliminating answers without an exact match, we use edit distance to retain answers that include minor edits (e.g. removal of parenthesis) in our dataset. Filtering for Natural Questions For the Natural Questions dataset all answers are guaranteed to be grounded in Wikipedia entries. We use the short Figure 5: Instructions provided to the human annotators for labelling the relevant image for the triplet Figure 6: Interface shown to the human annotators for the task of identifying the need of the multimodal output answer provided by the authors as our target answer, and use the long answer along with distractor passages as the input to our model. However, to reduce the noise from NQ, we removed questions with a single-word answer and questions where the original Wikipedia article had no images. Scraping images from Wikipedia: Our main motivation of using answers grounded in Wikipedia articles for our corpus was to exploit the structure of such articles to scrape images and get proxy supervision. To this end we prepend the title of the article provided in the url field of MS-MARCO with http://en.wikipedia.com/wiki/ to get the URL of the appropriate Wikipedia article. We use the BeautifulSoup package to find all objects of the img class from the HTML page and scrape the largest available resolution of the image (found from the srcset property). Further, ing the task fairly difficult. This has also been demonstrated by the large difference between the UNITER accuracy and MExBERT's accuracy. We show below some randomly chosen samples from the dataset (which were also correctly chosen by our model MExBERT) to provide reader with an idea about the variety of inputs and input images. The question is shown at the top of the box while the input passage and the set of images have been shown inside the box. The red boundary over one box one of the images denote the image which was annotated as the selected image during annotation and was also predicted correctly by the MExBERT framework.