Be Different to Be Better! A Benchmark to Leverage the Complementarity of Language and Vision

This paper introduces BD2BB, a novel language and vision benchmark that requires multimodal models combine complementary information from the two modalities. Recently, impressive progress has been made to develop universal multimodal encoders suitable for virtually any language and vision tasks. However, current approaches often require them to combine redundant information provided by language and vision. Inspired by real-life communicative contexts, we propose a novel task where either modality is necessary but not sufficient to make a correct prediction. To do so, we first build a dataset of images and corresponding sentences provided by human participants. Second, we evaluate state-of-the-art models and compare their performance against human speakers. We show that, while the task is relatively easy for humans, best-performing models struggle to achieve similar results.


Introduction
Human communication, in real-life situations, is multimodal (Kress, 2010): To convey and understand a message uttered in natural language, people build on what is present in the multimodal context surrounding them. As such, speakers do not need to "repeat" something that is already provided by the environment; similarly, listeners leverage information from various modalities, such as vision, to interpret the linguistic message. Integrating information from multiple modalities is indeed crucial for attention and perception (Partan and Marler, 1999) since combined information from concurrent modalities can give rise to different messages (McGurk and MacDonald, 1976).
The argument that language and vision convey different, possibly complementary aspects of meaning has been largely made to motivate the need for multimodal semantic representations of words (Ba-roni, 2016;Beinborn et al., 2018). However, computational approaches to language and vision typically do not fully explore this complementarity. To illustrate, given an image (e.g., the one depicted in Figure 1), popular tasks involve describing it in natural language, e.g., "A tennis player about to hit the ball" (Image Captioning; see Bernardi et al., 2016); answering questions that are grounded in it, e.g., Q: "What sport is he playing?", A: "Tennis" (Visual Question Answering; see Antol et al., 2015); having a dialogue on its entities, e.g., Q: "Is the person holding a racket?", A: "Yes." (visuallygrounded dialogue; see De Vries et al., 2017;Das et al., 2017). While all these tasks challenge models to perform visual grounding, i.e., an effective alignment of language and vision, none of them require a genuine combination of complementary information provided by the two modalities. All the information is fully available in the visual scene, and language is used to describe or retrieve it.
In this work, we propose a novel benchmark, Be Different to Be Better (in short, BD2BB), where the different, complementary information provided by the two modalities should push models develop a better, richer multimodal representation. As illustrated in Figure 1, models are asked to choose, among a set of candidate actions, the one a person who sees the visual context depicted by the image would do based on a certain intention (i.e., their goal, attitude or feeling). Crucially, the resulting multimodal input (the sum of the image and the intention) will be richer compared to that conveyed by either modality in isolation; in fact, the two modalities convey complementary or nonredundant information (Partan and Marler, 1999).
To illustrate, a model that only relies on the (nongrounded) linguistic information conveyed by the intention, i.e., "If I have tons of energy", might consider as equally plausible any actions that have to do with playing a sport, e.g., "I will play base- Given an image depicting, e.g., a tennis player during a match and the intention "If I have tons of energy", the task involves choosing, from a list of 5 candidate actions, the target action that unequivocally applies to the combined multimodal input: "I will play a game of tennis with the man". The task is challening: a model exploiting a language or vision bias could fall into the trap of decoy actions containing words highlighted in blue or orange, respectively. Therefore, selecting the target action requires models perform a genuine integration of the two modalities, whose information is complementary. Best viewed in color.
ball with the men" or "I will play a game of tennis with the man". Similarly, a model that only relies on the visual information conveyed by the imagea tennis player during a match-might consider as equally plausible any actions that have to do with 'tennis' and/or 'player', e.g., "I will applaud my favourite tennis player of all time" or "I will play a game of tennis with the man". In contrast, a model that genuinely combines information conveyed by both modalities should be able to select the target action, namely the only one that is both consistent with the intention and grounded in the image, i.e., "I will play a game of tennis with the man". Moreover, similarly to real-life communicative scenarios, in our approach different language inputs modulate differently the same visual context, and this gives rise to various multimodal messages. To illustrate, if the image in Figure 1 is paired with the intention "If I am tired watching", the target action "I will play a game of tennis with the man" is no longer valid. Indeed, the target action in this context is "I will leave the tennis court" (see Figure 3). Our work has the following key contributions: • We introduce a novel multimodal benchmark: the set of ∼ 10K image, intention, action datapoints collected via crowdsourcing and enriched with meta-annotation; the multiple choice task, BD2BB, which requires proper integration of language and vision and is specifically aimed at testing SoA pretrained multimodal models. The benchmark, together with the code and trained models, is available at: https://sites.google.com/view/bd2bb • We test various models (including the SoA multimodal, transformer-based LXMERT; Tan and Bansal, 2019) and show that, while BD2BB is a relatively easy task for humans (∼ 80% acc.), best systems struggle to achieve a similar performance (∼ 60% acc.).
• We extensively analyze the results and show the advantage of exploiting multimodal pretrained representations. This confirms they are effective, but not enough to solve the task.

Related Work
Since the introduction of the earliest multimodal tasks, such as Image Captioning (IC; see Bernardi et al., 2016) and Visual Question Answering (VQA; Antol et al., 2015), a plethora of tasks dealing with language and vision have been proposed. In parallel, baseline models have been replaced by more powerful attention-based systems (Anderson et al., 2018) and, more recently, by transformer-based architectures pretrained on several tasks (Tan and Bansal, 2019;Lu et al., 2019;Chen et al., 2019). These latter models build on multimodal representations that are meant to be task-agnostic; as such, they can be transferred to virtually any other multimodal task with minimal fine-tuning. Our work contribute to these two lines of research by (1) introducing a novel multimodal task, and (2) by evaluating a SoA multimodal encoder on it.
Multimodal tasks VQA was originally proposed to overcome the challenge of quantitatively evaluate IC models. The task (and its evaluation) is straightforward: given an image and a question about its visible objects, systems have to provide the correct answer by aligning information from the two modalities (Antol et al., 2015). Driven by VQA, several datasets have been proposed to minimize the bias observed in natural images (Goyal et al., 2017;Ray et al., 2019); to force models to "reason" over a joint visual and linguistic input (Johnson et al., 2017;Suhr et al., 2019); to deal with objects' attributes and relations (Krishna et al., 2017); to encompass more diverse (Zhu et al., 2016) and goal-oriented questions and answers (Gurari et al., 2018). At the same time, some work proposed higher-level evaluations of VQA models and showed their limitations (Hodosh and Hockenmaier, 2016;Shekhar et al., 2017); similarly, recent attention has been paid to understand what makes a question "difficult" for a model (Bhattacharya et al., 2019;Terao et al., 2020). Despite impressive progress, current approaches to VQA do not tackle one crucial limitation of the task: the answer to a question is given by the alignment of language and vision rather than their complementary integration.
Moving from objects to actions, several tasks have been proposed to mimic more realistic settings where a higher degree of integration between modalities is required. One is visual storytelling (Huang et al., 2016;Gonzalez-Rico and Pineda, 2018;Lukin et al., 2018), where models have to understand the action depicted in each photo and their relations to generate a story. Similar abilities are required in the task of generating non-grounded, human-like questions about an image (Mostafazadeh et al., 2016;Jain et al., 2017), and in that of asking discriminative questions over pairs of similar scenes (Li et al., 2017). Related tasks are also those of predicting motivations of visually-grounded actions (Vondrick et al., 2016) or generating explanations for a given answer (Park et al., 2018;Hendricks et al., 2018).
An even higher level of understanding of vision and language is required in the tasks of filling the blank with the correct answer (Yu et al., 2015); answering questions from videos and subtitles (Lei et al., 2018); having a dialogue on objects (De Vries et al., 2017;Das et al., 2017) or events (Mostafazadeh et al., 2017); answering and justifying commonsense questions (Zellers et al., 2019). However, all these tasks require making commonsense inferences over the two modalities rather than integrating their complementary infor-mation to answer a grounded question.
More akin to ours are the approaches by Iyyer et al. (2017), which aims to predict the subsequent scene and dialogue in a comic strip, and Kruk et al. (2019), where the goal is to compute the communicative intent of a social media post. Though they both require a challenging integration of language and vision, these tasks (as well as the type of data they use) are crucially different from BD2BB, where the task is to predict the action that is consequent to a given intention based on the image.
Transformer-based multimodal models Developing universal multimodal encoders whose pretrained representations are suitable for virtually any multimodal task is a crucial challenge. Inspired by the success of BERT, a pretrained transformerbased language encoder (Devlin et al., 2019), similar architectures have been recently proposed in the domain of language and vision (Lu et al., 2019;Tan and Bansal, 2019;Chen et al., 2019;Su et al., 2020;amd Nan Duan et al., 2020). While these architectures achieve state-of-the-art performance in many tasks, their novelty and complexity leave several questions open, and further work is needed to better understand, e.g., which layers are more suitable for transferability (Tamkin et al., 2020), or what is the relation between pretraining and downstream tasks (Zamir et al., 2018;Singh et al., 2020). Moreover, to prove they are readily applicable to novel multimodal benchmarks, pretrained universal encoders should be ideally effective with only minimal fine-tuning on the target tasks.
In this light, we believe that more efforts should be put in developing datasets that are challenging and yet relatively small, in line with the 'diagnostic' datasets proposed for VQA (Johnson et al., 2017) and the easy vs. hard subsets introduced by Akula et al. (2020) for visual referring expression recognition. Our contribution follows this line of thought. where each of the 10, 191 images depicts at least one person. This choice was aimed to make the participants' task more natural: indeed, the presence of people in the image allows more possibilities of interaction, and therefore guarantees that some actions can be performed in that situation.

Data Collection
We set up an annotation tool on Figure-Eight 2 (see Figure 2) where annotators were shown an image and asked to imagine themselves being in that situation, as ideal observers not represented in the picture. We instructed them to carefully look at the image and think about 1) an intention, i.e., how they might feel/behave if they were in that situation; 2) an action, i.e., what they would do based on that feeling/behavior. Intentions and actions were typed in free form by participants in two separate text boxes; by instructions, their sentences had to complete the provided opening words If I. . . and I will. . . , respectively. To ensure that intentions conveyed information that was complementary (nonredundant) to that by the image, participants were instructed not to mention any of the entities (people, objects, etc.) shown in the image. In contrast, to ensure that actions contained information that was grounded in the image, participants were asked to mention at least one visible entity when writing their action (see errors and warnings in Figure 2). 3 We randomly selected ∼ 3.6K images from the split by Vondrick et al. (2016) and, for each of them, we collected on average 5 intention, action tu-

2.
3. ples by 5 participants. In total, ∼ 18K unique image, intention, action datapoints were collected. Participants were recruited from native-English countries only. Overall, 477 annotators (based on the IP) took part in the data collection; on average, each of them provided 38 annotations. Participants were paid 0.04$ per tuple. 4 In total, the data collection costed ∼ 900$.

NTENT ONS ACT ONS
A few filtering steps were needed to get rid of datapoints with invalid annotations. First, we discarded those datapoints where intentions and/or actions were either not in English (e.g., bot-generated Lorem Ipsum sequences) or nonsense strings (e.g., random sequences of characters). This step was done semi-manually and filtered out ∼ 3K datapoints. Second, we removed datapoints where the action did not contain any noun nor pronoun. After this, we were left with 12, 457 valid datapoints.
To illustrate the type of data collected, Figure 3 reports the 5 intention, action tuples provided by 5 annotators for the image in Figure 1. As can be noted, the same visual context elicits different intentions, which in turn give rise to different possible actions. Crucially, no intentions refer to anything that is visible in the image, which makes them suitable for virtually any visual context. As for the actions, in contrast, they all 1) mention at least one entity that is grounded in the given scene, e.g., "player" or "tennis court", which makes them plausible only for sports contexts, particularly 'tennis'; 2) match their corresponding intention, but not (or to a much lesser extent) the others; i.e., different intentions trigger different actions, and the verb in the action is a proxy for such diversity. Below, we describe the meta-annotation process we performed to categorize each datapoint with respect to: 1) the topic of its action, e.g., 'tennis'; and 2) the argument structure of the verbs in its action.

Meta-Annotation
Topic For each of the 12, 457 datapoints, we built a 512-d semantic representation of its action using the off-the-shelf Universal Sentence Encoder (USE;Cer et al., 2018). We then run a k-means clustering algorithm over these vectors and obtained 60 topic clusters. 5 By manual inspection, 54 clusters were found to consistently group together actions revolving on the same topic, e.g., 'tennis' or 'birthday', in a way that it was easy to label them using such terms. Since for the remaining 6 clusters this was not straightforward due to the presence of rather disconnected actions, we filtered these clusters out. We further polished the 54 clusters (a) by manually moving actions to clusters that fit them better, and (b) by removing actions that were not in line with the cluster topic. Moreover, we removed actions that did not comply with the instructions provided to annotators during the data collection. After these steps, we were left with 10, 287 image, intention, action datapoints.
Argument structure Using the Stanford NLP Parser (Chen and Manning, 2014), we annotated the actions in each of the 10, 287 topic-categorized datapoints by means of a 4-code annotation schema. In particular, from each parsed action we extracted its main verb (code1) and its direct or indirect object (code2). Moreover, when present, the verb of the coordinate or subordinated sentence was also extracted (code3), as well as other nouns in any complement position of the main or secondary verb (code4). 6 All the outputs by the parser were man-

Task
We introduce the Be Different to Be Better (BD2BB) task, where the different, i.e., complementary information provided by the two modalities should push models develop a better, i.e., richer multimodal representation. To evaluate these abilities, we frame our task as a multiple-choice problem (similar to Antol et al., 2015;Yu et al., 2015;Zhu et al., 2016) where either modality is necessary but not sufficient to perform a correct prediction. The task is the following (see Figure 1): given an image and a corresponding intention, the model has to choose the correct action over a set of 5 candidate actions. We refer to the correct action as the target action; to the wrong actions as the decoy actions. Similarly to Chao et al. (2018), decoy actions are carefully selected to be as plausible as possible when evaluated against either the intention (2 decoys) or the image (the other 2) only. Below, we explain how language-based and image-based decoys were selected based on the meta-annotation.
Language-based decoys For each of the 10, 287 image, intention, action datapoints, we randomly selected a number of datapoints from the entire data that had the following criteria: 1) their action belonged to a different topic cluster than the one including the target action; 2) their action did not share any noun with the target action, i.e., their code2 and code4 were different. We then computed a similarity score between the target action and each of these selected actions by means of the cosine of their USE representations. We ranked these scores and selected as our language decoys the two with the highest similarity. This way, we obtained language-based decoys that are semantically very similar to the target action, but are on a different topic and do not share any noun with it.
Vision-based decoys For each datapoint, we randomly selected a number of datapoints from the I: If I want to protect myself, I will. . .
If I want to enjoy the sun, I will. . . If I want to get the blood pumping, I will. . . If I want to be noticed, I will. . .
L: sit on my skateboard instead of actually riding it take a huge bite out of my sandwich take a ride on the aerial tramway put on a costume and join the parade L: wear jeans when racing on a skateboard take a bite of the burger ride a horse in the rodeo join the men on the street T: wear a helmet while riding my motor bike eat my food on the roof patio ride a motorcycle wear a sign V: look at the motorcycle display use my phone to order from a take out menu seat next to a bike and read a book at least match my colors to look fancy V: challenge the people to a race assist the group with cutting food help the person who has fallen off their bike teach him how to tie a tie entire data that had the following criteria: 1) their action belonged to the same topic cluster of the target one; 2) their action did not share any verb with the target action, i.e., their code1 and code3 were different. We then ranked these actions with respect to their USE similarity with the target one, and selected as our vision-based decoys the two with the lowest score. This way, we obtained visionbased decoys that are about the same topic of the target action; at the same time, they do not share any verbs with it and are semantically different.

Dataset
Our final dataset includes 10, 265 samples 7 as the ones depicted in Figure 4: each sample consists of a unique image, intention, action datapoint paired with 4 carefully-selected decoy actions. Consistently with out purpose of making BD2BB a challenging benchmark for pretrained multimodal architectures (see Section 1), we split the dataset into "unusual" train/val/test partitions; i.e., we selected 20% samples for training; the remaining for validation (40%) and test (40%). We propose having small training data and larger validation and test sets should become a standard, as pretrained models already build on a massive amount of data. Table 1 reports the descriptive statistics of the dataset, including the number of unique images, intentions and actions per split, and the average length of the sentences. All the experiments reported in the paper are performed on these splits.

Experiments
To test the importance of combining information from the two modalities and the independent contribution of either modality, we experiment with 3 settings of the BD2BB task: L, where the target 7 For 22 datapoints it was not possible to find all the decoys, hence they were discarded during the creation of the dataset. action among the 5 candidates has to be guessed based on the intention only; V, where only the image is provided; LV, where both the image and the intention are provided. For each setting of the task, we evaluate the performance of (1) a simple baseline trained from scratch on the task; (2) a state-ofart transformer-based pretrained model fine-tuned on the task; (3) the same transformer-based model trained from scratch on the task. Moreover, results by models are compared to (4) human performance.

Models
Baseline For each image, intention, action datapoint in the sample, baseline LV builds a multimodal representation by concatenating the 2048- Finally, to account for any bias due to unavoidable association and repetition patterns among the actions, we test a version of the baseline which only encodes the actions. We refer to it as actions-only.
RoBERTa In setting L, we employ the robustly optimized version of BERT, RoBERTa (Liu et al., 2019); this model is a universal language encoder pretrained on the task of masked language modeling, which achieves best-performing performance in the challenging multiple-choice #samples (%) #img #int #act #t-act #d-act avg int len avg act len train 2102 ( LXMERT In settings LV and V, we employ LXMERT (Learning Cross-Modality Encoder Representations from Transformers; Tan and Bansal, 2019), a universal multimodal encoder pretrained on five language and vision tasks which is state-ofart on VQA2.0 (Goyal et al., 2017). This model represents an image by the set of position-aware object embeddings for the 36 most salient regions detected by Faster R- CNN (Ren et al., 2015) and processes the textual input by position-aware randomlyinitialized word embeddings. Like RoBERTa, LXMERT uses the special tokens CLS and SEP but, differently from RoBERTa, here SEP is used both to separate sequences and to denote the end of the textual input. Hence, we take this into account when adapting LXMERT to our task. Similar to RoBERTa, we use the encoding corresponding to CLS (768-d) to obtain a probability distribution over the 5 sample's candidate values. For each task setting, we evaluate each model in two versions, i.e., pretrained model fine-tuned on our task (LXMERT LV and LXMERT V ); trained from scratch (LXMERT s LV and LXMERT s V ).
Experimental setup For baseline models, we perform hyperparameter search on learning rate, Dropout, and hidden size; as for transformer-based models, we use the best configurations reported in the source papers (reproducibility details in Appendix B). All models are trained with 3 random seeds for 50 epochs with Adam (Kingma and Ba, 2015) minimizing a Cross Entropy Loss between the probability distribution over the 5 sample's candidate actions and the ground-truth action. For each of the 3 runs, we consider the model with the highest validation accuracy. Average accuracy and standard deviation over 3 runs is computed.

Human Evaluation
We randomly extracted 300 unique samples from the dataset and split them into 3 partitions including 100 samples each. For each partition, we collected judgments by 3 participants in each setting of the task: L, V, and LV. Crucially, participants did the task only once per partition; i.e., they judged each sample only in one of the 3 task settings. Using Quiz Maker, 8 we collected 2, 700 unique responses from 11 subjects who participated on a voluntary basis. For each setting of the task, we counted as 'correctly predicted' the samples where at least 2 out of 3 annotators converged on the target action. Moreover, for each task setting we computed the 'best' accuracy, i.e., the average of the 3 participants who achieved the highest accuracy in each split.

Results
Results by both models and humans are reported in Table 2. Several key observations can be made.
Multimodal integration is the key. The overall best-performing model in BD2BB is LXMERT LV (62.2%), which outperforms the other pretrained models, i.e., RoBERTa L (56.2%) and LXMERT V (59.2%). On the one hand, this shows that disposing of both modalities is beneficial to perform the  task. This is in line with the results by human participants, who achieve the highest accuracy in the multimodal setting (79% vs. 50% of L and 72.3% of V). On the other hand, the finding that LXMERT V surpasses RoBERTa L (+3%) confirms that the image provides more information compared to the intention. This, again, is consistent with human results, where the gap between V and LV (−7%) is much smaller compared to that between L and LV (−29%). For humans, this visual advantage is likely due to (MS-COCO) images depicting complex events that elicit a broad range of aspects related to people's experience of the world. As for the models, it confirms that LXMERT, thanks to its massive pretraining, is effective in extracting fine-grained information from images.
Models are far from humans. Humans achieve around 80% accuracy ('best' 82%) on the multimodal version of the task. This is a high result, in line with previous work with a similar setup (consider, e.g., SWAG, where 'expert' human accuracy is around 85% with 4 choices, i.e., chance level at 25%; Zellers et al., 2018). At the same time, the non-perfect human accuracy reveals that the benchmark is challenging due to the careful selec-tion of plausible decoys. Compared to humans, the best-performing LXMERT LV achieves much lower results (−17%), which indicates that BD2BB is challenging and far from being solved. Since the gap between best-performing models and human participants in unimodal settings is smaller (−13% in V and −6% in L), the biggest computational challenge lies in the integration of complementary information from different modalities.
Pretrained is better. Pretrained models neatly outperform the baseline in all the versions of the task 9 and, more interestingly, also all their counterparts trained from scratch. As can be seen in Table 2, indeed, transformer-based models trained from scratch achieve results that are only slightly better than those by the baseline in both LV and L; as for V, LXMERT s V turns out to perform worse than the baseline s V (and even worse than the actionsonly baseline). This clearly shows that these architectures are very effective when building on their pretraining, but suffer when challenged to learn a task from scratch with relatively few samples.

Analysis
Best models' errors We perform an analysis on the errors made by the 3 pretrained models to check whether they fall more often into the languagebased or vision-based decoys. To do so, we focus on each model's best run, and compute the proportion of wrong predictions in the test set that belong to one or the other decoy type. For comparison, a model that makes modality-balanced wrong predictions should fall into language-/visionbased decoys 50% of the times. Quite surprisingly, RoBERTa L has only a moderate bias toward language-based decoys: in fact, only 60.2% of its errors are of this type. As for LXMERT V , no bias at all is observed toward the vision-based decoys (48.6%). Finally, the best-performing LXMERT LV is shown to be halfway between these models, with only a slight preference for language-based (55.1%) over vision-based decoys (44.9%).
In Figure 5, we report two cherrypicked examples where LXMERT LV either correctly predicts the target action (left) or choses a wrong one, in this case a vision-based decoy (right). It is worth mentioning that these two cases are challenging: for I: If I am in the mood to act silly, I will. . .
If I don't like this, I will. . .
L: attend a dinner like this man holding a gift sit next to the woman on the bench L: buy him a cake and invite his friends to party get my face painted T: act silly with this man and eat cake avert my eyes from the man who looks silly V: help my child cut their cake teach him how to tie a tie V: have cake with soldiers wear a costume and march in a parade both of them, human annotators were able to pick the correct action only in the multimodal version of the task-but neither in L nor in V. As can be seen, in the leftmost example the model does a good job in combining complementary information from language and vision. In the rigthmost one, instead, it picks an action that is very much plausible based on the image, but not in presence of the given intention containing a negation (don't). Taken together, these analyses indicate that no simple strategies can be exploited by models to detect and rule out decoy types. Language-and vision-based decoys are equally challenging, and combining complementary information is needed to solve the task.
Hard test To explore the robustness of the pretrained models, we check how well they perform on a subset of the test set where several features of the samples were unseen in training. In particular, neither the image nor the intention were seen in training; moreover, the target action could be seen as a decoy but never as the target. In Table 3 we report the results by the 3 pretrained models on this subset (1, 505 samples); we refer to it as the hard test. As can be seen, all models experience a small decrease in accuracy compared to the whole test set-while humans do not. This indicates that the hard test is indeed more challenging. However, pretrained models are overall robust to unseen features.
In line with the standard test set, LXMERT LV still outperforms the unimodal models, though its drop in performance (−4%) is more pronounced compared to them (−1/2%). This suggests that part of the advantage of the multimodal system over  the unimodal ones is due to its fine-tuning. Indeed, pretraining on its own is not enough to properly combine complementary information from the intention and the image. Finally, since humans do not perform worse in these samples, the performance gap with LXMERT LV increases to ∼ 20%.

Conclusion
Inspired by real-life communicative contexts where language and vision are non-redundant, we proposed a novel benchmark to challenge models combine complementary multimodal information. This is a crucial ability that, we believe, our benchmark will contribute push further. In particular, recently proposed universal multimodal encoders can greatly benefit from relatively small but challenging resources as is BD2BB, which can be used to shed light on model abilities and help developing architectures which exhibit more human-like skills.
Here, we evaluated LXMERT and showed that it struggles to achieve results that are comparable to those by humans. In the future, we plan to evaluate other multimodal encoders on it, and to contribute to the development of better multimodal systems. Crowdsourcers are presented with detailed instructions and examples before starting with the annotation task. First, we introduce the task and provide them with some details to familiarize with the annotation tool. Then, we give them instructions regarding the constraints to be observed, i.e., for intentions: (1) to use the present tense and (2) do not mention any of the entities depicted in the image; for actions: (1) to use the present tense and (2) do mention entities that are visible in the image. To make instructions and constraints clearer, we show them several examples of good/wrong annotations Figure 6: Data collection. One annotation sample presented to participants. Given an image, participants are asked to provide an intention and an action. To ensure they are doing the task properly, a verification question is asked preliminarly. Answering the question correctly (multiple correct answers) leads to the proper annotation phase.
(see Figure 2). Moreover, to make sure participants are performing the task properly (and, crucially, to avoid collecting fake data from automatic bots), a verification question is asked at the beginning of each image's annotation phase. The verification question has multiple correct answers, and only by picking one of these answers participants can proceed with the annotation phase (see Figure 6). In addition, we add two sanity checks to the collected intentions. We check that (1) they have a length of at least 5 tokens; if this is not the case, participants are shown a warning and asked to fix their sentence; (2) they do not contain any noun referring to an entity that is grounded in the image; this is checked by means of a simple heuristic which extracts all the nouns from a given image's MS-COCO captions. Nouns with frequency > 1 are not allowed, and when typing them turkers are warned to modify their sentence.

A.2 BD2BB Dataset Statistics
As described in Section 4, the final BD2BB dataset includes 10, 265 samples, where each sample includes a image, intention, target action triple associated with 4 selected decoy actions. These triples were provided by 430 unique annotators. In particular, 253 were from the USA, 111 from the United Kingdom, 53 from Canada, 6 from Ireland, 5 from New Zealand, 2 from Australia. Each of them provided, on average, 23.87 image, intention, target action tuples contained in the dataset (min 1, max 192).
Each sample contains 5 actions. On average, these actions were provided by 4.90 unique annotators (min 3, max 5); moreover, they were collected for 4.96 (min 3, max 5) unique images, i.e., the decoy actions in each sample refer to different images than the target one in most of the cases.

A.3 Meta-Annotation
Topics We manually inspected the 60 clusters obtained through k-means clustering and removed 6 clusters for which we could not identify a coherent topic. Examples of the actions for each of the remainining 54 clusters, and the corresponding labels we assigned to them, are provided in Table 4. The 60 clusters were reviewed by two of the authors. We kept only clusters for which full agreement was met.
Numeric 4-Code Annotation We organize our data through a two-step system of wordcodes using codes to mark the syntatic class and the word-type. With the Stanford NLP parser (Chen and Manning, 2014), we extract from each action syntatic information and mark: 1) the main verb: "code1"; 2) the direct or indirect object of the main verb, as well as other complements related to the main verb: "code2"; 3) the second verb -if present (i.e., the verb of the coordinated or subordinated sentence): "code3"; 4) the object of the second verb -if present: "code4". In this case, we considered not only the direct object of the second verb, but also all the words referring to an object grounded  Table 4: We report the label assigned to each of the 54 clusters (which summarizes its main topic), and one example of the actions included in it. Each action was annotated with codes to mark the verb (code1) and the complement object (code2) of the main sentence, and the verb (code3) and complements (code4) of the secondary sentence. Clusters are listed by their size: in descending order, from biggest to smallest .  50  frisbee 2  172  25  25  29  22  birthday  170  62  71  46  59  water sports  165  87  60  38  41  photo  163  39  21  30  44  zoo animals  161  57  25  32  39  public transports  159  46  28  23  22  skateboard 2  158  45  36  35  25  frisbee 1  154  39  11  31  27  wii  149  36  22  35  22  bedtime  144  53  38  51  29  manual work / hobbies 139  69  75  44  60  animals farm  139  69  41  32  26  good intentions  132  66  64  44  32  kite  125  28  18  31  17  horse riding  118  49  22  22  29  toilet things  105  43  38  29  24  skateboard 3  98  22  16  18  14  street scenes  96  56  37  26  35  ski and snow  95  48  26  31  23  snowboard 1  94  27  26  21  17  airport  93  48  30  35  12  fruit  89  33  18  24  20  haircut  54  31  21  19  15  women and food  43  24  18  22  14  reading  32  11  11  11  7   Table 5: Statistics on the meta-annotation of the data. For each cluster, we report the number of actions, the number of verbs in the main (code1) and in the secondary sentence (code3), the number of nouns occuring as complements in the main (code2) and in the secondary sentence (code4).
in the corresponding image that specify the action expressed by the sentence. This way, for each action in which this was possible, we have a word that underlines the link between the linguistic and the visual aspect of the annotation. All the outputs by the parser were manually checked and fixed were needed. This was done by two of the authors: First, a subset of the data was annotated by the two auhtors together; then, each of the authors annotated a different subset. Only doubtful cases were discussed. In Table 4, for each action given as an example of the cluster, we highlight the words cor-cluster action code1 code2 code3 code4 food join the people in the restaurant to enjoy a meal join 1 people 77 enjoy 15 meal 28 food get some food with the people get 107 food 6 0 people 666 frisbee join this man playing frisbee join 9 man 11 play 13 frisbee 14 frisbee catch the frisbee and throw it again catch 777 frisbee 777 throw 8 frisbee 14 Table 6: Examples of actions and corresponding word-type codes. Note that: (1) a given verb, e.g., join, is assigned different codes in different clusters (lines 1 and 3); (2) a given object within the same cluster, e.g., frisbee at line 4, is assigned different codes in different syntactic positions; (3) a given object, e.g., frisbee at lines 3 and 4, is assigned the same code if belonging to the same cluster and in the same syntactic position.  responding to each of the four codes. Statistics about this meta-annotation are reported in Table 5.

Model
Furthermore, for each topic cluster, we assign a numeric wordcode to each unique word-type in the 4 syntactic classes described above. In other words, each sentence is translated into a code composed of 4 numbers, each one representing a unique word in the corresponding syntactic class. 10 Illustrative examples are given in Table 6. 11

B.1 Models
The number of parameters of each model is reported in Table 7. The number of parameters is the same both in models trained from scratch and in pre-trained ones. The validation accuracy and epoch of the best models for each one of the three runs are reported in Table 8. For each of the three runs, we consider the model obtaining the best validation accuracy. For each model, we report mean and standard deviation of the test accuracies obtained across the three runs.
Baseline Our baseline is inspired by Jabri et al.
(2016), but we use Softmax instead of Sigmoid as 10 When we choose to consider more than one object, we create a compositional code, using the '+' mark 11 Here numbers are assigned randomly, just to provide a concrete example of our meta-annotation. the final activation function to compute a probability distribution over all the candidates and choose the best one. We consider a version receiving image, intention and actions (baseline LV ), a version receiving image and actions (baseline V ), and a version receiving intention and actions (baseline L ). We used PyTorch 1.4.0. Baseline models were run on a CPU and their training took 33 seconds per epoch on average. We used a batch size equal to 32. We performed a grid search over two hyperparameters: the size of the hidden layer receiving concatenated figures (we tried values 8192 and 2048) and the dropout probability of zeroing elements of the input tensor right after the ReLU activation function (we tried values 0.0 and 0.5). The combination of parameters which leaded to the best validation accuracy was a hidden layer having size 8192 and a dropout probability of 0.0 corresponding to not having any dropout.
RoBERTa The RoBERTa BASE model we used has 12 self-attention layers with 12 heads each. It uses three special tokens, namely CLS, which is taken to be the representation of the given sequence, SEP, which separates sequences, and EOS, which denotes the end of the input. For each of the 5 image, intention, action datapoints in the sample, RoBERTa encodes the input as a sequence composed by CLS, the intention, SEP, the action, and EOS. As in the original work, we use the representation corresponding to the CLS token to use the encoder in the downstream task. For RoBERTa we used PyTorch 1.0.1 and we started from the source code available at https:// github.com/huggingface/transformers. Both when fine-tuning the pre-trained model and when training the model from scratch, we used a batch size equal to 32 with 8 gradient accumulation steps, thereby having a batch size equal to 256, a weight decay equal to 0.01, gradient clipping equal to 5, and a learning rate which is warmed up over the  first 10% steps to a peak value of 0.00005 and then linearly decayed.
LXMERT The LXMERT model we used has a Object-Relationship Encoder and a Language Encoder which encode relationships between regions and relationships words, respectively, through a self-attention mechanism, and a Cross-Modality Encoder which encode relationships between regions and words and vice-versa through a crossmodal attention mechanism followed by a selfattention mechanism. The number of layers in the Language Encoder, Object-Relationship Encoder, and Cross-Modality Encoder are 9, 5, and 5, respectively. As in RoBERTa, LXMERT uses the special tokens CLS and SEP. Differently from RoBERTa, LXMERT uses the special token SEP both to separate sequences and to denote the end of the textual input. As in the original work, we use the representation corresponding to the CLS token to use the encoder in the downstream task. For RoBERTa we used PyTorch 1.0.1 and we started from the source code available at https://github. com/airsplay/lxmert. As with RoBERTa, both when fine-tuning the pre-trained model and when training the model from scratch, we used a batch size equal to 32 with 8 gradient accumulation steps, thereby having a batch size equal to 256, a weight decay equal to 0.01, gradient clipping equal to 5, and a learning rate which is warmed up over the first 10% steps to a peak value of 0.00005 and then linearly decayed.