Fatality Killed the Cat or: BabelPic, a Multimodal Dataset for Non-Concrete Concepts

Thanks to the wealth of high-quality annotated images available in popular repositories such as ImageNet, multimodal language-vision research is in full bloom. However, events, feelings and many other kinds of concepts which can be visually grounded are not well represented in current datasets. Nevertheless, we would expect a wide-coverage language understanding system to be able to classify images depicting recess and remorse, not just cats, dogs and bridges. We fill this gap by presenting BabelPic, a hand-labeled dataset built by cleaning the image-synset association found within the BabelNet Lexical Knowledge Base (LKB). BabelPic explicitly targets non-concrete concepts, thus providing refreshing new data for the community. We also show that pre-trained language-vision systems can be used to further expand the resource by exploiting natural language knowledge available in the LKB. BabelPic is available for download at http://babelpic.org.


Introduction
There is growing research interest in developing effective systems capable of achieving some understanding of the content of an image. As in most fields of applied AI, this requires annotated data to train a supervised system on. While ImageNet 1 (Deng et al., 2009), one of the most influential projects in computer vision, was undeniably an important milestone towards image understanding, there is still a lot of ground to be covered. Ima-geNet's initial aim was to collect pictures for most WordNet synsets (Miller, 1995). Yet, at the time of writing, only some 21,841 nominal synsets are covered according to ImageNet's official website.
One issue with ImageNet and most other image repositories like COCO (Lin et al., 2014) and 1 http://www.image-net.org Flickr30kEntities (Plummer et al., 2015) is their focus on concepts denoting concrete, tangible things, such as CAT, TRAFFIC LIGHT and so on. Concepts whose denotation is not clearly identifiable with a set of objects having distinct boundaries, such as events (e.g., FATALITY, COMPETITION), emotions (e.g., SADNESS) and psychological features (e.g., SHARPNESS), have enjoyed less attention. For lack of a better term, we will henceforth refer to them as non-concrete (NC) concepts.
On one hand, the inclusion of NC concepts would be an important step towards wide-coverage image semantic understanding. On the other hand, it also goes in the same direction as recent multimodal language-vision approaches, e.g., monoand cross-lingual Visual Sense Disambiguation (Barnard and Johnson, 2005;Loeff et al., 2006;Saenko and Darrell, 2008;Gella et al., 2016Gella et al., , 2019. Taking into account NC concepts could also be of crucial importance for fascinating languagefocused applications, such as Multimodal Machine Translation. Last but not least, NC concepts would represent a significative benchmark for real-world multimodal applications. In fact, traditional computer vision approaches rely on the detection of objects within the image, but many NC concepts are not well described by a bag of objects. Consider, for instance, Figure 1. The two images illustrate different NC concepts (i.e., HIGH JUMP and POLE VAULT) which are different configurations of the same elementary objects (i.e., PERSON, ROD, BLEACHERS). Thus, NC concepts require complex image understanding, integrating a fair amount of common sense knowledge.
As a contribution towards this goal of expanding the scope of research, we introduce BabelPic, the first dataset for multimodal language-vision tasks with a focus on NC concepts and that is also linked to WordNet. BabelPic has been built by manually validating synset-image associations available in BabelNet (Navigli and Ponzetto, 2012), a large multilingual resource linking WordNet to Wikipedia and other resources.
Furthermore, we provide a methodology to extend the BabelPic coverage to all the BabelNet synsets. To this end, we adapt the recently introduced Vision-Language Pre-training (VLP) model (Zhou et al., 2020). We define the verification of synset-image associations as a Visual Question Answering (VQA) task with two possible answers. The evaluation demonstrates that our methodology achieves high performances on zero-shot classification as well, thus enabling verification across the inventory. Thanks to the automatic production of a silver dataset, BabelPic constitutes a significant extension of ImageNet. A few examples from Ba-belPic (both gold and silver) are shown in Figure 2.

Related Work
To the best of our knowledge, no dataset of annotated images exists which has a focus on NC nominal and verbal concepts and is also linked to Lexical Knowledge Bases (LKB) such as WordNet and Ba-belNet. For example, the very popular ImageNet dataset, which includes images belonging to around 21,800 categories organized according to the Word-Net nominal hierarchy, offers only sparse coverage of NC concepts. JFT (Hinton et al., 2015;Chollet, 2017;Sun et al., 2017) is an internal dataset at Google containing 300M images annotated with over 19,000 classes including objects, scenes (e.g., SUNSET), events (e.g., BIRTHDAY) and attributes (e.g., RED). JFT differs from our work in not being linked to an LKB and in not being publicly released. The Open Images dataset (Kuznetsova et al., 2018) contains 9M images annotated with 19,794 classes taken from JFT. While Open Images does contain NC labels, the classes are not linked to an LKB, thus limiting their usefulness. The Tencent ML-Images dataset (Wu et al., 2019) was created starting from a subset of ImageNet and Open Images and includes images annotated with 11,166 categories, which are then linked to Word-Net synsets. The dataset differs from our work since any NC label has been explicitly discarded. Our work is in some sense similar to MultiSense (Gella et al., 2019) and VerSe (Gella et al., 2016), two datasets including images annotated with verbal senses. However, MultiSense is not directly linked to an LKB and neither of these two datasets deals with nominal synsets. Finally, we note that datasets including images annotated with objectlevel categories (Lin et al., 2014;Plummer et al., 2015) or videos (Loui et al., 2007;Dollár et al., 2009;Moneglia et al., 2014;Heilbron et al., 2015;Abu-El-Haija et al., 2016) are outside the scope of this work, since we are only interested in the main NC concepts depicted within images.

Gold Dataset
BabelPic is built by exploiting the link between WordNet (Miller, 1995) and Wikipedia within Ba-belNet 2 (Navigli and Ponzetto, 2012). Our approach is organised in a three-step process. First, we select a set of NC synsets from WordNet, on the basis of both their paradigmatic nature and relations in the knowledge base. Second, we gather all the corresponding images in BabelNet, which are themselves mostly taken from Wikipedia pages. Third, we manually validate the synset-images mapping. Note that, having defined the task as a validation of concept-image associations, we do allow images to be mapped to more than one concept and vice versa. For instance, both images in Figure 1 could be mapped to the concept COMPETITION as well. The result is a gold dataset containing 2,733 synsets and 14,931 images.

Synset selection
We decided to build our gold dataset starting from concepts related to events and emotions because these have been shown to be the most appealing NC concepts for the multimodal and vision communities (see Section 2). As a first step towards this goal, we select the nominal synsets belonging to the transitive closure of the hyponymy relation, rooted in the following set of WordNet synsets: {feeling.n.01,event.n.01}. To ensure that only NC concepts are selected, we filter out any synset connected by the hypernymy relation to at least one of the following synsets: physical entity.n.01, shape.n.02, color.n.01. This is done in order to discard concepts denoting tangible things that inherit from abstraction.n.06 in WordNet (e.g., THUNDERBOLT). Furthermore, we select all the synsets belonging to the following WordNet lexicographer files: verb.competition, verb.motion and verb.social. This is done to create a dataset with an explicit focus on events, properties and verbs.
As a second step, we discard all the concepts belonging to either the mathematics or the physics domains since images are often not relevant (e.g., ROUNDING). Finally, we associate each selected synset with the first 15 corresponding images in BabelNet 4.0. Note that, in order to improve the quality of the dataset, we filter out images on the basis of simple heuristics. For example, we filter out all images where transparency is used and at least half of the pixels are white-coloured, as these are not likely be relevant. Most of the noise images from Wikipedia are removed as a result of this step.

Manual validation
The synset-image associations found are manually validated during phase 3. We have decided to use the services of two expert annotators who are familiar with the BabelNet resource, and the whole annotation process is performed through an ad hoc graphical interface. Annotators are shown tuples in the form s, l, g, i , where s is the target synset, i is a candidate image for s, and l and g are, respectively, the main lemma and gloss (i.e., definition) for s. Annotators are asked to answer the question "is i pertinent to g?". Possible answers are yes (i.e., i is an illustration of g), no (i.e., i is either not pertinent or in contradiction with g) and discard (i.e., i is a bad image). To maximize coverage, each annotator is assigned roughly half of the concept-image association candidates. However, in order to establish and agree on possible useful guidelines for the evaluation, annotators are asked to collaboratively perform the validation of a first sample of 500 instances. We also provide them with a few extra directions. For instance, we ask them to discard images in which the association cannot be verified without reading text depicted in the image. In addition to this collaboratively annotated sample, we select an intersection of 100 annotation instances which we then use to obtain an inter-annotator agreement figure. The level of agreement achieved is 80.39%, with a κ value of 0.6078 (moderate agreement). As for these shared examples, we include in our gold dataset only those instances that have been approved by both annotators. Our gold dataset is hence composed of all the validated synset-image associations.

Model
Since manual validation is time consuming, we are interested in developing a methodology for the automatic verification of synset-image associations. In the recent past there has been a great research effort to develop models for vision-language pretraining. Many such models (e.g., VLP (Zhou et al., 2020) BERT-based models achieve state-of-the-art scores on many language-vision tasks, hence they represent a promising resource for our task.
The system that we use to perform classification is the fine-tuned VLP model. Despite the fact that LXMERT (Tan and Bansal, 2019) achieves a slightly higher score on yes/no questions on the VQA 2.0 dataset (Goyal et al., 2017), our preference goes for the VLP system since it is pre-trained on a wider and more general dataset. More specifically, the VLP model is pre-trained on Conceptual Captions (CC) (Sharma et al., 2018), a dataset including more than 3M image-caption pairs, using two unsupervised vision-language tasks: bidirectional and sequence-to-sequence masked language prediction. The input images are preprocessed using Faster R- CNN (Ren et al., 2015) pre-trained on Visual Genome (Krishna et al., 2017;Anderson et al., 2018), hence obtaining 100 object regions per image. The model input consists of both class-aware region embeddings and word embeddings, the former obtained by combining the corresponding region features with the probability of each object label and region geometric information. Furthermore, a Multi-Layer Perceptron (MLP) is trained during the fine-tuning phase in order to select the chosen answer starting from the hidden state of the encoder.
In order to adapt the VLP model to extend the BabelPic coverage to all the BabelNet synsets, we define the verification of synset-image associations as a VQA task with two possible answers. More specifically, we define a question template as in the following: "Does the image depict l (g)?" where l is the main lemma and g is the WordNet gloss of the target synset. We instantiate our template for each synset-image pair in the dataset, thus obtaining a textual question for each instance. We set the ground truth answers to either "yes" or "no", hence reducing our classification task to VQA.

Experiments
To test the reliability of our approach for the automatic verification of concept-image associations we experiment in a zero-shot setting (see Section 5.3). As a first step toward this goal, we need to augment our dataset with negative instances (see Section 5.1) and select the most suitable VLP version (see Section 5.2). A deeper analysis of how the sampling of negative instances affects the performances of the system is described in Section 5.4.

Setting
In order to evaluate our methodology for the automatic verification of synset-image associations, we need to define a procedure for the generation of negative instances (i.e., irrelevant synset, image pairs). More specifically, we define a negative instance s, i by picking two different synsets s and s and an image i associated with s from our gold dataset. Negative instances can be distinguished on the basis of the relation connecting s to s : Polysemy: both s and s contain the same lemma (e.g., the synsets of swim.v.01 and swim.v.02).
Unrelated: there exists no relation connecting s to s in BabelNet (e.g., RACING and GLAD-FULNESS).
Exploiting the WordNet relations as mentioned above is also very effective in handling any potential issue due to images that are instances of multiple concepts. For instance, the images in Fig The result is a dataset which is perfectly balanced between the two output classes. We split the dataset into training, validation and test sets following the 80%/10%/10% rule. Each class is proportionally distributed between the splits, as well as the relations used to define the negative instances. In order to test the system's capability to handle previously unseen concepts, we force both the validation and test sets to contain also instances referring to synsets that are not present in the training set. We refer to the subset of the test set given by these instances as the zero-shot test. Statistics are reported in Table 1.

Pre-Trained vs. Fine-Tuned
In this work we refer to the VLP 3 model (Zhou et al., 2020) pre-trained on CC and fine-tuned for the VQA task on the VQA 2.0 dataset as, respectively, P-VLP and F-VLP. Note that both P-VLP  and F-VLP are then further fine-tuned for the verification of concept-image associations on BabelPic's training split. Our experiments show that both systems are reliable on our task, achieving precision and F1 scores that are over 70% on all the splits (see Table 2). However, the F-VLP model proves to be the most stable for the task. In fact, in a common use case scenario it is more important to accept only correct synset-image associations than it is to detect all the correct pairs. More specifically, we value precision over recall, and thus prefer the fine-tuned VLP model.

Zero-Shot Classification
Our main interest is in developing a model capable of annotating images with synsets even when the target concept is new to the system (i.e., zero-shot). As shown in the last column of Table 2, both the P-VLP and F-VLP models are robust to zero-shot classification, achieving scores that are comparable to the performances registered on the other splits. The F-VLP system, in particular, is able to verify the associations between unseen synsets and images with precision 77.67%, hence enabling the automatic extension of BabelPic to any other synset.

Fine-Grained Analysis
Finally, we analyse the system performances on the different types of negative instances. The accuracy scores achieved by F-VLP are listed in Table 3. As one would expect, when the input synset-image pair is unrelated, the system is able to correctly  classify most of the instances. When considering the instances labelled as sibling, the difficulty level increases and F-VLP achieves an accuracy score of 62.07%. This is not surprising when it is considered that discriminating between images representing sibling concepts (e.g., DISAPPOINTMENT and BOREDOM) can be tricky for humans as well. Finally, the instances labelled as polysemy prove to be the hardest ones, demonstrating that BabelPic can be an interesting benchmark for Visual Sense Disambiguation as well. The performances achieved by P-VLP follow the same trend.

Conclusions
In this work we introduced BabelPic, a new resource for language-vision tasks, built by validating the existing image-to-synset associations in the BabelNet resource. BabelPic is innovative in being the first dataset with a focus on nominal and verbal non-concrete concepts linked to the Word-Net and BabelNet Lexical Knowledge Bases. Furthermore, we presented a methodology to extend the resource by fine-tuning VLP, a state-of-the-art pre-trained language-vision architecture. In our approach, we automatically verify the synset-image associations by exploiting the natural language definitions in WordNet, showing strong results on zero-shot classification as well. We exploited our method for the automatic generation of a widecoverage silver dataset containing around 10,013 synsets. We make BabelPic (both gold and silver data) available to the community for download at