“A Distorted Skull Lies in the Bottom Center...” Identifying Paintings from Text Descriptions

Most question answering systems use symbolic or text information. We present a dataset for a task that requires understanding descriptions of visual themes and their layout: identifying paintings from their descriptions. We annotate paintings with contour data, align regions with entity mentions from an ontology, and associate image regions with text spans from descriptions. A simple embedding-based method applied to text-to-image coreferences achieves state-of-the-art results on our task when paired with bipartite matching. The task is made all the more difﬁcult by scarcity of training data.

"A Distorted Skull Lies in the Bottom Center..." Identifying Paintings from Text Descriptions 1 Knowledge from Images Question answering is a standard NLP task that typically requires gathering information from knowledge sources such as raw text, ontologies, and databases.Recently, vision and language have been amalgamated into an exciting and difficult task: using images to ask or answer questions.While humans can easily answer complex questions using knowledge gleaned from images, visual question answering (VQA) is difficult for computers.Humans excel at this task because they abstract key concepts away from the minutiae of visual representations, but computers often fail to synthesise prior knowledge with confusing visual representations.
We present a new instance of visual question answering: can a computer identify an artistic work given only a textual description?Our dataset contains images of paintings, tapestries, and sculptures covering centuries of artistic movements from dozens of countries.Since these images are of cultural importance, we have access to many redundant descriptions of the same works, allowing us to create a naturalistic but inexpensive dataset.Due to the complex and oblique nature of questions about paintings, their visual complexity, and the relatively small data size, prior approaches used for VQA over natural images are infeasible for our task.
We formalise the task in Section 3, where we also present a preliminary system (ARTMATCH) and compare with it a data-driven text baseline to illustrate the usefulness and versatility of our method (Section 4).Finally, in Section 5 we compare our task and system to previous work that combines NLP and vision.

Describing Art
University Challenge (UK) or quiz bowl (USA) has previously been studied for question answering using text-based methods (Boyd-Graber et al., 2012).However, some quiz bowl questions are inherently visual in that their answers are works of art.
Figure 1 shows an example of a painting description and associated annotations (to be described later) from a quiz bowl question.Identifying paintings from textual descriptions of their contents is difficult; for example, many disparate paintings feature two men (Stag at Sharky's, The Sacrifice of Isaac, and Kindred Spirits).Given their varied style, composition, and depiction, how do we teach computers to infer the meaning of a painting?
To capture the meaning, we rely on redundant descriptions of entities in paintings offered by multiple text spans in these questions.The man on the right in Figure 1 is variously described as a "Frenchman", "a diplomat", "a man in black", and "George de Selve".We can use this redundancy to learn the meaning of the pixels within the red contour.
In text, this is the problem of coreference resolution (Radford, 2004) as the multiple text spans refer to the same "real world" entity.Trivia questions have complex descriptive coreferent groupings (Guha et al., 2015).Thus, to annotate our dataset we map coreference groups in question text to regions in paintings using LabelMe (Russell et al., 2008), providing a direct mapping of text spans to groups of pixels in the images and their spatial properties.
Our dataset contains 128 paintings, 1 where each painting is the answer to a single quiz bowl question.First, we assign each object in a painting to a single class from an ontology with eight coarse and fifty two fine (level two) classes.This ontology is three levels deep and follows the hyponymy structure of ImageNet (Deng et al., 2009). 2 Then, we map each coreference group from the question text to an image contour from the painting (see Table 1).As the questions come from a game, the mentions are often oblique, making them hard to answer with text alone.For instance, a description of Rain, Steam, and Speed will avoid explicitly mentioning the painting's central object by name (a "locomotive") in favor of describing it in a roundabout way (e.g., a "conveyance").Given one of the questions in our dataset, our goal is to provide the name of the painting that it describes.Because our focus is not on building better feature extractors for paintings, we assume that we have gold visual annotations (e.g., object contours, classes, and locations). 3The task is challenging due to the size of our dataset (only 128 annotated question/painting pairs), which prevents the training of most machine learning models, as well as high visual complexity and vagueness in question text.

A Text-Only Baseline
Our baseline model is "blind" in that it does not use any visual features to solve the task.We use the deep averaging network (Iyyer et al., 2015, DAN), which takes as input a textual description of a painting and learns a 128-label classifier over an average of embeddings from words in the question.Since this model does not do any visual mapping, we collect unannotated questions about our 128 paintings to form a respectably-sized training set of 503 questions.While the DAN has access to more data than our non-blind model, we hope that we can improve over the baseline using visual information.

Answer Questions Using Annotated Paintings
Our method, which we call ARTMATCH, assumes that some of the groups of coreferent text in a question describe visual objects in the associated painting.
If we have a unified vector representation of visual object classes from painting regions and textual coreferent groups, a bipartite mapping can match them.

Matching Mentions to Images
To identify the painting described by a question, we first convert every question to a list of objects obtained from coreference chains (e.g., a lute, a distorted skull).On the painting side, we have a list of annotated visual objects.These two lists form the nodes of a bipartite graph on which we perform a maximum cardinality match (Hopcroft and Karp, 1973), where edge weights represent match strength.We consider the painting with the most matched edges as the answer; in case of a tie, the painting with the highest cumulative edge weight wins.
This process requires that our visual object classes are in the same vector space as objects found in textual coreference chains.For one chain, we compute a vector representation by averaging the embeddings of its words. 4Also, for each visual object class, we obtain a set of synonyms and hyponyms and compute an averaged word vector over this set.Similarly, we produce averaged vectors over location and number attributes (e.g., the single attribute is represented by a vector average over {single, one, a, an}).Since distance between word embeddings measures semantic similarity, we assign an object class and attributes that have the highest cosine similarity to that chain's vector representation, as shown in Figure 2. 4 We use publicly available 300-dimensional word2vec embeddings trained on Google News (Mikolov et al., 2013).
Coreference chains: these robed seraphs, the angels in the sky, them, the divine beings, the winged creatures  We easily combine ARTMATCH's matching with DAN by modifying the weight of a bipartite match by multiplying that weight with the probability of that question-answer pair being correct as given by DAN.

Performance and Analysis
We investigate the performance both locally (matching specific objects) and globally (identifying the correct image) before doing an error analysis.
First, we examine the individual matching on objects and measure accuracy on three different tasks: 1. Does ARTMATCH properly match coarse and fine visual object classes to question text (e.g., is "an angel" mentioned in the question matched to an image region depicting an angel)?5 2. Are the matches is in the correct locations (e.g., is the "angel" in the top-left corner?)? 3. Is the number of objects correctly matched (e.g., are there two or three angels)?
Table 2 shows the results of these experiments.Additionally, for the highest-scoring paintings (the answers output by ARTMATCH), 13.2% of objects are exactly matched with location and number; without considering those attributes, 20.4% of fine-grained classes are matched.
Next, we look at the main task of identifying the correct painting.As Table 3   by four absolute points, indicating that the models are learning complementary information.
Having established that our simple method of incorporating visual information can achieve significant gains in accuracy, we now proceed to analyse instances in which our system does well and the DAN does not (and vice-versa).

Error Analysis
There are 34 questions for which the DAN fails but ARTMATCH succeeds.For many of these questions, the DAN fails because it overfits to common clues.Given a test question about Melencholia I, the DAN answers Madonna with the Long Neck, as the training questions about both paintings repeatedly mention a female figure and cherubs.However, the question also mentions geometric figures, the spatial locations of which enable ARTMATCH to answer correctly.
Conversely, there are thirty-one questions where ARTMATCH fails but DAN succeeds.Some of these questions contain text constructs such as the painter's name that are repeated in both training and test questions, which makes it easy for the DAN to solve (e.g., "Identify this most famous work of Claude Monet").In other cases, ARTMATCH answers incorrectly because of spurious matches due to substantial visual similarity between various objects in paintings.For example, in a question about The Holy Trinity by Masaccio, "St.John" is assigned the close but incorrect class of "statue" while "Jesus" is correctly identified as a person.Further confused with spatial similarities between the paintings, ARTMATCH's answer is Supper at Emmaus, which has Jesus but no St.John.In other cases, peripheral similarity leads to the central mismatch being overlooked, motivating an an attention mechanism to focus on "significant" entities for future work (Mnih et al., 2014).

Related Work
Our work is specifically related to previous work on visual question answering and more generally to multimodal applications of vision and language.
Visual QA has previously focused on content questions (Antol et al., 2015;Ren et al., 2015;Andreas et al., 2015), while we focus on identity questions.Relatedly, Zhu et al. (2015) find semantic links between images and text via an attention model.
We use coreference to connect text and image regions, similar to Kong et al. (2014).However, not all text is "visual" (Dodge et al., 2012) and not all image regions can be described textually (Berg et al., 2012).While we focus on meaning, structure of text (Elsner et al., 2014) can also be inferred from images.Socher et al. (2014) match sentences to images; however, our dataset is unique in that the text is intentionally oblique (rather than direct descriptions) and our images-paintings-are more varied visually.
Aside from QA, images have been successfully used to generate captions (Karpathy and Fei-Fei, 2014;Mao et al., 2014;Vinyals et al., 2014;Xu et al., 2015;Chen and Zitnick, 2014).While we use vision to aid an NLP task, others have gone in the opposite direction, inducing correspondences between words and video clips (Yu and Siskind, 2013), words and action models (Ramanathan et al., 2013), and language and perception (Matuszek et al., 2012).

Conclusion and Future Work
The major contribution of this work is to extend question answering to a complex visual setting by presenting an annotated dataset and a simple system that manages to exceed the performance of a strong textonly baseline QA system.The next challenge is to scale up this dataset to enable end-to-end training pipelines for answering questions using raw images.

Figure 1 :
Figure 1: A painting (left) with image regions matched to coreference chains in a question (right).The question uses a variety of oblique mentions to make the trivia question more difficult; some entities (e.g., de Selve) are mentioned again later in the question.

Figure 2 :
Figure 2: Using word2vec representations from coreferent groupings in a description to deduce object class and attributes by cosine similarity.

Table 1 :
Statistics of our new question answering dataset 1 Annotated data provided at www.cs.umd.edu/˜aguha/data/paintdata.rar 2 Ontology provided as supplementary material.Number of . . .dataset Unique Paintings 128 Objects with contours 1,436 Coreferring text groups 1,104 Object gross classes 8 Object fine classes 52 3 Identifying Paintings from Text

Table 2 :
shows, ARTMATCH nearly matches the blind DAN baseline with just the coarse and fine object classes.Spatial location and number attributes boost ARTMATCH above the baseline, and combining both systems pushes accuracy Individual metrics of classes and features detected by word embeddings from coreference chains describing objects

Table 3 :
Our system vs the blind baseline.DAN is trained on 503 questions but has no visual information.ARTMATCH has visual features from paintings but no training data.Combining both leads to a significant increase in performance.