Grounding Semantic Roles in Images

We address the task of visual semantic role labeling (vSRL), the identification of the participants of a situation or event in a visual scene, and their labeling with their semantic relations to the event or situation. We render candidate participants as image regions of objects, and train a model which learns to ground roles in the regions which depict the corresponding participant. Experimental results demonstrate that we can train a vSRL model without reliance on prohibitive image-based role annotations, by utilizing noisy data which we extract automatically from image captions using a linguistic SRL system. Furthermore, our model induces frame—semantic visual representations, and their comparison to previous work on supervised visual verb sense disambiguation yields overall better results.


Introduction
Images of everyday scenes can be interpreted and described in many ways, depending on the perceiver and the context in which the image is presented. The latter may be natural language data or a visual sequence. As an example, consider the two scenes in Figure 1 and the question What is the man doing? The interpretation of the first target image (left) in isolation would allow many answers. Taking into account the visual context, however, may disprove many of those answers (e.g., He is questioning the women.). For the target image on the right, the reason for Why there is so much food on the table? can be inferred from its textual context.
As the examples illustrate, the interpretation of a (visual) scene is related to the determination of its events, their participants and the roles they play therein (i.e., distill who did what to whom, where, why and how), and this may require a joint processing or reasoning with possibly multiple (extra-)linguistic information sources (e.g., text, images). In NLP, the well-established and studied task of semantic role labeling (SRL) aims to extract such knowledge in the form of shallow semantic structures from natural language texts (e.g., questioning(Agent:man, Theme:women) ); see, e.g., Gildea and Jurafsky (2002); Palmer et al. (2010), for an overview). It is considered an essential task towards text understanding, and was shown to be beneficial for applications such as information extraction (see Roth and Lapata (2016) and the references therein) and question answering (Shen and Lapata, 2007). In computer vision research, recent efforts have been made on visual SRL or situation recognition, a task coined by transferring the use of semantic roles to produce similar structured meaning descriptions for visual scenes (e.g., Yang et al. (2016); Yatskar et al. (2016)).
To facilitate the endeavor of joint processing over multiple sources, it is desirable to induce representations of texts and visual scenes which do encode this kind of information, and in, essentially, a congruent and generic way. The latter would furthermore support the induction of a desired level of abstraction as needed.
In this paper we propose an approach towards this goal: We address the task of visual SRL (vSRL) and learn frame-semantic representations of images. Specifically, we present a model that learns to ground the semantic roles of a semantic frame in image regions, which may be crucial for, e.g., human-robot interaction or surveillance (e.g., Who/Where is the robber?). For example, the image shown in Figure 2 evokes the ARREST frame, and its semantic roles Authorities, Suspect, and Place are grounded in the image regions (delineated by bounding boxes) which depict their corresponding fillers. While being trained on this task, our model learns distributed situation representations (for images and frames), and participant representations (for image regions and roles) which Well, the fridge broke, so I had to eat everything. capture the visual-frame-semantic features of situations and participants, respectively. We train our model on data that we automatically extract by running a linguistic SRL system on image captions-human produced data that is abundant and requires less time and expertise than frame-semantic annotations. Supervised SRL has suffered from data sparsity since it relies on laborintensive human annotations. Analogous issues on manually annotated images have been addressed by Yatskar et al. (2017). By leveraging existing efforts made in NLP, we explore whether we can alleviate the supervision bottleneck in visual SRL Our experiments yield promising results, and our models are even able to make correct predictions for erroneous data points. Furthermore, we evaluate the induced situation representations on the task of supervised visual verb sense disambiguation, where it outperforms or is comparable to previous work (on motion or non-motion verbs, respectively). Yatskar et al. (2016) introduced the ImSitu dataset for the task of situation recognition, i.e., the problem of, given an image, predicting a structured output which specifies the depicted activity (e.g., jumping) and its associated semantic roles paired with their nominal fillers (e.g., {(agent, bear), (obstacle, water) }. To address the task, Yatskar et al. (2016Yatskar et al. ( , 2017 train conditional random field (CRF) models on ImSitu (Yatskar et al., 2016) and on additional training data for rarely occurring noun-role combinations which they source from the web (Yatskar et al., 2017). Mallya and Lazebnik (2017) assume that the roles associated with each activity are in a fixed order, and treat the above task as one of recognizing activities and generating a sequence of nouns, for which they use a recurrent neural network. They show how hereby learned features can be transferred to tackle image caption generation. Li et al. (2017) explicitly model role dependencies through a gated graph neural network. Given an image, they instantiate a fully connected graph with a verb and its roles as nodes. Each node's hidden state vector is initialized with image features from two CNNs, which were pre-trained for the prediction of verbs and nouns, respectively. Using a softmax layer augmented with hidden state vectors, they predict the verb and the nominal fillers of its roles.

Related Work
In contrast to above works on ImSitu, we do not link the roles of a verb to their lexical fillers. We address the related task of explicitly grounding roles in the corresponding image regions, since our focus is on the relation between semantic roles and the typical visual features of their fillers (e.g., a Body part is typically not a bike but arms). Gupta and Malik (2015) introduced this task as visual semantic role labeling. Similarly, Yang et al. (2016) formulate a CRF that jointly processes a cooking video and its natural language descriptions in order to ground the semantic roles associated with the verbs in corresponding object tracks. Both of these studies are limited to a small number of activities performed by people and a few semantic roles (26 and 11 verbs, 3 and 6 roles, respectively).
Unlike related work, our approach does not rely on manual role annotations of images, but exploits a linguistic SRL system for data creation. With more than 1k frame-specific roles, our data is of a larger scope than Gupta and Malik (2015) and Yang et al. (2016). Further, unlike the CRF-based approaches, our model induces frame-semantic representations during training.
ARREST ARREST PLACING PLACING r 1 , r 2 r 1 , r 2 Authorities Authorities r 1 r 1 Agent Agent r 5 r 5 Suspect Suspect r 5 r 5 Theme Theme r 3 r 3 Place Place r 3 r 3 Place Place r 4 r 4 Goal Goal target outputs

Grounding Semantic Roles in Images
We first define the task of vSRL and then present our model and our approach for data creation.

Task Definition: vSRL
Our approach is based on the linguistic theory of frame-semantics (Fillmore, 1982), which underlies the idea that words evoke semantic frames. Frames describe prototypical situations or events and contain semantic roles. For example, in the sentence They arrested him for assault, the argument they fills the Authorities role, him is the Suspect, and assault the Charges of the ARREST frame, which was evoked by the verb arrest.
Let F be a set of frames, E be the set of all semantic role labels, and E f be the inventory of roles associated with the frame f (e.g., E ARREST ={Authorities, Suspect, Charges, Offense, Place}) 1 . Assume we are given an image i, which evokes a frame f , and a set of image regions R i , which render one or several objects in i. The task of vSRL is to link each role e ∈ E f to the object r ∈ R i that fills role e in the situation or event which f describes. We call a role e to be realized in an image, if it can be grounded in an image (region). The object r shown in the image region is called the filler or realization of e. The structure A f = {(r, e)|r ∈ R i , e ∈ E f } overall repre-sents the frame f in the image i.
In SRL, the task of identifying the frame which a predicate evokes is a prerequisite, but it is usually treated as a subtask of SRL. We follow this approach and consider the identification of the frames evoked by an image as a subtask of vSRL. We formulate two further subtasks for vSRL, namely role prediction-determining the correct role for a relevant image region, and role grounding-linking a realized role to its filler.
Note that not all roles of a frame may be realized in an image, and not all objects may play a role in an evoked frame. Figure 2, for instance, shows an image with some of its objects delineated by six bounding boxes R i = {r 1 , r 2 , r 3 , r 4 , r 5 , r 6 }. The target outputs (bottom, Fig. 2) are the frames AR-REST and PLACING, as well as their realized roles which are aligned with their fillers (marked by colors). The FrameNet roles Charges and Offense are not realized in the image, i.e., they cannot be grounded. The vehicle, box r 4 , in turn, does not participate in the ARREST frame.

Model: Visual-Frame-Semantic Embedder
Our model, illustrated in Figure 3, is formulated as a neural network architecture. Its input is a tuple q = (i, r, f, e) ∈ Q of an image i, an object which is delineated by bounding box r, a frame f ∈ F , and a role label e ∈ E f (e.g., q = (img 1 , r 5 , ARREST, Suspect); cf. Fig. 2). The model output is a score s(q) ∈ [−1, 1] which quantifies the visual-frame-semantic correspondence between the box r and the role e of f ( Fig. 3, right).
More specifically, the model maps visual encodings of i and r (e.g., vectors of a pre-trained CNN), and frame-semantic representations of f and e (randomly initialized embeddings) to common visual-frame-semantic spaces (cross-modal layers in Fig. 3).
We assume that images capture different framesemantic features than image regions-an image encodes the whole scene and its participants and thus evokes a frame, while individual image regions of participants capture the participant-specific features of the semantic roles they fill. We therefore distinguish between two different cross-modal spaces: a situation space for images and frames, and a participant space for regions and roles. Using the respective representations in these spaces, the model then estimates the situation similar- Figure 3: The ImgObjLoc model which scores the correspondence between a semantic role and its frame, respectively, and a candidate role filler (an image region) and the whole image, respectively. ity, sim s (i, f ), between the image and the frame, and the participant-role similarity, sim p (r, e), between the box and the role. Finally, the overall frame-semantic score s(q) is the aggregation of sim s and sim p : (1) where parameter b f ∈ θ weights the contribution of the situation and participant scores to the overall score and is learned along all model parameters θ.
By definition of the output function s (Equ. 1), each role-object pair is scored independently of the decisions made for the other roles and regions of the same frame and image, respectively. Technically, this allows for the use of partially labeled training data, where not every realized role of a frame has been linked to its filler, as we will explain in Section 3.3.
Below we describe how we use our model to address the subtasks of role prediction and grounding (Section 3.1), respectively, for which we will report experimental results in Section 5. 2 In any case, the method is based on the visual-frame-semantic correspondence s(q) (Equ. 1), where we discard all candidates of role-filler pairings with a score less than zero.

Role Prediction
Given an image i, we formulate the role prediction problem as a mapping L: That is, the predicted role (and the frame it is associated with) which an image region r ∈ R i of i fills is that e ∈ E to which r is most similar in the visual-frame-semantic space.
Role Grounding is the equivalent to linguistic semantic role labeling. 3 Given a frame f realized in i, we ground each role e ∈ E f in the region r ∈ R i with the highest visual-framesemantic similarity to e: Training We train the model by using a ranking criterion designed to give higher scores to true cross-modal frame-semantic combinations (i, r, f, e) than to mismatches, by a margin M . To this end, for each positive example q = (i, r, f, e) of a training set Q, we sample K negative examples q k = (i, r, f , e ) of a frame f and role e ∈ E f not true for image i and box r, 4 and learn model parameters θ by minimizing the maximum margin hinge loss function on the tuples (q, q ) (Equ. 4) . Ideally, using this loss function would guide the parameter learning towards mapping images and the frames they evoke, and regions and the roles they fill, respectively, nearby each other in the cross-modal spaces.

Using Linguistic Knowledge for Data Creation
SRL systems in NLP research use training data which have been carefully created by linguistic experts (e.g., Ruppenhofer et al. (2006); Palmer et al. (2005)) for many years. To train our model on the visual SRL task, we build upon the annotation efforts made in NLP. The exploitation of existing resources which were developed for the analogous goal means to get around the time-consuming and costly annotation effort involved in the creation of training data. Moreover, adopting an established framework in NLP for shallow semantic representations (FrameNet, Ruppenhofer et al. (2006), in our case), including the therein defined frame and role labels, could facilitate cross-modal interactionsadvances in vSRL can help to improve SRL and vice versa, or jointly draw inferences from both modalities (e.g., a text and its illustration). Our data creation approach is to use a (linguistic) SRL system to extract frame-semantic annotations from a corpus of images paired with captions. We use the Flickr30k Entities dataset (Plummer et al., 2015) 5 which contains 30k images and five captions per image. We chose this dataset since its captions are augmented with entity mention annotations, associating them with the 276k manually annotated bounding boxes (i.e., entities are grounded in the image). To create the set Q = {(i (j) , r (k j ) , f (l j ) , e (l j ,k j ) )|j ∈ {1, . . . , 30k}} of training instances, we run PathLSTM (Roth, 2016;Roth and Lapata, 2016) on all captions, and extract all semantic frame annotations whose roles are filled by a grounded entity. As a result, our training corpus comprises images, the frames they evoke, and the associated semantic roles paired with their grounded fillers (i.e., bounding boxes).
Sentences (1a)-(1c) in Figure 4 (left), for ex-5 See web.engr.illinois.edu/˜bplumme2/ Flickr30kEntities ample, are three human produced captions for the image in Figure 2, in which entity mentions are linked to their image regions (indicated by colors). Using PathLSTM, we extract the grounded framesemantic annotations (2a)-(2c) (Fig. 4, right), which results in the following six instances of our corpus Q:

Data
Training Data We adopt the training, validation and test splits provided in the Flickr30k Entities dataset (Plummer et al., 2015) and create our dataset Q with the method described above. Some verbs and the frame types which they evoke occur very frequently in the set of annotations (e.g., BEING LOCATED) and therefore allow the induction of a finer-grained frame inventory. Specifically, we transform each frame which is evoked by an individual verb (e.g., stand or sit) for at least 100 images (as obtained from the captions) in the Flickr30k Entities training split to a finer-grained frame type by concatenating it with the verb (e.g., BEING LOCATED-sit ). Finally, we keep all frame types (fine-grained or coarse) which had been assigned to at least 100 different images. This amounts to an inventory of 252 frame types (102 coarse types, e.g., STATE-MENT), 1, 409 frame-specific role types (e.g., STATE-MENT.speaker), 169 role labels (e.g., Speaker) and 76, 939 training instances. We derive our validation and test splits from the original splits on the basis of above modifications. See Table 1 for the quantitative details on the dataset, which we henceforth call Flickr30k Roles.  Reference Data Flickr30k Roles may contain false instances due to its creation on the basis of automatic frame-semantic annotations. Im-Situ (Yatskar et al., 2016) is, to the best of our knowledge, the only existing benchmark dataset for vSRL. As explained in Section 2, however, it is image-based, and does not provide explicit links between roles and the regions which depict their fillers. It cannot be used for the evaluation of role prediction and grounding without additional annotations. We therefore created a set of reference instances by presenting a subset of the Flickr30k Roles test data to two human subjects (both students of computational linguistics) for annotation. We chose all instances which agree in their frame label with instances extracted from at least two other captions of the underlying image. This amounts to 201 images and 715 instances. The annotators were presented with an image with relevant objects rendered by bounding boxes, along with the automatically grounded semantic frame annotations. Figure 5 gives an example image along with the 4 automatically obtained instances. They were asked to judge the correctness of the frame (e.g., INGESTION, Fig. 5), the verb (in the case of a fine-grained frame type; e.g., eat) and each of the role-filler links (e.g., Ingestor-226403). They further linked wrong role assignments to their correct fillers when possible. We created the reference set as the intersection of all correct instances of the two annotators (frame and role-filler linkings), which amounts to 554 instances.

Visual
Representations We use highdimensional distributed vectors to represent images and regions (bounding boxes), and represent the latter by additional contextual features. These encode a region's relative location and size with respect to the whole image (cf. (Mao et al., 2016)): x tl W , where (x tl , y tl ) and (x br , y br ) are the coordinates of the top left and bottom right corners of the  Roles (colored, left columns) and the human correctness judgments of the frame, verb, and role fillers (right-most column; 1 is correct, 0 wrong). The object names were presented to facilitate the annotation, but are not part of the instance.
bounding box, H and W are the height and width of the image, and h and w the height and width of the box. These features have been found useful for referring expression generation/interpretation for objects in images (Mao et al., 2016). We hypothesize that the relative position and size of an object can be likewise informative for the roles it can(not) realize. For example, an object that is located at the bottom of an image is probably rather the Patient of a KICKing event than the Agent.

Experiments
We first evaluate our model in terms of different aspects related to visual SRL on the two subtasks role prediction and grounding (see Section 3). Our second experiment assesses the usefulness of the learned frame-semantic image representations on the task of visual verb disambiguation: given an image and a verb, assign the correct sense of the verb, i.e., the one that describes the action depicted in the image (e.g., play an instrument; play sport). This task is different from visual SRL, but forms a prerequisite for it, since in frame semantics, roles are defined on the basis of frames evoked by verb senses.
Model Details For each bounding box and image, we use the VGG16 network (Simonyan and Zisserman, 2014), trained on ImageNet (Deng et al., 2009), to extract a 4, 096-dimensional feature vec-  tor from the fully connected fc7 layer. To transform the feature vectors into the visual-frame-semantic embedding space, we use two two-layer networks which are composed of a layer with rectified linear activation units (relu) followed by a layer with tanh activations (see Fig. 3, top left). We furthermore concatenate the first hidden layer (relu layer) of each image region (i.e., box) with a vector of contextual features (relative box size and location, Equ. 5).
Frames and roles, in turn, are encoded as onehot vectors and mapped to randomly initialized embedding layers, which are then transformed into the visual-frame-semantic representations using tanh activation layers (Fig. 3, bottom). We use the cosine similarity to quantify visual-frame-semantic correspondences in the cross-modal space (Equ. 1).
Throughout our experiments we compare our model (ImgObjLoc), which takes into account the contextual features (Equ. 5), to a model that does not use contextual box features (ImgObject), and one that only uses the image as visual input (Imageonly). Image-only derives its cross-modal role representation by augmenting both, the image and the box input layers with the image's fc7 feature vector.
The network parameters were optimized using AdaGrad (Duchi et al., 2011) with a learning rate of 0.003. We monitored the role prediction performance on the validation set of Flickr30k Roles and kept the best performing model. See Appendix A.1 for further details on the model hyperparameters.

Exp.1: Semantic Role Prediction and Local Grounding
In the role prediction evaluation, the model is given an image and a bounding box, which represents a candidate role filler, and needs to predict the frame and role which the entity (or entities) in the box fills.
In the grounding experiment, the model is given an image, a frame and an associated role which is realized in the image, and needs to determine the correct role filler from a list of boxes. We report results on using ground truth boxes as well as box proposals, extracted with selective search (Uijlings et al., 2013). Regarding the latter, we apply the intersection over union (IoU) metric (e.g., Everingham et al. (2010)), and consider a role to be grounded in the correct box proposalr if the area of overlap betweenr and the reference box, divided by the area of their union, exceeds 50%.

Results
We report top-1 and top-k accuracy (i.e., the frame and role is among the top-k scored predictions) on the Flickr30k Roles test and reference sets for both subtasks (recall that Flickr30k Entities provides ground truth alignments between entity mentions and objects). Table 2 gives the results on role prediction with ground truth bounding boxes (i.e., for all entities which fill at least one semantic role). We report the accuracy for predicting the correct frame and role (columns fr.role), for predicting the correct frame (columns frame), and the correct role regardless of its frame (columns role; e.g., a prediction of STATEMENT.Speaker would be considered correct even if the reference was SPEAK ON TOPIC.Speaker). We further give results for the coarse frame types, where verbs are stripped off the frame labels (i.e., STATEMENT-speak is STATEMENT). Since the role prediction performance is equal for both frame types, we report the results for the fine-grained frames only.
As Table 2 shows, the models which use participant representations extracted from the relevant image regions (ImgObject and ImgObjLoc) perform better than Image-only which considers the   global image only, except for the top-1 frame prediction. This indicates that the two models are able to learn useful role-specific visual representations. Contextual features in the form of the relative size and location of a region (cf. Equ. 5) seems to be also beneficial, due to ImgObjLoc yielding the overall best results. These features are furthermore beneficial for role grounding in automatically selected bounding boxes: When using automatically selected boxes, ImgObjLoc is significantly more effective than ImgObject in all settings (rows props, right block in Table 3). The Random baseline, which assigns each role randomly to a box in the image, performs unsurprisingly worst.
Interestingly, the models perform substantially better on the reference set than on the noisy test set (top and bottom blocks in Tables 2,3). 6 This indicates that they were able to generalize over wrong role-filler pairs in the training data, and are able to make correct predictions even for erroneous instances (see the qualitative analysis below). When assuming that the correct frame has been identified (columns gt fr.), the best role prediction ac- 6 The accuracy scores on the uncorrected instances in the reference set yield comparable or worse accuracy scores than those on the test set, except for the top-5 predicted frames. curacy reaches 70.3% on the reference set, and grounding accuracy with box proposals is at 35.5% (ImgObjLoc, Tables 2,3, respectively).
Finally, frame prediction proves to be a difficult task, especially for fine-grained frame types (e.g., BEING LOCATED-sit ; left block in Table 2).
Qualitative Analysis Notably, our analysis revealed that ImgObjLoc could correctly predict roles for cases in which PathLSTM failed, especially for highly visual entities (e.g., performance vs. location, goal vs. path). Overall, ImgOb-jLoc was often able to identify location roles which PathLSTM had missed, but may confuse the specific labels (e.g., area vs. path or location) for reasons discussed below. See Figure 6 for the recall of ImgObjLoc on the reference set for individual roles (top-20).
In an error analysis of the predictions of Im-gObjLoc we identified several classes of errors. Typical errors in role prediction were in cases in which an image region contained multiple objects, and the system predicted a label for an object which was occluded by the target or vice versa (e.g., ingestibles vs. source; clothing vs. wearer or body part; path vs. area). We found that this error was propagated from noise in the training data. Table 4 shows the roles which were most difficult to predict by ImgObjLoc, and which the textual SRL system (PathLSTM) could predict with a high precision (top; as calculated from the human annotations, cf. Section 4), or with a low precision (bottom), respectively. As may be expected, among these are also highly non-visual roles, such as manner and purpose.
Other noise propagated from the training data was caused by wrong frame predictions of PathLSTM (e.g., TRAVERSE-pass instead of BRINGING-carry; CONTAINING-hold .contents vs. IN- Figure 6: Prediction recall of ImgObjLoc on the reference set for the top-20 roles, ordered by their frequency. GESTION.ingestibles). Frequent patterns of incorrect frame predictions were furthermore a failure of the system to distinguish between finegrained frames (e.g., BEING LOCATED-sit vs. -lie or SELF MOTION-walk vs. -run), or between motion and non-motion actions (e.g., POSTURE vs. SELF MOTION).
Finally, we observed that often the reference did not contain an actually valid frame which had been predicted by the system for an image, due to different levels of frame specificity, i.e., the output of ImgObjLoc was more specific (e.g., ASSISTANCE-help.helper vs. WEAR-ING.wearer; OPERATE VEHICLE-ride.vehicle vs. PER-CEPTION ACTIVE-look .location of perceiver) or it was more general (e.g., WEARING.wearer vs. IMPACThit.impactor).

Exp.2: Visual Verb Sense Disambiguation
We evaluate the effectiveness of the frame-semantic image representations that can be extracted with our ImgObjLoc model on the VerSe (visual Verb Sense disambiguation) dataset (Gella et al., 2018). It covers 90 verbs and 163 senses used to annotate 3, 510 images. We follow the supervised method applied in (Gella et al., 2018), divide VerSe into training and test data, and train logistic regression classifiers for sense prediction on 19 motion verbs and 19 non-motion verbs (those which have at least 20 images and at least 2 senses). Input to the sense classifiers are the frame-semantic image representations (second top cross-modal layer in Fig. 3) of the VerSe images, which we extract with the ImgObjLoc model, trained on Flickr30k Roles. Table 5 gives the mean accuracy obtained on the test data (of 100 runs). Our ImgObjLoc vectors outperform all comparison models on motion verbs, including CNN-based image features and the best-  Table 5: Sense prediction accuracy for motion (left) and non-motion verbs (right) using different image representations. + marks results taken from Gella et al. (2018). MFS is the most frequent sense heuristic.
performing models of (Gella et al., 2018), namely Gella-CNN+O and Gella-CNN+C (CNN features concatenated with predicted object labels and image captions, respectively). On non-motion verbs, the best models, including our own, perform only comparably to the most frequent sense heuristic. Note that we examine the simplest representation ImgObjLoc can yield, i.e., frame-semantic representations for individual images. More complex representations are left for future work. See Appendix A.3 for examples.

Conclusions
We addressed the task of grounding semantic roles of frames which an image evokes in the corresponding image regions of its fillers. We found that our model can be trained without the need of manual role annotations of image data, and that the framesemantic image representations it learns can be used for related tasks. Encouraged by our findings, future work includes the exploration of the model and its learned frame-semantic representations for tasks such as the interpretation of multimodal scenes and stories and referring expressions.