Semantic Frames and Visual Scenes: Learning Semantic Role Inventories from Image and Video Descriptions

Frame-semantic parsing and semantic role labelling, that aim to automatically assign semantic roles to arguments of verbs in a sentence, have become an active strand of research in NLP. However, to date these methods have relied on a predefined inventory of semantic roles. In this paper, we present a method to automatically learn argument role inventories for verbs from large corpora of text, images and videos. We evaluate the method against manually constructed role inventories in FrameNet and show that the visual model outperforms the language-only model and operates with a high precision.


Introduction
The theory of frame semantics (Fillmore, 1976) postulates that our interpretation of word meanings is not limited to isolated concepts, but rather instantiates complex knowledge structures about events and their participants, known as semantic frames. For instance, the COMMERCIAL TRANS-ACTION frame includes elements such as a seller, a buyer, goods and money which can be mapped to higher-level semantic roles such as agent, patient, instrument etc. The verbs linked to this frame are buy, sell, pay, cost and charge, each evoking different aspects of the frame.
This theory has been implemented in a lexicalsemantic resource called FrameNet (Fillmore et al., 2003). Each semantic frame is encoded in FrameNet as a list of lexical units that evoke this frame (typically verbs) and the roles that their semantic arguments may take given the scenario represented by the frame. FrameNet has inspired a direction in NLP research known as semantic role labelling (Gildea and Jurafsky, 2002;Màrquez et al., 2008) and frame-semantic parsing (Das et al., 2014), whose goal is to assign semantic roles to arguments of the verbs in a sentence. However, these works point out the coverage limitations of the hand-constructed FrameNet database, suggesting that a data-driven frame acquisition method is needed to enable the integration of frame semantics into real-world NLP applications. In this paper, we propose such a method, experimenting with semantic frame induction from linguistic and visual data. Our system first performs clustering of verb arguments to identify their possible semantic roles and then computes the level of association between a given argument role and the verb, thus deriving the structure of the semantic frame in which the verb participates.
Frame semantics emphasizes the relation between our lexical semantic knowledge and our experience in the world, suggesting that semantic frames are not merely a linguistic construct but also a result of our sensory-motor and perceptual experience. However, frame semantic approaches in NLP typically rely on textual data. Our method, in contrast, induces semantic frames from both a text corpus and a corpus of tagged images and videos. We evaluate the method against handconstructed frames in FrameNet. Our results show that the visual model outperforms the languageonly model and achieves a high precision. This frame induction method can be used to complement existing FrameNets or to construct a new resource of automatically mined semantic frames, free from manual annotation bias.

Experimental Data
Textual data. We extracted linguistic features for our model from the British National Corpus (BNC) (Burnard, 2007). We parsed the corpus using the RASP parser (Briscoe et al., 2006) and extracted subject-verb and verb-object relations from its dependency output. These relations were then used as features for clustering to obtain arguments classes, which we then used as proxies for frame elements, i.e. argument roles.
Image and video data. We used the Yahoo! Webscope Flickr-100M dataset (Shamma, 2014) to extract visual relations between verbs and their arguments. Flickr-100M contains 99.3 million images and 0.7 million videos with natural language tags for scenes, objects and actions annotated by users. We first stem the tags and remove words that are absent in WordNet (e.g. named entities and misspellings). We then identify their part of speech based on their visual context using the method of Shutova et al. (2015) and extract verb-noun cooccurrences.

Argument Clustering
We use a clustering method to obtain semantic classes of arguments of verbs, thus generalising from individual arguments to their semantic types which correspond to frame roles. We obtain argument classes by means of spectral clustering of nouns with lexico-syntactic features, which has been shown effective in previous lexical classification tasks (Sun and Korhonen, 2009).
Spectral clustering partitions the data relying on a similarity matrix that records similarities between all pairs of data points. We use Jensen-Shannon divergence to measure similarity between feature vectors for two nouns, w i and w j , defined as follows: where d KL is the Kullback-Leibler divergence, and m is the average of w i and w j . We construct the similarity matrix S computing similarities S ij as S ij = exp(−d JS (w i , w j )). The matrix S then encodes a similarity graph G (over our nouns), where S ij are the adjacency weights. The clustering problem can then be defined as identifying the optimal partition, or cut, of the graph into clusters, such that the intra-cluster weights are high and the inter-cluster weights are low. We use the multiway normalized cut (MNCut) algorithm of Meila and Shi (2001) for this purpose. The algorithm transforms S into a stochastic matrix P containing transition probabilities between the vertices in the graph as P = D −1 S, where the degree matrix D is a diagonal matrix with D ii = N j=1 S ij . It then computes the K leading eigenvectors of P , where K is the desired number of clusters. The graph is partitioned by finding approximately equal elements in the eigenvectors using a simpler clustering algorithm, such as k-means. Meila and Shi (2001) have shown that the partition I derived in this way minimizes the MNCut criterion: which is the sum of transition probabilities across different clusters. Since k-means starts from a random cluster assignment, we ran the algorithm multiple times and used the partition that minimizes the cluster distortion, i.e. distances to its centroid.
We clustered the 2,000 most frequent nouns in the BNC, using their grammatical relations as features. The features consisted of verb lemmas appearing in the subject, direct object and indirect object relations with the given nouns in the RASPparsed BNC, indexed by relation type. The feature vectors were first constructed from the corpus counts, and subsequently normalized by the sum of the feature values.
Our use of linguistic dependency features for argument clustering is motivated by the results of previous research (Sun and Korhonen, 2011;Shutova et al., 2015), that has shown that such features lead to clusters of nouns belonging to the same semantic type, as opposed to topic or scene as it is the case with linguistic windowbased features or image-derived features (Shutova et al., 2015). Since the argument roles in semantic frames correspond to semantic types (such as location or instrument), the linguistic dependency features are best suited to generalise the predicateargument structure in semantic frames. Example clusters produced by our method are shown in Fig. 1. The resulting clusters represent frame elements, i.e. argument roles, in our model.

Predicate-Argument Association
We then use the verb-noun co-occurrence information extracted from the visual data to quantify the strength of association of a given verb with each of the argument classes, thus identifying the relevant argument roles for the verb. We adopted an information theoretic measure originally proposed by Resnik (1993) in his selectional preference model. Resnik first measures selectional official officer inspector journalist detective constable police policeman reporter fire pipe torch candle lamp cigarette potato apple slice food cake meat bread fruit lifetime quarter period century succession stage generation decade phase interval future disorder infection illness disease virus cancer profit surplus earnings income turnover revenue SPS measures how strongly the predicate constrains its arguments. Selectional association of the verb with a particular argument class is then defined as a relative contribution of that argument class to the overall SPS of the verb: We use this measure to quantify the strength of verb-argument association based on the visual cooccurrence information. We extract verb-noun cooccurrences from Flickr-200M, map the nouns to argument classes and quantify selectional association of a given verb with each argument class, thus acquiring its semantic frame structure. An example argument distribution for the verb kill, and thus the KILLING frame, is presented in Fig. 2. One can see from the figure that the argument clusters correspond to specific roles in FrameNet, e.g. the killer and the victim, the motive, the weapon (instrument) and death (result).

Evaluation against FrameNet
Baseline. We evaluate the effectiveness of visual information for our task by comparing the model based on vision and language (VIS) to a baseline model using language alone (LING). In the LING system, the predicate-argument association scores are computed based on verb-argument co-occurrence information extracted from verbsubject, verb-direct object and verb-indirect object relations in the BNC. In case of the indirect object relations, the accompanying prepostions were discarded and the noun counts were aggregated.  (Wilson, 1988). The verb was considered concrete if its concreteness score was 400 and abstract if it was < 400. We extracted the 10 highest-ranked verb-argument class pairings produced by the system for each verb. Each pairing was then evaluated against the argument roles listed for this verb in FrameNet via manual comparison. This resulted in a dataset of 500 verbargument pairings for VIS and 500 for LING. The pairing was considered correct if the argument cluster corresponded to the semantic type of the role listed in FrameNet and contained nouns listed in the linguistic examples (if these were provided in FrameNet). We have evaluated the system performance in terms of precision at top 10 argument classes and recall of the Core Frame Elements (FEs) among the top 10 argument classes.

Results
The VIS model attained a performance of P = 0.74 and R = 0.78, outperforming the LING model with P = 0.72 and R = 0.76. When evaluated on the subsets of concrete and abstract verbs separately, VIS attains a P = 0.76; R = 0.80 (concrete) and P = 0.72; R = 0.75 (abstract), and LING attains P = 0.67; R = 0.75 (concrete) and P = 0.78; R = 0.76 (abstract).

Discussion and Data Analysis
Our results show that the vision-based model outperforms the language-only model on our dataset. The difference in performance is particularly pronounced for the concrete verbs. For the abstract verbs in isolation, however, LING attains a higher precision and recall. This is not surprising, as the visual information is better suited to capture the properties of concrete concepts than the abstract ones (Kiela et al., 2014). However, our results indicate that integrating linguistic and visual information provides a better overall model than the linguistic information alone. Our qualitative analysis of the data revealed a number of interesting trends. Some of the errors of both systems can be traced back to the clustering step. Different argument roles according to FrameNet are sometimes found in one cluster. For instance, both the killer and the victim are in the same cluster, as shown in Figure 2. However, it is also the case that one FrameNet role can be split into several clusters, e.g. the Victim role in the killing frame is represented by two clusters of humans and animate beings more generally.
The common error of the LING model concerns frame mixing, i.e. both literal and metaphorical arguments of the verb are present in the output. For instance, eat has a disease cluster as one of its arguments; however, disease is not part of the ingestion frame, but rather an instance of its metaphorical transfer. A common trend in the LING output is that it is dominated by the Agent and Theme roles, with situational roles (e.g. Location) typically ranked lower or not appearing at all. In contrast, the output of VIS encompases a range of situational roles, such as Instrument, Location, Time etc. The two models also sometimes differ in the roles that they identify. For instance, for the verb risk the VIS output is dominated by arguments of type Asset and the LING output by the arguments related to the Bad outcome role in FrameNet.
6 Related Work

Semantic Role Induction
Approaches most similar in spirit to ours are those concerned with unsupervised semantic role labeling. A number of methods represented semantic roles as latent variables in a graphical model, which related the verb, its semantic roles and their syntactic realisations (Grenager and Manning, 2006;Lang and Lapata, 2010;Garg and Henderson, 2012). The induction process then relied on inferring the state of the latent variable. Other researchers adopted a similarity-based argument clustering framework to derive semantic roles. The investigated methods include graph partitioning algorithms (Lang and Lapata, 2014), Bayesian clustering based on Chinese Restaurant Process (Titov and Klementiev, 2012) and integer linear programming to incorporate semantic and structural constraints during clustering (Woodsend and Lapata, 2015). Titov and Khoddam (2015) proposed a reconstruction-error minimization approach using a log-linear model to predict roles given syntactic and lexical features and a probabilistic tensor factorization model to identify argument fillers based on the role predictions and the predicate. To the best of our knowledge, ours is the first approach to this task exploiting visual data, in the form of image and video descriptions.

Multi-modal Methods in Semantics
Visual data has been previously used to learn meaning representations that project multiple modalities into the same vector space. Semantic models integrating linguistic and visual information have been shown successful in tasks such as modeling semantic similarity and relatedness (Silberer and Lapata, 2014;Bruni et al., 2012), lexical entailment (Kiela et al., 2015a), compositionality (Roller and Schulte im Walde, 2013), bilingual lexicon induction (Kiela et al., 2015b) and metaphor identification (Shutova et al., 2016).
Other applications of multimodal data include language modeling (Kiros et al., 2014) and knowledge mining from images (Chen et al., 2013;Divvala et al., 2014). Young et al. (2014) show that large collections of image captions can be exploited for entailment tasks. Shutova et al. (2015) used image and video descriptions to induce verb selectional preferences enhanced with visual information.

Conclusion
We have presented a method for semantic frame induction from text, images and videos and shown that it operates with a high precision and recall. Although our experiments relied on manually annotated tags for images and videos, recent research shows that such tags can be generated automatically (Bernardi et al., 2016). In the future, our model can be applied to such automatically generated tags, reducing its dependence on manual annotation. While our current experiments focused on nominal arguments of the verbs for semantic role identification, in principle, our model can be applied to other parts of speech, e.g. adverbs, to better incorporate argument roles such as Manner.