Cross-media Structured Common Space for Multimedia Event Extraction

We introduce a new task, MultiMedia Event Extraction, which aims to extract events and their arguments from multimedia documents. We develop the first benchmark and collect a dataset of 245 multimedia news articles with extensively annotated events and arguments. We propose a novel method, Weakly Aligned Structured Embedding (WASE), that encodes structured representations of semantic information from textual and visual data into a common embedding space. The structures are aligned across modalities by employing a weakly supervised training strategy, which enables exploiting available resources without explicit cross-media annotation. Compared to uni-modal state-of-the-art methods, our approach achieves 4.0% and 9.8% absolute F-score gains on text event argument role labeling and visual event extraction. Compared to state-of-the-art multimedia unstructured representations, we achieve 8.3% and 5.0% absolute F-score gains on multimedia event extraction and argument role labeling, respectively. By utilizing images, we extract 21.4% more event mentions than traditional text-only methods.


Introduction
Traditional event extraction methods target a single modality, such as text (Wadden et al., 2019), images (Yatskar et al., 2016) or videos (Ye et al., 2015;Caba Heilbron et al., 2015;Soomro et al., 2012). However, the practice of contemporary journalism (Stephens, 1998) distributes news via multimedia. By randomly sampling 100 multimedia news articles from the Voice of America (VOA), we find that 33% of images in the articles contain visual objects that serve as event arguments and are not mentioned in the text. Take  Figure 1 as an example, we can extract the Agent and Person arguments of the Movement.Transport event from text, but can extract the Vehicle argument only from the image. Nevertheless, event extraction is independently studied in Computer Vision (CV) and Natural Language Processing (NLP), with major differences in task definition, data domain, methodology, and terminology. Motivated by the complementary and holistic nature of multimedia data, we propose MultiMedia Event Extraction (M 2 E 2 ), a new task that aims to jointly extract events and arguments from multiple modalities. We construct the first benchmark and evaluation dataset for this task, which consists of 245 fully annotated news articles.
We propose the first method, Weakly Aligned Structured Embedding (WASE), for extracting events and arguments from multiple modalities. Complex event structures have not been covered by existing multimedia representation methods (Wu et al., 2019b;Faghri et al., 2017;Karpathy and Fei-Fei, 2015), so we propose to learn a structured multimedia embedding space. More specifically, given a multimedia document, we represent each image or sentence as a graph, where each node represents an event or entity and each edge represents an argument role. The node and edge embeddings are represented in a multimedia common semantic space, as they are trained to resolve event co-reference across modalities and to match images with relevant sentences. This enables us to jointly classify events and argument roles from both modalities. A major challenge is the lack of multimedia event argument annotations, which are costly to obtain due to the annotation complexity. Therefore, we propose a weakly supervised framework, which takes advantage of annotated uni-modal corpora to separately learn visual and textual event extraction, and uses an image-caption dataset to align the modalities.
We evaluate WASE on the new task of M 2 E 2 . Compared to the state-of-the-art uni-modal methods and multimedia flat representations, our method significantly outperforms on both event extraction and argument role labeling tasks in all settings. Moreover, it extracts 21.4% more event mentions than text-only baselines. The training and evaluation are done on heterogeneous data sets from multiple sources, domains and data modalities, demonstrating the scalability and transferability of the proposed model. In summary, this paper makes the following contributions: • We propose a new task, MultiMedia Event Extraction, and construct the first annotated news dataset as a benchmark to support deep analysis of cross-media events.
• We develop a weakly supervised training framework, which utilizes existing singlemodal annotated corpora, and enables joint inference without cross-modal annotation.
• Our proposed method, WASE, is the first to leverage structured representations and graph-based neural networks for multimedia common space embedding.
2 Task Definition

Problem Formulation
Each input document consists of a set of images M = {m 1 , m 2 , . . . } and a set of sentences S = {s 1 , s 2 , . . . }. Each sentence s can be represented as a sequence of tokens s = (w 1 , w 2 , . . . ), where w i is a token from the document vocabulary W. The input also includes a set of entities T = {t 1 , t 2 , . . . } extracted from the document text. An entity is an individually unique object in the real world, such as a person, an organization, a facility, a location, a geopolitical entity, a weapon, or a vehicle. The objective of M 2 E 2 is twofold: Event Extraction: Given a multimedia document, extract a set of event mentions, where each event mention e has a type y e and is grounded on a text trigger word w or an image m or both, i.e., e = (y e , {w, m}).
Note that for an event, w and m can both exist, which means the visual event mention and the textual event mention refer to the same event. For example in Figure 1, deploy indicates the same Movement.Transport event as the image. We consider the event e as text-only event if it only has textual mention w, and as image-only event if it only contains visual mention m, and as multimedia event if both w and m exist.
Argument Extraction: The second task is to extract a set of arguments of event mention e. Each argument a has an argument role type y a , and is grounded on a text entity t or an image object o (represented as a bounding box), or both, a = (y a , {t, o}) .
The arguments of visual and textual event mentions are merged if they refer to the same realworld event, as shown in Figure 1.

The M 2 E 2 Dataset
We define multimedia newsworthy event types by exhaustively mapping between the event ontology in NLP community for the news domain (ACE 2 ) and the event ontology in CV community for general domain (imSitu (Yatskar et al., 2016)). They cover the largest event training resources in each community. Table 1 shows the selected complete intersection, which contains 8 ACE types (i.e., 24% of all ACE types), mapped to 98 imSitu types (i.e., 20% of all imSitu types). We expand the ACE event role set by adding visual arguments from imSitu, such as instrument, bolded in Table 1. This set encompasses 52% ACE events in a news corpus, which indicates that the selected eight types are salient in the news domain. We reuse these existing ontologies because they enable us to train event and argument classifiers for both modalities without requiring joint multimedia event annotation as training data.  We collect 108,693 multimedia news articles from the Voice of America (VOA) website 3 2006-2017, covering a wide range of newsworthy topics such as military, economy and health. We select 245 documents as the annotation set based on three criteria: (1) Informativeness: articles with more event mentions; (2) Illustration: articles with more images (> 4); (3) Diversity: articles that balance the event type distribution regardless of true frequency. The data statistics are shown in Table 2. Among all of these events, 192 textual event mentions and 203 visual event mentions can be aligned as 309 cross-media event mention pairs. The dataset can be divided into 1,105 text-only event mentions, 188 image-only event mentions, and 395 multimedia event mentions.  We follow the ACE event annotation guidelines (Walker et al., 2006) for textual event and argument annotation, and design an annotation guideline 4 for multimedia events annotation.
One unique challenge in multimedia event annotation is to localize visual arguments in complex scenarios, where images include a crowd of people or a group of object. It is hard to delineate each of them using a bounding box. To solve this problem, we define two types of bounding boxes: (1) union bounding box: for each role, we annotate the smallest bounding box covering all constituents; and (2) instance bounding box: for each role, we annotate a set of bounding boxes, where each box is the smallest region that covers an individual participant (e.g., one person in the crowd), following the VOC2011 Annotation Guidelines 5 . Figure 2 shows an example. Eight NLP and CV researchers complete the annotation work with two independent passes and reach an Inter-Annotator Agreement (IAA) of 81.2%. Two expert annotators perform adjudication.

Approach Overview
As shown in Figure 3, the training phase contains three tasks: text event extraction (Section 3.2), visual situation recognition (Section 3.3), and crossmedia alignment (Section 3.4). We learn a crossmedia shared encoder, a shared event classifier, and a shared argument classifier. In the testing phase (Section 3.5), given a multimedia news article, we encode the sentences and images into the structured common space, and jointly extract textual and visual events and arguments, followed by cross-modal coreference resolution.

Text Event Extraction
Text Structured Representation: As shown in Figure 4, we choose Abstract Meaning Representation (AMR) (Banarescu et al., 2013) to represent text because it includes a rich set of 150 fine-grained semantic roles. To encode each text sentence, we run the CAMR parser (Wang et al., 2015b to generate an AMR graph, based on the named entity recognition and partof-speech (POS) tagging results from Stanford CoreNLP . To represent each word w in a sentence s, we concatenate its

Cross-media Structured Common Representation Encoder
Cross-media Shared Argument Classifier

ACE Text Event
Figure 3: Approach overview. During training (left), we jointly train three tasks to establish a cross-media structured embedding space. During test (right), we jointly extract events and arguments from multimedia articles.
pre-trained GloVe word embedding (Pennington et al., 2014), POS embedding, entity type embedding and position embedding. We then input the word sequence to a bi-directional long short term memory (Bi-LSTM) (Graves et al., 2013) network to encode the word order and get the representation of each word w. Given the AMR graph, we apply a Graph Convolutional Network (GCN) (Kipf and Welling, 2016) to encode the graph contextual information following (Liu et al., 2018a): (1) where N (i) is the neighbour nodes of w i in the AMR graph, E(i, j) is the edge type between w i and w j , g ij is the gate following (Liu et al., 2018a), k represents GCN layer number, and f is the Sigmoid function. W and b denote parameters of neural layers in this paper. We take the hidden states of the last GCN layer for each word as the common-space representation w C , where C stands for the common (multimedia) embedding space. For each entity t, we obtain its representation t C by averaging the embeddings of its tokens. Event and Argument Classifier: We classify each word w into event types y e 6 and classify each 6 We use BIO tag schema to decide trigger word boundary, i.e., adding prefix B-to the type label to mark the beginning of a trigger, I-for inside, and O for none. entity t into argument role y a : . (2) We take ground truth text entity mentions as input following (Ji and Grishman, 2008) during training, and obtain testing entity mentions using a named entity extractor (Lin et al., 2019).

Image Event Extraction
Image Structured Representation: To obtain image structures similar to AMR graphs, and inspired by situation recognition (Yatskar et al., 2016), we represent each image with a situation graph, that is a star-shaped graph as shown in Figure 4, where the central node is labeled as a verb v (e.g., destroying), and the neighbor nodes are arguments labeled as {(n, r)}, where n is a noun (e.g., ship) derived from WordNet synsets (Miller, 1995) to indicate the entity type, and r indicates the role (e.g., item) played by the entity in the event, based on FrameNet (Fillmore et al., 2003). We develop two methods to construct situation graphs from images and train them using the im-Situ dataset (Yatskar et al., 2016) as follows.
(1) Object-based Graph: Similar to extracting entities to get candidate arguments, we employ the  We compare the predicted verb embedding to all verbs v in the imSitu taxonomy in order to classify the verb, and similarly compare each predicted noun embedding to all imSitu nouns n which results in probability distributions:

Caption AMR Graph
where v and n are word embeddings initialized with GloVE (Pennington et al., 2014). We use another MLP with one hidden layer followed by Softmax (σ) to classify role r i for each object o i : Given verb v * and role-noun (r * i , n * i ) annotations for an image (from the imSitu corpus), we define the situation loss functions: (2) Attention-based Graph: State-of-the-art object detection methods only cover a limited set of object types, such as 600 types defined in Open Images. Many salient objects such as bomb, stone and stretcher are not covered in these ontologies. Hence, we propose an open-vocabulary alternative to the object-based graph construction model. To this end, we construct a role-driven attention graph, where each argument node is derived by a spatially distributed attention (heatmap) conditioned on a role r. More specifically, we use a VGG-16 CNN to extract a 7×7 convolutional feature map for each image m, which can be regarded as attention keys k i for 7 × 7 local regions. Next, for each role r defined in the situation recognition ontology (e.g., agent), we build an attention query vector q r by concatenating role embedding r with the image feature m as context and apply a fully connected layer: Then, we compute the dot product of each query with all keys, followed by Softmax, which forms a heatmap h on the image, i.e., .
We use the heatmap to obtain a weighted average of the feature map to represent the argument o r of each role r in the visual space: Similar to the object-based model, we embed o r toô r , compare it to the imSitu noun embeddings to define a distribution, and define a classification loss function. The verb embeddingm and the verb prediction probability P (v|m) and loss are defined in the same way as in the object-based method. Event and Argument Classifier: We use either the object-based or attention-based formulation and pre-train it on the imSitu dataset (Yatskar et al., 2016). Then we apply a GCN to obtain the structured embedding of each node in the common space, similar to Equation 1. This yields m C and o C i . We use the same classifiers as defined in Equation 2 to classify each visual event and argument using the common space embedding: . (3)

Cross-Media Joint Training
In order to make the event and argument classifier shared across modalities, the image and text graph should be encoded to the same space. However, it is extremely costly to obtain the parallel text and image event annotation. Hence, we use event and argument annotations in separate modalities (i.e., ACE and imSitu datasets) to train classifiers, and simultaneously use VOA news image and caption pairs to align the two modalities. To this end, we learn to embed the nodes of each image graph close to the nodes of the corresponding caption graph, and far from those in irrelevant caption graphs. Since there is no ground truth alignment between the image nodes and caption nodes, we use image and caption pairs for weakly supervised training, to learn a soft alignment from each words to image objects and vice versa.
where w i indicates the i th word in caption sentence s and o j represents the j th object of image m. Then, we compute a weighted average of softly aligned nodes for each node in other modality, i.e., We define the alignment cost of the image-caption pair as the Euclidean distance between each node to its aligned representation, We use a triplet loss to pull relevant image-caption pairs close while pushing irrelevant ones apart: where m − is a randomly sampled negative image that does not match s. Note that in order to learn the alignment between the image and the trigger word, we treat the image as a special object when learning cross-media alignment. The common space enables the event and argument classifiers to share weights across modalities, and be trained jointly on the ACE and im-Situ datasets, by minimizing the following objective functions:

Cross-Media Joint Inference
In the test phase, our method takes a multimedia document with sentences S = {s 1 , s 2 , . . . } and images M = {m 1 , m 2 , . . . , } as input. We first generate the structured common embedding for each sentence and each image, and then compute pairwise similarities s, m . We pair each sentence s with the closest image m, and aggregate the features of each word of s with the aligned representation from m by weighted averaging: where γ = exp(− s, m ) and w ′ i is derived from m using Equation 4. We use w ′′ i to classify each  6 66.8 48.1 27.5 33.2 30.1 32.3 63.4 42.8 9.7 11.1 10.3 38.2 67.1 49.1 18.6 21.6 19.9  WASEobj 42.8 61.9 50.6 23.5 30.3 26.4 43.1 59.2 49.9 14.5 10.1 11.9 43.0 62.1 50.8 19.5 18.9 19.2 Table 3: Event and argument extraction results (%). We compare three categories of baselines in three evaluation settings. The main contribution of the paper is joint training and joint inference on multimedia data (bottom right).
word into an event type and to classify each entity into a role with multimedia classifiers in Equation 2. To this end, we define t ′′ i similar to w ′′ i but using t i and t ′ i . Similarly, for each image m we find the closest sentence s, compute the aggregated multimedia features m ′′ and o ′′ i , and feed into the shared classifiers (Equation 3) to predict visual event and argument roles. Finally, we corefer the cross-media events of the same event type if the similarity s, m is higher than a threshold.

Evaluation Setting
Evaluation Metrics We conduct evaluation on text-only, image-only, and multimedia event mentions in M 2 E 2 dataset in Section 2.2. We adopt the traditional event extraction measures, i.e., Precision, Recall and F 1 . For text-only event mentions, we follow (Ji and Grishman, 2008;Li et al., 2013): a textual event mention is correct if its event type and trigger offsets match a reference trigger; and a textual event argument is correct if its event type, offsets, and role label match a reference argument. We make a similar definition for image-only event mentions: a visual event mention is correct if its event type and image match a reference visual event mention; and a visual event argument is correct if its event type, localization, and role label match a reference argument. A visual argument is correctly localized if the Intersection over Union (IoU) of the predicted bounding box with the ground truth bounding box is over 0.5. Finally, we define a multimedia event mention to be correct if its event type and trigger offsets (or the image) match the reference trigger (or the reference image). The arguments of multimedia events are either textual or visual arguments, and are evaluated accordingly. To generate bounding boxes for the attention-based model, we threshold the heatmap using the adaptive value of 0.75 * p, where p is the peak value of the heatmap. Then we compute the tightest bounding box that encloses all of the thresholded region. Examples are shown in Figure 7 and Figure 8.
Baselines The baselines include: (1) Textonly models: We use the state-of-the-art model JMEE (Liu et al., 2018a) and GAIL  for comparison. We also evaluate the effectiveness of cross media joint training by including a version of our model trained only on ACE, denoted as WASE T . (2) Image-only models: Since we are the first to extract newsworthy events, and the most similar work situation recognition can not localize arguments in images, we use our model trained only on image corpus as baselines. Our visual branch has two versions, object-based and attention-based, denoted as WASE I obj and WASE I att .
(3) Multimedia models: To show the effectiveness of structured embedding, we include a baseline by removing the text and image GCNs from our model, which is denoted as Flat. The Flat baseline ignores edges and treats images and sentences as sets of vectors. We also compare to the state-of-the-art crossmedia common representation model, Contrastive Visual Semantic Embedding VSE-C (Shi et al., 2018), by training it the same way as WASE.
Parameter Settings The common space dimension is 300. The dimension is 512 for image position embedding and feature map, and 50 for word position embedding, entity type embedding, and POS tag embedding. The layer of GCN is 3.

Quantitative Performance
As shown in Table 3, our complete methods (WASE att and WASE obj ) outperform all baselines in the three evaluation settings in terms of F 1 . The comparison with other multimedia models demonstrates the effectiveness of our model architecture and training strategy. The advantage of structured embedding is shown by the better performance over the flat baseline. Our model outperforms its text-only and image-only variants on multimedia events, showing the inadequacy of singlemodal information for complex news understanding. Furthermore, our model achieves better performance on text-only and image-only events, which demonstrates the effectiveness of multimedia training framework in knowledge transfer between modalities.
WASE obj and WASE att , are both superior to the state of the art and each has its own advantages. WASE obj predicts more accurate bounding boxes since it is based on a Faster R-CNN pretrained on bounding box annotations, resulting in a higher argument precision. While WASE att achieves a higher argument recall as it is not limited by the predefined object classes of the Faster R-CNN.  Furthermore, to evaluate the cross-media event coreference performance, we pair textual and visual event mentions in the same document, and calculate Precision, Recall and F 1 to compare with ground truth event mention pairs 7 . As shown in Table 4, WASE obj outperforms all multimedia embedding models, as well as the rule-based baseline using event type matching. This demonstrates the effectiveness of our cross-media soft alignment.

Qualitative Analysis
Our cross-media joint training approach successfully boosts both event extraction and argument role labeling performance. For example, in Figure 5 (a), the text-only model can not extract Jus- 7 We do not use coreference clustering metrics because we only focus on mention-level cross-media event coreference instead of the full coreference in all documents. tice.Arrest event, but the joint model can use the image as background to detect the event type. In Figure 5 (b), the image-only model detects the image as Conflict.Demonstration, but the sentences in the same document help our model not to label it as Conflict.Demonstration. Compared with multimedia flat embedding in Figure 6, WASE can learn structures such as Artifact is on top of Vehicle, and the person in the middle of Justice.Arrest is Entity instead of Agent.

Remaining Challenges
One of the biggest challenges in M 2 E 2 is localizing arguments in images. Object-based models suffer from the limited object types. Attentionbased method is not able to precisely localize the objects for each argument, since there is no supervision on attention extraction during training. For example, in Figure 7, the Entity argument in the Conflict.Demonstrate event is correctly predicted as troops, but its localization is incorrect because Place argument share similar attention. When one argument targets at too many instances, attention heatmaps tend to lose focus and cover the whole image, as shown in Figure 8.

Related Work
Text Event Extraction Text event extraction has been extensively studied for general news do-  main (Ji and Grishman, 2008;Liao and Grishman, 2011;Huang and Riloff, 2012;Li et al., 2013;Chen et al., 2015;Nguyen et al., 2016;Hong et al., 2018;Liu et al., 2018b;Liu et al., 2018a;Yang et al., 2019;Wadden et al., 2019). Multimedia features has been proven to effectively improve text event extraction (Zhang et al., 2017). Visual Event Extraction "Events" in NLP usually refer to complex events that involve multiple entities in a large span of time (e.g. protest), while in CV (Chang et al., 2016;Zhang et al., 2007;Ma et al., 2017) events are less complex singleentity activities (e.g. washing dishes) or actions (e.g. jumping). Visual event ontologies focus on daily life domains, such as "dogshow" and "wedding ceremony" (Perera et al., 2012). Moreover, most efforts ignore the structure of events including arguments. There are a few methods that aim to localize the agent (Gu et al., 2018;Duarte et al., 2018), or classify the recipient (Sigurdsson et al., 2016;Kato et al., 2018;Wu et al., 2019a) of events, but neither detects the complete set of arguments for an event. The most similar to our work is Situation Recognition (SR) (Yatskar et al., 2016;Mallya and Lazebnik, 2017) which predicts an event and multiple arguments from an input image, but does not localize the arguments. We use SR as an auxiliary task for training our visual branch, but exploit object detection and attention to enable localization of arguments. Silberer and Pinkal redefine the problem of visual argument role labeling with event types and bounding boxes as input. Different from their work, we extend the problem scope to including event identification and coreference, and further advance argument localization by proposing an attention framework which does not require bounding boxes for training nor testing.

Conclusions and Future Work
In this paper we propose a new task of multimedia event extraction and setup a new benchmark. We also develop a novel multimedia structured common space construction method to take advantage of the existing image-caption pairs and singlemodal annotated data for weakly supervised training. Experiments demonstrate its effectiveness as a new step towards semantic understanding of events in multimedia data. In the future, we aim to extend our framework to extract events from videos, and make it scalable to new event types. We plan to expand our annotations by including event types from other text event ontologies, as well as new event types not in existing text ontologies. We will also apply our extraction results to downstream applications including cross-media event inference, timeline generation, etc.