GAIA: A Fine-grained Multimedia Knowledge Extraction System

We present the first comprehensive, open source multimedia knowledge extraction system that takes a massive stream of unstructured, heterogeneous multimedia data from various sources and languages as input, and creates a coherent, structured knowledge base, indexing entities, relations, and events, following a rich, fine-grained ontology. Our system, GAIA, enables seamless search of complex graph queries, and retrieves multimedia evidence including text, images and videos. GAIA achieves top performance at the recent NIST TAC SM-KBP2019 evaluation. The system is publicly available at GitHub and DockerHub, with a narrated video that documents the system.


Introduction
Knowledge Extraction (KE) aims to find entities, relations and events involving those entities from unstructured data, and link them to existing knowledge bases. Open source KE tools are useful for many real-world applications including disaster monitoring (Zhang et al., 2018a), intelligence analysis (Li et al., 2019a) and scientific knowledge mining (Luan et al., 2017;Wang et al., 2019). Recent years have witnessed the great success and wide usage of open source Natural Language Processing tools (Manning et al., 2014;Fader et al., 2011;Daniel Khashabi, 2018;Honnibal and Montani, 2017), but there is no comprehensive open source system for KE. We release * These authors contributed equally to this work.
GAIA has been inherently designed for multimedia, which is rapidly replacing text-only data in many domains. We extract complementary knowledge from text as well as related images or video frames, and integrate the knowledge across modalities. Taking Figure 1 as an example, the text entity extraction system extracts the nominal mention troops, but is unable to link or relate that due to a vague textual context. From the image, the entity linking system recognizes the flag as Ukrainian and represents it as a NationalityCitizen relation in the knowledge base. It can be deduced, although not for sure, that the detected people are Ukrainian. Meanwhile, our cross-media fusion system grounds the troops to the people detected in the image. This establishes a connection between the knowledge Figure 2: User-facing views of knowledge networks constructed with events automatically extracted from multimedia multilingual news reports. We display the event arguments, type, summary, similar events, as well as visual knowledge extracted from the corresponding image and video. extracted from the two modalities, allowing to infer that the troops are Ukrainian, and They refers to the Ukrainian government.
Compared to coarse-grained event types of previous work (Li et al., 2019a), we follow a richer ontology to extract fine-grained types, which are crucial to scenario understanding and event prediction. For example, an event of type Movement.TransportPerson involving an entity of type PER.Politician.HeadOfGovernment differs in implications from the same event type involving a PER.Combatant.Sniper entity (i.e., a political trip versus a military deployment). Similarly, it is far more likely that an event of type Conflict.Attack.Invade will lead to a Contact.Negotiate.Meet event, while a Conflict.Attack.Hanging event is more likely to be followed by an event of type Contact.FuneralVigil.Meet. The knowledge base extracted by GAIA can support various applications, such as multimedia news event understanding and recommendation. We use Russia-Ukraine conflicts of 2014-2015 as a case study, and develop a knowledge exploration interface that recommends events related to the user's ongoing search based on previously-selected attribute values and dimensions of events being viewed 7 , as shown in Figure 2. Thus, this system automatically provides the user with a more comprehensive exposure to collected events, their importance, and their interconnections. Extensions of this system to real-time applications would be particularly useful for tracking current events, providing alerts, and predicting possible changes, as well as topics related to ongoing incidents.

Overview
The architecture of our multimedia knowledge extraction system is illustrated in Figure 3. The system pipeline consists of a Text Knowledge Extraction (TKE) branch and a Visual Knowledge Extraction (VKE) branch (Sections 3 and 4 respectively). Each branch takes the same set of documents as input, and initially creates a separate knowledge base (KB) that encodes the information from its respec-

Cross-Media Fusion
Visual Grounding

Multimedia KB
Cross-modal Entity Linking tive modality. Both output knowledge bases make use of the same types from the DARPA AIDA ontology 8 , as referred to in Table 1. Therefore, while the branches both encode their modality-specific extractions into their KBs, they do so with types defined in the same semantic space. This shared space allows us to fuse the two KBs into a single, coherent multimedia KB through the Cross-Media Knowledge Fusion module (Section 5). Our userfacing system demo accesses one such resulting KB, where attack events have been extracted from multi-media documents related to the 2014-2015 Russia-Ukraine conflict scenario. In response to user queries, the system recommends information around a primary event and its connected events from the knowledge graph (screenshot in Figure 2).

Text Knowledge Extraction
As shown in Figure 3, the Text Knowledge Extraction (TKE) system extracts entities, relations, and events from input documents. Then it clusters identical entities through entity linking and coreference, and clusters identical events using event coreference.

Text Entity Extraction and Coreference
Coarse-grained Mention Extraction We extract coarse-grained named and nominal entity mentions using a LSTM-CRF (Lin et al., 2019) model. We use pretrained ELMo (Peters et al., 2018) word embeddings as input features for English, and pretrain Word2Vec (Le and Mikolov, 2014) models on Wikipedia data to generate Russian and Ukrainian word embeddings. Entity Linking and Coreference We seek to link the entity mentions to pre-existing entities in the background KBs (Pan et al., 2015), including Freebase (LDC2015E42) and GeoNames (LDC2019E43). For mentions that are linkable to the same Freebase entity, coreference information is added. For name mentions that cannot be linked to the KB, we apply heuristic rules (Li et al., 2019b) to same-named mentions within each document to form NIL clusters. A NIL cluster is a cluster of entity mentions referring to the same entity but do not have corresponding KB entries (Ji et al., 2014). Fine-grained Entity Typing We develop an attentive fine-grained type classification model with latent type representation . It takes as input a mention with its context sentence and predicts the most likely fine-grained type. We obtain the YAGO (Suchanek et al., 2008) fine-grained types from the results of Freebase entity linking, and map these types to the DARPA AIDA ontology. For mentions with identified, coarse-grained GPE and LOC types, we further determine their fine-grained types using GeoNames attributes feature class and feature code from the GeoNames entity linking result. Given that most nominal mentions are descriptions and thus do not link to entries in Freebase or GeoNames, we develop a nominal keyword list (Li et al., 2019b) for each type to incorporate these mentions into the entity analyses. Entity Salience Ranking To better distill the information, we assign each entity a salience score in each document. We rank the entities in terms of the weighted sum of all mentions, with higher weights for name mentions. If one entity appears only in nominal and pronoun mentions, we reduce its salience score so that it is ranked below other entities with name mentions. The salience score is normalized over all entities in each document.

Text Relation Extraction
For fine-grained relation extraction, we first apply a language-independent CNN based model (Shi et al., 2018) to extract coarse-grained relations from English, Russian and Ukrainian documents. Then we apply entity type constraints and dependency patterns to these detected relations and re-categorize them into fine-grained types (Li et al., 2019b). To extract dependency paths for these relations in the three languages, we run the corresponding language's Universal Dependency parser (Nivre et al., 2016). For types without coarse-grained type training data in ACE/ERE, we design dependency pathbased patterns instead and implement a rule-based system to detect their fine-grained relations directly from the text (Li et al., 2019b).

Text Event Extraction and Coreference
We start by extracting coarse-grained events and arguments using a Bi-LSTM CRF model and a CNNbased model (Zhang et al., 2018b) for three languages, and then detect the fine-grained event types by applying verb-based rules, context-based rules, and argument-based rules (Li et al., 2019b). We also extract FrameNet frames (Chen et al., 2010) in English corpora to enrich the fine-grained events.
We apply a graph-based algorithm (Al-Badrashiny et al., 2017) for our languageindependent event coreference resolution. For each event type, we cast the event mentions as nodes in a graph, so that the undirected, weighted edges be-tween these nodes represent coreference confidence scores between their corresponding events. We then apply hierarchical clustering to obtain event clusters and train a Maximum Entropy binary classifier on the cluster features (Li et al., 2019b).

Visual Knowledge Extraction
The Visual Knowledge Extraction (VKE) branch of GAIA takes images and video key frames as input and creates a single, coherent (visual) knowledge base, relying on the same ontology as GAIA's Text Knowledge Extraction (TKE) branch. Similar to TKE, the VKE consists of entity extraction, linking, and coreference modules. Our VKE system also extracts some events and relations.

Visual Entity Extraction
We use an ensemble of visual object detection and concept localization models to extract entities and some events from a given image. To detect generic objects such as person and vehicle, we employ two off-the-shelf Faster R-CNN models (Ren et al., 2015) trained on the Microsoft Common Objects in COntext (MS COCO) (Lin et al., 2014) and Open Images (Kuznetsova et al., 2018) datasets. To detect scenario-specific entities and events, we train a Class Activation Map (CAM) model (Zhou et al., 2016) in a weakly supervised manner using a combination of Open Images with image-level labels and Google image search.
Given an image, each R-CNN model produces a set of labeled bounding boxes, and the CAM model produces a set of labeled heat maps which are then thresholded to produce bounding boxes. The union of all bounding boxes is then post-processed by a set of heuristic rules to remove duplicates and ensure quality. We separately apply a face detector, MTCNN (Zhang et al., 2016), and add the results to the pool of detected objects as additional person entities. Finally, we represent each detected bounding box as an entity in the visual knowledge base. Since the CAM model includes some event types, we create event entries (instead of entity entries) for bounding boxes classified as events.

Visual Entity Linking
Once entities are added into the (visual) knowledge base, we try to link each entity to the real-world entities from a curated background knowledge base. Due to the complexity of this task, we develop distinct models for each coarse-grained entity type. For the type person, we train a FaceNet model (Schroff et al., 2015) that takes each cropped human face (detected by the MTCNN model as mentioned in Section 4.1) and classifies it in one or none of the predetermined identities. We compile a list of recognizable and scenario-relevant identities by automatically searching for each person name in the background KB via Google Image Search, collecting top retrieved results that contain a face, training a binary classifier on half of the results, and evaluating on the other half. If the accuracy is higher than a threshold, we include that person name in our list of recognizable identities. For example, the visual entity in Figure 4 (a) is linked to the Wikipedia entry Rudy Giuliani 9 .
To recognize location, facility, and organization entities, we use a DELF model (Noh et al., 2017) pre-trained on Google Landmarks, to match each image with detected buildings against a predetermined list. We use a similar approach as mentioned above to create a list of recognizable, scenariorelevant landmarks, such as buildings and other types of structure that identify a specific location, facility, or organization. For example, the visual entity in Figure 4 (b) is linked to the Wikipedia entry Maidan Square 10 Finally, to recognize geopolitical entities, we train a CNN to classify flags into a predetermined list of entities, such as all the nations in the world, for detection in our system. Take Figure 4 (c) as an example. The flags of Ukraine, US and Russia are linked to the Wikipedia entries of corresponding countries. Once a flag in an image is recognized, we apply a set of heuristic rules to create a nationality affiliation relationship in the knowledge base between some entities in the scene and the detected country. For instance, a person who is holding a Ukrainian flag would be affiliated with the country 9 https://en.wikipedia.org/wiki/Rudy_ Giuliani 10 https://en.wikipedia.org/wiki/Maidan_ Nezalezhnosti Ukraine.

Visual Entity Coreference
While we cast each detected bounding box as an entity node in the output knowledge base, we resolve potential coreferential links between them, since one unique real-world entity can be detected multiple times. Cross-image coreference resolution aims to identify the same entity appearing in multiple images, where the entities are in different poses from different angles. Take Figure 5 as an example. The red bounding boxes in these two images refer to the same person, so they are coreferential and are put into the same NIL cluster. Within-image coreference resolution requires the detection of duplicates, such as the duplicates in an collage image. To resolve entity coreference, we train an instancematching CNN on the Youtube-BB dataset (Real et al., 2017), where we ask the model to match an object bounding box to the same object in a different video frame, rather than to a different object. We use this model to extract features for each detected bounding box and run the DBSCAN (Ester et al., 1996) clustering algorithm on the box features across all images. The entities in the same cluster are coreferential, and are represented using a NIL cluster in the output (visual) KB. Similarly, we use a pretrained FaceNet (Schroff et al., 2015) model followed by DBSCAN to cluster face features. Figure 5: The two green bounding boxes are coreferential since they are both linked to "Kirstjen Nielsen", and two red bounding boxes are coreferential based on face features. The yellow bounding boxes are unlinkable and also not coreferential to other bounding boxes.
We also define heuristic rules to complement the aforementioned procedure in special cases. For example, if in the entity linking process (Section 4.2), some entities are linked to the same real-world entity based on entity linking result, we consider them coreferential. Besides, since we have both face detection and person detection which result in two entities for each person instance, we use their bounding box intersection to merge them into the same entity.

Cross-Media Knowledge Fusion
Given a set of multimedia documents which consist of textual data, such as written articles and transcribed speech, as well as visual data, such as images and video key frames, the TKE and VKE branches of the system take their respective modality data as input, extract knowledge elements, and create separate knowledge bases. These textual and visual knowledge bases rely on the same ontology, but contain complementary information. Some knowledge elements in a document may not be explicitly mentioned in the text, but will appear visually, such as the Ukrainian flag in Figure 1. Even coreferential knowledge elements that exist in both knowledge bases are not completely redundant, since each modality has its own unique granularity. For example, the word troops in text could be considered coreferential to the individuals with military uniform detected in the image, but the uniforms being worn may provide additional visual features useful in identifying the military ranks, organizations and nationalities of the individuals.
To exploit the complementary nature of the two modalities, we combine the two modality-specific knowledge bases into a single, coherent, multimedia knowledge base, where each knowledge element could be grounded in either or both modalities. To fuse the two bases, we develop a state-of-the-art visual grounding system (Akbari et al., 2019) to resolve entity coreference across modalities. More specifically, for each entity mention extracted from text, we feed its text along with the whole sentence into an ELMo model  that extracts contextualized features for the entity mention, and then we compare that with CNN feature maps of surrounding images. This leads to a relevance score for each image, as well as a granular relevance map (heatmap) within each image. For images that are relevant enough, we threshold the heatmap to obtain a bounding box, compare that box content with known visual entities, and assign it to the entity with the most overlapping match. If no overlapping entity is found, we create a new visual entity with the heatmap bounding box. Then we link the matching textual and visual entities using a NIL cluster. Additionally, with visual linking (Section 4.2), we corefer cross-modal entities that are linked to the same background KB node.   (Walker et al., 2006), ERE (Song et al., 2015), AIDA (LDC2018E01:AIDA Seedling Corpus V2.0), MSCOCO (Lin et al., 2014), FDDB (Jain and Learned-Miller, 2010), LFW (Huang et al., 2008), Oxf105k (Philbin et al., 2007), YoutubeBB (Real et al., 2017), and Flickr30k (Plummer et al., 2015).

Quantitative Performance
The performance of each component is shown in Table 2. To evaluate the end-to-end performance, we participated with our system in the TAC SM-KBP 2019 evaluation 11 . The input corpus contains 1999 documents (756 English, 537 Russian, 703 Ukrainian), 6194 images, and 322 videos. We populated a multimedia, multilingual knowledge base with 457,348 entities, 67,577 relations, 38,517 events. The system performance was evaluated based on its responses to class queries and graph queries 12 , and GAIA was awarded first place. Class queries evaluated cross-lingual, crossmodal, fine-grained entity extraction and coreference, where the query is an entity type, such as FAC.Building.GovernmentBuilding, and the result is a ranked list of entities of the given type. Our entity ranking is generated by the entity salience score in Section 3.1. The evaluation metric was  Graph queries evaluated cross-lingual, crossmodal, fine-grained relation extraction, event extraction and coreference, where the query is an argument role type of event (e.g., Victim of Life.Die.DeathCausedByViolentEvents) or relation (e.g., Parent of PartWhole.Subsidiary) and the result is a list of entities with that role. The evaluation metrics were Precision, Recall and F 1 .

Qualitative Analysis
To demonstrate the system, we have selected Ukraine-Russia Relations in 2014-2015 for a case study to visualize attack events, as extracted from the topic-related corpus released by LDC 14 . The system displays recommended events related to the user's ongoing search based on their previouslyselected attribute values and dimensions of events being viewed, such as the fine-grained type, place, time, attacker, target, and instrument. The demo is publicly available 15 with a user interface as shown in Figure 2, displaying extracted text entities and events across languages, visual entities, visual entity linking and coreference results from face, landmark and flag recognition, and the results of grounding text entities to visual entities.

Related Work
Existing knowledge extraction systems mainly focus on text (Manning et al., 2014;Fader et al., 2011;Daniel Khashabi, 2018;Honnibal and Montani, 2017;Li et al., 2019a), and do not readily support fine-grained knowledge extraction. Visual knowledge extraction is typically limited to atomic concepts that have distinctive visual features of daily life (Ren et al., 2015;Schroff et al., 2015;Fernández et al., 2017;Gu et al., 2018;Lin et al., 2014), and so lacks more complex concepts, making extracted elements challenging to integrate with text. Existing multimedia systems overlook the connections and distinctions between modalities (Yazici et al., 2018). Our system makes use of a multi-modal ontology with concepts from real-world, newsworthy topics, resulting in a rich cross-modal, as well as intra-modal connectivity.

Conclusion
We demonstrate a state-of-the-art multimedia multilingual knowledge extraction and event recommendation system. This system enables the user to readily search a knowledge network of extracted, linked, and summarized complex events from multimedia, multilingual sources (e.g., text, images, videos, speech and OCR).