RESIN: A Dockerized Schema-Guided Cross-document Cross-lingual Cross-media Information Extraction and Event Tracking System

We present a new information extraction system that can automatically construct temporal event graphs from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). The system advances state-of-the-art from two aspects: (1) extending from sentence-level event extraction to cross-document cross-lingual cross-media event extraction, coreference resolution and temporal event tracking; (2) using human curated event schema library to match and enhance the extraction output. We have made the dockerlized system publicly available for research purpose at GitHub, with a demo video.

However, it's much more difficult to remember event-related information compared to entityrelated information. For example, most people in the United States will be able to answer the question "Which city is Columbia University located in?", but very few people can give a complete answer to "Who died from COVID-19?". Progress in natural language understanding and computer vision has helped automate some parts of event understanding but the current, first-generation, automated event understanding is overly simplistic since most methods focus on sentence-level sequence labeling for event extraction. Existing methods for complex event understanding also lack of incorporating knowledge in the form of a repository of abstracted event schemas (complex event templates), understanding the progress of time via temporal event tracking, using background knowledge, and performing global inference and enhancement.
To address these limitations, in this paper we will demonstrate a new end-to-end open-source dockerized research system to extract temporally ordered events from a collection of news documents from multiple sources, multiple languages (English and Spanish for our experiment), and multiple data modalities (speech, text, image and video). Our system consists of a pipeline of components that involve schema-guided entity, relation and complex event extraction, entity and event coreference resolution, temporal event tracking and cross-media entity and event grounding. Event schemas encode knowledge of stereotypical structures of events and their connections. Our end-to-end system has been dockerized and made publicly available for research purpose.

Overview
The architecture of our system is illustrated in Figure Figure 1: The architecture of RESIN schema-guided information extraction and temporal event tracking system. ment cluster contains documents about a specific complex event. Our textual pipeline takes input from texts and transcribed speeches. It first extracts entity, relation and event mentions (Section 2.2-2.3) and then perform cross-document cross-lingual entity and event coreference resolution (Section 2.4). The extracted events are then ordered by temporal relation extraction (Section 2.5). Our visual pipeline takes images and videos as input and extracts events and arguments from visual signals and ground the extracted knowledge elements onto our extracted graph via cross-media event coreference resolution (Section 2.6). Finally, our system selects the schema from a schema repository that best matches the extracted IE graph and merges these two graphs (Section 2.7). Our system can extract 24 types of entities, 46 types of relations and 67 types of events as defined in the DARPA KAIROS 3 ontology.

Joint Entity, Relation and Event Mention Extraction and Linking from Speech and Text
For speech input, we apply the Amazon Transcribe API 4 for converting English and Spanish speech to text. When the language is not specified, it is automatically detected from the audio signal. It returns the transcription with starting and ending 3 https://www.darpa.mil/program/knowledge-di rected-artificial-intelligence-reasoning-overschemas 4 https://aws.amazon.com/transcribe/ times for each detected words, as well as potential alternative transcriptions.
Then from the speech recognition results and text input, we extract entity, relation, and event mentions using OneIE , a stateof-the-art joint neural model for sentence-level information extraction. Given a sentence, the goal of this module is to extract an information graph G = (V, E), where V is the node set containing entity mentions and event triggers and E is the edge set containing entity relations and event-argument links. We use a pre-trained BERT encoder (Devlin et al., 2018) to obtain contextualized word representations for the input sentence. Next, we adopt separate conditional random field-based taggers to identify entity mention and event trigger spans from the sentence. We represent each span, or node in the information graph, by averaging vectors of words in the span. After that, we calculate the label scores for each node or edge using separate task-specific feed forward networks. In order to capture the interactions among knowledge elements, we incorporate schema-guided global features when decoding information graphs. For a candidate graph G, we define a global feature vector f = {f 1 (G), ..., f M (G)}, where f i (·) is a function that evaluates whether G matches a specific global feature. We compute the global feature score as uf , where u is a learnable weight vector. Finally, we use a beam search-based decoder to generate the information graph with the highest global score. After we extract these mentions, we apply a syntactic parser (Honnibal et al., 2020) to extend mention head words to their extents. Then we apply a cross-lingual entity linker (Pan et al., 2017) to link entity mentions to WikiData (Vrandečić and Krötzsch, 2014) 5 .

Document-level Event Argument Extraction
The previous module can only operate on the sentence level. In particular, event arguments can often be found in neighboring sentences. To make up for this, we further develop a document-level event argument extraction model (Li et al., 2021) and use the union of the extracted arguments from both models as the final output. We formulate the argument extraction problem as conditional text generation. Our model can easily handle the case of missing arguments and multiple arguments in the same role without the need of tuning thresholds and can extract all arguments in a single pass. The condition consists of the original document and a blank event template. For example, the template for Transportation event type is arg1 transported arg2 in arg3 from arg4 place to arg5 place. The desired output is a filled template with the arguments. Our model is based on BART (Lewis et al., 2020), which is an encoder-decoder language model. To utilize the encoder-decoder LM for argument extraction, we construct an input sequence of s template s /s document /s . All argument names (arg1, arg2 etc.) in the template are replaced by a special placeholder token arg . This model is trained in an end-to-end fashion by directly optimizing the generation probability.
To align the extracted arguments back to the document, we adopt a simple postprocessing procedure and find the matching text span closest to the corresponding event trigger.

Cross-document Cross-lingual Entity and Event Coreference Resolution
After extracting all mentions of entities and events, we apply our cross-document cross-lingual entity coreference resolution model, which is an extension of the e2e-coref model (Lee et al., 2017). We use the multilingual XLM-RoBERTa (XLM-R) Transformer model (Conneau et al., 2020) so that our coreference resolution model can handle non-English data. Second, we port the e2e-coref model to the cross-lingual cross-document setting.
Given N hybrid English and Spanish input documents, we create N (N −1) 2 pairs of documents and treat each pair as a single "mega-document". We apply our model to each mega-document and, at the end, aggregate the predictions across all megadocuments to extract the coreference clusters. Finally, we also apply a simple heuristic rule that prevents two entity mentions from being merged together if they are linked to different entities with high confidence.
Our event coreference resolution method (Lai et al., 2021) is similar to entity coreference resolution, while incorporating additional symbolic features such as the event type information. If the input documents are all about one specific complex event, we apply some schema-guided heuristic rules to further refine the predictions of the neural event coreference resolution model. For example, in a bombing schema, there is typically only one bombing event. Therefore, in a document cluster, if there are two event mentions of type bombing and they have several arguments in common, these two mentions will be considered as coreferential.

Cross-document Temporal Event Ordering
Based on the event coreference resolution component described above, we group all mentions into clusters. Next we aim to order events along a timeline. We follow  to design a component for temporal event ordering. Specifically, we further pre-train a T5 model (Raffel et al., 2020) with distant temporal ordering supervision signals. These signals are acquired through two set of syntactic patterns: 1) before/after keywords in text and 2) explicit date and time mentions. We take such a pre-trained temporal T5 model and finetune it on MATRES (Ning et al., 2018b) and use it as the system for temporal event ordering. We perform pair-wise temporal relation classification for all event mention pairs in a documents. We further train an alternative model from finetuning RoBERTa (Liu et al., 2019) on MATRES (Ning et al., 2018b). This model has also been successfully applied for event time prediction (Wen et al., 2021;Li et al., 2020a). We only consider event mention pairs which are within neighboring sentences, or can be connected by shared arguments.
Besides model prediction, we also learn high confident patterns from the schema repository. We consider temporal relations that appear very frequently as our prior knowledge. For each given document cluster, we apply these patterns as highprecision patterns before two statistical temporal ordering models separately. The schema matching algorithm will select the best matching from two graphs as the final instantiated schema results.
Because the annotation for non-English data can be expensive and time-consuming, the temporal event tracking component has only been trained on English input. To extend the temporal event tracking capability to cross-lingual setting, we apply Google Cloud neural machine translation 6 to translate Spanish documents into English and apply the FastAlign algorithm (Dyer et al., 2013) to obtain word alignment.

Cross-media Information Grounding and Fusion
Visual event and argument role extraction: Our goal is to extract visual events along with their argument roles from visual data, i.e., images and videos. In order to train event extractor from visual data, we have collected a new dataset called Video M2E2 which contains 1,500 video-article pairs by searching over YouTube news channels using 18 event primitives related to visual concepts as search keywords. We have extensively annotated the the videos and sampled key frames for annotating bounding boxes of argument roles. Our Visual Event and Argument Role Extraction system consists of an event classification model (ResNet-50 (He et al., 2016)) and an argument role extraction model (JSL (Marasović et al., 2020)). To extract the events and associated argument roles, we leverage a public dataset called Situation with Groundings (SWiG) (Marasović et al., 2020) to pretrain our system. SWiG is designed for event and argument understanding in images with object groundings but has a different ontology. We mapped the event types, argument role types and entity names in SWiG to our ontology (covering 12 event sub-types) so that our model is able to extract event information from both images and videos. For videos, we sample frames at a frame rate of 1 frame per second and process them as individual images. In this way, we have a unified model for both image and video inputs.
Multimodal event coreference: We further extended the previous visual event extraction model to find coreference links between visual and text events. For the video frames with detected events, we apply a weakly-supervised grounding model (Akbari et al., 2019) to find sentences and video frames that have high frame-to-sentence similarity, representing the sentence content similar to the video frame content. We apply a rule-based approach to determine if a visual event mention and a textual event mention are coreferential: (1) Their event types match; (2) No contradiction in the entity types for the same argument role across different modalities. (3) The video frame and sentence have a high semantic similarity score. Based on this pipeline, we are able to add visual provenance of events into the event graph. Moreover, we are able to add visual-only arguments to the event graph, which makes the event graph more informative.

Schema Matching
Once we have acquired a large-scale schema repository by schema induction methods (Li et al., 2020c), we can view it as providing a scaffolding that we can instantiate with incoming data to construct temporal event graphs. Based on each document cluster, we need to find the most accurate schema from the schema repository. We further design a schema matching algorithm that can align our extracted event, entities and relations to a schema.
We first perform topological sort for events based on temporal relations for both IE graph and schema graph so that we can get linearized event sequences in chronological order. Then for each pair of IE graph and schema graph, we apply the longest common subsequence (LCS) method to find the best matching. Our schema matching considers coreference and relations, which will break the optimal substructure when only considering event sequences. We extend the algorithm by replacing the best results for subproblems with a beam of candidates with ranking from a scoring metric that considers matched events, arguments and relations. The candidates consist of matched event pairs, and then we greedily match their arguments and relations for scoring. We merge the best matched IE graph and schema graph to form the final instantiated schema.

Data
We have conducted evaluations including schema matching and schema-guided information extraction.

Quantitative Performance
Schema Induction. To induce schemas, we collect Wikipedia articles describing complex events related to improvised explosive device (IED), and extract event graphs by applying our IE system. The data statistics are shown in Table 1. We induce schemas by applying the path language model (Li et al., 2020c) over event paths in the training data, and merge top ranked paths into schema graphs for human curation. The statistics of the human curated schema repository are shown in Table 2   Schema-guided Information Extraction. The performance of each component is shown in Table 3. We evaluate the end-to-end performance of our system on a complex event corpus (LDC2020E39), which contains multi-lingual multi-media document clusters. The data statistics are shown in  ), CoNLL 2002(Tjong Kim Sang, 2002, DCEP (Dias, 2016) and SemEval 2010 (Recasens et al., 2010); temporal ordering component on MATRES (Ning et al., 2018b); visual event and argument extraction on Video M2E2 and SWiG (Marasović et al., 2020). The statistics of our output are shown in Table 5. The DARPA program's phrase 1 human assessment on about 25% of our system output shows that about 70% of events are correctly extracted.     can see that our system can extract events, entities and relations and align them well with the selected schema. The final instantiated schema is the hybrid of two graphs from merging the matched elements.

Related Work
Text Information Extraction. Existing end-toend Information Extraction (IE) systems (Wadden et al., 2019;Li et al., 2020b;Li et al., 2019) mainly focus on extracting entities, events and entity relations from individual sentences. In contrast, we extract and infer arguments over the global document context. Furthermore, our IE system is guided by a schema repository.
The extracted graph will be used to instantiate a schema graph, which can be applied to predict future events.
Multimedia Information Extraction. Previous multimedia IE systems (Li et al., 2020b;Yazici et al., 2018) only include cross-media entity coreference resolution by grounding the extracted visual entities to text. We are the first to perform crossmedia joint event extraction and coreference resolution to obtain the coreferential events from text, images and videos.
Temporal Event Ordering. Temporal relations between events are extracted for neighbor events in one sentence (Ning et al., 2017(Ning et al., , 2018a(Ning et al., , 2019Han et al., 2019), ignoring the temporal dependencies between events across sentences. We perform document-level event ordering and propagate temporal attributes through shared arguments. Furthermore, we take advantage of the schema repository knowledge by using the frequent temporal order between event types to guide the ordering between events.

Conclusions and Future Work
We demonstrate a state-of-the-art schema-guided cross-document cross-lingual cross-media information extraction and event tracking system. This system is made publicly available to enable users to effectively harness rich information from a variety of sources, languages and modalities. In the future, we plan to develop more advanced graph neural networks based method for schema matching and schema-guided event prediction.

Broader Impact
Our goal in developing Cross-document Crosslingual Cross-media information extraction and event tracking systems is to advance the state-ofthe-art and enhance the field's ability to fully understand real-world events from multiple sources, languages and modalities. We believe that to make real progress in event-centric Natural Language Understanding, we should not focus only on datasets, but to also ground our work in real-world applications. The application we focus on is navigating news, and the examples shown here and in the paper demonstrate the potential use in news understanding. For our demo, the distinction between beneficial use and harmful use depends, in part, on the data. Proper use of the technology requires that input documents/images are legally and ethically obtained. We are particularly excited about the potential use of the technologies in applications of broad societal impact, such as disaster monitoring and emergency response. Training and assessment data is often biased in ways that limit system accuracy on less well represented populations and in new domains. The performance of our system components as reported in the experiment section is based on the specific benchmark datasets, which could be affected by such data biases. Thus questions concerning generalizability and fairness should be carefully considered.
A general approach to ensure proper, rather than malicious, application of dual-use technology should: incorporate ethics considerations as the first-order principles in every step of the system design, maintain a high degree of transparency and interpretability of data, algorithms, models, and functionality throughout the system. We intend to make our software available as open source and shared docker containers for public verification and auditing, and explore countermeasures to protect vulnerable groups. Zheng Chen, Heng Ji, and Robert M Haralick. 2009